• Complex
  • Title
  • Keyword
  • Abstract
  • Scholars
  • Journal
  • ISSN
  • Conference
成果搜索

author:

Tang, Haoyu (Tang, Haoyu.) [1] | Hu, Yupeng (Hu, Yupeng.) [2] | Wang, Yunxiao (Wang, Yunxiao.) [3] | Zhang, Shuaike (Zhang, Shuaike.) [4] | Xu, Mingzhu (Xu, Mingzhu.) [5] | Zhu, Jihua (Zhu, Jihua.) [6] | Zheng, Qinghai (Zheng, Qinghai.) [7] (Scholars:郑清海)

Indexed by:

EI Scopus SCIE

Abstract:

In the era of smart cities, the advent of the Internet of Things technology has catalyzed the proliferation of multimodal sensor data, presenting new challenges in cross -modal event detection, particularly in audio event detection via textual queries. This paper focuses on the novel task of text -to -audio grounding (TAG), aiming to precisely localize sound segments that correspond to events described in textual queries within an untrimmed audio. This challenging new task requires multi -modal (acoustic and linguistic) information fusion as well as the reasoning for the cross -modal semantic matching between the given audio and textual query. Unlike conventional methods that often overlook the nuanced interactions between and within modalities, we introduce the Cross -modal Graph Interaction (CGI) model. This innovative approach leverages a language graph to model complex semantic relationships between query words, enhancing the understanding of textual queries. Additionally, a cross -modal attention mechanism generates snippet -specific query representations, facilitating fine-grained semantic matching between audio segments and textual descriptions. A cross -gating module further refines this process by emphasizing relevant features across modalities and suppressing irrelevant information, optimizing multimodal information fusion. Our comprehensive evaluation on the Audiogrounding benchmark dataset not only demonstrates the CGI model's superior performance over existing methods, but also underscores the significance of sophisticated multimodal interaction in improving the efficacy of TAG in smart cities.

Keyword:

Cross-modal learning Graph neural network Multimodal information fusion Smart city Text-to-audio grounding

Community:

  • [ 1 ] [Tang, Haoyu]Shandong Univ, Sch Software Engn, Jinan 250101, Peoples R China
  • [ 2 ] [Hu, Yupeng]Shandong Univ, Sch Software Engn, Jinan 250101, Peoples R China
  • [ 3 ] [Wang, Yunxiao]Shandong Univ, Sch Software Engn, Jinan 250101, Peoples R China
  • [ 4 ] [Zhang, Shuaike]Shandong Univ, Sch Software Engn, Jinan 250101, Peoples R China
  • [ 5 ] [Xu, Mingzhu]Shandong Univ, Sch Software Engn, Jinan 250101, Peoples R China
  • [ 6 ] [Zhu, Jihua]Xi An Jiao Tong Univ, Sch Software Engn, Xian 710049, Peoples R China
  • [ 7 ] [Zheng, Qinghai]Fuzhou Univ, Sch Software Engn, Fuzhou 350108, Peoples R China

Reprint 's Address:

  • [Hu, Yupeng]Shandong Univ, Sch Software Engn, Jinan 250101, Peoples R China;;

Show more details

Related Keywords:

Related Article:

Source :

INFORMATION FUSION

ISSN: 1566-2535

Year: 2024

Volume: 110

1 4 . 8 0 0

JCR@2023

Cited Count:

WoS CC Cited Count:

SCOPUS Cited Count:

ESI Highly Cited Papers on the List: 0 Unfold All

WanFang Cited Count:

Chinese Cited Count:

30 Days PV: 1

Online/Total:54/10057502
Address:FZU Library(No.2 Xuyuan Road, Fuzhou, Fujian, PRC Post Code:350116) Contact Us:0591-22865326
Copyright:FZU Library Technical Support:Beijing Aegean Software Co., Ltd. 闽ICP备05005463号-1