Listen as you wish: Fusion of audio and text for cross-modal event detection in smart cities - Details

author：

Indexed by：

EI Scopus SCIE

Abstract：

In　the　era　of　smart　cities,　the　advent　of　the　Internet　of　Things　technology　has　catalyzed　the　proliferation　of　multimodal　sensor　data,　presenting　new　challenges　in　cross　-modal　event　detection,　particularly　in　audio　event　detection　via　textual　queries.　This　paper　focuses　on　the　novel　task　of　text　-to　-audio　grounding　(TAG),　aiming　to　precisely　localize　sound　segments　that　correspond　to　events　described　in　textual　queries　within　an　untrimmed　audio.　This　challenging　new　task　requires　multi　-modal　(acoustic　and　linguistic)　information　fusion　as　well　as　the　reasoning　for　the　cross　-modal　semantic　matching　between　the　given　audio　and　textual　query.　Unlike　conventional　methods　that　often　overlook　the　nuanced　interactions　between　and　within　modalities,　we　introduce　the　Cross　-modal　Graph　Interaction　(CGI)　model.　This　innovative　approach　leverages　a　language　graph　to　model　complex　semantic　relationships　between　query　words,　enhancing　the　understanding　of　textual　queries.　Additionally,　a　cross　-modal　attention　mechanism　generates　snippet　-specific　query　representations,　facilitating　fine-grained　semantic　matching　between　audio　segments　and　textual　descriptions.　A　cross　-gating　module　further　refines　this　process　by　emphasizing　relevant　features　across　modalities　and　suppressing　irrelevant　information,　optimizing　multimodal　information　fusion.　Our　comprehensive　evaluation　on　the　Audiogrounding　benchmark　dataset　not　only　demonstrates　the　CGI　model＇s　superior　performance　over　existing　methods,　but　also　underscores　the　significance　of　sophisticated　multimodal　interaction　in　improving　the　efficacy　of　TAG　in　smart　cities.

Keyword：

Cross-modal learning Graph neural network Multimodal information fusion Smart city Text-to-audio grounding

Community：

[ 1 ] [Tang, Haoyu]Shandong Univ, Sch Software Engn, Jinan 250101, Peoples R China
[ 2 ] [Hu, Yupeng]Shandong Univ, Sch Software Engn, Jinan 250101, Peoples R China
[ 3 ] [Wang, Yunxiao]Shandong Univ, Sch Software Engn, Jinan 250101, Peoples R China
[ 4 ] [Zhang, Shuaike]Shandong Univ, Sch Software Engn, Jinan 250101, Peoples R China
[ 5 ] [Xu, Mingzhu]Shandong Univ, Sch Software Engn, Jinan 250101, Peoples R China
[ 6 ] [Zhu, Jihua]Xi An Jiao Tong Univ, Sch Software Engn, Xian 710049, Peoples R China
[ 7 ] [Zheng, Qinghai]Fuzhou Univ, Sch Software Engn, Fuzhou 350108, Peoples R China