Indexed by:
Abstract:
Multimodal emotion recognition in conversations aims to accurately detect emotions by integrating audio, text, and video modalities, playing an important role in various systems. Existing approaches utilize convolutional and recurrent networks to learn short-term emotional information from individual modalities, or employ graph and attention mechanisms to integrate long-term emotional information from multiple modalities. These methods effectively combine emotional information within the conversational content in the time domain.However, psychological research shows that emotional information are not only conveyed in the time domain but also in the frequency domain (e.g., pitch and speech rate). To capture emotions from a more comprehensive perspective, we propose TF-MERC, a framework that integrates both time and frequency domains.TF-MERC uses a multi-domain alignment module to learn modality information within the time or frequency domains. It then employs FATransformer to deeply integrate the multimodal associations between the time and frequency domains, providing a more comprehensive approach for emotion prediction.Experimental results show that TF-MERC outperforms state-of-the-art methods, achieving superior performance across multiple datasets. © 2025 ACM.
Keyword:
Reprint 's Address:
Email:
Version:
Source :
Year: 2025
Page: 126-134
Language: English
Cited Count:
SCOPUS Cited Count:
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 0
Affiliated Colleges: