Indexed by:
Abstract:
Multimodal sentiment analysis is an actively growing research area that utilizes language, acoustic and visual signals to predict sentiment inclination. Compared to language, acoustic and visual features carry a more evident personal style which may degrade the model generalization capability. The issue will be exacerbated in a speaker-independent setting, where the model will encounter samples from unseen speakers during the testing stage. To mitigate personal style's impact, we propose a framework named SIMR for learning speaker-independent multimodal representation. This framework separates the nonverbal inputs into style encoding and content representation with the aid of informative cross-modal correlations. Besides, in terms of integrating cross-modal complementary information, the classical transformer-based approaches are inherently inclined to discover compatible cross-modal interactions but ignore incompatible ones. In contrast, we suggest simultaneously locating both through an enhanced cross-modal transformer module. Experimental results show that the proposed model achieves state-of-the-art performance on several datasets.
Keyword:
Reprint 's Address:
Email:
Version:
Source :
INFORMATION SCIENCES
ISSN: 0020-0255
Year: 2023
Volume: 628
Page: 208-225
0 . 0
JCR@2023
0 . 0 0 0
JCR@2023
ESI Discipline: COMPUTER SCIENCE;
ESI HC Threshold:32
CAS Journal Grade:1
Cited Count:
WoS CC Cited Count: 5
SCOPUS Cited Count: 5
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 0
Affiliated Colleges: