Indexed by:
Abstract:
Remote sensing (RS) audio-visual cross-modal retrieval is a challenging task in the search of meaningful RS information. Nevertheless, the impact of multiscale features and associated redundant information in the RS images cannot be overlooked in the retrieval task. In addition, how to deal with the completely different physical expressions of different modal information is crucial for cross-modal retrieval tasks. To tackle these issues, we propose a Scale-aware Adaptive Refinement and Cross-Interaction (SARCI) network. The Quaternion-attention Dominated Multiscale Visual Refinement (QDMVR) module in SARCI is suggested to learn multiscale visual features and further optimize features containing redundant information for different scale features. To better integrate channel attention and spatial attention for adaptively learning of meaningful visual semantics, we propose the symmetric quaternion attention (SQA) within the QDMVR module to enhance RS visual features. The SQA mechanism acts on both high-level and low-level features to explore salient RS vision information across different scales. In order to allow information from different modalities to interact more valuably, we propose the Instruction-based Cross-Learning Module (ICLM) to perform cross-modal feature interaction based on the characteristic of the two modalities. SARCI network demonstrates state-of-the-art performance on three public RS cross-modal datasets: Sydney, UCM, and RSICD audio-visual datasets. The code is available at: https://github.com/WUTCM-Lab/SARCI.
Keyword:
Reprint 's Address:
Email:
Version:
Source :
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
ISSN: 0196-2892
Year: 2024
Volume: 62
7 . 5 0 0
JCR@2023
Cited Count:
SCOPUS Cited Count:
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 1
Affiliated Colleges: