Indexed by:
Abstract:
Remote-sensing image-text (RSIT) retrieval involves the use of either textual descriptions or remote-sensing images (RSI) as queries to retrieve relevant RSIs or corresponding text descriptions. Many traditional cross-modal RSIT retrieval methods tend to overlook the importance of capturing salient information and establishing the prior similarity between RSIs and texts, leading to a decline in cross-modal retrieval performance. In this article, we address these challenges by introducing a novel approach known as multiscale salient image-guided text alignment (MSITA). This approach is designed to learn salient information by aligning text with images for effective cross-modal RSIT retrieval. The MSITA approach first incorporates a multiscale fusion module and a salient learning module to facilitate the extraction of salient information. In addition, it introduces an image-guided text alignment (IGTA) mechanism that uses image information to guide the alignment of texts, enabling the effective capture of fine-grained correspondences between RSI regions and textual descriptions. In addition to these components, a novel loss function is devised to enhance the similarity across different modalities and reinforce the prior similarity between RSIs and texts. Extensive experiments conducted on four widely adopted RSIT datasets affirm that the MSITA approach significantly enhances cross-modal RSIT retrieval performance in comparison to other state-of-the-art methods.
Keyword:
Reprint 's Address:
Email:
Version:
Source :
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
ISSN: 0196-2892
Year: 2024
Volume: 62
7 . 5 0 0
JCR@2023
Cited Count:
SCOPUS Cited Count:
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 0
Affiliated Colleges: