Indexed by:
Abstract:
Remote sensing image-text retrieval (RSITR) is a cross-modal task that integrates visual and textual information, attracting significant attention in remote sensing research. Remote sensing images typically contain complex scenes with abundant details, presenting significant challenges for accurate semantic alignment between images and texts. Despite advances in the field, achieving precise alignment in such intricate contexts remains a major hurdle. To address this challenge, this article introduces a novel context-aware local-global semantic alignment (CLGSA) method. The proposed method consists of two key modules: the local key feature alignment (LKFA) module and the cross-sample global semantic alignment (CGSA) module. The LKFA module incorporates a local image masking and reconstruction task to improve the alignment between image and text features. Specifically, this module masks certain regions of the image and uses text context information to guide the reconstruction of the masked areas, enhancing the alignment of local semantics and ensuring more accurate retrieval of region-specific content. The CGSA module employs a hard sample triplet loss to improve global semantic consistency. By prioritizing difficult samples during training, this module refines feature space distributions, helping the model better capture global semantics across the entire image-text pair. A series of extensive experiments demonstrates the effectiveness of the proposed method. The method achieves an mR score of 32.07% on the RSICD dataset and 46.63% on the RSITMD dataset, outperforming baseline methods and confirming the robustness and accuracy of the approach.
Keyword:
Reprint 's Address:
Email:
Version:
Source :
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
ISSN: 0196-2892
Year: 2025
Volume: 63
7 . 5 0 0
JCR@2023
Cited Count:
SCOPUS Cited Count:
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 2
Affiliated Colleges: