Indexed by:
Abstract:
Multimodal prompt learning has emerged as an effective strategy for adapting vision-language models such as CLIP to downstream tasks. However, conventional approaches typically operate at the input level, forcing learned prompts to propagate through a sequence of frozen Transformer layers. This indirect adaptation introduces cumulative geometric distortions, a limitation that we formalize as the indirect learning dilemma (ILD), leading to overfitting of the base class and reduced generalization to novel classes. To overcome this challenge, we propose the Multimodal Self-Attention Prompt (MSP) framework, which shifts adaptation into the semantic core of the model by injecting learnable prompts directly into the key and value sequences of attention blocks. This direct modulation preserves the pretrained embedding geometry while enabling more precise downstream adaptation. MSP further incorporates distance-aware optimization to maintain semantic consistency with CLIP's original representation space, and partial prompt learning via stochastic dimension masking to improve robustness and prevent over-specialization. Extensive evaluations across 11 benchmarks demonstrate the effectiveness of MSP. It achieves a state-of-the-art harmonic mean accuracy of 80.67%, with 77.32% accuracy on novel classes-representing a 2.18% absolute improvement over prior methods-while requiring only 0.11M learnable parameters. Notably, MSP surpasses CLIP's zero-shot performance on 10 out of 11 datasets, establishing a new paradigm for efficient and generalizable prompt-based adaptation.
Keyword:
Reprint 's Address:
Email:
Version:
Source :
IEEE TRANSACTIONS ON IMAGE PROCESSING
ISSN: 1057-7149
Year: 2025
Volume: 34
Page: 5978-5988
1 0 . 8 0 0
JCR@2023
Cited Count:
SCOPUS Cited Count:
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 0
Affiliated Colleges: