MSP: Multimodal Self-Attention Prompt Learning - Details

author：

Lai, Xinyi (Lai, Xinyi.) ^[1] | Ke, Xiao (Ke, Xiao.) ^[2] (Scholars：柯逍) | Xu, Huangbiao (Xu, Huangbiao.) ^[3] | Wu, Shanghui (Wu, Shanghui.) ^[4] | Guo, Wenzhong (Guo, Wenzhong.) ^[5]

Indexed by：

EI SCIE

Abstract：

Multimodal　prompt　learning　has　emerged　as　an　effective　strategy　for　adapting　vision-language　models　such　as　CLIP　to　downstream　tasks.　However,　conventional　approaches　typically　operate　at　the　input　level,　forcing　learned　prompts　to　propagate　through　a　sequence　of　frozen　Transformer　layers.　This　indirect　adaptation　introduces　cumulative　geometric　distortions,　a　limitation　that　we　formalize　as　the　indirect　learning　dilemma　(ILD),　leading　to　overfitting　of　the　base　class　and　reduced　generalization　to　novel　classes.　To　overcome　this　challenge,　we　propose　the　Multimodal　Self-Attention　Prompt　(MSP)　framework,　which　shifts　adaptation　into　the　semantic　core　of　the　model　by　injecting　learnable　prompts　directly　into　the　key　and　value　sequences　of　attention　blocks.　This　direct　modulation　preserves　the　pretrained　embedding　geometry　while　enabling　more　precise　downstream　adaptation.　MSP　further　incorporates　distance-aware　optimization　to　maintain　semantic　consistency　with　CLIP＇s　original　representation　space,　and　partial　prompt　learning　via　stochastic　dimension　masking　to　improve　robustness　and　prevent　over-specialization.　Extensive　evaluations　across　11　benchmarks　demonstrate　the　effectiveness　of　MSP.　It　achieves　a　state-of-the-art　harmonic　mean　accuracy　of　80.67%,　with　77.32%　accuracy　on　novel　classes-representing　a　2.18%　absolute　improvement　over　prior　methods-while　requiring　only　0.11M　learnable　parameters.　Notably,　MSP　surpasses　CLIP＇s　zero-shot　performance　on　10　out　of　11　datasets,　establishing　a　new　paradigm　for　efficient　and　generalizable　prompt-based　adaptation.

Keyword：

Adaptation models Distortion Few-shot learning Geometry image classification Optimization prompt learning Semantics Training transfer learning Transformers Tuning Vectors vision-language model Visualization

Community：

[ 1 ] [Lai, Xinyi]Fuzhou Univ, Coll Comp & Data Sci, Fujian Key Lab Network Comp & Intelligent Informat, Fuzhou 350116, Peoples R China
[ 2 ] [Ke, Xiao]Fuzhou Univ, Coll Comp & Data Sci, Fujian Key Lab Network Comp & Intelligent Informat, Fuzhou 350116, Peoples R China
[ 3 ] [Xu, Huangbiao]Fuzhou Univ, Coll Comp & Data Sci, Fujian Key Lab Network Comp & Intelligent Informat, Fuzhou 350116, Peoples R China
[ 4 ] [Wu, Shanghui]Fuzhou Univ, Coll Comp & Data Sci, Fujian Key Lab Network Comp & Intelligent Informat, Fuzhou 350116, Peoples R China
[ 5 ] [Guo, Wenzhong]Fuzhou Univ, Coll Comp & Data Sci, Fujian Key Lab Network Comp & Intelligent Informat, Fuzhou 350116, Peoples R China
[ 6 ] [Lai, Xinyi]Minist Educ, Engn Res Ctr Big Data Intelligence, Fuzhou 350116, Peoples R China
[ 7 ] [Ke, Xiao]Minist Educ, Engn Res Ctr Big Data Intelligence, Fuzhou 350116, Peoples R China
[ 8 ] [Xu, Huangbiao]Minist Educ, Engn Res Ctr Big Data Intelligence, Fuzhou 350116, Peoples R China
[ 9 ] [Wu, Shanghui]Minist Educ, Engn Res Ctr Big Data Intelligence, Fuzhou 350116, Peoples R China
[ 10 ] [Guo, Wenzhong]Minist Educ, Engn Res Ctr Big Data Intelligence, Fuzhou 350116, Peoples R China

Reprint 's Address：

柯逍
[Ke, Xiao]Fuzhou Univ, Coll Comp & Data Sci, Fujian Key Lab Network Comp & Intelligent Informat, Fuzhou 350116, Peoples R China;;[Guo, Wenzhong]Fuzhou Univ, Coll Comp & Data Sci, Fujian Key Lab Network Comp & Intelligent Informat, Fuzhou 350116, Peoples R China

Email：

231020057@fzu.edu.cn |
kex@fzu.edu.cn |
huangbiaoxu.chn@gmail.com |
shanghuiw@qq.com |
guowenzhong@fzu.edu.cn

Show more details

Version：