登录 注册

Do VLMs Need Vision Transformers? Evaluating State Space 模型 (Model)s as Vision Encoders
Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

🔗 访问原文
🔗 Access Paper

📝 摘要
Abstract

Large vision--language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet-1K initialization, the SSM backbone achieves the strongest overall performance across both VQA and grounding/localization. We further adapt both SSM and ViT-family backbones with detection or segmentation training and find that dense-task tuning generally improves performance across families; after this adaptation, the SSM backbone remains competitive while operating at a substantially smaller model scale. We further observe that (i) higher ImageNet accuracy or larger backbones do not reliably translate into better VLM performance, and (ii) some visual backbones are unstable in localization. Based on these findings, we propose stabilization strategies that improve robustness for both backbone families and highlight SSM backbones as a strong alternative to transformer-based vision encoders in VLMs.

📊 文章统计
Article Statistics

基础数据
Basic Stats

304 浏览
Views
0 下载
Downloads
6 引用
Citations

引用趋势
Citation Trend

阅读国家分布
Country Distribution

阅读机构分布
Institution Distribution

月度浏览趋势
Monthly Views

相关关键词
Related Keywords

影响因子分析
Impact Analysis

7.30 综合评分
Overall Score
引用影响力
Citation Impact
浏览热度
View Popularity
下载频次
Download Frequency

📄 相关文章
Related Articles

海洋智能分析Ocean AI Analysis

正在分析中,请稍候…Analyzing, please wait…
海洋智能体 🌊
海洋智能体
AI科研助手 · 2279篇文献
我看到你正在阅读一篇文献,需要我帮你解读摘要、推荐相关论文,或者分析研究方法论吗?