Some questions about encoder-free VLMs


How wonderful the Work [Mono-InternVL](https://github.com/OpenGVLab/Mono-InternVL)  is!!! I have the following questions that I would like to ask you, and I would be extremely grateful if you could provide some answers.

Firstly, regarding the convolutional architecture, can the dynamic resolution used in the InternVL series [1] enhance performance? I have observed that directly increasing the resolution, such as in ConvLava without using dynamic resolution, results in slow token growth but normal performance. However, other encoder-free models like Mono-InternVL [2] and HoVLE [3] do employ dynamic resolution. In your opinion, should encoder-free models use dynamic resolution?

Secondly, for both encoder-free and encoder-based models, the attention values in the first few layers typically show relatively weak interaction between user prompt tokens and vision tokens [2]. Do you think this is a key factor affecting the performance of encoder-free models?


[1] [Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks](http://openaccess.thecvf.com/content/CVPR2024/html/Chen_InternVL_Scaling_up_Vision_Foundation_Models_and_Aligning_for_Generic_CVPR_2024_paper.html)

[2] [Mono-internvl: Pushing the boundaries of monolithic multimodal large language models with endogenous visual pre-training](https://arxiv.org/abs/2410.08202)

[3] [HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding](https://arxiv.org/abs/2412.16158)




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions about encoder-free VLMs #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Some questions about encoder-free VLMs #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions