How wonderful the Work Mono-InternVL is!!! I have the following questions that I would like to ask you, and I would be extremely grateful if you could provide some answers.
Firstly, regarding the convolutional architecture, can the dynamic resolution used in the InternVL series [1] enhance performance? I have observed that directly increasing the resolution, such as in ConvLava without using dynamic resolution, results in slow token growth but normal performance. However, other encoder-free models like Mono-InternVL [2] and HoVLE [3] do employ dynamic resolution. In your opinion, should encoder-free models use dynamic resolution?
Secondly, for both encoder-free and encoder-based models, the attention values in the first few layers typically show relatively weak interaction between user prompt tokens and vision tokens [2]. Do you think this is a key factor affecting the performance of encoder-free models?
[1] Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
[2] Mono-internvl: Pushing the boundaries of monolithic multimodal large language models with endogenous visual pre-training
[3] HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
How wonderful the Work Mono-InternVL is!!! I have the following questions that I would like to ask you, and I would be extremely grateful if you could provide some answers.
Firstly, regarding the convolutional architecture, can the dynamic resolution used in the InternVL series [1] enhance performance? I have observed that directly increasing the resolution, such as in ConvLava without using dynamic resolution, results in slow token growth but normal performance. However, other encoder-free models like Mono-InternVL [2] and HoVLE [3] do employ dynamic resolution. In your opinion, should encoder-free models use dynamic resolution?
Secondly, for both encoder-free and encoder-based models, the attention values in the first few layers typically show relatively weak interaction between user prompt tokens and vision tokens [2]. Do you think this is a key factor affecting the performance of encoder-free models?
[1] Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
[2] Mono-internvl: Pushing the boundaries of monolithic multimodal large language models with endogenous visual pre-training
[3] HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding