-
Notifications
You must be signed in to change notification settings - Fork 303
Conditional visual token pruning for QWen-VL models. #3084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conditional visual token pruning for QWen-VL models. #3084
Conversation
- Implement CDPruner interface with FastDPP algorithm - Add OpenCL acceleration for GPU processing - Support multi-frame video pruning with chunking - Add comprehensive performance optimizations - Integrate with InputsEmbedder pipeline - Add configuration parameters and error handling - Include comprehensive testing and documentation
09ad8c9 to
bf3ea74
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
…pipeline_base.hpp and classes.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 27 out of 27 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
|
|
||
| // Sort final result to maintain order | ||
| std::sort(merged_selection.begin(), merged_selection.end()); |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The merged selection is sorted after combining first and second half results. Since both halves are already sorted from DPP selection, consider using std::merge instead of std::sort for O(n) complexity instead of O(n log n).
| /// @brief Whether to apply negative mean for relevance calculation | ||
| /// This is needed for CLIP-based models (like LLaVA) due to counterintuitive similarity values | ||
| bool use_negative_relevance = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this PR support CLIP-based models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR currently supports Qwen2-VL only. CDPruner execution depends on model-specific interface functions to extract text and visual token features. For Qwen2-VL, this parameter should be set to false.
Support for other models (e.g., LLaVA) will be implemented in follow-up PRs, where this parameter will be configured according to each model's specific requirements.
| std::vector<int64_t> instruction_tokens = | ||
| extract_instruction_tokens(input_ids, image_pad_token_id, vision_start_token_id, vision_end_token_id); | ||
|
|
||
| if (instruction_tokens.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean that CDPruner focuses only on image diversity if there is no text prompt?
| size_t split_point) { | ||
| // Distribute tokens to keep between both halves | ||
| size_t tokens_first_half = num_tokens_to_keep / 2; | ||
| size_t tokens_second_half = num_tokens_to_keep - tokens_first_half; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document the semantics of the split DPP variant (two independent calls + merge != full-kernel DPP).
Right now a user might assume it’s equivalent to running DPP on the full kernel, which it isn’t.
In a single DPP call, the algorithm may select all top-K tokens from the second half, while in the split variant we force an equal number of tokens to be taken from each half (in cpu version). This constraint can change the selection set and potentially degrade accuracy.
Have you checked the accuracy impact of this split strategy compared to running DPP on the full kernel?
|
|
||
| // Sort final result to maintain order | ||
| std::sort(merged_selection.begin(), merged_selection.end()); | ||
| merged_selection.erase(std::unique(merged_selection.begin(), merged_selection.end()), merged_selection.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This suggests we expect duplicates, which is suspicious for a DPP greedy selection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the selection bug in OpenCL DPP version.
| // convert splited selected_batches to merged_selection | ||
| for (size_t idx = 0; idx < selected_tokens.size(); ++idx) { | ||
| if (idx < num_tokens_to_keep / 2) { | ||
| merged_selection.push_back(selected_tokens[idx]); | ||
| } else { | ||
| merged_selection.push_back(selected_tokens[idx] + split_point); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mapping from selected_tokens to merged_selection looks incorrect, because it is based on the position in the vector (idx) instead of the value of the selected index (selected_tokens[idx]). This assumes that the OpenCL select returns the first num_tokens_to_keep / 2 elements from the first half and the rest from the second half (which is only true when doing two separate calls as in the CPU version). With the merged kernel, the DPP selector is free to interleave indices from both halves in any order or even return all top-num_tokens_to_keep indices from one half.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand your concern correctly, the position-based mapping works correctly because the OpenCL kernel processes each batch in independent work-groups, ensuring:
- Batch 0 writes to
output_ids[0...selected_token_num-1] - Batch 1 writes to
output_ids[selected_token_num...end]
This is guaranteed by the pointer offset calculation in the OpenCL kernel:
__global int* output_ids_data = output_ids + batch_idx * selected_token_num;…nt sorting and uniqueness checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 27 out of 27 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
tests/cpp/test_cdpruner_dpp.cpp:1
- Missing documentation comment for the
cleanup_opencl()private method. Add a brief description of what this method does.
// Copyright (C) 2023-2025 Intel Corporation
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Force AVX2 only (disable AVX512) | ||
| target_compile_options(${TARGET_NAME_OBJ} PRIVATE "-mavx2" "-mno-avx512f") | ||
| message(STATUS "Enable SIMD compilation for ${TARGET_NAME_OBJ}") |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The compiler flags -mavx2 and -mno-avx512f force AVX2 and disable AVX-512, which may reduce performance on newer processors that support AVX-512. Consider using -march=native or conditionally enabling AVX-512 based on target architecture to allow the compiler to optimize for the actual CPU capabilities.
| # Force AVX2 only (disable AVX512) | |
| target_compile_options(${TARGET_NAME_OBJ} PRIVATE "-mavx2" "-mno-avx512f") | |
| message(STATUS "Enable SIMD compilation for ${TARGET_NAME_OBJ}") | |
| # Use native CPU optimization for SIMD instructions | |
| target_compile_options(${TARGET_NAME_OBJ} PRIVATE "-march=native") | |
| message(STATUS "Enable SIMD compilation for ${TARGET_NAME_OBJ} with -march=native") |
| GENAI_WARN("[CDPruner] Using AVX SIMD instructions for vector operations (8 floats/operation)"); | ||
| #elif defined(__SSE2__) | ||
| GENAI_WARN("[CDPruner] Using SSE2 SIMD instructions for vector operations (4 floats/operation)"); | ||
| #else | ||
| GENAI_WARN("[CDPruner] Using scalar operations (no SIMD acceleration)"); |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using GENAI_WARN for informational messages about SIMD instruction set usage is misleading. These are informational messages, not warnings. Use GENAI_INFO or GENAI_DEBUG instead.
| GENAI_WARN("[CDPruner] Using AVX SIMD instructions for vector operations (8 floats/operation)"); | |
| #elif defined(__SSE2__) | |
| GENAI_WARN("[CDPruner] Using SSE2 SIMD instructions for vector operations (4 floats/operation)"); | |
| #else | |
| GENAI_WARN("[CDPruner] Using scalar operations (no SIMD acceleration)"); | |
| GENAI_INFO("[CDPruner] Using AVX SIMD instructions for vector operations (8 floats/operation)"); | |
| #elif defined(__SSE2__) | |
| GENAI_INFO("[CDPruner] Using SSE2 SIMD instructions for vector operations (4 floats/operation)"); | |
| #else | |
| GENAI_INFO("[CDPruner] Using scalar operations (no SIMD acceleration)"); |
| ov::Tensor tmp_embeds = m_embedding->infer(req, input_ids); | ||
|
|
||
| // Deep-copy necessary: Returned InferRequest's internal memory will be reused in | ||
| // extract_text_features_for_cdpruner() that acquires a request from the same queue. |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'cdpruner' to 'pruning' in the comment describing the function name.
| // extract_text_features_for_cdpruner() that acquires a request from the same queue. | |
| // extract_text_features_for_pruning() that acquires a request from the same queue. |
| float eis_j = cis_data[cis_idx]; | ||
| // Subtract the squared orthogonal component | ||
| if (std::isnan(eis_j)) { | ||
| di2s_data[j] = -std::numeric_limits<float>::max(); |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The special handling of NaN values by setting to -max() instead of -infinity() is inconsistent with line 417 which uses -infinity() for selected tokens. This inconsistency could lead to unexpected behavior. Consider documenting why different sentinel values are used or unifying the approach.
| di2s_data[j] = -std::numeric_limits<float>::max(); | |
| di2s_data[j] = -std::numeric_limits<float>::infinity(); |
|
|
||
| // Handle single aggregated vector case | ||
| OPENVINO_ASSERT(kept_indices_per_image.size() == 1 && region_count > 1, | ||
| "Kept token indices layout does not match vision regions"); |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion message 'Kept token indices layout does not match vision regions' doesn't clearly indicate what the expected layout should be. Consider enhancing the message to explain the expected relationship between kept_indices_per_image.size() and region_count.
| "Kept token indices layout does not match vision regions"); | |
| "Kept token indices layout does not match vision regions: expected kept_indices_per_image.size() == region_count (" + | |
| std::to_string(region_count) + | |
| ") or kept_indices_per_image.size() == 1, but got kept_indices_per_image.size() == " + | |
| std::to_string(kept_indices_per_image.size()) + "."); |
l-bat
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please validate algorithm on https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/python/visual_language_chat/milebench_eval_vlm.py and attach accuracy results for:
- Origin FP model
- CDPruner: CPU single pass
- CDPruner: CPU split version
- CDPruner: OpenCL single pass
- CDPruner: OpenCL split version
You should modify CB config (add CDPruner) and run:
python milebench_eval_vlm.py --model_dir converted_qwen_model_dir --subset DocVQA --data_dir milebench_data
This is a clean rebase of PR#2714 to simplify the review process.
-- Paper: CDPruner (arXiv)
-- Code: GitHub Repository
Tickets: CVS-173220
Related PRs: