Conditional visual token pruning for QWen-VL models. #3084

yangwang201911 · 2025-11-28T07:05:49Z

This is a clean rebase of PR#2714 to simplify the review process.

Implement conditional visual token pruning for QWen-VL models.
-- Paper: CDPruner (arXiv)
-- Code: GitHub Repository
Add configurations to benchmark.py and WWB tools

Tickets: CVS-173220
Related PRs:

- Implement CDPruner interface with FastDPP algorithm - Add OpenCL acceleration for GPU processing - Support multi-frame video pruning with chunking - Add comprehensive performance optimizations - Integrate with InputsEmbedder pipeline - Add configuration parameters and error handling - Include comprehensive testing and documentation

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

…pipeline_base.hpp and classes.cpp

Copilot

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/python_tests/test_vlm_pipeline.py

src/cpp/src/visual_language/qwen2vl/classes.cpp

Copilot · 2025-12-01T01:05:55Z

src/cpp/src/visual_language/cdpruner/fast_dpp.cpp

+        }
+
+        // Sort final result to maintain order
+        std::sort(merged_selection.begin(), merged_selection.end());


The merged selection is sorted after combining first and second half results. Since both halves are already sorted from DPP selection, consider using std::merge instead of std::sort for O(n) complexity instead of O(n log n).

src/cpp/src/visual_language/cdpruner/cdpruner.cpp

src/cpp/src/visual_language/inputs_embedder.cpp

src/cpp/CMakeLists.txt

src/cpp/src/visual_language/cdpruner/cdpruner_config.hpp

l-bat · 2025-11-28T09:51:41Z

src/cpp/src/visual_language/cdpruner/cdpruner_config.hpp

+    /// @brief Whether to apply negative mean for relevance calculation
+    /// This is needed for CLIP-based models (like LLaVA) due to counterintuitive similarity values
+    bool use_negative_relevance = false;


Does this PR support CLIP-based models?

This PR currently supports Qwen2-VL only. CDPruner execution depends on model-specific interface functions to extract text and visual token features. For Qwen2-VL, this parameter should be set to false.

Support for other models (e.g., LLaVA) will be implemented in follow-up PRs, where this parameter will be configured according to each model's specific requirements.

src/cpp/src/visual_language/qwen2vl/classes.cpp

src/cpp/src/visual_language/cdpruner/cdpruner.cpp

l-bat · 2025-11-28T12:12:37Z

src/cpp/src/visual_language/qwen2vl/classes.cpp

+    std::vector<int64_t> instruction_tokens =
+        extract_instruction_tokens(input_ids, image_pad_token_id, vision_start_token_id, vision_end_token_id);
+
+    if (instruction_tokens.empty()) {


Does it mean that CDPruner focuses only on image diversity if there is no text prompt?

l-bat · 2025-12-01T07:45:59Z

src/cpp/src/visual_language/cdpruner/fast_dpp.cpp

+                                                       size_t split_point) {
+    // Distribute tokens to keep between both halves
+    size_t tokens_first_half = num_tokens_to_keep / 2;
+    size_t tokens_second_half = num_tokens_to_keep - tokens_first_half;


Document the semantics of the split DPP variant (two independent calls + merge != full-kernel DPP).
Right now a user might assume it’s equivalent to running DPP on the full kernel, which it isn’t.
In a single DPP call, the algorithm may select all top-K tokens from the second half, while in the split variant we force an equal number of tokens to be taken from each half (in cpu version). This constraint can change the selection set and potentially degrade accuracy.

Have you checked the accuracy impact of this split strategy compared to running DPP on the full kernel?

src/cpp/src/visual_language/cdpruner/fast_dpp.cpp

l-bat · 2025-12-01T11:59:07Z

src/cpp/src/visual_language/cdpruner/fast_dpp.cpp

+
+        // Sort final result to maintain order
+        std::sort(merged_selection.begin(), merged_selection.end());
+        merged_selection.erase(std::unique(merged_selection.begin(), merged_selection.end()), merged_selection.end());


This suggests we expect duplicates, which is suspicious for a DPP greedy selection

Fixed the selection bug in OpenCL DPP version.

l-bat · 2025-12-01T13:03:33Z

src/cpp/src/visual_language/cdpruner/fast_dpp.cpp

+        // convert splited selected_batches to merged_selection
+        for (size_t idx = 0; idx < selected_tokens.size(); ++idx) {
+            if (idx < num_tokens_to_keep / 2) {
+                merged_selection.push_back(selected_tokens[idx]);
+            } else {
+                merged_selection.push_back(selected_tokens[idx] + split_point);
+            }
+        }


The mapping from selected_tokens to merged_selection looks incorrect, because it is based on the position in the vector (idx) instead of the value of the selected index (selected_tokens[idx]). This assumes that the OpenCL select returns the first num_tokens_to_keep / 2 elements from the first half and the rest from the second half (which is only true when doing two separate calls as in the CPU version). With the merged kernel, the DPP selector is free to interleave indices from both halves in any order or even return all top-num_tokens_to_keep indices from one half.

If I understand your concern correctly, the position-based mapping works correctly because the OpenCL kernel processes each batch in independent work-groups, ensuring:

Batch 0 writes to output_ids[0...selected_token_num-1]

Batch 1 writes to output_ids[selected_token_num...end]

This is guaranteed by the pointer offset calculation in the OpenCL kernel:

__global int* output_ids_data = output_ids + batch_idx * selected_token_num;

…formance

…nt sorting and uniqueness checks

Copilot

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

tests/cpp/test_cdpruner_dpp.cpp:1

Missing documentation comment for the cleanup_opencl() private method. Add a brief description of what this method does.

// Copyright (C) 2023-2025 Intel Corporation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-04T09:10:02Z

src/cpp/CMakeLists.txt

+        # Force AVX2 only (disable AVX512)
+        target_compile_options(${TARGET_NAME_OBJ} PRIVATE "-mavx2" "-mno-avx512f")
+        message(STATUS "Enable SIMD compilation for ${TARGET_NAME_OBJ}")


[nitpick] The compiler flags -mavx2 and -mno-avx512f force AVX2 and disable AVX-512, which may reduce performance on newer processors that support AVX-512. Consider using -march=native or conditionally enabling AVX-512 based on target architecture to allow the compiler to optimize for the actual CPU capabilities.

Suggested change

# Force AVX2 only (disable AVX512)

target_compile_options(${TARGET_NAME_OBJ} PRIVATE "-mavx2" "-mno-avx512f")

message(STATUS "Enable SIMD compilation for ${TARGET_NAME_OBJ}")

# Use native CPU optimization for SIMD instructions

target_compile_options(${TARGET_NAME_OBJ} PRIVATE "-march=native")

message(STATUS "Enable SIMD compilation for ${TARGET_NAME_OBJ} with -march=native")

Copilot · 2025-12-04T09:10:02Z

src/cpp/src/visual_language/cdpruner/fast_dpp.cpp

+            GENAI_WARN("[CDPruner] Using AVX SIMD instructions for vector operations (8 floats/operation)");
+#elif defined(__SSE2__)
+            GENAI_WARN("[CDPruner] Using SSE2 SIMD instructions for vector operations (4 floats/operation)");
+#else
+            GENAI_WARN("[CDPruner] Using scalar operations (no SIMD acceleration)");


Using GENAI_WARN for informational messages about SIMD instruction set usage is misleading. These are informational messages, not warnings. Use GENAI_INFO or GENAI_DEBUG instead.

Suggested change

GENAI_WARN("[CDPruner] Using AVX SIMD instructions for vector operations (8 floats/operation)");

#elif defined(__SSE2__)

GENAI_WARN("[CDPruner] Using SSE2 SIMD instructions for vector operations (4 floats/operation)");

#else

GENAI_WARN("[CDPruner] Using scalar operations (no SIMD acceleration)");

GENAI_INFO("[CDPruner] Using AVX SIMD instructions for vector operations (8 floats/operation)");

#elif defined(__SSE2__)

GENAI_INFO("[CDPruner] Using SSE2 SIMD instructions for vector operations (4 floats/operation)");

#else

GENAI_INFO("[CDPruner] Using scalar operations (no SIMD acceleration)");

Copilot · 2025-12-04T09:10:03Z

src/cpp/src/visual_language/qwen2vl/classes.cpp

+        ov::Tensor tmp_embeds = m_embedding->infer(req, input_ids);
+
+        // Deep-copy necessary: Returned InferRequest's internal memory will be reused in
+        // extract_text_features_for_cdpruner() that acquires a request from the same queue.


Corrected spelling of 'cdpruner' to 'pruning' in the comment describing the function name.

Suggested change

// extract_text_features_for_cdpruner() that acquires a request from the same queue.

// extract_text_features_for_pruning() that acquires a request from the same queue.

Copilot · 2025-12-04T09:10:03Z

src/cpp/src/visual_language/cdpruner/fast_dpp.cpp

+        float eis_j = cis_data[cis_idx];
+        // Subtract the squared orthogonal component
+        if (std::isnan(eis_j)) {
+            di2s_data[j] = -std::numeric_limits<float>::max();


The special handling of NaN values by setting to -max() instead of -infinity() is inconsistent with line 417 which uses -infinity() for selected tokens. This inconsistency could lead to unexpected behavior. Consider documenting why different sentinel values are used or unifying the approach.

Suggested change

di2s_data[j] = -std::numeric_limits<float>::max();

di2s_data[j] = -std::numeric_limits<float>::infinity();

Copilot · 2025-12-04T09:10:04Z

src/cpp/src/visual_language/qwen2vl/classes.cpp

+
+        // Handle single aggregated vector case
+        OPENVINO_ASSERT(kept_indices_per_image.size() == 1 && region_count > 1,
+                        "Kept token indices layout does not match vision regions");


The assertion message 'Kept token indices layout does not match vision regions' doesn't clearly indicate what the expected layout should be. Consider enhancing the message to explain the expected relationship between kept_indices_per_image.size() and region_count.

Suggested change

"Kept token indices layout does not match vision regions");

"Kept token indices layout does not match vision regions: expected kept_indices_per_image.size() == region_count (" +

std::to_string(region_count) +

") or kept_indices_per_image.size() == 1, but got kept_indices_per_image.size() == " +

std::to_string(kept_indices_per_image.size()) + ".");

l-bat

Could you please validate algorithm on https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/python/visual_language_chat/milebench_eval_vlm.py and attach accuracy results for:

Origin FP model
CDPruner: CPU single pass
CDPruner: CPU split version
CDPruner: OpenCL single pass
CDPruner: OpenCL split version

You should modify CB config (add CDPruner) and run:

python milebench_eval_vlm.py --model_dir converted_qwen_model_dir --subset DocVQA --data_dir milebench_data

yangwang201911 requested review from Copilot, peterchen-intel and xipingyan November 28, 2025 07:05

yangwang201911 mentioned this pull request Nov 28, 2025

Conditional visual token pruning for QWen-VL models. #2714

Closed

yangwang201911 force-pushed the ywang2/vlm-cdpruner-clean branch from 09ad8c9 to bf3ea74 Compare November 28, 2025 07:26

Copilot AI reviewed Nov 28, 2025

View reviewed changes

Fix copyright year in test_cdpruner_dpp.cpp and clean up includes in …

ea3667e

…pipeline_base.hpp and classes.cpp

peterchen-intel requested a review from l-bat November 28, 2025 08:35

Merge branch 'master' into ywang2/vlm-cdpruner-clean

65e5058

Copilot AI review requested due to automatic review settings December 1, 2025 01:04

Copilot AI reviewed Dec 1, 2025

View reviewed changes

l-bat reviewed Dec 1, 2025

View reviewed changes

peterchen-intel added the under_perf_acc_test label Dec 3, 2025

yangwang201911 added 2 commits December 3, 2025 16:24

Refactor CDPruner and related components for improved clarity and per…

f88dcd3

…formance

Refactor DPP implementation to improve efficiency by removing redunda…

afbe8c8

…nt sorting and uniqueness checks

Copilot AI review requested due to automatic review settings December 4, 2025 09:08

Copilot AI reviewed Dec 4, 2025

View reviewed changes

l-bat requested changes Dec 4, 2025

View reviewed changes

Merge branch 'master' into ywang2/vlm-cdpruner-clean

1b39c90

	// extract_text_features_for_cdpruner() that acquires a request from the same queue.
	// extract_text_features_for_pruning() that acquires a request from the same queue.

	di2s_data[j] = -std::numeric_limits<float>::max();
	di2s_data[j] = -std::numeric_limits<float>::infinity();

-                        "Kept token indices layout does not match vision regions");
+                        "Kept token indices layout does not match vision regions: expected kept_indices_per_image.size() == region_count (" +
+                        std::to_string(region_count) +
+                        ") or kept_indices_per_image.size() == 1, but got kept_indices_per_image.size() == " +
+                        std::to_string(kept_indices_per_image.size()) + ".");

Conditional visual token pruning for QWen-VL models. #3084

Are you sure you want to change the base?

Conditional visual token pruning for QWen-VL models. #3084

Conversation

yangwang201911 commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

l-bat Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

yangwang201911 Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

l-bat Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

l-bat Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

l-bat Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

yangwang201911 Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

l-bat Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

yangwang201911 Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

l-bat left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

yangwang201911 commented Nov 28, 2025 •

edited

Loading

l-bat Dec 1, 2025 •

edited

Loading

yangwang201911 Dec 3, 2025 •

edited

Loading