Skip to content

Conversation

@yangwang201911
Copy link
Contributor

@yangwang201911 yangwang201911 commented Nov 28, 2025

This is a clean rebase of PR#2714 to simplify the review process.

  • Implement conditional visual token pruning for QWen-VL models.
    -- Paper: CDPruner (arXiv)
    -- Code: GitHub Repository
  • Add configurations to benchmark.py and WWB tools

Tickets: CVS-173220
Related PRs:

@github-actions github-actions bot added category: visual language Visual language pipeline category: continuous batching Continuous batching category: sampling Sampling / Decoding algorithms category: cmake / build Cmake scripts category: Python API Python API for GenAI category: CPP API Changes in GenAI C++ public headers no-match-files category: GGUF GGUF file reader labels Nov 28, 2025
- Implement CDPruner interface with FastDPP algorithm
- Add OpenCL acceleration for GPU processing
- Support multi-frame video pruning with chunking
- Add comprehensive performance optimizations
- Integrate with InputsEmbedder pipeline
- Add configuration parameters and error handling
- Include comprehensive testing and documentation
@yangwang201911 yangwang201911 force-pushed the ywang2/vlm-cdpruner-clean branch from 09ad8c9 to bf3ea74 Compare November 28, 2025 07:26
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot AI review requested due to automatic review settings December 1, 2025 01:04
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}

// Sort final result to maintain order
std::sort(merged_selection.begin(), merged_selection.end());
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The merged selection is sorted after combining first and second half results. Since both halves are already sorted from DPP selection, consider using std::merge instead of std::sort for O(n) complexity instead of O(n log n).

Copilot uses AI. Check for mistakes.
Comment on lines +38 to +40
/// @brief Whether to apply negative mean for relevance calculation
/// This is needed for CLIP-based models (like LLaVA) due to counterintuitive similarity values
bool use_negative_relevance = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this PR support CLIP-based models?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR currently supports Qwen2-VL only. CDPruner execution depends on model-specific interface functions to extract text and visual token features. For Qwen2-VL, this parameter should be set to false.

Support for other models (e.g., LLaVA) will be implemented in follow-up PRs, where this parameter will be configured according to each model's specific requirements.

std::vector<int64_t> instruction_tokens =
extract_instruction_tokens(input_ids, image_pad_token_id, vision_start_token_id, vision_end_token_id);

if (instruction_tokens.empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that CDPruner focuses only on image diversity if there is no text prompt?

size_t split_point) {
// Distribute tokens to keep between both halves
size_t tokens_first_half = num_tokens_to_keep / 2;
size_t tokens_second_half = num_tokens_to_keep - tokens_first_half;
Copy link
Contributor

@l-bat l-bat Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document the semantics of the split DPP variant (two independent calls + merge != full-kernel DPP).
Right now a user might assume it’s equivalent to running DPP on the full kernel, which it isn’t.
In a single DPP call, the algorithm may select all top-K tokens from the second half, while in the split variant we force an equal number of tokens to be taken from each half (in cpu version). This constraint can change the selection set and potentially degrade accuracy.

Have you checked the accuracy impact of this split strategy compared to running DPP on the full kernel?


// Sort final result to maintain order
std::sort(merged_selection.begin(), merged_selection.end());
merged_selection.erase(std::unique(merged_selection.begin(), merged_selection.end()), merged_selection.end());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggests we expect duplicates, which is suspicious for a DPP greedy selection

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the selection bug in OpenCL DPP version.

Comment on lines 352 to 359
// convert splited selected_batches to merged_selection
for (size_t idx = 0; idx < selected_tokens.size(); ++idx) {
if (idx < num_tokens_to_keep / 2) {
merged_selection.push_back(selected_tokens[idx]);
} else {
merged_selection.push_back(selected_tokens[idx] + split_point);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mapping from selected_tokens to merged_selection looks incorrect, because it is based on the position in the vector (idx) instead of the value of the selected index (selected_tokens[idx]). This assumes that the OpenCL select returns the first num_tokens_to_keep / 2 elements from the first half and the rest from the second half (which is only true when doing two separate calls as in the CPU version). With the merged kernel, the DPP selector is free to interleave indices from both halves in any order or even return all top-num_tokens_to_keep indices from one half.

Copy link
Contributor Author

@yangwang201911 yangwang201911 Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand your concern correctly, the position-based mapping works correctly because the OpenCL kernel processes each batch in independent work-groups, ensuring:

  1. Batch 0 writes to output_ids[0...selected_token_num-1]
  2. Batch 1 writes to output_ids[selected_token_num...end]

This is guaranteed by the pointer offset calculation in the OpenCL kernel:

__global int* output_ids_data = output_ids + batch_idx * selected_token_num;

Copilot AI review requested due to automatic review settings December 4, 2025 09:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

tests/cpp/test_cdpruner_dpp.cpp:1

  • Missing documentation comment for the cleanup_opencl() private method. Add a brief description of what this method does.
// Copyright (C) 2023-2025 Intel Corporation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +227 to +229
# Force AVX2 only (disable AVX512)
target_compile_options(${TARGET_NAME_OBJ} PRIVATE "-mavx2" "-mno-avx512f")
message(STATUS "Enable SIMD compilation for ${TARGET_NAME_OBJ}")
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The compiler flags -mavx2 and -mno-avx512f force AVX2 and disable AVX-512, which may reduce performance on newer processors that support AVX-512. Consider using -march=native or conditionally enabling AVX-512 based on target architecture to allow the compiler to optimize for the actual CPU capabilities.

Suggested change
# Force AVX2 only (disable AVX512)
target_compile_options(${TARGET_NAME_OBJ} PRIVATE "-mavx2" "-mno-avx512f")
message(STATUS "Enable SIMD compilation for ${TARGET_NAME_OBJ}")
# Use native CPU optimization for SIMD instructions
target_compile_options(${TARGET_NAME_OBJ} PRIVATE "-march=native")
message(STATUS "Enable SIMD compilation for ${TARGET_NAME_OBJ} with -march=native")

Copilot uses AI. Check for mistakes.
Comment on lines +433 to +437
GENAI_WARN("[CDPruner] Using AVX SIMD instructions for vector operations (8 floats/operation)");
#elif defined(__SSE2__)
GENAI_WARN("[CDPruner] Using SSE2 SIMD instructions for vector operations (4 floats/operation)");
#else
GENAI_WARN("[CDPruner] Using scalar operations (no SIMD acceleration)");
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using GENAI_WARN for informational messages about SIMD instruction set usage is misleading. These are informational messages, not warnings. Use GENAI_INFO or GENAI_DEBUG instead.

Suggested change
GENAI_WARN("[CDPruner] Using AVX SIMD instructions for vector operations (8 floats/operation)");
#elif defined(__SSE2__)
GENAI_WARN("[CDPruner] Using SSE2 SIMD instructions for vector operations (4 floats/operation)");
#else
GENAI_WARN("[CDPruner] Using scalar operations (no SIMD acceleration)");
GENAI_INFO("[CDPruner] Using AVX SIMD instructions for vector operations (8 floats/operation)");
#elif defined(__SSE2__)
GENAI_INFO("[CDPruner] Using SSE2 SIMD instructions for vector operations (4 floats/operation)");
#else
GENAI_INFO("[CDPruner] Using scalar operations (no SIMD acceleration)");

Copilot uses AI. Check for mistakes.
ov::Tensor tmp_embeds = m_embedding->infer(req, input_ids);

// Deep-copy necessary: Returned InferRequest's internal memory will be reused in
// extract_text_features_for_cdpruner() that acquires a request from the same queue.
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'cdpruner' to 'pruning' in the comment describing the function name.

Suggested change
// extract_text_features_for_cdpruner() that acquires a request from the same queue.
// extract_text_features_for_pruning() that acquires a request from the same queue.

Copilot uses AI. Check for mistakes.
float eis_j = cis_data[cis_idx];
// Subtract the squared orthogonal component
if (std::isnan(eis_j)) {
di2s_data[j] = -std::numeric_limits<float>::max();
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The special handling of NaN values by setting to -max() instead of -infinity() is inconsistent with line 417 which uses -infinity() for selected tokens. This inconsistency could lead to unexpected behavior. Consider documenting why different sentinel values are used or unifying the approach.

Suggested change
di2s_data[j] = -std::numeric_limits<float>::max();
di2s_data[j] = -std::numeric_limits<float>::infinity();

Copilot uses AI. Check for mistakes.

// Handle single aggregated vector case
OPENVINO_ASSERT(kept_indices_per_image.size() == 1 && region_count > 1,
"Kept token indices layout does not match vision regions");
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion message 'Kept token indices layout does not match vision regions' doesn't clearly indicate what the expected layout should be. Consider enhancing the message to explain the expected relationship between kept_indices_per_image.size() and region_count.

Suggested change
"Kept token indices layout does not match vision regions");
"Kept token indices layout does not match vision regions: expected kept_indices_per_image.size() == region_count (" +
std::to_string(region_count) +
") or kept_indices_per_image.size() == 1, but got kept_indices_per_image.size() == " +
std::to_string(kept_indices_per_image.size()) + ".");

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@l-bat l-bat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please validate algorithm on https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/python/visual_language_chat/milebench_eval_vlm.py and attach accuracy results for:

  1. Origin FP model
  2. CDPruner: CPU single pass
  3. CDPruner: CPU split version
  4. CDPruner: OpenCL single pass
  5. CDPruner: OpenCL split version

You should modify CB config (add CDPruner) and run:

python milebench_eval_vlm.py --model_dir converted_qwen_model_dir --subset DocVQA --data_dir milebench_data 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: cmake / build Cmake scripts category: continuous batching Continuous batching category: CPP API Changes in GenAI C++ public headers category: GGUF GGUF file reader category: Python API Python API for GenAI category: sampling Sampling / Decoding algorithms category: visual language Visual language pipeline no-match-files under_perf_acc_test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants