-
Notifications
You must be signed in to change notification settings - Fork 31.1k
Open
Labels
Feature requestRequest for a new featureRequest for a new feature
Description
Feature request
A built-in way to cap how many tokens a reasoning model spends inside its <think> … </think> block. Today, we can only control the total response length via max_new_tokens. No parameter limits the internal reasoning segment when enable_thinking=True.
Motivation
- Reasoning models (e.g., Qwen3 series) often produce very long thought blocks, which can blow past latency budgets before the final answer starts.
- Users need a simple, model-agnostic control to bound that “thinking” cost without disabling reasoning entirely.
- The Qwen docs (https://qwen.readthedocs.io/en/latest/getting_started/quickstart.html#thinking-budget) already describe a brute-force approach (two-step generation) to implement “thinking budgets”.
Your contribution
I want to submit a PR that:
- Extends
GenerationConfigwith:
max_thinking_tokens: integer budget for reasoning tokens.
begin_thinking_token_id / end_thinking_token_id: marker IDs so generation knows where the thinking span begins/ends. - Add a
MaxThinkingTokensLogitsProcessorthat watches the active<think>block. Once the budget is reached, it forces end_thinking_token_id, ensuring the model exits reasoning and continues with the final response. - Document the new parameter in reasoning-model guides (EXAONE, CWM, etc.) and show how to wire the thinking-token IDs until configs do it automatically.
- Provide unit coverage so
_get_logits_processorinjects the new processor whenever the config is fully specified.
Metadata
Metadata
Assignees
Labels
Feature requestRequest for a new featureRequest for a new feature