Skip to content

Add thinking-budget support (max_thinking_tokens) for reasoning-capable chat models #42111

@AndresAlgaba

Description

@AndresAlgaba

Feature request

A built-in way to cap how many tokens a reasoning model spends inside its <think> … </think> block. Today, we can only control the total response length via max_new_tokens. No parameter limits the internal reasoning segment when enable_thinking=True.

Motivation

  • Reasoning models (e.g., Qwen3 series) often produce very long thought blocks, which can blow past latency budgets before the final answer starts.
  • Users need a simple, model-agnostic control to bound that “thinking” cost without disabling reasoning entirely.
  • The Qwen docs (https://qwen.readthedocs.io/en/latest/getting_started/quickstart.html#thinking-budget) already describe a brute-force approach (two-step generation) to implement “thinking budgets”.

Your contribution

I want to submit a PR that:

  • Extends GenerationConfig with:
    max_thinking_tokens: integer budget for reasoning tokens.
    begin_thinking_token_id / end_thinking_token_id: marker IDs so generation knows where the thinking span begins/ends.
  • Add a MaxThinkingTokensLogitsProcessor that watches the active <think> block. Once the budget is reached, it forces end_thinking_token_id, ensuring the model exits reasoning and continues with the final response.
  • Document the new parameter in reasoning-model guides (EXAONE, CWM, etc.) and show how to wire the thinking-token IDs until configs do it automatically.
  • Provide unit coverage so _get_logits_processor injects the new processor whenever the config is fully specified.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions