Skip to content

Conversation

@ddh0
Copy link
Contributor

@ddh0 ddh0 commented Dec 11, 2025

This PR implements a new sampler that reshapes token probability distributions to favor tokens near a configurable target probability, rather than selecting from the highest-probability candidates. The technique is called Power Law sampling and it was originally described and implemented by @MrJackSpade here.

Theory

Traditional samplers operate on a simple principle: select from the most probable tokens. Power Law sampling takes a fundamentally different approach: select tokens whose probability falls near a configurable target value.

This treats probability space as navigable terrain, allowing you to intentionally sample from specific regions of the model's probability distribution rather than always defaulting to the top candidates.

How it works

  1. Compute original softmax probabilities
  2. Calculate the target probability (optionally adaptive based on recent history)
  3. Reshape the distribution using a power law transform that peaks at the target
  4. Sample from the reshaped distribution
  5. Record the original probability of the selected token for adaptive targeting

The power law transform assigns new logits based on distance from the target:

new_logit = 3.0 / (1 + (|p - target| / 0.2)^3)

Tokens near the target get high logits; tokens far from it get low logits.

Advantages

The sampler is designed to promote "mid-range" tokens - ones the model considers plausible but not dominant. This can help with:

  • Reducing repetitive, predictable outputs
  • Exploring creative alternatives that are still coherent
  • Maintaining variety over long generations via adaptive targeting

For example, with target=0.10, tokens in the 5-15% probability range get boosted, while the dominant 60%+ tokens get suppressed. The model still respects its own confidence structure (unlike pure temperature scaling), so you avoid boosting actual nonsense.

Parameters

  • --power-law-target (float, 0.0-1.0, default 0.5): The probability value to favor.
  • --power-law-target-range (float, default 0.5): Adaptive range around the target. The actual target can shift within target ± range based on history. Set to 0.0 for a fixed target. The range is clamped internally to [0.0, 1.0].
  • --power-law-window-size (int, default 10): Rolling window size for adaptive targeting. When > 0, the sampler tracks the original probabilities of recent selections and nudges the target to maintain the desired average. Set to 0 for a fixed target. Default 10 is a good value.

Usage

This sampler selects a token rather than just filtering candidates, like greedy, dist, or mirostat. It must be the final sampler in the chain. Light filtering beforehand (like a mild min-p) can help remove garbage tokens.

This sampler is intentionally not part of the default sampler chain. To enable it, add power_law (or power-law) to your sampler chain, e.g. with --samplers "top_k;min_p;power_law".

@ddh0
Copy link
Contributor Author

ddh0 commented Dec 11, 2025

I think this is more or less ready for review now. Also pinging @MrJackSpade in case he'd like to chime in.

@ddh0 ddh0 marked this pull request as ready for review December 11, 2025 23:59
@ddh0 ddh0 requested a review from ggerganov as a code owner December 11, 2025 23:59
@ddh0
Copy link
Contributor Author

ddh0 commented Dec 12, 2025

Nevermind, sorry, I think we want to do a little more testing. I'm going to mark this as draft again temporarily.

@ddh0 ddh0 marked this pull request as draft December 12, 2025 02:55
Copy link
Contributor

@pnb pnb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very interesting! I wish the original compared to XTC, since the goals seem highly similar.

As an aside, I am curious if there is some way to make it work without selecting a token (i.e., only steps 1-3). I see why token selection is necessary, given the need to save the original probability to the history for the adaptive adjustment part. But, for example, maybe it would suffice instead to save the original probability of the highest-probability token after transforming, regardless of which one is eventually selected by a downstream sampler.


// fixed power law transform parameters (from original implementation)
const float distribution_width = 0.2f;
const float peak_logit_value = 3.0f;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these parameters be configurable like in the original implementation? There is probably a tradeoff with feature creep, having too many options for users to control, but some of these seem potentially important (especially distribution_width). Also, I noticed peak_logit_value is outside the range suggested in the original implementation; is that intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Myself and the original author are discussing the parameters over the next few days, I agree that the current implementation is probably not ideal, which is why I marked it back as draft.

I will post a comment in the main thread with an update once we've got it more figured out. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants