-
Notifications
You must be signed in to change notification settings - Fork 14.1k
implement Power Law sampling #17927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
implement Power Law sampling #17927
Conversation
|
I think this is more or less ready for review now. Also pinging @MrJackSpade in case he'd like to chime in. |
|
Nevermind, sorry, I think we want to do a little more testing. I'm going to mark this as draft again temporarily. |
pnb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very interesting! I wish the original compared to XTC, since the goals seem highly similar.
As an aside, I am curious if there is some way to make it work without selecting a token (i.e., only steps 1-3). I see why token selection is necessary, given the need to save the original probability to the history for the adaptive adjustment part. But, for example, maybe it would suffice instead to save the original probability of the highest-probability token after transforming, regardless of which one is eventually selected by a downstream sampler.
src/llama-sampling.cpp
Outdated
|
|
||
| // fixed power law transform parameters (from original implementation) | ||
| const float distribution_width = 0.2f; | ||
| const float peak_logit_value = 3.0f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these parameters be configurable like in the original implementation? There is probably a tradeoff with feature creep, having too many options for users to control, but some of these seem potentially important (especially distribution_width). Also, I noticed peak_logit_value is outside the range suggested in the original implementation; is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Myself and the original author are discussing the parameters over the next few days, I agree that the current implementation is probably not ideal, which is why I marked it back as draft.
I will post a comment in the main thread with an update once we've got it more figured out. Thank you!
This PR implements a new sampler that reshapes token probability distributions to favor tokens near a configurable target probability, rather than selecting from the highest-probability candidates. The technique is called Power Law sampling and it was originally described and implemented by @MrJackSpade here.
Theory
Traditional samplers operate on a simple principle: select from the most probable tokens. Power Law sampling takes a fundamentally different approach: select tokens whose probability falls near a configurable target value.
This treats probability space as navigable terrain, allowing you to intentionally sample from specific regions of the model's probability distribution rather than always defaulting to the top candidates.
How it works
The power law transform assigns new logits based on distance from the target:
Tokens near the target get high logits; tokens far from it get low logits.
Advantages
The sampler is designed to promote "mid-range" tokens - ones the model considers plausible but not dominant. This can help with:
For example, with
target=0.10, tokens in the 5-15% probability range get boosted, while the dominant 60%+ tokens get suppressed. The model still respects its own confidence structure (unlike pure temperature scaling), so you avoid boosting actual nonsense.Parameters
--power-law-target(float, 0.0-1.0, default 0.5): The probability value to favor.--power-law-target-range(float, default 0.5): Adaptive range around the target. The actual target can shift withintarget ± rangebased on history. Set to 0.0 for a fixed target. The range is clamped internally to[0.0, 1.0].--power-law-window-size(int, default 10): Rolling window size for adaptive targeting. When > 0, the sampler tracks the original probabilities of recent selections and nudges the target to maintain the desired average. Set to 0 for a fixed target. Default 10 is a good value.Usage
This sampler selects a token rather than just filtering candidates, like
greedy,dist, ormirostat. It must be the final sampler in the chain. Light filtering beforehand (like a mildmin-p) can help remove garbage tokens.This sampler is intentionally not part of the default sampler chain. To enable it, add
power_law(orpower-law) to your sampler chain, e.g. with--samplers "top_k;min_p;power_law".