Code accompanying the paper Optimistic Dual Averaging Unifies Modern Optimizers.
soda.py: Contains theSODAreference implementation along with various norm choices.moda.py: ContainsMODA, the Modernized Optimistic Dual Averaging special case for which the implementation simplifies.soda_wrapper.py: A lightweight wrapper that adds the SODA averaging step on top of an existing optimizer without weight decay.modded-nanogpt/: Example usage of nanoGPT experiments.
The SODA optimizer comes with the following hyperparameters:
| Hyperparameter | Meaning | Default in soda.py
|
Common setting / comment |
|---|---|---|---|
lr |
Learning-rate |
1e-4 |
Typically uses a schedule. |
dual_momentum1 |
Dual averaging parameter |
0.05 |
This is 1 - usual_momentum in PyTorch SGD-style notation. Same default as in Muon. |
dual_momentum2 |
Optimistic parameter |
0.05 |
Uses the same 1 - usual_momentum convention and default. Set to 0 to disable optimism. |
primal_momentum1 |
Averaging parameter |
1 / (k + 2) |
Uniform averaging over the primal iterates. |
primal_momentum2 |
Extrapolation parameter |
0 |
Default enables primal extrapolation, so gradients and evaluation use the same iterate. This is the MODA special case. |
norm |
Norm choice for |
"Spectral" |
Supported: Spectral, SpectralConv, Sign, ColNorm, RowNorm, BiasRMS, Sinkhorn. |
norm_kwargs |
Arguments for the selected norm. | {} |
For example, {"steps": 5} for Spectral or SpectralConv. |
Note For evaluation with standalone
SODA, calloptimizer.eval()before validation andoptimizer.train()before resuming training. This switches between the averaged iterate$x^k$ , where performance is evaluated, and the extrapolated iterate$y^k$ , where gradients are computed. With the default$\bar\lambda_k=0$ , this simplifies to$x^k=y^k$ (seeMODA).
Usage:
optim_groups = [{
"params": model.transformer.h.parameters(),
"lr": 50 * 2**-12,
"norm": "Spectral",
"norm_kwargs": {"steps": 5},
}, {
"params": model.lm_head.parameters(),
"lr": 3000 * 2**-12,
"norm": "Sign",
"norm_kwargs": {},
}]
optimizer = SODA(
optim_groups,
dual_momentum1=0.05,
dual_momentum2=0.05,
)These parameter choices are based on the hyperparameters of uScion, which surprisingly works out of the box even with the 1/k schedule in SODA.
MODA is the Modernized Optimistic Dual Averaging special case of SODA, obtained by setting the primal extrapolation parameter to zero (primal_momentum2=0).
This configuration corresponds to evaluating the performance and the gradients on the same model, which simplifies the code substantially since optimizer.train() and optimizer.eval() can be avoided.
Usage:
optimizer = MODA(
optim_groups,
dual_momentum1=0.05,
dual_momentum2=0.05,
)SODAWrapper has no additional hyperparameters. It wraps a base optimizer and applies the SODA averaging step after the base optimizer update. The wrapper should be used with weight decay disabled in the base optimizer.
An existing optimizer can be wrapped as:
base_optimizer = Muon(params, lr=lr, weight_decay=0)
optimizer = SODAWrapper(base_optimizer)Note The purpose of the SODA Wrapper is to quickly test on any given base optimizer. It is not memory optimized and for production we recommend folding the wrapper logic into the base optimizer.
If you find this work useful, please cite it as follows:
@article{pethick2026optimistic,
title={Optimistic Dual Averaging Unifies Modern Optimizers},
author={Pethick, Thomas and Xie, Wanyun and Machacek, Roman and Cevher, Volkan},
journal={arXiv preprint arXiv:2605.11172},
year={2026}
}