You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+16-23Lines changed: 16 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,30 +30,26 @@ See our [paper](https://arxiv.org/pdf/2309.05516) for more details. For usage in
30
30
31
31
32
32
## 🆕 What's New
33
-
[2025/11] AutoRound has now landed in **LLM-Compressor**! You can apply AutoRound algorithm using `AutoRoundModifier`. Check out the [example](https://github.com/vllm-project/llm-compressor/tree/main/examples/autoround/README.md) to get started!
34
33
35
-
[2025/11] AutoRound now offers preliminary support for an enhanced GGUF quantization algorithm via `--enable_alg_ext`. For detailed accuracy benchmarks, please refer to the [documentation](./docs/gguf_alg_ext_acc.md).
34
+
*[2025/11] AutoRound has landed in **LLM-Compressor**: [*Usage*](https://github.com/vllm-project/llm-compressor/tree/main/examples/autoround/README.md).
36
35
37
-
[2025/10] AutoRound has been integrated into **SGLang**. You can now run models in the AutoRound format directly using the SGLang versions newer than v0.5.4.
36
+
*[2025/11] An **enhanced GGUF** quantization algorithm is available via `--enable_alg_ext`: [*Accuracy*](./docs/gguf_alg_ext_acc.md).
38
37
39
-
[2025/10]We enhanced the RTN mode (--iters 0) to significantly reduce quantization cost compared to the default tuning mode. Check out [this doc](./docs/opt_rtn.md) for some accuracy results. If you don’t have sufficient resources, you can use this mode for 4-bit quantization.
38
+
*[2025/10]AutoRound has been integrated into **SGLang**: [*Usage*](), [*LMSYS Blog*](https://lmsys.org/blog/2025-11-13-AutoRound/), [*X post*](https://x.com/lmsysorg/status/1991977019220148650?s=20), [*Linkedin*](https://www.linkedin.com/feed/update/urn:li:activity:7397742859354857472).
40
39
41
-
[2025/10] We proposed a fast algorithm to generate **mixed bits/datatypes** schemes in minutes. Please
42
-
refer to the documentation for accuracy [results](./docs/auto_scheme_acc.md) and [this guide](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#autoscheme) for usage instructions.
40
+
*[2025/10] A **mix precision** algorithm is available to generate schemes in minutes: [*Usage*](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#autoscheme), [*Accuracy*](./docs/auto_scheme_acc.md).
43
41
44
-
[2025/09] AutoRound now includes experimental support for the **mxfp4 and nvfp4 dtypes**. For accuracy results, see the [documentation](./docs/mxnv_acc.md)
45
-
. We currently recommend exporting to the LLM-Compressor format.
42
+
*[2025/09]**MXFP4** and **NVFP4** dtypes is available: [*Accuracy*](./docs/mxnv_acc.md).
46
43
47
-
[2025/08] AutoRound now provides experimental support for **an improved INT2 algorithm** via `--enable_alg_ext`. See this [documentation](./docs/alg_202508.md)
48
-
for some accuracy results.
44
+
*[2025/08] An **improved INT2** algorithm is available via `--enable_alg_ext`: [*Accuracy*](./docs/alg_202508.md)
49
45
50
-
[2025/07] AutoRound now offers experimental support for **GGUF** format, and recommends using optimized RTN mode (--iters 0) for
51
-
all bits other than 3 bits.
46
+
*[2025/07]**GGUF** format is supported: [*Usage*](./docs/step_by_step.md#gguf-format).
52
47
53
-
[2025/05] AutoRound has been integrated into **Transformers** and **vLLM**.
48
+
*[2025/05] AutoRound has been integrated into **vLLM**: [*Usage*](https://docs.vllm.ai/en/latest/features/quantization/auto_round/), [*Blog*](https://medium.com/@NeuralCompressor/accelerating-vllm-and-sglang-deployment-using-autoround-45fdc0b2683e).
54
49
55
-
[2025/03] The INT2-mixed **DeepSeek-R1** model (~200GB) retains 97.9% accuracy. Check
56
-
out [OPEA/DeepSeek-R1-int2-mixed-sym-inc](https://huggingface.co/OPEA/DeepSeek-R1-int2-mixed-sym-inc).
50
+
*[2025/05] AutoRound has been integrated into **Transformers**: [*Blog*](https://huggingface.co/blog/autoround).
51
+
52
+
*[2025/03] The INT2-mixed **DeepSeek-R1** model (~200GB) retains 97.9% accuracy: [*Model*]((https://huggingface.co/OPEA/DeepSeek-R1-int2-mixed-sym-inc)).
57
53
58
54
59
55
## ✨ Key Features
@@ -319,14 +315,14 @@ for prompt, output in zip(prompts, outputs):
319
315
320
316
### Transformers (CPU/Intel GPU/Gaudi/CUDA)
321
317
322
-
323
318
AutoRound supports 10+ backends and automatically selects the best available backend based on the installed libraries and prompts the user to
324
319
install additional libraries when a better backend is found.
325
320
326
321
**Please avoid manually moving the quantized model to a different device** (e.g., model.to('cpu')) during inference, as
327
322
this may cause unexpected exceptions.
328
323
329
324
The support for Gaudi device is limited.
325
+
330
326
```python
331
327
from transformers import AutoModelForCausalLM, AutoTokenizer
332
328
@@ -337,15 +333,12 @@ text = "There is a girl who likes adventure,"
Special thanks to open-source low precision libraries such as AutoGPTQ, AutoAWQ, GPTQModel, Triton, Marlin, and ExLLaMAV2 for providing low-precision CUDA kernels, which are leveraged in AutoRound.
342
339
340
+
> **Note**:
341
+
> For all publications/events, please view [Publication List](./docs/publication_list.md).
342
+
343
343
## 🌟 Support Us
344
344
If you find AutoRound helpful, please ⭐ star the repo and share it with your community!
* Blog in LMSYS: [AutoRound Meets SGLang: Enabling Quantized Model Inference with AutoRound](https://lmsys.org/blog/2025-11-13-AutoRound/) (Nov 2025)
7
+
8
+
* Blog in Medium: [Accelerating vLLM and SGLang Deployment using AutoRound](https://medium.com/@NeuralCompressor/accelerating-vllm-and-sglang-deployment-using-autoround-45fdc0b2683e) (Oct 2025)
9
+
10
+
* Blog in HuggingFace: [What is AutoRound?](https://huggingface.co/blog/autoround) (April 2025)
11
+
12
+
## 2024 (1)
13
+
14
+
* EMNLP: [Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLM](https://aclanthology.org/2024.findings-emnlp.662/) (Oct 2024)
15
+
16
+
# 2023 (2)
17
+
18
+
* arXiv: [TEQ: Trainable Equivalent Transformation for Quantization of LLMs](https://arxiv.org/abs/2310.10944) (Oct 2023)
19
+
20
+
* Blog in Medium: [Effective Post-Training Quantization for Large Language Models](https://medium.com/intel-analytics-software/effective-post-training-quantization-for-large-language-models-with-enhanced-smoothquant-approach-93e9d104fb98) (Apr 2023)
0 commit comments