Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
c2b83e2
Split up enhancement and features in release notes template (#984)
NathanHB Oct 14, 2025
090630e
Fixing mixeval (#1006)
clefourrier Oct 14, 2025
3af8925
Fix nltk import failing (#1013)
clefourrier Oct 14, 2025
70acb85
Fix 999: always provide parameters in the metric name to allow using …
clefourrier Oct 14, 2025
e7d885c
added fallback for incomplete configs for vlm models launched as llms…
clefourrier Oct 14, 2025
161d47c
Fixing naming for sample evals + adding reqs in aime24 (#989)
clefourrier Oct 14, 2025
bf8b547
add translation literals indic (#1015)
rpm000 Oct 21, 2025
3cd31fd
Move tasks to individual files (#1016)
NathanHB Oct 29, 2025
880bebe
Adds inspectai (#1022)
NathanHB Nov 3, 2025
fa4860f
adds mmlu-pro (#1031)
NathanHB Nov 4, 2025
17e024b
Fix inspect reasoning effrot (#1033)
NathanHB Nov 4, 2025
97303ac
Update huggingface-cli login to use newer hf auth login (#1034)
xeophon Nov 4, 2025
5aa09c5
add openai and inspect ai lower bound (#1035)
NathanHB Nov 4, 2025
b5cbd91
fix `lighteval task inspect` command and tiny bench task (#992)
NathanHB Nov 5, 2025
5b7ca62
run all hf providers with `:all` (#1039)
NathanHB Nov 5, 2025
31433cc
remove suites and make fewshot optional (#1038)
NathanHB Nov 5, 2025
566a7be
put lower bound on typer to use literal type (#1042)
NathanHB Nov 6, 2025
d04e4f9
remove suites from serbian_eval.py (#1044)
S-Y-A-N Nov 12, 2025
cd91dde
neater bundle and logdir (#1043)
NathanHB Nov 12, 2025
2247df7
not forcing use_logits at True (#1050)
f14-bertolotti Nov 12, 2025
6524c6a
wrong attribute self.k -> self.n (#1049)
f14-bertolotti Nov 12, 2025
35babcb
Remove suites in task configs example and fix task with hf_filters (#…
NathanHB Nov 12, 2025
b8ccd20
add a task dump in registry for better documentation of tasks (#1052)
NathanHB Nov 12, 2025
cb97d5c
Fix set using wrong syntax (#1057)
f14-bertolotti Nov 13, 2025
391d5b4
Fix: correct argument order in MajAtN.compute (#1058)
WeiKangda Nov 14, 2025
af6b5b4
Update LiteLLM configuration for hosted_vllm provider (#1060)
abhiram1809 Nov 14, 2025
ad58fed
use correct hf subset for ifbench multiturn (#1061)
sam-paech-liquid Nov 17, 2025
babeec9
One file one task definition (#1059)
NathanHB Nov 17, 2025
d9ea404
adding satrred tag for frontend
NathanHB Nov 18, 2025
5425c33
Adding AA Omniscience task (#1066)
NathanHB Nov 18, 2025
99162f1
Fix task config metric typing to accept Metric enums (#1018)
emmanuel-ferdman Nov 20, 2025
2236e17
removed duplicate code, useless function, added stronger deletion of …
clefourrier Nov 20, 2025
1496355
[FT] Add `py.typed` to `lighteval` (#1071)
akshathmangudi Nov 20, 2025
7f50228
Add style bot and other quality of life (#1076)
NathanHB Nov 21, 2025
d59ce25
Add style bot (#1077)
NathanHB Nov 21, 2025
a64541c
Add style bot (#1078)
NathanHB Nov 21, 2025
66ce47e
Add style bot (#1079)
NathanHB Nov 21, 2025
943c4c3
batched metric was not aggregated properly (#1067)
f14-bertolotti Nov 24, 2025
9009723
add to inspect (#1065)
NathanHB Nov 24, 2025
5803818
bumping version
NathanHB Nov 24, 2025
99ef5b9
Fix the quickstart description? (#1091)
julien-c Nov 28, 2025
98ac1bf
Add starred attribute to gpqa.py metadata
NathanHB Dec 4, 2025
48dcd83
Update available-tasks.mdx (#1088)
bram-pramono Dec 4, 2025
6889901
[DOC] Fix dev dependencies install command (#1085)
jgyasu Dec 4, 2025
557b8d5
update hf_revision hash for multilingual hellaswag (#1084)
rpm000 Dec 4, 2025
2c98b54
feat: Add Kyrgyz LLM Bench multilingual tasks (#1070)
golden-ratio Dec 8, 2025
c6e2ce7
aime_avg was not added to TASKS_TABLE (#1098)
francesco-bertolotti Dec 8, 2025
6d1f147
Comment out PR Style Bot workflow configuration
paulinebm Dec 9, 2025
a5d13a4
Refactor PR Style Bot workflow configuration
paulinebm Dec 10, 2025
22aa98c
Add TEST secret environment variable
paulinebm Dec 10, 2025
0a8b90a
Comment out PR Style Bot workflow configuration
paulinebm Dec 10, 2025
8238d3e
Refactor PR Style Bot workflow configuration
paulinebm Dec 11, 2025
f48af0b
Comment out PR Style Bot workflow configuration
paulinebm Dec 11, 2025
7fba130
Update comment bot secrets in workflow (#1107)
paulinebm Dec 15, 2025
6496d62
Refactor PR Style Bot workflow with new inputs (#1105)
NathanHB Dec 15, 2025
03d8c4e
Enable loading data sets from files for custom tasks (#1083)
davebiagioni Jan 6, 2026
36b3e6c
refactor: adding api_key param to litellm (#1114)
pjavanrood Jan 8, 2026
d9a9401
multi challenge (#1120)
NathanHB Jan 13, 2026
845c989
refactor: add formatted response to litellm (#1116)
pjavanrood Jan 13, 2026
e7048c3
Mathvista (#1118)
NathanHB Jan 13, 2026
61c547b
long horizon execution (#1119)
NathanHB Jan 14, 2026
f888858
bbeh (#1124)
NathanHB Jan 14, 2026
6ce93d5
Fix typo in few_shots_select option error message. ('fbalanced' -> 'b…
jayminban Jan 14, 2026
06aee5b
add eval results tip (#1126)
burtenshaw Jan 21, 2026
0a74a17
Upgrade vLLM from 0.10.1.1 to 0.14.1 (#1173)
NathanHB Mar 4, 2026
33acf35
Add test on main branch of vllm (#1175)
NathanHB Mar 4, 2026
e274b37
🔒 Pin GitHub Actions to commit SHAs (#1201)
paulinebm Apr 7, 2026
34889df
Add support for vllm >= 0.19.0 (#1211)
lewtun Apr 13, 2026
10b9104
chore: bump doc-builder SHA for PR upload workflow (#1213)
rtrompier Apr 15, 2026
d1cf663
Merge upstream huggingface/lighteval main into merge_hf_main
Jeronymous Apr 22, 2026
180975c
Fix ruff style and lint after merge
Jeronymous Apr 22, 2026
2466d64
Solve version incompatibility in project install
Jeronymous Apr 22, 2026
68494ca
less differences with the upstream branch
Jeronymous Apr 22, 2026
9ca1f4b
Add copyright
Jeronymous Apr 22, 2026
6ee2a9e
less differences with the upstream branch
Jeronymous Apr 22, 2026
d9fe736
do not build doc on fork
Jeronymous Apr 22, 2026
379ed71
Add safety / red-teaming benchmarks
Jeronymous Apr 22, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
5 changes: 4 additions & 1 deletion .github/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,10 @@ changelog:
categories:
- title: New Features 🎉
labels:
- feature/enhancement
- feature
- title: Enhancement ⚙️
labels:
- enhancement
- title: Documentation 📚
labels:
- documentation
Expand Down
3 changes: 2 additions & 1 deletion .github/workflows/doc-build.yml
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ on:

jobs:
build:
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
if: github.repository == 'huggingface/lighteval'
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@90b4ee2c10b81b5c1a6367c4e6fc9e2fb510a7e3 # main
with:
commit_sha: ${{ github.sha }}
package: lighteval
Expand Down
3 changes: 2 additions & 1 deletion .github/workflows/doc-pr-build.yml
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ concurrency:

jobs:
build:
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
if: github.repository == 'huggingface/lighteval'
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@90b4ee2c10b81b5c1a6367c4e6fc9e2fb510a7e3 # main
with:
commit_sha: ${{ github.event.pull_request.head.sha }}
pr_number: ${{ github.event.number }}
Expand Down
6 changes: 4 additions & 2 deletions .github/workflows/doc-pr-upload.yml
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,11 @@ on:

jobs:
build:
uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main
if: github.repository == 'huggingface/lighteval'
uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@9ad2de8582b56c017cb530c1165116d40433f1c6 # main
with:
package_name: lighteval
secrets:
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}
comment_bot_app_id: ${{ secrets.COMMENT_BOT_APP_ID }}
comment_bot_secret_pem: ${{ secrets.COMMENT_BOT_SECRET_PEM }}
16 changes: 16 additions & 0 deletions .github/workflows/pr_style_bot.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
name: PR Style Bot

on:
issue_comment:
types: [created]

permissions:
pull-requests: write

jobs:
style:
uses: huggingface/huggingface_hub/.github/workflows/style-bot-action.yml@e000c1c89c65aee188041723456ac3a479416d4c # main
with:
python_quality_dependencies: "[quality]"
secrets:
bot_token: ${{ secrets.HF_STYLE_BOT_ACTION }}
4 changes: 2 additions & 2 deletions .github/workflows/quality.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v2
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Setup Python environment
uses: actions/setup-python@v2
uses: actions/setup-python@e9aba2c848f5ebd159c070c61ea2c4e2b122355e # v2
with:
python-version: '3.10'
- name: Install dependencies
Expand Down
44 changes: 40 additions & 4 deletions .github/workflows/slow_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,21 +25,57 @@ jobs:
fi

- name: Checkout repository
uses: actions/checkout@v4
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
lfs: true

- name: Install uv
uses: astral-sh/setup-uv@v5
uses: astral-sh/setup-uv@d4b2f3b6ecc6e67c4457f6d3e41ec42d3d0fcb86 # v5
with:
enable-cache: true

- name: Install the project
run: uv sync --extra dev
run: uv sync --extra dev-gpu

- name: Install Python development headers
run: sudo apt-get update && sudo apt-get install -y python3.12-dev

- name: Cache CUDA Toolkit
id: cache-cuda
uses: actions/cache@0057852bfaa89a56745cba8c7296529d2fc39830 # v4
with:
path: /usr/local/cuda-12.8
key: cuda-toolkit-12-8-${{ runner.os }}

- name: Install CUDA Toolkit
if: steps.cache-cuda.outputs.cache-hit != 'true'
run: |
# Add NVIDIA package repositories
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
# Install CUDA toolkit 12.8 to match nvidia-cuda-runtime-cu12==12.8.90
sudo apt-get install -y cuda-toolkit-12-8

- name: Verify CUDA installation
run: |
ls -la /usr/local/cuda-12.8/bin/nvcc || echo "WARNING: nvcc not found at /usr/local/cuda-12.8/bin/nvcc"
if [ -f /usr/local/cuda-12.8/bin/nvcc ]; then
/usr/local/cuda-12.8/bin/nvcc --version
fi

- name: Setup CUDA environment
run: |
export CUDA_HOME=/usr/local/cuda-12.8
export PATH="/usr/local/cuda-12.8/bin:$PATH"
echo "CUDA_HOME=/usr/local/cuda-12.8" >> $GITHUB_ENV
echo "/usr/local/cuda-12.8/bin" >> $GITHUB_PATH

- name: run nvidia-smi
run: nvidia-smi

- name: Run tests
run: uv run pytest --disable-pytest-warnings --runslow tests/slow_tests/
run: |
export CUDA_HOME=/usr/local/cuda-12.8
export PATH="/usr/local/cuda-12.8/bin:$PATH"
uv run pytest --disable-pytest-warnings --runslow -v -s tests/slow_tests/
2 changes: 1 addition & 1 deletion .github/workflows/tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ jobs:
enable-cache: true

- name: Install the project
run: uv sync --extra dev
run: uv sync --extra dev-gpu

- name: Ensure cache directories exist
run: mkdir -p cache/models cache/datasets
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/trufflehog.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 0
- name: Secret Scanning
uses: trufflesecurity/trufflehog@main
uses: trufflesecurity/trufflehog@6bd2d14f7a4bc1e569fa3550efa7ec632a4fa67b # main
with:
extra_args: --only-verified
79 changes: 79 additions & 0 deletions .github/workflows/vllm_main_tests.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
name: vLLM Main Branch Tests

on:
schedule:
- cron: '0 2 * * 1' # Every Monday at 2 AM UTC
workflow_dispatch:

permissions:
contents: read

jobs:
test_vllm_main:
name: Test with vLLM main branch
runs-on: 'aws-g4dn-2xlarge-use1-public-80'
continue-on-error: true

steps:
- name: Install Git LFS
run: |
if ! command -v git-lfs &> /dev/null; then
sudo apt-get update && sudo apt-get install -y git-lfs
git lfs install
fi

- name: Checkout repository
uses: actions/checkout@v4
with:
lfs: true

- name: Install uv
uses: astral-sh/setup-uv@v5
with:
enable-cache: true

- name: Install the project
run: uv sync --extra dev-gpu

- name: Install Python development headers
run: sudo apt-get update && sudo apt-get install -y python3.12-dev

- name: Cache CUDA Toolkit
id: cache-cuda
uses: actions/cache@v4
with:
path: /usr/local/cuda-12.8
key: cuda-toolkit-12-8-${{ runner.os }}

- name: Install CUDA Toolkit
if: steps.cache-cuda.outputs.cache-hit != 'true'
run: |
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get install -y cuda-toolkit-12-8

- name: Setup CUDA environment
run: |
echo "CUDA_HOME=/usr/local/cuda-12.8" >> $GITHUB_ENV
echo "/usr/local/cuda-12.8/bin" >> $GITHUB_PATH

- name: Verify CUDA
run: |
nvidia-smi
nvcc --version

- name: Install vLLM from main branch
run: |
uv pip uninstall -y vllm || true
uv pip install git+https://github.com/vllm-project/vllm.git@main

- name: Get vLLM version
id: vllm-info
run: |
VERSION=$(uv run python -c "import vllm; print(vllm.__version__)")
echo "version=$VERSION" >> $GITHUB_OUTPUT
echo "Testing vLLM version: $VERSION"

- name: Run tests
run: uv run pytest --disable-pytest-warnings --runslow -v -s tests/slow_tests/test_vllm_model.py
29 changes: 20 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,9 @@
<a href="https://huggingface.co/docs/lighteval/main/en/index" target="_blank">
<img alt="Documentation" src="https://img.shields.io/badge/Documentation-4F4F4F?style=for-the-badge&logo=readthedocs&logoColor=white" />
</a>
<a href="https://huggingface.co/spaces/OpenEvals/open_benchmark_index" target="_blank">
<img alt="Open Benchmark Index" src="https://img.shields.io/badge/Open%20Benchmark%20Index-4F4F4F?style=for-the-badge&logo=huggingface&logoColor=white" />
</a>
</p>

---
Expand All @@ -39,7 +42,10 @@ sample-by-sample results* to debug and see how your models stack-up.

## Available Tasks

Lighteval supports **7,000+ evaluation tasks** across multiple domains and languages. Here's an overview of some *popular benchmarks*:
Lighteval supports **1000+ evaluation tasks** across multiple domains and
languages. Use [this
space](https://huggingface.co/spaces/OpenEvals/open_benchmark_index) to find what
you need, or, here's an overview of some *popular benchmarks*:


### 📚 **Knowledge**
Expand All @@ -62,7 +68,7 @@ Lighteval supports **7,000+ evaluation tasks** across multiple domains and langu

### 🌍 **Multilingual Evaluation**
- **Cross-lingual**: XTREME, Flores200 (200 languages), XCOPA, XQuAD
- **Language-specific**:
- **Language-specific**:
- **Arabic**: ArabicMMLU
- **Filipino**: FilBench
- **French**: IFEval-fr, GPQA-fr, BAC-fr
Expand All @@ -71,6 +77,7 @@ Lighteval supports **7,000+ evaluation tasks** across multiple domains and langu
- **Turkic**: TUMLU (9 Turkic languages)
- **Chinese**: CMMLU, CEval, AGIEval
- **Russian**: RUMMLU, Russian SQuAD
- **Kyrgyz**: Kyrgyz LLM Benchmark
- **And many more...**

### 🧠 **Core Language Understanding**
Expand All @@ -94,13 +101,14 @@ If you want to push results to the **Hugging Face Hub**, add your access token a
an environment variable:

```shell
huggingface-cli login
hf auth login
```

## 🚀 Quickstart

Lighteval offers the following entry points for model evaluation:

- `lighteval eval`: Evaluation models using [inspect-ai](https://inspect.aisi.org.uk/) as a backend (prefered).
- `lighteval accelerate`: Evaluate models on CPU or one or more GPUs using [🤗
Accelerate](https://github.com/huggingface/accelerate)
- `lighteval nanotron`: Evaluate models in distributed settings using [⚡️
Expand All @@ -117,12 +125,10 @@ Lighteval offers the following entry points for model evaluation:
Did not find what you need ? You can always make your custom model API by following [this guide](https://huggingface.co/docs/lighteval/main/en/evaluating-a-custom-model)
- `lighteval custom`: Evaluate custom models (can be anything)

Here's a **quick command** to evaluate using the *Accelerate backend*:
Here's a **quick command** to evaluate using a remote inference service:

```shell
lighteval accelerate \
"model_name=gpt2" \
"leaderboard|truthfulqa:mc|0"
lighteval eval "hf-inference-providers/openai/gpt-oss-20b" gpqa:diamond
```

Or use the **Python API** to run a model *already loaded in memory*!
Expand All @@ -136,7 +142,7 @@ from lighteval.pipeline import ParallelismManager, Pipeline, PipelineParameters


MODEL_NAME = "meta-llama/Meta-Llama-3-8B-Instruct"
BENCHMARKS = "lighteval|gsm8k|0"
BENCHMARKS = "gsm8k"

evaluation_tracker = EvaluationTracker(output_dir="./results")
pipeline_params = PipelineParameters(
Expand Down Expand Up @@ -181,7 +187,12 @@ If you're adding a **new feature**, please *open an issue first*.
If you open a PR, don't forget to **run the styling**!

```bash
pip install -e .[dev]
# For basic development (code quality, tests)
pip install -e ".[dev]"

# Or for GPU/vllm development and slow tests
pip install -e ".[dev-gpu]"

pre-commit install
pre-commit run --all-files
```
Expand Down
Loading