Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
fd58a24
Merge branch
Oligou Oct 14, 2025
80fb9cd
skip task if no documents
Oligou Oct 16, 2025
acd19f1
Change default use_chat_template when loading the tokenizer fails
Jeronymous Oct 23, 2025
3cc6315
Take HF_HOME env variable into account (if set)
Jeronymous Oct 23, 2025
f0f7162
Fix MGSM evals
Jeronymous Oct 28, 2025
df19f29
fix reshape bug
Jeronymous Oct 31, 2025
646d657
Remove padding from response
Jeronymous Oct 31, 2025
8c07847
add ruler metric and prompt
Oligou Nov 20, 2025
ed1718b
Add RULER in metrics
Oligou Nov 20, 2025
58d0ccf
make FLORES translation benchmark work with datasets v2 (parquet vers…
Jeronymous Dec 9, 2025
1deed74
Fix possible failure around stop_sequences
Jeronymous Dec 12, 2025
769a575
Fix failure reported in https://github.com/huggingface/lighteval/issu…
Jeronymous Dec 12, 2025
2d001dd
Do not use GPT as a judge
Jeronymous Dec 12, 2025
e7069e2
Fix IFBench subset
Jeronymous Dec 12, 2025
628d2b0
Fix IFEval-fr dataset repo
Jeronymous Dec 12, 2025
2d1f146
limit the model length to avoid error "ValueError: The model's max se…
Jeronymous Dec 15, 2025
b7cf5ff
make cache string independant of function random address
Jeronymous Dec 15, 2025
9436e15
Do not take version of transformers that is bug w.r.t OFFLINE behaviour
Jeronymous Dec 15, 2025
4c9e90c
Fix use of sets in eval code
Jeronymous Dec 15, 2025
bc164c1
Fix corner case
Jeronymous Dec 15, 2025
cb2da29
Misc fixes in RULER evaluation
Jeronymous Dec 16, 2025
82805ab
Change the code to make it work with more recent versions of vllm
Jeronymous Dec 18, 2025
41dec9a
Fix vllm call in LLM as a judge
Jeronymous Dec 18, 2025
2e968b2
Fix error in logprob computation with vllm >= 0.12, because of prefix…
Jeronymous Jan 6, 2026
d9af025
Fix GPQA-French benchmark (original dataset cannot be found anymore, …
Jeronymous Jan 20, 2026
a7e4591
Fix for Mistral tokenizer, that does not have eos_token attribute (bu…
Jeronymous Jan 20, 2026
45ba41e
Fix corner cases
Jeronymous Jan 20, 2026
9ba96b0
Fix corner case on IFBench
Jeronymous Feb 11, 2026
e74e9c0
override max_position_embedding with max_length passed by the user, t…
Jeronymous Feb 11, 2026
ddce778
add COMET and MetricX metrics to lighteval
Jeronymous Feb 17, 2026
b8532b6
Add COMET and MetricX to FLORES benchmarks
Jeronymous Feb 17, 2026
48ee2dc
Add new dependencies
Jeronymous Feb 17, 2026
9730191
COMET/MetricX : add options for device and batch size
Jeronymous Feb 17, 2026
cb1d040
Fix MetricX
Jeronymous Feb 17, 2026
be22ae1
Fix serialization of metric
Jeronymous Feb 17, 2026
121c6a2
Fix corner case
Jeronymous Feb 18, 2026
ee69f36
Merge pull request #1 from OpenLLM-France/comet
Jeronymous Feb 18, 2026
aafd3db
Fix mix of data and pipeline parallelism
Jeronymous Feb 19, 2026
e3fd675
Add support of context parallelism for versions of VLLM that support …
Jeronymous Feb 20, 2026
637d2ef
remove unnecessary deps (already there)
Jeronymous Feb 20, 2026
1167c70
Merge pull request #2 from OpenLLM-France/parallelism
Jeronymous Feb 20, 2026
33968ce
fix corner case
Jeronymous Mar 2, 2026
7ab7fa0
tune generation_size for math tasks
Jeronymous Mar 2, 2026
6a5c942
larger limit for gsm_plus
Jeronymous Mar 2, 2026
e8ac11b
add an option enable_thinking
Jeronymous Mar 4, 2026
0d59c8d
Add MathAlea benchmark for French math multiple-choice evaluation
Lduignan1 Feb 17, 2026
7859993
Fix gold index retrieval in prompt_mathalea function
Lduignan1 Feb 18, 2026
3354541
Update MathAlea metadata with detailed description, language, and tags
Lduignan1 Mar 6, 2026
e372a0f
Fix dataset reference in MathAlea metadata
Lduignan1 Mar 6, 2026
d42f5fd
Refactor MathAlea dataset configuration and prompt generation functions
Lduignan1 Mar 11, 2026
ce6848f
add system prompts in french and english
Lduignan1 Mar 23, 2026
1db696e
Make GPQA-fr a generative benchmark, not a MCQ
Jeronymous Apr 7, 2026
2d55527
Implement MMLU pro eval, with generative style (for instruct models)
Jeronymous Apr 8, 2026
91d9639
Merge pull request #3 from Lduignan1/mathalea
Jeronymous Apr 9, 2026
02757f7
Add Red Teaming benchmark based on AvgBench
Jeronymous Apr 9, 2026
7138a21
Allow to have non-numeric results (ex: judge textual output, for details
Jeronymous Apr 9, 2026
280f450
Make results deterministic. Add the judgement in the details
Jeronymous Apr 9, 2026
8d5c991
Also add another judgement where the judge does not see the question
Jeronymous Apr 9, 2026
da058f2
Add possibility to avoid running evaluation
Jeronymous Apr 22, 2026
481d9bd
Merge pull request #4 from OpenLLM-France/advbench
Jeronymous Apr 22, 2026
d1cf663
Merge upstream huggingface/lighteval main into merge_hf_main
Jeronymous Apr 22, 2026
180975c
Fix ruff style and lint after merge
Jeronymous Apr 22, 2026
2466d64
Solve version incompatibility in project install
Jeronymous Apr 22, 2026
68494ca
less differences with the upstream branch
Jeronymous Apr 22, 2026
9ca1f4b
Add copyright
Jeronymous Apr 22, 2026
6ee2a9e
less differences with the upstream branch
Jeronymous Apr 22, 2026
d9fe736
do not build doc on fork
Jeronymous Apr 22, 2026
379ed71
Add safety / red-teaming benchmarks
Jeronymous Apr 22, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/doc-build.yml
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ on:

jobs:
build:
if: github.repository == 'huggingface/lighteval'
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@90b4ee2c10b81b5c1a6367c4e6fc9e2fb510a7e3 # main
with:
commit_sha: ${{ github.sha }}
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/doc-pr-build.yml
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ concurrency:

jobs:
build:
if: github.repository == 'huggingface/lighteval'
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@90b4ee2c10b81b5c1a6367c4e6fc9e2fb510a7e3 # main
with:
commit_sha: ${{ github.event.pull_request.head.sha }}
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/doc-pr-upload.yml
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ on:

jobs:
build:
if: github.repository == 'huggingface/lighteval'
uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@9ad2de8582b56c017cb530c1165116d40433f1c6 # main
with:
package_name: lighteval
Expand Down
3 changes: 3 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,9 @@ multilingual = [
"pyvi", # for vietnamese tokenizer
]
math = ["latex2sympy2_extended==1.0.6"]
# Disabled: unbabel-comet pins numpy<2 (all versions through 2.2.7), which conflicts with the base numpy>=2 pin.
# To use the COMET metric, install unbabel-comet manually
# translation = ["unbabel-comet>=2.2.0"]
wandb = ["wandb"]
trackio = ["trackio"]

Expand Down
4 changes: 3 additions & 1 deletion src/lighteval/logging/info_loggers.py
Original file line number Diff line number Diff line change
Expand Up @@ -343,7 +343,9 @@ def aggregate(self, task_dict: dict[str, LightevalTask], bootstrap_iters: int =
# The metric is in a subset which has already been computed and saved
continue

aggregation = task.aggregation()[metric_name]
aggregation = task.aggregation().get(metric_name)
if aggregation is None:
continue

try:
metric_result = aggregation(metric_values)
Expand Down
57 changes: 57 additions & 0 deletions src/lighteval/metrics/imports/metricx_model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""MetricX model wrapper using MT5ForConditionalGeneration from transformers.

Instead of vendoring the custom MT5ForRegression class (which has compatibility
issues with newer transformers versions), we load the weights into the standard
MT5ForConditionalGeneration model and extract the regression prediction
(logit at vocab position 250089, clamped to [0, 25]) in the same way MetricX does.
"""

import torch
from transformers import MT5ForConditionalGeneration


class MetricXModel:
"""Wrapper that loads a MetricX checkpoint and performs regression inference."""

def __init__(self, model_name: str, device: str = "cpu"):
self.model = MT5ForConditionalGeneration.from_pretrained(model_name)
self.model.to(device)
self.model.eval()
self.device = device

def predict(self, input_ids: torch.LongTensor, attention_mask: torch.LongTensor) -> torch.FloatTensor:
"""Run MetricX regression inference.

Args:
input_ids: Tokenized input (batch, seq_len), with EOS already removed.
attention_mask: Attention mask (batch, seq_len), with EOS already removed.

Returns:
Prediction scores (batch,), clamped to [0, 25]. Lower is better.
"""
batch_size = input_ids.size(0)
decoder_input_ids = torch.zeros(batch_size, 1, dtype=torch.long, device=self.device)

with torch.no_grad():
output = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
decoder_input_ids=decoder_input_ids,
)

# 250089 = <extra_id_10>, the token MetricX uses for regression output
predictions = output.logits[:, 0, 250089]
return torch.clamp(predictions, 0, 25)
32 changes: 31 additions & 1 deletion src/lighteval/metrics/metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,11 @@
BLEURT,
MRR,
ROUGE,
RULER,
AccGoldLikelihood,
AvgAtN,
BertScore,
COMETMetric,
ExactMatches,
Extractiveness,
F1_score,
Expand All @@ -53,6 +55,7 @@
JudgeLLMSimpleQA,
LoglikelihoodAcc,
MajAtN,
MetricXMetric,
PassAtK,
Recall,
StringDistance,
Expand Down Expand Up @@ -207,7 +210,6 @@ class Metrics(Enum):
corpus_level_fn=np.mean,
higher_is_better=True,
)

bleurt = SampleLevelMetric(
metric_name="bleurt",
sample_level_fn=BLEURT(),
Expand Down Expand Up @@ -236,6 +238,13 @@ class Metrics(Enum):
corpus_level_fn=CorpusLevelTranslationMetric("chrf++"),
higher_is_better=True,
)
comet = SampleLevelMetric(
metric_name="comet",
sample_level_fn=COMETMetric(),
category=SamplingMethod.GENERATIVE,
corpus_level_fn=np.mean,
higher_is_better=True,
)
copyright = SampleLevelMetricGrouping(
metric_name=["longest_common_prefix_length", "edit_distance", "edit_similarity"],
sample_level_fn=StringDistance(
Expand Down Expand Up @@ -445,6 +454,13 @@ class Metrics(Enum):
corpus_level_fn=MatthewsCorrCoef(),
higher_is_better=True,
)
metricx = SampleLevelMetric(
metric_name="metricx",
sample_level_fn=MetricXMetric(),
category=SamplingMethod.GENERATIVE,
corpus_level_fn=np.mean,
higher_is_better=False,
)
mrr = SampleLevelMetric(
metric_name="mrr",
sample_level_fn=MRR(),
Expand Down Expand Up @@ -550,6 +566,20 @@ class Metrics(Enum):
corpus_level_fn=np.mean,
higher_is_better=True,
)
ruler_match_any = SampleLevelMetric(
metric_name="ruler_match",
sample_level_fn=RULER("any"),
category=SamplingMethod.GENERATIVE,
corpus_level_fn=np.mean,
higher_is_better=True,
)
ruler_match_all = SampleLevelMetric(
metric_name="ruler_match",
sample_level_fn=RULER("all"),
category=SamplingMethod.GENERATIVE,
corpus_level_fn=np.mean,
higher_is_better=True,
)
simpleqa_judge = SampleLevelMetricGrouping(
metric_name=["simpleqa_judge"],
higher_is_better={"simpleqa_judge": True},
Expand Down
152 changes: 151 additions & 1 deletion src/lighteval/metrics/metrics_sample.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ def __str__(self):
attr_strs = []
for k, v in attrs.items():
if callable(v):
val_str = v.__name__
val_str = getattr(v, "__name__", type(v).__name__)
else:
val_str = str(v)
attr_strs.append(f"{k}={val_str}")
Expand Down Expand Up @@ -762,6 +762,39 @@ def compute(self, doc: Doc, model_response: ModelResponse, **kwargs) -> dict[str
return self.summac.score_one(inp, prediction)["score"]


class RULER(SampleLevelComputation):
def __init__(
self,
aggregation_method="any",
):
"""RULER exact match class.

Args:
aggregation_method (str, optional): Method to aggregate multiple golds. Can be 'any' or 'all'. Defaults to 'any'.
"""
if aggregation_method not in ["any", "all"]:
raise ValueError(f"aggregation_method must be one of 'any' or 'all'. Was {aggregation_method} instead.")
self.aggregation_method = aggregation_method

def compute(self, doc: Doc, model_response: ModelResponse, **kwargs) -> float:
"""Computes the metric over a list of golds and predictions for one single sample.

Args:
doc (Doc): The document containing gold references.
model_response (ModelResponse): The model's response containing predictions.
**kwargs: Additional keyword arguments.

Returns:
float: Aggregated score over the current sample's items.
"""
golds = doc.get_golds()
predictions = model_response.final_text
if self.aggregation_method == "any":
return max(1.0 if r.lower() in predictions[0].lower() else 0.0 for r in golds)
elif self.aggregation_method == "all":
return sum(1.0 if r.lower() in predictions[0].lower() else 0.0 for r in golds) / len(golds)


class BLEURT(SampleLevelComputation):
def __init__(self):
"""Creates a BLEURT scorer using a light bleurt-tiny-512 model.
Expand Down Expand Up @@ -1454,3 +1487,120 @@ def metric_names(self):

def num_samples(self):
return self.n if self.n is not None else self.k


class COMETMetric(SampleLevelComputation):
def __init__(
self,
model_name: str = "Unbabel/wmt22-comet-da",
source_column: str = "source",
batch_size: int = 8,
gpus: int = 0,
accelerator: str = "cpu",
):
"""COMET metric for machine translation evaluation.

Args:
model_name (str): Name of the COMET model to use.
source_column (str): Key in doc.specific containing the source text.
batch_size (int): Batch size for COMET model inference.
gpus (int): Number of GPUs to use (0 for CPU-only).
accelerator (str): Accelerator to use ("cpu" or "cuda"). MPS is not supported.
"""
if accelerator == "mps":
raise ValueError("MPS is not supported for COMET")

self.model_name = model_name
self.source_column = source_column
self.batch_size = batch_size
self.gpus = gpus
self.accelerator = accelerator
self._model = None

def compute(self, doc: Doc, model_response: ModelResponse, **kwargs) -> float:
"""Computes the COMET score for a single translation.

Args:
doc (Doc): The document containing gold references and source text in doc.specific.
model_response (ModelResponse): The model's response containing predictions.
**kwargs: Unused; kept for compatibility with the metric compute signature.

Returns:
float: COMET score scaled to 0-100 (higher is better).
"""
if self._model is None:
from comet import download_model, load_from_checkpoint

logger.info(f"Loading COMET model {self.model_name}...")
model_path = download_model(self.model_name)
self._model = load_from_checkpoint(model_path)

source = doc.specific[self.source_column]
prediction = model_response.final_text[0]
reference = doc.get_golds()[0]

data = [{"src": source, "mt": prediction, "ref": reference}]
output = self._model.predict(
data,
batch_size=self.batch_size,
gpus=self.gpus,
accelerator=self.accelerator,
)
return output.scores[0] * 100


class MetricXMetric(SampleLevelComputation):
def __init__(
self,
model_name: str = "google/metricx-24-hybrid-large-v2p6",
tokenizer_name: str = "google/mt5-large",
source_column: str = "source",
batch_size: int = 8,
device: str = "cpu",
):
"""MetricX metric for machine translation evaluation.

Args:
model_name (str): Name of the MetricX model to use.
tokenizer_name (str): Name of the tokenizer to use.
source_column (str): Key in doc.specific containing the source text.
batch_size (int): Batch size for tokenization.
device (str): Device to run inference on ("cpu", "cuda").
"""
self.model_name = model_name
self.tokenizer_name = tokenizer_name
self.source_column = source_column
self.batch_size = batch_size
self.device = device
self._model = None
self._tokenizer = None

def compute(self, doc: Doc, model_response: ModelResponse, **kwargs) -> float:
"""Computes the MetricX score for a single translation.

Args:
doc (Doc): The document containing gold references and source text in doc.specific.
model_response (ModelResponse): The model's response containing predictions.
**kwargs: Unused; kept for compatibility with the metric compute signature.

Returns:
float: MetricX score (lower is better, typically 0-25).
"""
if self._model is None:
from lighteval.metrics.imports.metricx_model import MetricXModel

logger.info(f"Loading MetricX model {self.model_name}...")
self._model = MetricXModel(self.model_name, device=self.device)
self._tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_name)

source = doc.specific[self.source_column]
prediction = model_response.final_text[0]
reference = doc.get_golds()[0]

input_text = f"candidate: {prediction} reference: {reference} source: {source}"
inputs = self._tokenizer(input_text, return_tensors="pt", truncation=True, max_length=1024)
# MetricX requires removing the EOS token appended by the tokenizer
input_ids = inputs["input_ids"][:, :-1].to(self.device)
attention_mask = inputs["attention_mask"][:, :-1].to(self.device)

return self._model.predict(input_ids, attention_mask).item()
Loading