Fix wrong `gpqa_diamond_generative_n_shot` answer template #3407

fxmarty-amd · 2025-11-15T13:50:46Z

As per title.

Fixes #3404, see the context there.

Running

LOGLEVEL=debug CUDA_VISIBLE_DEVICES=0 lm_eval \
  --model hf \
  --model_args '{"pretrained":"/models/openai_gpt-oss-20b","dtype":"auto","chat_template_args":{"reasoning_effort":"low"},"enable_thinking": true,"think_end_token":200008}' \
  --device "cuda" \
  --gen_kwargs max_gen_toks=4048 \
  --tasks gpqa_diamond_generative_n_shot \
  --apply_chat_template \
  --fewshot_as_multiturn \
  --log_samples \
  --output_path debug_gpqa \
  --num_fewshot 5 \
  --batch_size 16

on main gives:

|            Tasks             |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|------------------------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gpqa_diamond_generative_n_shot|      2|flexible-extract|     5|exact_match|↑  |0.5455|±  |0.0355|
|                              |       |strict-match    |     5|exact_match|↑  |0.0000|±  |0.0000|

and with this fix gives:

|            Tasks             |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|------------------------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gpqa_diamond_generative_n_shot|      2|flexible-extract|     5|exact_match|↑  |0.5303|±  |0.0356|
|                              |       |strict-match    |     5|exact_match|↑  |0.5000|±  |0.0356|

NOTE: I also modified the description as I noticed certain models (gpt-oss-20b) do not necessarily obey the instruction following the format of the previous questions exactly, although the n-shot answer format is correct.

cc @baberabb

fix gpqa diamond generative

05d6618

fxmarty-amd requested a review from baberabb as a code owner November 15, 2025 13:50

baberabb mentioned this pull request Nov 19, 2025

add instruct_format (or chat_overload or similar) field to task configs #3417

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix wrong `gpqa_diamond_generative_n_shot` answer template #3407

Fix wrong `gpqa_diamond_generative_n_shot` answer template #3407

fxmarty-amd commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix wrong gpqa_diamond_generative_n_shot answer template #3407

Are you sure you want to change the base?

Fix wrong gpqa_diamond_generative_n_shot answer template #3407

Conversation

fxmarty-amd commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix wrong `gpqa_diamond_generative_n_shot` answer template #3407

Fix wrong `gpqa_diamond_generative_n_shot` answer template #3407