Skip to content

Conversation

@fxmarty-amd
Copy link
Contributor

As per title.

Fixes #3404, see the context there.

Running

LOGLEVEL=debug CUDA_VISIBLE_DEVICES=0 lm_eval \
  --model hf \
  --model_args '{"pretrained":"/models/openai_gpt-oss-20b","dtype":"auto","chat_template_args":{"reasoning_effort":"low"},"enable_thinking": true,"think_end_token":200008}' \
  --device "cuda" \
  --gen_kwargs max_gen_toks=4048 \
  --tasks gpqa_diamond_generative_n_shot \
  --apply_chat_template \
  --fewshot_as_multiturn \
  --log_samples \
  --output_path debug_gpqa \
  --num_fewshot 5 \
  --batch_size 16

on main gives:

|            Tasks             |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|------------------------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gpqa_diamond_generative_n_shot|      2|flexible-extract|     5|exact_match|↑  |0.5455|±  |0.0355|
|                              |       |strict-match    |     5|exact_match|↑  |0.0000|±  |0.0000|

and with this fix gives:

|            Tasks             |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|------------------------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gpqa_diamond_generative_n_shot|      2|flexible-extract|     5|exact_match|↑  |0.5303|±  |0.0356|
|                              |       |strict-match    |     5|exact_match|↑  |0.5000|±  |0.0356|

NOTE: I also modified the description as I noticed certain models (gpt-oss-20b) do not necessarily obey the instruction following the format of the previous questions exactly, although the n-shot answer format is correct.

cc @baberabb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GPQA strict-match regex pattern does not match the fewshot response template

1 participant