Skip to content

GPQA strict-match regex pattern does not match the fewshot response template #3404

@fxmarty-amd

Description

@fxmarty-amd

Hi,

As per title.

In

description: "Here are some example questions from experts. Answer the final question yourself, following the format of the previous questions exactly.\n"
doc_to_text: "Question: {{Question}}\nChoices:\n(A) {{choice1}}\n(B) {{choice2}}\n(C) {{choice3}}\n(D) {{choice4}}\nLet's think step by step: "
doc_to_target: answer
filter_list:
- name: "strict-match"
filter:
- function: "regex"
regex_pattern: "(?<=The answer is )(.*)(?=.)"
- function: "take_first"
and
description: "Here are some example questions from experts. Answer the final question yourself, following the format of the previous questions exactly.\n"
doc_to_text: "Question: {{Question}}\nChoices:\n(A) {{choice1}}\n(B) {{choice2}}\n(C) {{choice3}}\n(D) {{choice4}}\nAnswer:"
doc_to_target: answer
filter_list:
- name: "strict-match"
filter:
- function: "regex"
regex_pattern: "(?<=The answer is )(.*)(?=.)"
- function: "take_first"
we have the instruction Answer the final question yourself, following the format of the previous questions exactly, with the expected output regex_pattern: "(?<=The answer is )(.*)(?=.)".

However, when using e.g. --num_fewshot 5, the answers in the prompt are formatted as follow:

Image

Which is not the format suggested in the regex, as The answer is is missing. The ends up not incentivizing the model to use the regex format, and eventually the strict-match is 0.

Only flexible-extract is decent.

One can reproduce with:

CUDA_VISIBLE_DEVICES=0 nohup lm_eval \
  --model hf \
  --model_args '{"pretrained":"openai/gpt-oss-20b","dtype":"auto","chat_template_args":{"reasoning_effort":"low"},"enable_thinking": true}' \
  --device "cuda" \
  --gen_kwargs max_gen_toks=4048 \
  --tasks gpqa_diamond_generative_n_shot \
  --apply_chat_template \
  --fewshot_as_multiturn \
  --limit 1 \
  --num_fewshot 5 \
  --batch_size 1

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions