Skip to content

[EVALS] gpt-5.4-nano reports ~3% accuracy in SciFact tutorial due to fragile prompt template + extract_solution regex #258

@kamran-rapidfireAI

Description

@kamran-rapidfireAI

TL;DR — the issue is that the prompt template + answer-extraction regex are a fragile combination, and gpt-5.4-nano falls into a trap that gemini-2.5-flash-lite mostly avoids. This is misleading end-users into thinking gpt-5.4-nano is broken or unsuitable for the task, when in reality the model is producing correct answers that the extractor simply throws away.

The tutorial_notebooks/rag-contexteng/rf-tutorial-scifact-full-evaluation.ipynb notebook ships with api_config=List([openai_config, gemini_config]) where openai_config targets gpt-5.4-nano. When the notebook runs out of the box, the OpenAI-driven pipelines report an Accuracy of roughly 0.02–0.04, while the gemini-driven pipelines on the same queries with the same retrieval report ~0.66–0.73.

This is misleading. An end user looking at the result table would reasonably conclude that gpt-5.4-nano is unsuitable for the task, or that something is broken. In fact the model is producing correct answers — the prompt template + answer-extraction regex are a fragile combination, and gpt-5.4-nano falls into a trap that gemini-2.5-flash-lite mostly avoids.

Potential root cause

The INSTRUCTIONS system prompt contains:

You will output your final answer after reasoning through the evidence. The final answer should be one of the three options and should be formatted as follows:

Reasoning for the answer #### ANSWER

Here is an example: ... Response: ... the claim is contradicted. #### CONTRADICT

The literal phrase #### ANSWER is meant as a placeholder, but gpt-5.4-nano interprets it as part of the expected output and copies it verbatim, then writes the real verdict on a separate line afterward.

Live reproduction against the OpenAI API with one SciFact claim and the same INSTRUCTIONS:

Reasoning for the answer #### ANSWER

The evidence states that CK4 and CK13 are broadly expressed in normal
esophageal mucosa but are markedly decreased in ESCC... This supports
their use as biomarkers for ESCC...

#### ANSWER  
SUPPORT

The notebook's extractor is:

def extract_solution(answer):
    solution = re.search(r"####\s*(SUPPORT|CONTRADICT|NOINFO)", answer, re.IGNORECASE)
    if solution is None:
        return "INVALID"
    return solution.group(1).upper()

The regex scans for #### followed (after optional whitespace) by one of the three verdict words. In the output above it finds #### ANSWER first — and because ANSWER is not in the allowed set, the regex returns no match, extract_solution returns "INVALID", and every gpt-5.4-nano reply that follows this template-copy pattern is graded wrong regardless of whether the real verdict at the end was correct.

A spot-check shows even gemini-2.5-flash-lite occasionally falls into the same pattern, producing #### ANSWER: CONTRADICT — which also fails the regex. Gemini just falls into the trap less often, so its measured Accuracy is much higher; it is not actually immune.

Impact on end users

Anyone running this tutorial unchanged will see a result table where the OpenAI pipelines look near-random and the Gemini pipelines look competent. The natural reading is "the choice of generator matters enormously" or "gpt-5.4-nano is broken on this task". The actual story — the prompt template confuses some models into copying a placeholder, and the extractor is too strict to recover — is invisible. This undermines the credibility of the tutorial's apparent finding and risks misguiding users who use this notebook as a starting template for their own evaluations.

Suggested fix (either is sufficient; both are stronger)

A — make extract_solution more forgiving so it falls back to any trailing verdict word:

def extract_solution(answer):
    m = re.search(r"####\s*(SUPPORT|CONTRADICT|NOINFO)\b", answer, re.IGNORECASE)
    if m:
        return m.group(1).upper()
    # Fallback: take the last verdict-word that appears anywhere in the reply
    matches = re.findall(r"\b(SUPPORT|CONTRADICT|NOINFO)\b", answer, re.IGNORECASE)
    if matches:
        return matches[-1].upper()
    return "INVALID"

B — rewrite the prompt example so the format description doesn't contain the literal token ANSWER adjacent to #### (which is what models are copying). For example:

End your reply with the final verdict on its own line, prefixed with ####.
Example: #### CONTRADICT

Either fix should bring gpt-5.4-nano's Accuracy into the same range as gemini-2.5-flash-lite and remove the misleading model-comparison story from the result table.

Reproduce

rapidfireai==0.16.0rc3, evals-mode install. Run the notebook with OPENAI_API_KEY and GOOGLE_API_KEY exported. The results_df will show ~0.03 Accuracy for the four gpt-5.4-nano rows and ~0.70 Accuracy for the four gemini-2.5-flash-lite rows, despite (near-)identical retrieval metrics across matching (embedding, search) pairs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions