[EVALS] gpt-5.4-nano reports ~3% accuracy in SciFact tutorial due to fragile prompt template + extract_solution regex

> **TL;DR — the issue is that the prompt template + answer-extraction regex are a fragile combination, and `gpt-5.4-nano` falls into a trap that `gemini-2.5-flash-lite` mostly avoids. This is misleading end-users into thinking `gpt-5.4-nano` is broken or unsuitable for the task, when in reality the model is producing correct answers that the extractor simply throws away.**

The `tutorial_notebooks/rag-contexteng/rf-tutorial-scifact-full-evaluation.ipynb` notebook ships with `api_config=List([openai_config, gemini_config])` where `openai_config` targets `gpt-5.4-nano`. When the notebook runs out of the box, the OpenAI-driven pipelines report an `Accuracy` of roughly **0.02–0.04**, while the gemini-driven pipelines on the same queries with the same retrieval report ~**0.66–0.73**.

This is misleading. An end user looking at the result table would reasonably conclude that `gpt-5.4-nano` is unsuitable for the task, or that something is broken. In fact the model is producing correct answers — the prompt template + answer-extraction regex are a fragile combination, and `gpt-5.4-nano` falls into a trap that `gemini-2.5-flash-lite` mostly avoids.

### Potential root cause

The `INSTRUCTIONS` system prompt contains:

> You will output your final answer after reasoning through the evidence. The final answer should be one of the three options and should be formatted as follows:
>
> Reasoning for the answer #### ANSWER
>
> Here is an example: ... Response: ... the claim is contradicted. #### CONTRADICT

The literal phrase `#### ANSWER` is meant as a placeholder, but `gpt-5.4-nano` interprets it as part of the expected output and **copies it verbatim**, then writes the real verdict on a separate line afterward.

Live reproduction against the OpenAI API with one SciFact claim and the same `INSTRUCTIONS`:

```
Reasoning for the answer #### ANSWER

The evidence states that CK4 and CK13 are broadly expressed in normal
esophageal mucosa but are markedly decreased in ESCC... This supports
their use as biomarkers for ESCC...

#### ANSWER  
SUPPORT
```

The notebook's extractor is:

```python
def extract_solution(answer):
    solution = re.search(r"####\s*(SUPPORT|CONTRADICT|NOINFO)", answer, re.IGNORECASE)
    if solution is None:
        return "INVALID"
    return solution.group(1).upper()
```

The regex scans for `####` followed (after optional whitespace) by one of the three verdict words. In the output above it finds `#### ANSWER` first — and because `ANSWER` is not in the allowed set, the regex returns no match, `extract_solution` returns `"INVALID"`, and every gpt-5.4-nano reply that follows this template-copy pattern is graded wrong regardless of whether the real verdict at the end was correct.

A spot-check shows even `gemini-2.5-flash-lite` occasionally falls into the same pattern, producing `#### ANSWER: CONTRADICT` — which also fails the regex. Gemini just falls into the trap less often, so its measured Accuracy is much higher; it is not actually immune.

### Impact on end users

Anyone running this tutorial unchanged will see a result table where the OpenAI pipelines look near-random and the Gemini pipelines look competent. The natural reading is "the choice of generator matters enormously" or "gpt-5.4-nano is broken on this task". The actual story — the prompt template confuses some models into copying a placeholder, and the extractor is too strict to recover — is invisible. This undermines the credibility of the tutorial's apparent finding and risks misguiding users who use this notebook as a starting template for their own evaluations.

### Suggested fix (either is sufficient; both are stronger)

**A — make `extract_solution` more forgiving** so it falls back to any trailing verdict word:

```python
def extract_solution(answer):
    m = re.search(r"####\s*(SUPPORT|CONTRADICT|NOINFO)\b", answer, re.IGNORECASE)
    if m:
        return m.group(1).upper()
    # Fallback: take the last verdict-word that appears anywhere in the reply
    matches = re.findall(r"\b(SUPPORT|CONTRADICT|NOINFO)\b", answer, re.IGNORECASE)
    if matches:
        return matches[-1].upper()
    return "INVALID"
```

**B — rewrite the prompt example so the format description doesn't contain the literal token `ANSWER` adjacent to `####`** (which is what models are copying). For example:

> End your reply with the final verdict on its own line, prefixed with `####`.
> Example: `#### CONTRADICT`

Either fix should bring gpt-5.4-nano's Accuracy into the same range as gemini-2.5-flash-lite and remove the misleading model-comparison story from the result table.

### Reproduce

`rapidfireai==0.16.0rc3`, evals-mode install. Run the notebook with `OPENAI_API_KEY` and `GOOGLE_API_KEY` exported. The `results_df` will show ~0.03 Accuracy for the four gpt-5.4-nano rows and ~0.70 Accuracy for the four gemini-2.5-flash-lite rows, despite (near-)identical retrieval metrics across matching (embedding, search) pairs.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EVALS] gpt-5.4-nano reports ~3% accuracy in SciFact tutorial due to fragile prompt template + extract_solution regex #258

Potential root cause

Impact on end users

Suggested fix (either is sufficient; both are stronger)

Reproduce

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[EVALS] gpt-5.4-nano reports ~3% accuracy in SciFact tutorial due to fragile prompt template + extract_solution regex #258

Description

Potential root cause

Impact on end users

Suggested fix (either is sufficient; both are stronger)

Reproduce

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions