TL;DR — the issue is that the prompt template + answer-extraction regex are a fragile combination, and gpt-5.4-nano falls into a trap that gemini-2.5-flash-lite mostly avoids. This is misleading end-users into thinking gpt-5.4-nano is broken or unsuitable for the task, when in reality the model is producing correct answers that the extractor simply throws away.
The tutorial_notebooks/rag-contexteng/rf-tutorial-scifact-full-evaluation.ipynb notebook ships with api_config=List([openai_config, gemini_config]) where openai_config targets gpt-5.4-nano. When the notebook runs out of the box, the OpenAI-driven pipelines report an Accuracy of roughly 0.02–0.04, while the gemini-driven pipelines on the same queries with the same retrieval report ~0.66–0.73.
This is misleading. An end user looking at the result table would reasonably conclude that gpt-5.4-nano is unsuitable for the task, or that something is broken. In fact the model is producing correct answers — the prompt template + answer-extraction regex are a fragile combination, and gpt-5.4-nano falls into a trap that gemini-2.5-flash-lite mostly avoids.
Potential root cause
The INSTRUCTIONS system prompt contains:
You will output your final answer after reasoning through the evidence. The final answer should be one of the three options and should be formatted as follows:
Reasoning for the answer #### ANSWER
Here is an example: ... Response: ... the claim is contradicted. #### CONTRADICT
The literal phrase #### ANSWER is meant as a placeholder, but gpt-5.4-nano interprets it as part of the expected output and copies it verbatim, then writes the real verdict on a separate line afterward.
Live reproduction against the OpenAI API with one SciFact claim and the same INSTRUCTIONS:
Reasoning for the answer #### ANSWER
The evidence states that CK4 and CK13 are broadly expressed in normal
esophageal mucosa but are markedly decreased in ESCC... This supports
their use as biomarkers for ESCC...
#### ANSWER
SUPPORT
The notebook's extractor is:
def extract_solution(answer):
solution = re.search(r"####\s*(SUPPORT|CONTRADICT|NOINFO)", answer, re.IGNORECASE)
if solution is None:
return "INVALID"
return solution.group(1).upper()
The regex scans for #### followed (after optional whitespace) by one of the three verdict words. In the output above it finds #### ANSWER first — and because ANSWER is not in the allowed set, the regex returns no match, extract_solution returns "INVALID", and every gpt-5.4-nano reply that follows this template-copy pattern is graded wrong regardless of whether the real verdict at the end was correct.
A spot-check shows even gemini-2.5-flash-lite occasionally falls into the same pattern, producing #### ANSWER: CONTRADICT — which also fails the regex. Gemini just falls into the trap less often, so its measured Accuracy is much higher; it is not actually immune.
Impact on end users
Anyone running this tutorial unchanged will see a result table where the OpenAI pipelines look near-random and the Gemini pipelines look competent. The natural reading is "the choice of generator matters enormously" or "gpt-5.4-nano is broken on this task". The actual story — the prompt template confuses some models into copying a placeholder, and the extractor is too strict to recover — is invisible. This undermines the credibility of the tutorial's apparent finding and risks misguiding users who use this notebook as a starting template for their own evaluations.
Suggested fix (either is sufficient; both are stronger)
A — make extract_solution more forgiving so it falls back to any trailing verdict word:
def extract_solution(answer):
m = re.search(r"####\s*(SUPPORT|CONTRADICT|NOINFO)\b", answer, re.IGNORECASE)
if m:
return m.group(1).upper()
# Fallback: take the last verdict-word that appears anywhere in the reply
matches = re.findall(r"\b(SUPPORT|CONTRADICT|NOINFO)\b", answer, re.IGNORECASE)
if matches:
return matches[-1].upper()
return "INVALID"
B — rewrite the prompt example so the format description doesn't contain the literal token ANSWER adjacent to #### (which is what models are copying). For example:
End your reply with the final verdict on its own line, prefixed with ####.
Example: #### CONTRADICT
Either fix should bring gpt-5.4-nano's Accuracy into the same range as gemini-2.5-flash-lite and remove the misleading model-comparison story from the result table.
Reproduce
rapidfireai==0.16.0rc3, evals-mode install. Run the notebook with OPENAI_API_KEY and GOOGLE_API_KEY exported. The results_df will show ~0.03 Accuracy for the four gpt-5.4-nano rows and ~0.70 Accuracy for the four gemini-2.5-flash-lite rows, despite (near-)identical retrieval metrics across matching (embedding, search) pairs.
The
tutorial_notebooks/rag-contexteng/rf-tutorial-scifact-full-evaluation.ipynbnotebook ships withapi_config=List([openai_config, gemini_config])whereopenai_configtargetsgpt-5.4-nano. When the notebook runs out of the box, the OpenAI-driven pipelines report anAccuracyof roughly 0.02–0.04, while the gemini-driven pipelines on the same queries with the same retrieval report ~0.66–0.73.This is misleading. An end user looking at the result table would reasonably conclude that
gpt-5.4-nanois unsuitable for the task, or that something is broken. In fact the model is producing correct answers — the prompt template + answer-extraction regex are a fragile combination, andgpt-5.4-nanofalls into a trap thatgemini-2.5-flash-litemostly avoids.Potential root cause
The
INSTRUCTIONSsystem prompt contains:The literal phrase
#### ANSWERis meant as a placeholder, butgpt-5.4-nanointerprets it as part of the expected output and copies it verbatim, then writes the real verdict on a separate line afterward.Live reproduction against the OpenAI API with one SciFact claim and the same
INSTRUCTIONS:The notebook's extractor is:
The regex scans for
####followed (after optional whitespace) by one of the three verdict words. In the output above it finds#### ANSWERfirst — and becauseANSWERis not in the allowed set, the regex returns no match,extract_solutionreturns"INVALID", and every gpt-5.4-nano reply that follows this template-copy pattern is graded wrong regardless of whether the real verdict at the end was correct.A spot-check shows even
gemini-2.5-flash-liteoccasionally falls into the same pattern, producing#### ANSWER: CONTRADICT— which also fails the regex. Gemini just falls into the trap less often, so its measured Accuracy is much higher; it is not actually immune.Impact on end users
Anyone running this tutorial unchanged will see a result table where the OpenAI pipelines look near-random and the Gemini pipelines look competent. The natural reading is "the choice of generator matters enormously" or "gpt-5.4-nano is broken on this task". The actual story — the prompt template confuses some models into copying a placeholder, and the extractor is too strict to recover — is invisible. This undermines the credibility of the tutorial's apparent finding and risks misguiding users who use this notebook as a starting template for their own evaluations.
Suggested fix (either is sufficient; both are stronger)
A — make
extract_solutionmore forgiving so it falls back to any trailing verdict word:B — rewrite the prompt example so the format description doesn't contain the literal token
ANSWERadjacent to####(which is what models are copying). For example:Either fix should bring gpt-5.4-nano's Accuracy into the same range as gemini-2.5-flash-lite and remove the misleading model-comparison story from the result table.
Reproduce
rapidfireai==0.16.0rc3, evals-mode install. Run the notebook withOPENAI_API_KEYandGOOGLE_API_KEYexported. Theresults_dfwill show ~0.03 Accuracy for the four gpt-5.4-nano rows and ~0.70 Accuracy for the four gemini-2.5-flash-lite rows, despite (near-)identical retrieval metrics across matching (embedding, search) pairs.