Add SCT (Script Concordance Test) benchmark #5
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add SCT (Script Concordance Test) Benchmark
Summary
This PR adds a new benchmark for evaluating AI clinical reasoning using Script Concordance Tests (SCTs).
Examples are from the public SCT-Bench/sctpublic repository.
Paper: McCoy et al., NEJM AI 2025
What is SCT?
Script Concordance Testing is a validated assessment method that measures clinical reasoning by evaluating how new information affects diagnostic or therapeutic hypotheses. Unlike multiple-choice questions, SCT captures the nuanced, probabilistic nature of clinical decision-making.
Changes
benchmarks/sct/prompt.md- Prompt template for SCT questionsschema.json- JSON schema for response validationvalidator.py- API validation scriptinputs/- 5 example test casesoutputs/- Reference outputs for examplesREADME.md- Benchmark documentationResponse Format
Models respond with a JSON object containing a rating (-2 to +2) and rationale:
{ "Rating": 1, "Rationale": "Brief clinical justification" }Example Cases
Five calibration examples are included, covering the full rating scale:
Full Benchmark
The complete benchmark includes 750 validated questions from 10 international medical institutions across multiple specialties (internal medicine, emergency medicine, neurology, pediatrics, physiotherapy).
Testing