Skip to content

Conversation

@liamgmccoy
Copy link
Collaborator

@liamgmccoy liamgmccoy commented Jan 3, 2026

Add SCT (Script Concordance Test) Benchmark

Summary

This PR adds a new benchmark for evaluating AI clinical reasoning using Script Concordance Tests (SCTs).

Examples are from the public SCT-Bench/sctpublic repository.

Paper: McCoy et al., NEJM AI 2025

What is SCT?

Script Concordance Testing is a validated assessment method that measures clinical reasoning by evaluating how new information affects diagnostic or therapeutic hypotheses. Unlike multiple-choice questions, SCT captures the nuanced, probabilistic nature of clinical decision-making.

Changes

  • New benchmark: benchmarks/sct/
    • prompt.md - Prompt template for SCT questions
    • schema.json - JSON schema for response validation
    • validator.py - API validation script
    • inputs/ - 5 example test cases
    • outputs/ - Reference outputs for examples
    • README.md - Benchmark documentation

Response Format

Models respond with a JSON object containing a rating (-2 to +2) and rationale:

{
  "Rating": 1,
  "Rationale": "Brief clinical justification"
}

Example Cases

Five calibration examples are included, covering the full rating scale:

Example Clinical Context Expected
001 Otitis externa vs oral antibiotics -2
002 Pediatric diarrhea + fever -1
003 Pregnancy test + denial 0
004 Atopic dermatitis + fever/rash +1
005 Trisomy 21 + petechiae +2

Full Benchmark

The complete benchmark includes 750 validated questions from 10 international medical institutions across multiple specialties (internal medicine, emergency medicine, neurology, pediatrics, physiotherapy).

Testing

python benchmarks/sct/validator.py example_001

Adds a new benchmark for evaluating AI clinical reasoning using
Script Concordance Tests from McCoy et al., NEJM AI 2025.

- prompt.md: SCT prompt template
- schema.json: JSON response validation schema
- validator.py: API validation script
- inputs/: 5 example test cases (from SCT-Bench/sctpublic)
- outputs/: Reference outputs for examples

Paper: https://ai.nejm.org/doi/full/10.1056/AIdbp2500120
@vishnuravi vishnuravi merged commit 0c24004 into HealthRex:main Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants