Skip to content

Conversation

@cantemizyurek
Copy link
Collaborator

@cantemizyurek cantemizyurek commented Nov 26, 2025

Per-eval scoreThreshold override

Adds the ability to set scoreThreshold per-eval, overriding the global threshold when specified.

Usage

evalite("Eval that requires high percision", {
  data: [...],
  task: ...,
  scoreThreshold: 95, // Requires 95%
});

evalite("Baseline Eval", {
  data: [...],
  task: ...,
  scoreThreshold: 80, // Only requires 80%
});

Behavior

  • Per-eval threshold overrides global when set
  • Falls back to global threshold when not specified
  • If ANY eval fails its threshold, exit code is 1

Watch mode output

When thresholds fail, shows which evals failed:

 FAIL  Score threshold not met. Watching for file changes...
       - High-Quality Eval: 75% < 95%
       - Baseline Eval: 60% < 80%

Solves: #347

- Introduced per-eval score thresholds to enhance evaluation control.

Solves: mattpocock#347
@changeset-bot
Copy link

changeset-bot bot commented Nov 26, 2025

🦋 Changeset detected

Latest commit: 73636d9

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel
Copy link

vercel bot commented Nov 26, 2025

@cantemizyurek is attempting to deploy a commit to the Skill Recordings Team on Vercel.

A member of the Team first needs to authorize it.

@mattpocock
Copy link
Owner

Keeping this on ice for now, will put this in the post-v1 milestone.

@mattpocock mattpocock added this to the Post-v1 milestone Dec 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants