pi-evaluate

An adversarial post-execute evaluation skill for pi.

After a complex execution, you're staring at a large diff and don't know where to look. pi-evaluate reads your contract (what you asked for) and your outputs (what was built), then tells you exactly where to focus — and what you can safely skip.

Inspired by the GAN discriminator pattern: a second agent that sees only the contract and the output, never the implementation plan, and returns a structured verdict.

What it does

pi-evaluate acts as an adversarial discriminator:

Reads your contract — brief + specs (reespec), or freeform text you paste in
Reads your actual outputs — files, test results, documents
Returns a structured verdict per capability: ✅ SATISFIED / ⚠️ PARTIAL / ❌ UNSATISFIED / ❓ UNCLEAR
Produces a triage summary: safe to skip, worth a look, human call

It does NOT read tasks.md, design.md, or any implementation intent. It is blind to the "how" — it only judges whether the "what" was delivered.

It does NOT fix gaps. It reports them. You decide what to do.

Installation

npm install pi-evaluate

Then restart pi or run /reload. The evaluate skill will appear in your available skills.

Reespec mode

If you use reespec, pi-evaluate detects your project automatically.

After completing an execute phase, invoke the skill:

/skill:evaluate

The evaluator will:

Detect your active reespec request
Load brief.md and specs/ as the contract silently
Scan your outputs
Return a verdict per spec capability + triage summary

Example output:

Evaluating request: my-feature

### user-auth-capability
verdict:  ⚠️ PARTIAL
reason:   brief says "support OAuth and password login" — found OAuth handler,
          no password login handler found in src/auth/
focus:    src/auth/ — password login handler is missing

### error-handling-capability
verdict:  ✅ SATISFIED
reason:   all error paths covered in tests/errors.test.mjs

## Triage
✅ Safe to skip:   error-handling, logging
⚠️  Worth a look:  user-auth (password login missing)

Standalone mode

No reespec? No problem. The skill works with any project.

Invoke it:

/skill:evaluate

You'll be asked:

"What's the contract? Paste your original ask, acceptance criteria, or whatever defines done."

Paste anything — a paragraph, a bullet list, a copied ticket, a Slack message. No structure required.

Example:

What's the contract?

> Build a user settings page. It should let users change their email and password.
> There should be a confirmation dialog before saving. Mobile-friendly. No external
> auth libraries.

(contract: user-supplied)

### change-email
verdict:  ✅ SATISFIED
reason:   src/settings/email.tsx exists, email change form found with validation

### change-password
verdict:  ⚠️ PARTIAL
reason:   password field found but no confirmation dialog present in src/settings/
focus:    src/settings/ — confirmation dialog before save is missing

### mobile-friendly
verdict:  ❓ UNCLEAR
reason:   contract says "mobile-friendly" but no breakpoints or responsive tests defined —
          cannot verify without clearer criteria
focus:    human call — define what mobile-friendly means for this project

## Triage
✅ Safe to skip:   change-email
⚠️  Worth a look:  change-password (missing confirmation dialog)
❓  Human call:    mobile-friendly (underspecified)

The GAN idea

GANs (Generative Adversarial Networks) pit two neural networks against each other: a generator that creates fake data, and a discriminator that judges whether the data is real or fake. The discriminator never sees how the generator made the data — it only sees the output and the training data (what "real" looks like).

pi-evaluate borrows this pattern:

GAN	pi-evaluate
Generator	Your agent (execute phase)
Discriminator	The evaluator skill
Training data ("real")	The contract (brief + specs)
Generated output ("fake")	The implementation
"Is this real?"	"Does this satisfy the contract?"

The key insight: the discriminator is blind to implementation intent. It can't be charitable about what the generator "meant to do" — it only sees what exists. This is what makes it useful. A self-review by the same agent that built the thing will always be biased. A blind discriminator won't.

Verdicts

Label	Meaning
✅ SATISFIED	All requirements for this capability are clearly present
⚠️ PARTIAL	Some requirements present, some missing
❌ UNSATISFIED	No evidence of this capability in the outputs
❓ UNCLEAR	Contract is too underspecified to judge — flag for human

Philosophy

Optional — never a hard gate. You decide what to do with the verdict.
Adversarial — looks for gaps, not confirmation. Absence of evidence is flagged.
Focused — the triage summary is the primary output. The human reads this first.
Honest about uncertainty — UNCLEAR is not failure. It means your contract needs more detail.

License

MIT

Made with reespec and ♥ in EU

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
extensions		extensions
skills/evaluate		skills/evaluate
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
README.md		README.md
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pi-evaluate

What it does

Installation

Reespec mode

Standalone mode

The GAN idea

Verdicts

Philosophy

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pi-evaluate

What it does

Installation

Reespec mode

Standalone mode

The GAN idea

Verdicts

Philosophy

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages