Skip to content

jdubdevs/do-better

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Do Better

When your AI coding agent keeps making you step in, Do Better helps find the pattern and suggest one practical change.

You're using an AI agent (Claude Code, Cursor, Codex) and it's mostly great — but you keep correcting it: redoing the same ask, re-explaining something you already said, reining it in when it over-builds. Do Better reads your own session, finds those moments, names the pattern, and coaches you to steer it better — in plain language, fit to how you work. Not a dashboard; a human answer.

Built for Claude Code (works out of the box). On Cursor/Codex, point your own agent at this repo and it can set itself up — see Other agents.

Do Better — the across-sessions trend: where you keep correcting your agent, and whether you're getting better at steering it

Illustrative — the wording, numbers, and pattern adapt to your own sessions. Run it on one session, a past session, or across your recent ones (shown here).

What you get

Example of what do-better hands back (illustrative — your wording adapts to you):

Where it kept tripping you up: three times this session you pulled it back for doing more than you asked — once it rewrote a file you didn't mention, twice it added structure to a one-function request. The pattern: it over-builds when the ask is open-ended. One change: state the scope up front — "just this function, don't touch other files." That single constraint would've prevented all three.

The numbers under the hood are evidence; that answer is the deliverable.

Why this exists

The Claude Code ecosystem is full of great tools — but they all tune the agent:

tool what it does
ponytail ★66.8k · caveman ★77.9k make it write less (anti-over-engineering, fewer tokens)
karpathy-skills ★184k · gstack ★118k make it behave better (opinionated setups + rules)
claude-mem ★85k give it memory across sessions
ccusage ★16.7k track what it costs

Every one of them improves the tool. None looks at your loop with it — where you keep stepping in, why, and what to change about how you drive it. That's the gap Do Better fills: a mirror on the human↔agent partnership, read from your own sessions. The others make the agent better; this makes you better at running it. (And when the fix really is "make it write less," Do Better will point you at ponytail.)

How it works — facts are deterministic, judgment is the model

your session log
   │
   ├─▶  extract facts     CLI · deterministic · free   — your turns + token cost (no hallucination)
   ├─▶  classify intent   the model · temp 0 · self-tested — where & why you corrected it
   └─▶  coach you         your own AI                   — a plain answer, fit to your setup + words

The rigor lives in the engine; the warmth lives in the conversation.

Why a model, not keywords

(Honest scope: the classifier is validated against a labeled fixture set, not benchmarked at scale, and is bring-your-own-model.)

A keyword flag is wrong more often than it's right. The bundled self-test shows a naive keyword baseline scoring 33% against the labeled fixtures and tripping on the obvious traps — it reads "sorry, let me rephrase" (you fixing your own prompt) and "nope, this works" (approval) as corrections:

$ python3 engine/classify/selftest.py --selftest-offline
  TRAP FAIL [sc-1] gold=self_clarifying pred=correcting_claude :: 'sorry, let me rephrase that'
  TRAP FAIL [po-1] gold=positive       pred=correcting_claude :: 'nope, this works — ship it'
  class agreement: 33%        # reproduce it yourself — no API key needed

A rubric-driven model read is meant to catch those cases before scoring real data — and the classifier won't touch your data until it passes that self-test (it must label the traps correctly first). In one external prior validation (a 156-turn sample), the model read flagged 5.1% of turns as real corrections vs 1.3% for the keyword baseline — about 4× as many labeled correction turns.

The suite

skill what it answers when to use it
do-better runs the right check and coaches you start here — "do better" / "this is driving me nuts"
intent-check where you keep correcting it, why — for this session, a past one, or a trend across recent sessions (are you improving) you're frustrated and want to know what keeps going wrong — or whether your steering is getting better over time
usage-check what it's costing you (plain spend) "am I burning money on this?"
post-mortem a deeper pass on a recurring pattern the same thing keeps happening — offered when the quick check isn't enough

Quick start (Claude Code)

git clone https://github.com/jdubdevs/do-better && cd do-better
# install the skills: copy/symlink skills/* into ~/.claude/skills/, or load as a plugin

Then mid-session, when it frustrates you: /do-better. After the skills are installed, your AI can walk through the rest of the onboarding. Or try the engine directly:

python3 engine/correction_tracker.py ~/.claude/projects/**/*.jsonl --out turns.json
python3 engine/classify/selftest.py --selftest-offline   # the gate, no key needed
engine/tokens.sh                                          # your spend (via ccusage)

Across sessions (the trend): pass several transcripts and roll the labels into a correction-rate-over-time read — is the same thing recurring, and are you getting better at steering it?

# newest 10 sessions -> extract (carries session + date) -> classify -> trend
python3 engine/correction_tracker.py $(ls -t ~/.claude/projects/**/*.jsonl | head -10) --out turns.json
python3 engine/classify/selftest.py --run turns.json      # writes engine/classify/labels.json
python3 engine/trend.py turns.json                        # per-session rate + recurring type + direction

Other agents

Claude-Code-first, but portable:

  • Spend already works elsewhereusage-check uses ccusage, which covers Codex, OpenCode, and Amp (engine/tokens.sh codex).
  • Cursor / Codex / other: the only CC-specific piece is where the session log lives + its JSONL shape (engine/correction_tracker.py reads ~/.claude/projects/**/*.jsonl). The engine is 89 lines of stdlib + one prompt, so the easiest path is to point your own agent at this repo and say "set this up for my agent" — it can read the extractor, find your log location, and adapt the parser. The classifier prompt + self-test are agent-agnostic as-is.

(New to this? Claude Code is Anthropic's AI coding agent that runs in your terminal and keeps a local session log — Do Better reads that log.)

Security & privacy

  • What runs where: correction_tracker.py reads your local session logs and writes the extracted turns (first 600 chars each) + token cost to a local turns.json. Classification sends those excerpts to the model you choose (e.g. the Anthropic API); labels are written to a local labels.json. Nothing else leaves your machine.
  • Your session data stays local + uncommitted: turns.json, labels.json, and .selftest-passed are gitignored — they're never committed if you fork this.
  • Third-party: usage-check / tokens.sh fetches and runs ccusage via npx at @latest. For supply-chain stability, pin a version (npx ccusage@<version>).
  • Untrusted transcript text: the classifier treats every turn as quoted data to classify, never as instructions to follow — prompt-injection-hardened and locked in by a self-test fixture.

Author / license

Built by John Weng — practitioner-scholar, founder of Bricolas and creator of Aperture — who got tired of re-correcting his own AI and measured it instead (on his sessions, each turn re-read roughly half a million tokens of context — the tax this is built to cut). MIT.

About

When your AI coding agent keeps making you step in — Do Better finds the pattern and one practical fix, read from your own sessions. Claude-Code-first, portable.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors