Visual debugging tool for patch generation traces
The Trace Viewer is a Streamlit-based UI for analyzing LLM interactions, failures, and performance metrics captured during patch generation.
pip install -e ".[observability]"This installs:
streamlit- Web UI frameworkplotly- Interactive charts (future use)pandas- Data analysis (future use)
If you don't have traces yet, run PatchPro with agentic mode:
# From patchpro-bot-agent-dev directory
patchpro analyze-pr --base main --head HEAD --with-llmThis creates .patchpro/traces/:
traces.db- SQLite database (queryable)*.json- Individual trace files (human-readable)
# From patchpro-bot-agent-dev directory
streamlit run trace_viewer.pyOpens browser at http://localhost:8501
The UI shows:
- Summary Metrics: Total traces, success rate, avg cost, avg latency
- Filters: Rule ID, status, strategy, text search
- Trace Cards: Expandable cards with full details for each attempt
Expandable Card Header:
π’ F401 - example.py:15 - Attempt 1
- Status: π’ success, π΄ failed,
β οΈ exhausted_retries - Rule ID: Ruff/Semgrep rule being fixed
- File and line number
- Retry attempt number
Card Details:
- Metadata: Strategy, model, file type, complexity, tokens, cost, latency
- Finding: The original issue message
- Prompt: System + user prompts sent to LLM (collapsed)
- LLM Response: Raw LLM output (collapsed)
- Generated Patch: The unified diff patch (if generated)
- Validation Errors: Git apply errors (if failed)
- Previous Errors: Errors from earlier attempts (if retry)
Top Banner:
- Total traces logged
- Success rate (% patches that validated)
- Average cost per patch
- Average latency per patch
- Total cost across all attempts
- Average retry attempt number
Search/Filter By:
- Rule ID: Focus on specific rule types (F401, D100, etc.)
- Status: success, failed, exhausted_retries
- Strategy: generate_single_patch, generate_batch_patch
- Text Search: Find by message text or file path
Scenario: You see low success rate in metrics.
Steps:
- Filter by
Status = failed - Open failed traces
- Look at Validation Errors section
- Common patterns:
- "patch does not apply" β Wrong line numbers
- "malformed patch" β LLM output formatting issue
- "unexpected end of file" β Multi-line string corruption
Action: Use error patterns to improve prompts or add post-processing.
Scenario: Want to see if retries actually help.
Steps:
- Search for same file/line with different attempt numbers
- Compare LLM Response between attempts
- Check Previous Errors section in retry attempts
- See if LLM learned from feedback
What to look for:
- Does attempt 2/3 fix issues from attempt 1?
- Are errors repeated (LLM stuck)?
- Does retry cost justify success rate improvement?
Scenario: Want to optimize token usage.
Steps:
- Look at Avg Cost in summary
- Open traces with highest cost
- Check Tokens Used and Prompt length
- Identify if verbose prompts for certain rules
Action: Shorten prompts for high-volume rules.
Scenario: Should you use batch or single patch mode?
Steps:
- Filter by
Strategy = generate_batch_patchβ Check success rate - Filter by
Strategy = generate_single_patchβ Check success rate - Compare costs, latency, success rate
Current expectation: Batch likely fails more (0% in some cases).
Scenario: Building fine-tuning dataset.
Steps:
- Filter by
Status = success - Filter by
Rule ID = F401(or target rule) - Open traces with clean patches
- Click "Save as Good Example" (feature coming soon)
Future: Saved examples export to fine-tuning JSON format.
Goal: Understand why batch patches fail and fix them.
# Filter for batch patch failures
Filter: Strategy = generate_batch_patch, Status = failedOpen first 10 failed batch traces. Look for patterns:
Hypothesis 1: Wrong line numbers in multi-hunk diffs
- Check patches: Are
@@hunk headers correct? - Check validation errors: "patch does not apply to line X"
- Pattern: Second/third hunk has wrong line numbers
Hypothesis 2: LLM corrupts file content
- Check patches: Are unchanged lines modified?
- Check validation errors: "unexpected content at line X"
- Pattern: LLM hallucinates code that wasn't there
Hypothesis 3: Prompt too complex
- Check prompts: How many findings in one request?
- Pattern: Batch of 5+ findings β higher failure rate
Based on hypothesis, implement fix:
Hypothesis 1 Fix: Add post-processing to recalculate hunk headers Hypothesis 2 Fix: Add instruction "DO NOT modify unchanged lines" Hypothesis 3 Fix: Limit batch size to 3 findings max
# Re-run with fix
patchpro analyze-pr --base main --head HEAD --with-llm
# Compare success rates
Old batch success rate: 0%
New batch success rate: ??%If success rate improved but still below target (70%), repeat process.
If you have traces from CI workflow run:
# 1. Download artifact from GitHub Actions
# (Currently: artifact upload has path issue, but traces ARE created)
# 2. Extract to local directory
unzip patchpro-traces.zip -d /path/to/traces
# 3. Launch viewer with custom path
streamlit run trace_viewer.py -- --trace-dir /path/to/tracesLive example: Workflow run 18263485405 in patchpro-demo-repo
What to look for:
- Navigate to Actions β run 18263485405
- Check logs for "Agentic mode: True"
- See debug output listing trace files:
.patchpro/traces/F841_example.py_9_1_*.json .patchpro/traces/F841_example.py_9_3_*.json - Notice
attempt_1andattempt_3for same finding β retry worked!
To view locally (once artifact upload fixed):
# Download artifact, then:
streamlit run trace_viewer.py -- --trace-dir ./downloaded-tracesR- Rerun app (refresh data)Ctrl+Cin terminal - Stop server- Browser refresh - Reload page
Cause: Haven't run PatchPro with agentic mode yet.
Fix:
patchpro analyze-pr --base main --head HEAD --with-llmCause: Streamlit caches data.
Fix: Press R to rerun app, or refresh browser.
Cause: Observability dependencies not installed.
Fix:
pip install -e ".[observability]"Cause: Traces in custom directory.
Fix:
streamlit run trace_viewer.py -- --trace-dir /path/to/traces- Automatic clustering of similar failures
- "Top 5 failure modes" section
- Pattern recognition using embeddings
- Interactive charts (Plotly)
- Cost trends over time
- Cost by rule category
- Latency distribution histograms
- Export selected traces to fine-tuning JSON
- One-click dataset curation
- Human labeling interface
- Agreement scoring with LLM-as-judge
Don't dive into individual traces immediately. Check:
- What's the overall success rate? (Below 50% β systemic issue)
- What's the retry rate? (High β validation often fails)
- What's the cost? (High β prompts too verbose)
Don't look at all traces. Focus on:
- First: Failed traces for most common rule
- Second: Successful traces for same rule (compare)
- Third: Exhausted retries (hardest cases)
One failed trace = edge case. Ten failed traces with same error = systemic issue.
Most bugs are in the LLM's interpretation of the prompt:
- Is the prompt clear?
- Does it include enough context?
- Does the response follow instructions?
If retry fails again, did the LLM:
- Ignore previous error feedback?
- Misunderstand the error?
- Make the same mistake differently?
- PATH_TO_MVP.md: Overall roadmap and Phase 2 plan
- TELEMETRY_PR_TEST_PLAN.md: How telemetry was tested in CI
- DEMO_EVALUATION_GUIDE.md: How to show traces to judges/stakeholders
How do I share traces with team?
β Commit .patchpro/traces/*.json to git (or zip and share)
Can I query traces programmatically?
β Yes! Use telemetry.PatchTracer.query_traces() API
Can I use this in CI? β Not yet (Streamlit needs interactive browser), but you can query SQLite in CI scripts
Where's the clustering feature? β Coming in Phase 2.2 (current focus: manual exploration)
Last Updated: 2025-10-06
Status: Phase 2.1 Complete π
Next: Phase 2.2 (Failure Clustering)