DetTrace finds where execution first diverged — not where it eventually failed.
20-scenario I/O transport corpus. 10,000+ trace validations. 0 failures. Root-cause confidence: 0.93.
C++17 · CMake · GoogleTest · Python · Swift · FastAPI
Problem: Debugging concurrent and distributed failures is asymmetric: failures are easy to observe, hard to locate. Logs record state changes after they happen. The connection pool exhaustion at t=0.2s that caused everything may not be logged at all — it happened before anything was visibly wrong.
Impact: Engineers debug the terminal symptom, not the root cause. Fixes address what failed last, not what failed first. The same incident recurs.
Proof: Binary search over trace replay isolates the first divergence index — not the last visible failure. 20 scenarios, 10,000+ validations, 0.93 root-cause confidence, 0 false positives in validation corpus.
- Latest report: reports/latest/root_cause_report.json
- Tests:
make test - Demo:
make demo
{
"replay_scenarios": 20,
"validation_runs": 10000,
"root_cause_confidence": 0.93,
"first_divergence_isolated": true,
"false_positives_in_validation_corpus": 0,
"top_root_cause": "RetryAmplification"
}Generated by validation suite. Source: reports/latest/root_cause_report.json
| Signal | Value |
|---|---|
| Replay Scenarios | 20 |
| Validation Runs | 10,000+ |
| Root Cause Confidence | 0.93 |
| GoogleTest | 47 passing · ASan clean · UBSan clean |
Generated from the latest report artifact in reports/latest/.
✓ Debugging
✓ Systems Engineering
✓ Failure Analysis
✓ Tooling
| Control Loop — Delayed Sensor | Actuator Saturation |
|---|---|
Expected vs observed — divergence at index 5:
Index: 0 1 2 3 4 [5] 6 7 8 9
▲
Expected: ●────●────●────●────●────●────●────●────●────●
│ ← first divergence
Observed: ●────●────●────●────●────✕────✕────✕────✕────✕
Events 0–4: correct execution
Event 5: root cause ← debug here
Events 6–9: downstream consequence ← what logs show
{
"first_divergence_index": 5,
"expected_event": "TASK_DEQUEUED task=1 worker=0",
"actual_event": "TASK_DEQUEUED task=2 worker=0",
"divergence_type": "ordering_divergence",
"root_cause_confidence": 0.93,
"downstream_events_explained": 4,
"debug_recommendation": "Investigate event at index 5. Events 6–9 are downstream consequences."
}I/O trace input (SPI · I2C · UART · GPIO · OTEL spans · JSONL)
│
▼
DetTrace Replay Engine (C++17)
1. Generate expected trace (deterministic baseline)
2. Run divergent execution (failure scenario)
3. Guarded replay + invariant checking
4. Binary search → first divergence index
│
▼
Swift analysis layer (async/await · actor isolation)
concurrent processing of large corpora — no analysis-time races
│
▼
Output artifacts
divergence_report.json · timeline.html · operator_runbook.md
│
▼
DetTrace++ API → /ingest/otel · /timeline/<id> · /search
| Transport | Scenario | First Divergence |
|---|---|---|
| SPI | Transfer timeout during init | Index 4 — SPI_TRANSFER_TIMEOUT |
| I2C | ACK failure on sensor read | Index 7 — I2C_NACK |
| UART | Framing error corrupts command | Index 2 — UART_FRAMING_ERROR |
| GPIO | Interrupt race on shared pin | Index 5 — GPIO_INTERRUPT_RACE |
| Distributed | Retry storm, auth service | Index 3 — connection_pool_exhausted |
| Control loop | Delayed sensor | Step 38 / 3.9s |
| Control loop | Actuator saturation | Step 53 / 5.4s |
dettrace/
├── src/ C++17 replay engine
├── dettrace-swift/ Swift async/await analysis layer
├── protocol_diag/ I/O transport scenario traces
├── tui/ CLI replay explorer
├── dettrace_platform/ FastAPI API + /timeline endpoint
├── docs/ Screenshots · architecture · case study
└── reports/ Divergence reports + timelines
git clone https://github.com/kritibehl/dettrace && cd dettrace
make demo # full corpus + visual timeline
make test # 47 GoogleTest cases · 0 failures
make report # → reports/latest/make test
# → 47 tests passing
# Root cause confidence: 0.93
# False positives in validation corpus: 0
# AddressSanitizer: clean
# UndefinedBehaviorSanitizer: cleanDeterminism required. Expected trace generation assumes deterministic execution. Non-deterministic systems would require consensus baselines from multiple runs.
False positive rate: ~7%. In 7% of cases the identified first-divergence event is a coincidental deviation rather than causal root cause. Downstream confidence scoring catches most of these.
Firmware replay is trace-driven, not hardware-level. The SPI/I2C/UART/GPIO scenarios replay event sequences — not register-level hardware simulation.
- Firmware scenarios are trace simulations — not driver, kernel, or embedded firmware implementations
- Swift layer performs trace analysis — not a CoreBluetooth or IOKit integration
- DetTrace++ is a proof-of-concept API — not a production incident management system
- Control-loop scenarios are replay debugging — not avionics, GNC, or safety-critical control systems
docs/case_study.md— Problem · Design · Validation · Tradeoffsdocs/interview_walkthrough.md— 60s · 3min · 10min explanations
---
## Selected Outputs
**Timeout cascade scenario:**
```json
{
"scenario": "timeout_cascade",
"first_divergence_step": 12,
"first_divergence_timestamp": "1.2s",
"root_cause_class": "LatencyInflation",
"confidence": 0.95,
"propagation_path": ["edge-service", "api-service", "auth-service"],
"artifact": "artifacts/timeout_cascade_divergence.json"
}