Skip to content

kritibehl/dettrace

Repository files navigation

DetTrace

DetTrace finds where execution first diverged — not where it eventually failed.

20-scenario I/O transport corpus. 10,000+ trace validations. 0 failures. Root-cause confidence: 0.93.

C++17 · CMake · GoogleTest · Python · Swift · FastAPI


Why This Exists

Problem: Debugging concurrent and distributed failures is asymmetric: failures are easy to observe, hard to locate. Logs record state changes after they happen. The connection pool exhaustion at t=0.2s that caused everything may not be logged at all — it happened before anything was visibly wrong.

Impact: Engineers debug the terminal symptom, not the root cause. Fixes address what failed last, not what failed first. The same incident recurs.

Proof: Binary search over trace replay isolates the first divergence index — not the last visible failure. 20 scenarios, 10,000+ validations, 0.93 root-cause confidence, 0 false positives in validation corpus.


Quick Links


Latest Project Output

{
  "replay_scenarios": 20,
  "validation_runs": 10000,
  "root_cause_confidence": 0.93,
  "first_divergence_isolated": true,
  "false_positives_in_validation_corpus": 0,
  "top_root_cause": "RetryAmplification"
}

Generated by validation suite. Source: reports/latest/root_cause_report.json


Live Metrics Snapshot

Signal Value
Replay Scenarios 20
Validation Runs 10,000+
Root Cause Confidence 0.93
GoogleTest 47 passing · ASan clean · UBSan clean

Generated from the latest report artifact in reports/latest/.


What This Proves

✓ Debugging
✓ Systems Engineering
✓ Failure Analysis
✓ Tooling

Screenshots

Control Loop — Delayed Sensor Actuator Saturation
Delayed Sensor Actuator

Expected vs observed — divergence at index 5:

Index:  0    1    2    3    4   [5]   6    7    8    9
                                 ▲
Expected: ●────●────●────●────●────●────●────●────●────●
                                 │  ← first divergence
Observed: ●────●────●────●────●────✕────✕────✕────✕────✕

  Events 0–4:  correct execution
  Event  5:    root cause  ← debug here
  Events 6–9:  downstream consequence  ← what logs show

Decision Contract

{
  "first_divergence_index": 5,
  "expected_event": "TASK_DEQUEUED task=1 worker=0",
  "actual_event":   "TASK_DEQUEUED task=2 worker=0",
  "divergence_type": "ordering_divergence",
  "root_cause_confidence": 0.93,
  "downstream_events_explained": 4,
  "debug_recommendation": "Investigate event at index 5. Events 6–9 are downstream consequences."
}

Architecture

I/O trace input  (SPI · I2C · UART · GPIO · OTEL spans · JSONL)
      │
      ▼
DetTrace Replay Engine (C++17)
  1. Generate expected trace  (deterministic baseline)
  2. Run divergent execution  (failure scenario)
  3. Guarded replay + invariant checking
  4. Binary search → first divergence index
      │
      ▼
Swift analysis layer  (async/await · actor isolation)
  concurrent processing of large corpora — no analysis-time races
      │
      ▼
Output artifacts
  divergence_report.json  ·  timeline.html  ·  operator_runbook.md
      │
      ▼
DetTrace++ API  →  /ingest/otel  ·  /timeline/<id>  ·  /search

Validation

Transport Scenario First Divergence
SPI Transfer timeout during init Index 4 — SPI_TRANSFER_TIMEOUT
I2C ACK failure on sensor read Index 7 — I2C_NACK
UART Framing error corrupts command Index 2 — UART_FRAMING_ERROR
GPIO Interrupt race on shared pin Index 5 — GPIO_INTERRUPT_RACE
Distributed Retry storm, auth service Index 3 — connection_pool_exhausted
Control loop Delayed sensor Step 38 / 3.9s
Control loop Actuator saturation Step 53 / 5.4s

Repository Structure

dettrace/
├── src/               C++17 replay engine
├── dettrace-swift/    Swift async/await analysis layer
├── protocol_diag/     I/O transport scenario traces
├── tui/               CLI replay explorer
├── dettrace_platform/ FastAPI API + /timeline endpoint
├── docs/              Screenshots · architecture · case study
└── reports/           Divergence reports + timelines

Run Locally

git clone https://github.com/kritibehl/dettrace && cd dettrace
make demo    # full corpus + visual timeline
make test    # 47 GoogleTest cases · 0 failures
make report  # → reports/latest/

Tests

make test
# → 47 tests passing
# Root cause confidence: 0.93
# False positives in validation corpus: 0
# AddressSanitizer: clean
# UndefinedBehaviorSanitizer: clean

Tradeoffs

Determinism required. Expected trace generation assumes deterministic execution. Non-deterministic systems would require consensus baselines from multiple runs.

False positive rate: ~7%. In 7% of cases the identified first-divergence event is a coincidental deviation rather than causal root cause. Downstream confidence scoring catches most of these.

Firmware replay is trace-driven, not hardware-level. The SPI/I2C/UART/GPIO scenarios replay event sequences — not register-level hardware simulation.


What This Project Does Not Claim

  • Firmware scenarios are trace simulations — not driver, kernel, or embedded firmware implementations
  • Swift layer performs trace analysis — not a CoreBluetooth or IOKit integration
  • DetTrace++ is a proof-of-concept API — not a production incident management system
  • Control-loop scenarios are replay debugging — not avionics, GNC, or safety-critical control systems

Further Reading

Root cause confidence: 0.93

False positives in validation corpus: 0


---

## Selected Outputs

**Timeout cascade scenario:**

```json
{
  "scenario": "timeout_cascade",
  "first_divergence_step": 12,
  "first_divergence_timestamp": "1.2s",
  "root_cause_class": "LatencyInflation",
  "confidence": 0.95,
  "propagation_path": ["edge-service", "api-service", "auth-service"],
  "artifact": "artifacts/timeout_cascade_divergence.json"
}

About

Deterministic replay and distributed incident forensics for first-failure and blast-radius analysis.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors