Context
No existing ADF mechanism measures whether agents genuinely understand the codebases they operate on, or merely execute surface-level patterns. Theory of Code Space (ToCS, arXiv:2603.00601) provides a 4-dimension evaluation framework for exactly this.
Proposal
Implement ToCS-inspired evaluation to measure ADF agent effectiveness across four dimensions:
Evaluation Dimensions
| Dimension |
What It Measures |
Metric |
| Construct |
Does the agent build an accurate dependency map? |
Edge F1 by type (IMPORTS, CALLS_API, REGISTRY_WIRES, DATA_FLOWS_TO) |
| Revise |
Does the agent update beliefs when code changes? |
Belief revision score (delta accuracy after code change) |
| Exploit |
Can the agent predict impact of changes? |
Counterfactual probe accuracy |
| Constraints |
Does the agent discover architectural rules? |
Invariant discovery F1 vs CLAUDE.md/domain model rules |
Implementation Phases
- Phase 0: Run ToCS benchmark against terraphim-ai workspace with current agents (baseline)
- Phase 1: Add periodic cognitive map probing -- every N tool calls, externalise understanding as structured JSON
- Phase 2: Compare probes against ground truth (KG-derived dependency graph) to compute scores
- Phase 3: Feed scores to NightwatchMonitor as new signal type (alert on degradation)
Cognitive Map Probing
- Injected via
PreToolUse hooks (Agent SDK) or system messages (subprocess)
- Agent outputs structured JSON: nodes (modules), edges (dependencies, typed), confidence scores
- Compared against ground truth from terraphim KG + tree-sitter analysis
Key Insight from ToCS Research
- Aho-Corasick automata cover ~67% of edges (IMPORTS level)
- CALLS_API (~17%) and DATA_FLOWS_TO (~7%) require semantic understanding
- Some models show "catastrophic belief collapse" -- losing knowledge between probes
- Evaluation framework should be built BEFORE KG enrichment (measure first, improve later)
Sub-issues (to be created during design phase)
- Run ToCS baseline against terraphim-ai workspace
- Implement cognitive map probe injection and collection
- Implement belief stability monitoring (successive probe comparison)
- Integrate evaluation scores with NightwatchMonitor
- KG enrichment with tree-sitter call graph (after baseline confirms gap)
References
Context
No existing ADF mechanism measures whether agents genuinely understand the codebases they operate on, or merely execute surface-level patterns. Theory of Code Space (ToCS, arXiv:2603.00601) provides a 4-dimension evaluation framework for exactly this.
Proposal
Implement ToCS-inspired evaluation to measure ADF agent effectiveness across four dimensions:
Evaluation Dimensions
Implementation Phases
Cognitive Map Probing
PreToolUsehooks (Agent SDK) or system messages (subprocess)Key Insight from ToCS Research
Sub-issues (to be created during design phase)
References
cto-executive-system/knowledge/external/context-engineering/tocs-theory-of-code-space-benchmark.mdcto-executive-system/plans/tocs-terraphim-ai-evaluation-plan.mdcto-executive-system/plans/adf-architecture-improvements.md(item 3.1)