Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,21 @@ Format follows [Keep a Changelog](https://keepachangelog.com/). Versions follow

---

## [0.1.4] - 2026-03-19

### Added
- Phase-1 retrieval ranking pipeline with reciprocal rank fusion (RRF), recency boost, and importance weighting controls.
- Retrieval config keys and environment overrides for `rrfK`, `recencyBoost`, `recencyHalfLifeHours`, and `importanceWeight`.
- Foundation and regression coverage for RRF scoring behavior and phase-1 ranking config defaults/overrides.
- New OpenSpec capability `memory-retrieval-ranking-phase1` with archived implementation change record.

### Changed
- Hybrid retrieval ranking now fuses vector and BM25 channels via rank-based RRF instead of direct weighted-score summation.
- Main specs for `memory-auto-capture-and-recall` and `memory-provider-config` now include phase-1 ranking requirements.
- Validation and operations docs now include low-feedback interpretation guidance and proxy-metric review workflows.

---

## [0.1.3] - 2026-03-17

### Added
Expand Down
38 changes: 35 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,11 @@ If you already use other plugins, keep them and append `"lancedb-opencode-pro"`.
"mode": "hybrid",
"vectorWeight": 0.7,
"bm25Weight": 0.3,
"minScore": 0.2
"minScore": 0.2,
"rrfK": 60,
"recencyBoost": true,
"recencyHalfLifeHours": 72,
"importanceWeight": 0.4
},
"includeGlobalScope": true,
"minCaptureChars": 80,
Expand Down Expand Up @@ -173,7 +177,11 @@ Create `~/.config/opencode/lancedb-opencode-pro.json`:
"mode": "hybrid",
"vectorWeight": 0.7,
"bm25Weight": 0.3,
"minScore": 0.2
"minScore": 0.2,
"rrfK": 60,
"recencyBoost": true,
"recencyHalfLifeHours": 72,
"importanceWeight": 0.4
},
"includeGlobalScope": true,
"minCaptureChars": 80,
Expand Down Expand Up @@ -216,6 +224,10 @@ Supported environment variables:
- `LANCEDB_OPENCODE_PRO_VECTOR_WEIGHT`
- `LANCEDB_OPENCODE_PRO_BM25_WEIGHT`
- `LANCEDB_OPENCODE_PRO_MIN_SCORE`
- `LANCEDB_OPENCODE_PRO_RRF_K`
- `LANCEDB_OPENCODE_PRO_RECENCY_BOOST`
- `LANCEDB_OPENCODE_PRO_RECENCY_HALF_LIFE_HOURS`
- `LANCEDB_OPENCODE_PRO_IMPORTANCE_WEIGHT`
- `LANCEDB_OPENCODE_PRO_INCLUDE_GLOBAL_SCOPE`
- `LANCEDB_OPENCODE_PRO_MIN_CAPTURE_CHARS`
- `LANCEDB_OPENCODE_PRO_MAX_ENTRIES_PER_SCOPE`
Expand Down Expand Up @@ -298,6 +310,22 @@ Key fields:
- `feedback.falsePositiveRate`: wrong-memory reports divided by stored memories.
- `feedback.falseNegativeRate`: missing-memory reports relative to capture attempts.

### Interpreting Low-Feedback Results

In real OpenCode usage, auto-capture and recall happen in the background, so explicit `memory_feedback_*` events are often sparse.

- Treat `capture.*` and `recall.*` as system-health metrics: they show whether the memory pipeline is running.
- Treat repeated-context reduction, clarification burden, manual memory rescue, correction signals, and sampled audits as product-value signals: they show whether memory actually helped the user.
- Treat `feedback.* = 0` as insufficient evidence, not proof that memory quality is good.
- Treat a high `recall.hitRate` or `recall.injectionRate` as recall availability only; those values do not prove usefulness by themselves.

Recommended review order in low-feedback environments:

1. Check `capture.successRate`, `capture.skipReasons`, `recall.hitRate`, and `recall.injectionRate` for operational health.
2. Review whether users repeated background context less often or needed fewer clarification turns.
3. Check whether users still needed manual rescue through `memory_search` or issued correction-like responses.
4. Run a bounded audit of recalled memories or skipped captures before concluding the system is helping.

## OpenAI Embedding Configuration

Default behavior stays on Ollama. To use OpenAI embeddings, set `embedding.provider` to `openai` and provide API key + model.
Expand All @@ -318,7 +346,11 @@ Example sidecar:
"mode": "hybrid",
"vectorWeight": 0.7,
"bm25Weight": 0.3,
"minScore": 0.2
"minScore": 0.2,
"rrfK": 60,
"recencyBoost": true,
"recencyHalfLifeHours": 72,
"importanceWeight": 0.4
},
"includeGlobalScope": true,
"minCaptureChars": 80,
Expand Down
15 changes: 15 additions & 0 deletions docs/VALIDATION_README.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,21 @@ Documentation & error messages
| Helpful recall rate | Reported | 7 | User feedback |
| False-positive / false-negative counts | Reported | 7 | User feedback |

### Low-Feedback Proxy Metrics
| Metric | Target | Phase | Reference |
|--------|--------|-------|-----------|
| Repeated-context reduction | Reviewed | 7 | Low-feedback framework |
| Clarification burden reduction | Reviewed | 7 | Low-feedback framework |
| Manual memory rescue rate | Reviewed | 7 | Low-feedback framework |
| Correction-signal rate | Reviewed | 7 | Low-feedback framework |
| Sampled recall usefulness | Reviewed | 7 | Low-feedback framework |

Interpretation rules:

- High `recall.hitRate` indicates retrieval availability, not proven usefulness.
- Zero explicit feedback counts indicate missing labels unless a proxy-metric review or sample audit says otherwise.
- Release review should pair runtime summaries with manual proxy-metric inspection whenever feedback volume is sparse.

---

## 🔍 Critical Tests (Must Pass Before Release)
Expand Down
6 changes: 6 additions & 0 deletions docs/acceptance-checklist.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,12 @@
- [ ] Users can report whether recalled memory was helpful.
- [ ] Operators can inspect machine-readable effectiveness summary output.

## Low-Feedback Evaluation

- [ ] Operators separate system-health metrics from product-value conclusions.
- [ ] Zero explicit feedback is treated as insufficient signal, not as proof of good memory quality.
- [ ] Proxy metrics or sampled audits are reviewed when explicit feedback is sparse.

## Build And Packaging

- [ ] `docker compose build --no-cache && docker compose up -d` succeeds.
Expand Down
44 changes: 43 additions & 1 deletion docs/memory-validation-checklist.md
Original file line number Diff line number Diff line change
Expand Up @@ -348,6 +348,49 @@
- Acceptance: Message explains problem + suggests fix
- Example: "Vector dimension mismatch: expected 384, got 768. Run memory_clear to reset."

### 7.3 Runtime Effectiveness Summary
- [ ] **System-Health Metrics Are Reported**
- Test: Run `memory_effectiveness` after a realistic write/recall workflow
- Measurement: Verify capture success, skip reasons, recall hit rate, and recall injection rate are present
- Acceptance: Summary includes all runtime fields needed to judge operational health

- [ ] **Zero Feedback Is Treated As Unknown Quality**
- Test: Review a summary with sparse or zero `feedback.*` counts
- Measurement: Confirm release guidance does not treat zero counts as success
- Acceptance: Review docs require proxy metrics or sample audits before claiming usefulness

### 7.4 Low-Feedback Proxy Metrics
- [ ] **Repeated-Context Reduction Review**
- Test: Compare follow-up sessions before/after memory use
- Measurement: Whether users repeat less project context manually
- Acceptance: Review process documents whether context repetition decreases, stays flat, or worsens

- [ ] **Clarification Burden Review**
- Test: Inspect conversations after recall injection
- Measurement: Count reminder or context-recovery questions that should have been avoided
- Acceptance: Review process can identify whether memory reduced clarification turns

- [ ] **Manual Memory Rescue Review**
- Test: Inspect whether operators still need `memory_search` after automatic recall
- Measurement: Manual search frequency relative to recall-heavy workflows
- Acceptance: Review process can describe whether automatic recall still required manual rescue

- [ ] **Correction-Signal Review**
- Test: Inspect `memory_feedback_wrong`, `memory_feedback_missing`, and correction-like conversation turns
- Measurement: Frequency of stale, wrong, or irrelevant recall corrections
- Acceptance: Review process can identify whether memory introduced prompt contamination or stale context

### 7.5 Sample Audit Workflow
- [ ] **Sampled Recall Audit**
- Test: Review 10-20 recent recall injections from one active project scope
- Measurement: Classify each as relevant, neutral noise, or misleading
- Acceptance: Audit result is documented before release claims are made in sparse-feedback environments

- [ ] **Sampled Skipped-Capture Audit**
- Test: Review 10-20 skipped captures, especially `no-positive-signal`
- Measurement: Determine whether durable decisions, facts, or preferences were missed
- Acceptance: Audit result identifies whether capture heuristics are too strict for real usage

---

## IMPLEMENTATION ROADMAP
Expand Down Expand Up @@ -478,4 +521,3 @@ async function profileLatency(fn: () => Promise<any>, iterations: number) {
- **Monitor Tail Performance**: p99 latency matters more than average for interactive tools
- **Scope Isolation is Critical**: Multi-project support depends on bulletproof scope enforcement
- **Embedding Provider Abstraction**: Design tests to support future providers (OpenAI, local models, etc.)

29 changes: 29 additions & 0 deletions docs/operations.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,13 @@
- User feedback is recorded through `memory_feedback_missing`, `memory_feedback_wrong`, and `memory_feedback_useful`.
- Operators can inspect the aggregated machine-readable summary with `memory_effectiveness` for the active project scope.

### System Health vs Product Value

- **System health metrics**: `capture.successRate`, `capture.skipReasons`, `recall.hitRate`, and `recall.injectionRate`.
- **Product value metrics**: repeated-context reduction, clarification burden reduction, manual memory rescue rate, correction-signal rate, and sampled recall usefulness.
- High recall availability means the store can return something; it does not prove that the injected memory helped the conversation.
- Zero `feedback.*` counts mean the workflow lacks direct labels, not that memory quality is confirmed.

### Example Workflow

```text
Expand All @@ -43,3 +50,25 @@ Expected summary fields:
- `recall.requested`, `recall.returnedResults`, `recall.injected`
- `feedback.missing`, `feedback.wrong`, `feedback.useful`
- `feedback.falsePositiveRate`, `feedback.falseNegativeRate`

### Low-Feedback Proxy Metrics

Use these proxy metrics when users rarely submit `memory_feedback_*` commands:

| Proxy metric | What it means | Current evidence source |
|---|---|---|
| Repeated-context reduction | Users repeat less project context across sessions or follow-up turns | Manual conversation review; not instrumented yet |
| Clarification burden | Agent asks fewer reminder or context-recovery questions | Manual conversation review; not instrumented yet |
| Manual memory rescue rate | Users still need `memory_search` after automatic recall | Search activity + session review; not instrumented as a dedicated rate |
| Correction-signal rate | Users say the recalled context is wrong, stale, or irrelevant | `memory_feedback_wrong`, `memory_feedback_missing`, or conversation review |
| Sampled recall usefulness | Audited recalled memories appear relevant and actually help move work forward | Sample audit of recalled memories |

### Sample Audit Workflow

When explicit feedback is sparse, run a bounded audit instead of assuming quality:

1. Sample 10-20 recent recall injections from the same project scope.
2. For each sample, inspect the recalled memory text and the next assistant reply.
3. Mark whether the memory was relevant, neutral noise, or misleading.
4. Sample 10-20 skipped captures, especially `no-positive-signal`, and check whether important durable knowledge was missed.
5. Treat the audit as release input alongside `memory_effectiveness`, not as a replacement for runtime metrics.
16 changes: 16 additions & 0 deletions docs/release-readiness.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ Still manual or not yet automated:
- FTS degradation fault injection validation
- embedding-backend-unavailable fault-path validation
- broader phase items outside current change scope (phase 2/5+/scalability extremes)
- low-feedback proxy metrics remain documentation-driven and require reviewer judgment or sampling

## Manual-Only Items (Current)

Expand All @@ -85,6 +86,20 @@ Before archive/ship, retain these as explicit manual checks:
1. Force an FTS-index failure scenario and verify retrieval fallback behavior.
2. Force embedding backend outage and verify hook-level graceful behavior.
3. Run real OpenCode directory-switch scenario end-to-end to validate scope transition behavior in live integration.
4. If explicit `memory_feedback_*` counts are sparse, review proxy metrics or run a bounded audit of recalled memories and skipped captures.

## Low-Feedback Evaluation Guidance

Interpret `memory_effectiveness` in two layers:

- **System health**: capture success, skip reasons, recall hit rate, and recall injection rate.
- **Product value**: repeated-context reduction, clarification burden reduction, manual memory rescue rate, correction-signal rate, and sampled recall usefulness.

Review rules:

- Zero feedback counts are insufficient evidence, not proof of zero defects.
- High `recall.hitRate` or `recall.injectionRate` means memory was available, not necessarily useful.
- When feedback volume is sparse, release reviewers should document either proxy-metric observations or the outcome of a sampled audit.

## Archive / Ship Gate

Expand All @@ -93,3 +108,4 @@ Treat release as ready when all conditions are true:
1. `docker compose exec app npm run verify:full` passes.
2. No new failing items in the manual-only checklist above.
3. Any unresolved manual-only item is explicitly documented in release notes.
4. Sparse-feedback releases include a low-feedback interpretation note or sample-audit outcome.
3 changes: 2 additions & 1 deletion docs/validation-priority-summary.md
Original file line number Diff line number Diff line change
Expand Up @@ -276,6 +276,8 @@ async function profileLatency(fn: () => Promise<any>, iterations: number) {
| p99 latency > 1000ms | Profile search algorithm + index structure |
| Scope isolation failure | STOP - data privacy issue, fix before release |
| Vector dimension mismatch | STOP - data integrity issue, fix before release |
| Recall hit rate is high but feedback is near zero | Treat as insufficient evidence; review proxy metrics or run a sample audit |
| Users still repeat background context after recall | Investigate product-value gap even if system-health metrics look good |

---

Expand All @@ -286,4 +288,3 @@ See `memory-validation-checklist.md` for:
- Complete measurement methodology
- Implementation roadmap (4 sprints)
- Success criteria for v0.1.0

Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-03-19
77 changes: 77 additions & 0 deletions openspec/changes/add-low-feedback-memory-evaluation/design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
## Context

The current project records capture, recall, and explicit feedback events, then exposes those aggregates through `memory_effectiveness`. This is useful for system-health visibility, but the interaction model is heavily background-driven: auto-capture happens without explicit user action, recall injection happens inside the system prompt, and users usually do not see which memory ids were involved unless they inspect raw memory output. As a result, explicit feedback counts are structurally sparse.

The design challenge is not to replace event metrics, but to redefine how maintainers interpret them. In a low-feedback environment, the project needs a framework that treats explicit feedback as optional high-value evidence while using behavior-based proxy metrics and periodic sample audits as the primary source of product-value assessment.

## Goals / Non-Goals

**Goals:**
- Define a low-feedback evaluation model that separates system-health metrics from user-value metrics.
- Establish proxy metrics that can be reviewed even when `memory_feedback_*` usage is near zero.
- Define sample-audit workflows so maintainers can validate capture and recall quality without requiring continuous user labeling.
- Clarify summary interpretation rules so zero feedback is treated as unknown quality, not success.

**Non-Goals:**
- Redesigning the existing event schema in this change.
- Guaranteeing fully automatic ground-truth measurement of memory usefulness.
- Replacing explicit feedback commands; they remain useful when available.
- Implementing dashboards or analytics infrastructure in this design-only change.

## Decisions

### Decision: Split effectiveness interpretation into system health and product value
The framework will define two separate evaluation layers.

- **System health**: capture success, skip reasons, recall hit rate, recall injection rate.
- **Product value**: repeated-context reduction, clarification burden reduction, manual memory rescue rate, correction-signal rate, and sampled recall usefulness.

Rationale:
- Existing metrics already describe whether the memory pipeline is operational.
- Users need a distinct lens for judging whether memory changed interaction cost in a beneficial way.
- This separation prevents high recall-hit rates from being misread as evidence of usefulness.

Alternatives considered:
- Continue treating a single `memory_effectiveness` summary as a complete quality signal: rejected because it overstates certainty when user feedback is sparse.

### Decision: Treat explicit feedback as sparse high-confidence evidence, not as the main KPI source
Explicit feedback commands remain important, but low feedback volume must be interpreted as insufficient signal.

Rationale:
- Background auto-capture and background recall mean most users cannot easily observe storage or injection moments.
- Sparse feedback is therefore expected even in healthy usage.
- When explicit feedback does exist, it is still high-value evidence and should influence quality review.

Alternatives considered:
- Ignore explicit feedback entirely: rejected because it is the strongest direct signal when present.
- Treat zero feedback as zero defects: rejected because it collapses missing observability into false confidence.

### Decision: Use proxy metrics and sample audits as the default low-feedback evaluation method
Maintainers will review proxy metrics and periodic sampled sessions or events instead of waiting for large volumes of user feedback.

Rationale:
- Proxy metrics can be collected passively from real usage.
- Sample audits allow teams to inspect actual recall usefulness and skipped-capture quality with bounded effort.
- This is more realistic for a background memory system than requiring constant manual annotation.

Alternatives considered:
- Require users to rate every memory interaction: rejected as too disruptive and unlikely to succeed in CLI workflows.

## Risks / Trade-offs

- [Proxy metrics are less direct than explicit labels] -> Mitigation: keep proxy metrics paired with periodic human sample review.
- [Teams may over-interpret high recall hit rates] -> Mitigation: explicitly document that recall availability does not prove usefulness.
- [Sample audits may be inconsistent across reviewers] -> Mitigation: define a lightweight review rubric with fixed questions for captured and recalled examples.
- [Low-feedback evaluation could drift into qualitative opinions] -> Mitigation: anchor reviews in repeatable proxy metrics plus explicit audit checklists.

## Migration Plan

1. Add OpenSpec requirements and design guidance for low-feedback evaluation.
2. Update docs and reporting guidance so maintainers classify metrics into system-health and product-value layers.
3. If needed later, extend runtime tooling to compute or expose the proxy metrics defined here.

## Open Questions

- Which proxy metrics can be derived from existing event streams without adding new runtime instrumentation?
- Should sampled audits focus first on recalled memories, skipped captures, or both?
- What minimum sample size should release reviewers use before drawing quality conclusions in low-feedback projects?
Loading
Loading