Skip to content

feat(memory): ADR-147 entity arm + signal provenance in hybridSearch#2327

Merged
ruvnet merged 1 commit into
mainfrom
feat/adr-147-entity-signal-2317
Jun 8, 2026
Merged

feat(memory): ADR-147 entity arm + signal provenance in hybridSearch#2327
ruvnet merged 1 commit into
mainfrom
feat/adr-147-entity-signal-2317

Conversation

@ruvnet

@ruvnet ruvnet commented Jun 8, 2026

Copy link
Copy Markdown
Owner

Closes part of #2317 and references #2324.

What this lands

Implements the actual gap from the ADR-147 multi-signal retrieval proposal: the third RRF arm (entity matching) + per-result signal provenance.

The ADR's stated P1 ("wire FTS5 + RRF fusion") turned out to be already shippedcontroller-registry.ts:713 already runs semanticSearch() + searchKeyword() in parallel, fuses via applyRRF(k=60), diversifies via applyMMR(λ=0.7). The dream-cycle author missed this. The actual gap was the entity arm (ADR P2) and the missing signals field.

Changes

File Lines What
src/entity-tagger.ts (new) +71 Regex extractor for emails, URLs, file paths (POSIX + Windows), quoted phrases, proper-noun 2-grams
src/entity-tagger.test.ts (new) +94 12 unit tests pinning conservatism — false negatives OK, false positives bad
src/controller-registry.ts +63 / −21 hybridSearch controller gains the entity arm (per-entity searchKeyword in parallel) + builds signals set membership before RRF
src/graceful-retrieval.test.ts +55 / −0 Provenance assertion on existing test + new needle-in-haystack test (Alice Smith in 30 generic auth entries)

Design notes

  • Conservative tagger. False negatives are fine (dense + sparse arms cover); false positives would dilute RRF. Tests pin generic prose → empty, single capital words → empty, and/or → empty. Key bug found and fixed: the original quote regex paired the closing " of one phrase with the opening " of the next (the "a" over "b" capturing over problem) — fixed with (?<!\w)..(?!\w) lookarounds.
  • Empty arm bypass. If extractEntities(query) returns nothing, the entity arm is dropped from the RRF input entirely rather than passed as []. Avoids diluting the fusion when there are no entities.
  • Provenance via pre-fusion set membership. Build denseIds, sparseIds, entityIds from the candidates BEFORE RRF, then stamp signals[] on each fused output by checking which sets contain the candidate id. Doesn't require modifying applyRRF.

Validation

  • npx vitest run src/entity-tagger.test.ts src/graceful-retrieval.test.ts16/16 pass
  • Full memory suite — 416/420 pass. The 4 failures are pre-existing Windows-environment issues in unrelated files (agent-memory-scope.test.ts path separators, benchmark.test.ts perf budget). My branch doesn't touch any of those files.

What this defers

  • Entity index in SQLite (ADR P2 stretch goal) — current implementation runs per-entity searchKeyword calls. Fine for typical query entity counts (1–3) but unbounded if a query mentions 20 entities. A dedicated entity index would cap this; deferred to a follow-up.
  • Async writes by default (ADR P3) — orthogonal concern; the existing consolidator already handles HNSW background rebuild.
  • LoCoMo benchmark publication (ADR P4) — requires harness wiring + dataset access; punted to a separate workstream.

Test plan

  • entity-tagger.test.ts 12/12 pass
  • graceful-retrieval.test.ts 4/4 pass (2 existing + 2 new ADR-147 ones)
  • npm run build clean (no TS errors)
  • Full memory suite has no regression from this branch

🤖 Generated with RuFlo

…2317)

Adds the third signal that the multi-signal retrieval ADR (#2317)
called out as the actual gap, plus per-result provenance.

The ADR's stated P1 — "wire FTS5 + RRF fusion" — turned out to be already
shipped: `controller-registry.ts:713` already runs `semanticSearch()` +
`searchKeyword()` in parallel, fuses via `applyRRF(k=60)`, diversifies via
`applyMMR(λ=0.7)`. What was actually missing is the entity arm (ADR P2)
and the `signals` field on each fused result.

This change:

1. `entity-tagger.ts` — regex-based extractor for emails, URLs, file
   paths (POSIX + Windows), quoted phrases, and proper-noun 2-grams.
   Deliberately conservative: false negatives are fine (dense + sparse
   cover the rest), false positives would dilute the RRF score.
   `(?<!\w)..(?!\w)` lookarounds on the quote patterns stop the regex
   from pairing a closing quote of one phrase with the opening of the
   next (the classic `"a" over "b"` bug). 12 unit tests.

2. `controller-registry.ts hybridSearch` — extracts entities from the
   query; if any, runs `searchKeyword(entity, fanOut/n)` per entity in
   parallel, flattens, and adds as a third RRF arm. Empty entity set
   bypasses the arm entirely so it doesn't dilute fusion.

3. `signals: ('vector'|'bm25'|'entity')[]` on every returned result.
   Computed by candidate-id set membership in each arm's pre-fusion
   result. Lets callers debug which arms surfaced an entry without
   re-running the search.

4. `graceful-retrieval.test.ts` — extends the existing hybridSearch
   test with provenance assertion + a needle-in-haystack test
   (30 generic "authentication" entries + 1 "Alice Smith"; query
   "Alice Smith authentication" surfaces the Alice entry with
   `signals.includes('entity')`).

Memory test suite: 416/420 pass. The 4 failures are pre-existing
Windows-environment issues in unrelated files (agent-memory-scope
path separator + benchmark.test.ts perf budget).

Co-Authored-By: RuFlo <ruv@ruv.net>
@ruvnet ruvnet merged commit b099b70 into main Jun 8, 2026
97 of 98 checks passed
ruvnet added a commit that referenced this pull request Jun 8, 2026
#2327)

@claude-flow/memory 3.0.0-alpha.19 → 3.0.0-alpha.20. Adds the entity arm
to hybridSearch alongside the existing dense + sparse RRF fusion, plus
per-result signals: ('vector'|'bm25'|'entity')[] provenance.

End-to-end capability smoke against built dist confirmed: Alice needle in
31-doc corpus ranks #1 with all three signals; runner-up has only
vector+bm25 — RRF score gap of ~47%.

@claude-flow/cli, claude-flow, ruflo 3.10.38 → 3.10.39. CLI also pins
@claude-flow/memory to ^3.0.0-alpha.20 so the wrapper users pick up the
entity arm automatically.

All four packages published with latest+alpha+v3alpha aligned.
Lockfile regen included (lesson from #2311 — bumping a workspace dep
without regenerating v3/pnpm-lock.yaml breaks frozen-lockfile CI).

Co-Authored-By: RuFlo <ruv@ruv.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant