dimknaf · dimknaf · May 24, 2026 · May 17, 2026 · May 17, 2026 · May 17, 2026
diff --git a/.env.example b/.env.example
@@ -35,6 +35,48 @@ AGENT_MODEL=
 # (visible via `docker logs braindb_api -f`). Response payload unchanged.
 AGENT_VERBOSE=false
 
+# Agent turn budget — how many tool-call turns the general /agent/query
+# is allowed before the SDK forces termination. Default 20. Lowering
+# this below ~15 degrades deep-research models (notably local Qwen via
+# vLLM); raising it costs more LLM calls per query. The wiki maintainer
+# / writer and the ingest watcher pass their own per-call values
+# (30/30/40/30) and are unaffected by this default.
+# AGENT_MAX_TURNS=20
+
+# How many turns from the end of the run the agent gets a synthetic
+# "start wrapping up" reminder injected as a user message. Default 8.
+# Set to 0 to disable the reminder entirely (the SDK will still
+# terminate at max_turns, but the model gets no warning). The reminder
+# tone is automatic: soft "start wrapping up" when max_turns > 5,
+# hard "call final_answer NOW" when max_turns <= 5 (which covers the
+# Layer 4 retry path).
+# AGENT_COUNTDOWN_THRESHOLD=8
+
 # Ingest watcher poll interval (seconds) — how often the watcher sidecar
 # scans data/sources/ for new files to ingest.
 INGEST_POLL_INTERVAL=7
+
+# Wiki scheduler HTTP read-timeout (seconds) on /wiki/maintain and
+# /wiki/write calls. Default 1200 (20 min). Local quantised models
+# (Qwen 27B AWQ-INT4 on vLLM) routinely take 6-15 min for a full wiki
+# body; setting this below ~600 caused the scheduler to give up while
+# the api kept working — queue drained slower than reality. Raise if
+# you see "Read timed out" in the scheduler log AND the corresponding
+# write actually committed (check `wikis_ext.revision`); lower only if
+# you specifically want quicker scheduler turnover. The api itself is
+# unbounded by this; this only controls the scheduler's patience.
+# WIKI_AGENT_TIMEOUT=1200
+
+# Per-wiki cooldown on attach claims (seconds). Default 300 (5 min).
+# Once the OLDEST pending attach for a given wiki is this old, the
+# writer claims ALL pending attaches for that wiki in a single batch.
+# Below the cooldown, fresh attaches keep accumulating — they don't
+# trigger a writer fire. Lets the writer fire once per cooldown window
+# instead of once per attach job; on a hot subject like a high-volume
+# person/topic wiki, this collapses 5-10 separate full-body
+# regenerations into 1 per window — ~80% LLM cost reduction on the
+# pattern we observed today. Self-limiting: each fire scoops up the
+# whole pending queue for that wiki. Set to 0 to disable (revert to
+# the old "fire on every attach" behaviour). Affects ATTACH only;
+# consolidate and create paths are unchanged.
+# WIKI_ATTACH_COOLDOWN_SECONDS=300
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -0,0 +1,79 @@
+name: tests
+
+on:
+  push:
+    branches: [main]
+  pull_request:
+    branches: [main]
+
+jobs:
+  validator-tests:
+    runs-on: ubuntu-latest
+    timeout-minutes: 10
+
+    services:
+      postgres:
+        image: pgvector/pgvector:pg16
+        env:
+          POSTGRES_PASSWORD: password
+          POSTGRES_DB: braindb
+        ports:
+          - 5432:5432
+        options: >-
+          --health-cmd pg_isready
+          --health-interval 5s
+          --health-timeout 5s
+          --health-retries 10
+
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Enable required postgres extensions
+        run: |
+          PGPASSWORD=password psql -h localhost -U postgres -d braindb \
+            -c "CREATE EXTENSION IF NOT EXISTS pg_trgm; CREATE EXTENSION IF NOT EXISTS vector;"
+
+      - name: Configure .env for the CI stack
+        run: |
+          cat > .env <<'EOF'
+          DATABASE_URL=postgresql://postgres:password@host.docker.internal:5432/braindb
+          API_PORT=8000
+          LLM_PROFILE=deepinfra
+          DEEPINFRA_API_KEY=ci-placeholder-key-not-used
+          AGENT_VERBOSE=false
+          WIKI_ENABLED=false
+          EOF
+
+      - name: Create the local-network the compose file expects
+        run: docker network create local-network
+
+      - name: Bring up the stack
+        run: docker compose up -d --build
+
+      - name: Wait for /health
+        run: |
+          for i in $(seq 1 60); do
+            if curl -sf http://localhost:8000/health > /dev/null; then
+              echo "API healthy after ${i} attempts"
+              curl -s http://localhost:8000/health
+              exit 0
+            fi
+            sleep 2
+          done
+          echo "API failed to become healthy"
+          docker logs braindb_api --tail 100
+          exit 1
+
+      - name: Install pytest into the api container
+        run: docker exec braindb_api pip install pytest pytest-asyncio --quiet
+
+      - name: Run validator + handoff unit tests
+        run: |
+          docker exec braindb_api python -m pytest \
+            tests/test_final_answer_rename.py \
+            tests/test_handoff_hooks.py \
+            -v
+
+      - name: Dump api logs on failure
+        if: failure()
+        run: docker logs braindb_api --tail 200
diff --git a/.gitignore b/.gitignore
@@ -36,3 +36,6 @@ Thumbs.db
 data/sources/*
 !data/sources/.gitkeep
 !data/sources/README.md
+
+# Wiki review exports — generated, read-only inspection output
+data/wiki_review/
diff --git a/BRAINDB_GUIDE.md b/BRAINDB_GUIDE.md
@@ -22,20 +22,25 @@ The API runs at **http://localhost:8000**. Everything is done via HTTP calls.
 ### Before answering anything non-trivial, always call:
 ```
 POST /api/v1/memory/context
-{"queries": ["topic 1", "topic 2"], "max_depth": 3, "max_results": 15}
+{"queries": ["bare-keyword-1", "bare-keyword-2", "one broader phrase"], "max_depth": 3}
 ```
 This returns:
-- Direct matches (fuzzy + full-text) across all queries, merged by best score
+- Direct matches (keyword-mediated fuzzy + keyword-mediated embedding) across all queries
 - Graph-connected entities up to 3 hops away (relevance fades: 100% -> 60% -> 30%)
+- Two-level diversity quota applied: per-search-term reservation (each query gets a guaranteed share) + per-keyword halving cap on the open remainder
 - Always-on rules (always injected regardless of query)
 
-Each item has a `final_rank` score. Trust higher-ranked items more.
+Each item has a `final_rank` score. Trust higher-ranked items more. `max_results` defaults to 30; the scoring pool internally considers up to 500 candidates per query so narrow keywords aren't excluded before they're evaluated.
+
+**Query strategy.** Prefer **multiple narrow queries** (single keywords, bare names) over one long sentence. Keywords are short, so a short query matches them at high pg_trgm similarity; a long phrase dilutes the trigram set and pushes narrow-subject facts down the ranking. Examples:
 
-You can also pass a single query for backward compatibility:
 ```
-{"query": "single topic", "max_depth": 3}
+GOOD:  "queries": ["Petros", "Selonda Saronikos fish farm", "Dimitrios manager"]
+BAD:   "queries": ["Petros person identity profile relation to Dimitris"]
 ```
 
+The per-search-term quota reserves slots for each query you pass, so the bare-keyword query is guaranteed to surface its specific facts even when paired with broader angles. Single `query` (string) still works for backward compatibility.
+
 ### After learning something new, save it:
 ```
 POST /api/v1/entities/facts      — for objective facts
@@ -74,9 +79,19 @@ curl "http://localhost:8000/api/v1/entities?entity_type=fact&source=user-stated&
 Query parameters: `entity_type`, `keyword`, `source`, `min_importance` (0-1), `limit` (1-200, default 50), `offset` (default 0).
 
 ### Get Entity by ID
+The **only full-content read**. Multi-item calls (context/search/list) return
+~1K previews ending `--truncated … get_entity("<id>")`; come here for the
+whole body.
 ```bash
 curl http://localhost:8000/api/v1/entities/<UUID>
+# Large body? page it (don't pull it whole):
+curl "http://localhost:8000/api/v1/entities/<UUID>?offset=0&limit=8000"
 ```
+With `offset`/`limit` the response adds `content_meta`:
+`{total_chars, offset, returned, next_offset}` — keep fetching `next_offset`
+until it is `null`. Default (no params) = full body, unchanged. For big
+documents, prefer delegating the read to a subagent via `/api/v1/agent/query`
+so the content never floods the caller's context.
 
 ### Delete Entity
 ```bash
@@ -226,14 +241,23 @@ curl -X POST http://localhost:8000/api/v1/memory/search \
 curl -X POST http://localhost:8000/api/v1/memory/context \
   -H "Content-Type: application/json" \
   -d '{
-    "queries": ["user profile expertise", "project architecture decisions"],
+    "queries": ["user-profile", "expertise", "project-decision"],
     "max_depth": 3,
-    "max_results": 15,
     "include_always_on_rules": true
   }'
 ```
 
-Each query runs fuzzy + full-text search independently. Seeds are merged keeping the **best score** per entity. One graph expansion runs on the combined seed set.
+Each query runs through TWO keyword-mediated pathways in parallel:
+- **Fuzzy** — `pg_trgm similarity(content, query)` over keyword entities.
+- **Embedding** — Qwen3-Embedding-0.6B (1024-dim) cosine similarity between the query and keyword-entity embeddings.
+
+Entities surface via `tagged_with` from the matched keywords. Per-entity score = `max(matched-keyword similarity)` on each pathway. Both signals are merged with the geometric mean (configurable `missing_signal_penalty` when only one signal fires).
+
+After scoring, **two diversity quotas** apply:
+1. **Per-search-term** — each query in `queries[]` reserves `ceil(max_results × per_query_share / num_queries)` slots filled from its own top-ranked entities. Knob: `per_query_share` (default 0.5; set to 0 to disable).
+2. **Per-keyword (halving)** — walking the remaining slots in `final_rank`-desc order, each new dominant keyword gets a halving allowance (50% / 25% / 12.5% ..., floor 1). Knob: `keyword_quota_halving` (default 0.5; set to 1.0 to disable).
+
+`max_results` defaults to 30 (LLM-visible cap). The internal scoring pool considers up to 500 keyword neighbours per query (`scoring_pool_keyword_neighbors`) and up to 500 fuzzy candidates (`scoring_pool_fuzzy`) — cheap pure-SQL/vector work, so narrow keywords aren't excluded before they're evaluated. None of these knobs are env-driven; tune them in [`braindb/config.py`](braindb/config.py) if needed.
 
 **Single query** (backward-compatible):
 ```bash
@@ -277,8 +301,14 @@ curl "http://localhost:8000/api/v1/memory/log?since=2026-04-08T00:00:00Z"
 
 Response includes: `id`, `timestamp`, `operation`, `entity_type`, `entity_id`, `details`, `context_note`.
 
-### Read-only SQL
-For ad-hoc exploration. Only `SELECT` and `WITH` queries; 5s timeout; 1000 row limit.
+### Read-only SQL — EXCEPTION tool, not for recall
+
+⚠ This is **not** a recall/discovery path. A flat SELECT has no embeddings, no
+graph, no ranking — it discards everything BrainDB is built for. Default to
+`POST /api/v1/memory/context` (and delegated `/api/v1/agent/query`) for all
+recall, discovery, and understanding. Use `/memory/sql` **only** for a
+specific structured/aggregate question those cannot express (counts, GROUP BY,
+activity-log joins). Only `SELECT` and `WITH` queries; 5s timeout; 1000 row limit.
 
 ```bash
 curl -X POST http://localhost:8000/api/v1/memory/sql \
@@ -306,18 +336,17 @@ curl -X POST http://localhost:8000/api/v1/entities/datasources/ingest \
 
 ### BrainDB Agent — natural language queries
 
-`POST /api/v1/agent/query` — instead of orchestrating individual API calls, send a plain English request and let BrainDB's internal agent handle it. The agent uses the OpenAI Agents SDK with LiteLLM (provider pluggable via `LLM_PROFILE` — default `deepinfra`, `nim` also supported) and has access to all 21 BrainDB operations as function tools.
+`POST /api/v1/agent/query` — instead of orchestrating individual API calls, send a plain English request and let BrainDB's internal agent handle it. The agent uses the OpenAI Agents SDK with LiteLLM (provider pluggable via `LLM_PROFILE` — **`deepinfra` with `google/gemma-4-31B-it` is the recommended default**; `nim` and local vLLM are also supported) and has access to all 21 BrainDB operations as function tools.
 
 ```bash
 curl -X POST http://localhost:8000/api/v1/agent/query \
   -H "Content-Type: application/json" \
-  -d '{
-    "query": "What do you know about the user role and recent projects?",
-    "max_turns": 15
-  }'
-# {"answer": "The user is ...", "max_turns": 15}
+  -d '{"query": "What do you know about the user role and recent projects?"}'
+# {"answer": "The user is ...", "max_turns": 20}
 ```
 
+(`max_turns` is optional; the default — currently 20 — is used when omitted.)
+
 **Save via the agent**:
 ```bash
 curl -X POST http://localhost:8000/api/v1/agent/query \
@@ -332,12 +361,12 @@ curl -X POST http://localhost:8000/api/v1/agent/query \
   -d '{"query":"Delegate to a subagent: find near-duplicate facts and return top 10 pairs with their IDs."}'
 ```
 
-The agent has these tools internally: `recall_memory`, `quick_search`, `save_fact`, `save_thought`, `save_source`, `save_rule`, `ingest_file`, `get_entity`, `list_entities`, `update_entity`, `delete_entity`, `create_relation`, `view_entity_relations`, `delete_relation`, `view_tree`, `search_sql`, `view_log`, `get_stats`, `generate_embeddings`, `delegate_to_subagent`, `submit_result`.
+The agent has these tools internally: `recall_memory`, `quick_search`, `save_fact`, `save_thought`, `save_source`, `save_rule`, `ingest_file`, `get_entity`, `list_entities`, `update_entity`, `delete_entity`, `create_relation`, `view_entity_relations`, `delete_relation`, `view_tree`, `search_sql`, `view_log`, `get_stats`, `generate_embeddings`, `delegate_to_subagent`, `final_answer`.
 
 **Setup (pick a provider)**:
-- **DeepInfra (default)**: set `LLM_PROFILE=deepinfra` and `DEEPINFRA_API_KEY=...` in `.env`. Get a key at https://deepinfra.com/
-- **NVIDIA NIM**: set `LLM_PROFILE=nim` and `NVIDIA_NIM_API_KEY=...` in `.env`. Get a key at https://build.nvidia.com/
-- **Self-hosted vLLM**: set `LLM_PROFILE=vllm_workstation` for a vLLM server bound to the Docker host's loopback at `:8002`. No API key needed if the server runs without auth. See [CONTRIBUTING.md](CONTRIBUTING.md) for how to add your own self-hosted profile.
+- **DeepInfra — recommended default**: set `LLM_PROFILE=deepinfra` and `DEEPINFRA_API_KEY=...` in `.env`. Fast (5–30s per agent call), cheap, validated end-to-end. Get a key at https://deepinfra.com/
+- **NVIDIA NIM** (free-tier alternative): set `LLM_PROFILE=nim` and `NVIDIA_NIM_API_KEY=...` in `.env`. Get a key at https://build.nvidia.com/
+- **Self-hosted vLLM** (advanced / offline / requires GPU workstation): set `LLM_PROFILE=vllm_workstation` (or `..._qwen`, `..._gemma`) — points at a vLLM server bound to the Docker host's loopback at `:8002` / `:8010` / `:8009` respectively. Reach it from the docker network via an SSH tunnel if the GPU is on a remote machine. No API key needed if the server runs without auth. See [CONTRIBUTING.md](CONTRIBUTING.md) for how to add your own self-hosted profile.
 - Profiles live in `braindb/config.py::_LLM_PROFILES`. Add new providers there (e.g. `together`, `openai`) by adding a dict entry — no code change required.
 - Optional override: set `AGENT_MODEL=` in `.env` to use a non-default model for the active profile.
 
@@ -384,13 +413,19 @@ This is complementary to `source_entity_id` (on facts — links to a specific so
 
 ## How Search Works
 
-The search uses a 4-tier scoring system:
+Two different paths, two different scoring models:
+
+**`POST /api/v1/memory/search`** (and the `quick_search` agent tool) — **content-matching** with a 4-tier score against entity content directly:
 1. **Full-text AND match** (all query words match) — highest weight (1.0)
 2. **Full-text OR match** (any query word matches) — lower weight (0.3)
 3. **Content trigram similarity** — fuzzy character matching (0.5)
 4. **Title trigram similarity** — fuzzy title matching (0.3)
 
-This means specific queries with terms that appear in stored content work best. Vague queries with stop words ("everything about X") may return fewer results. If you get 0 results, reformulate with more specific terms.
+This is for "find me entities whose CONTENT mentions these terms" — useful for arbitrary text matching, but it dilutes when the query is much longer than what's in the entity.
+
+**`POST /api/v1/memory/context`** (the sophisticated path) — **keyword-mediated**. Both the fuzzy and embedding pathways match the query against keyword entities (not entity bodies); entities surface via `tagged_with`. Then graph traversal, decay, two-level diversity quota, ranking. See the "Context" section above for the full pipeline.
+
+Use `/memory/search` for raw text matching; use `/memory/context` for everything that involves *understanding* a subject. If you get 0 results from either, reformulate with more specific terms.
 
 ---
 
@@ -422,7 +457,7 @@ The `final_rank` in context results already accounts for decay.
 3. **Notes are a log** — use `notes` on any entity to record how your understanding evolved
 4. **always_on rules are limited to 10** — keep them high-signal; use on-demand rules for specifics
 5. **access_count reinforces memory** — things you retrieve often stay important longer
-6. **Multi-query for better recall** — use `queries` (array) instead of `query` (single) to search multiple angles at once
+6. **Multi-query for better recall** — use `queries` (array) instead of `query` (single) AND prefer multiple **narrow** queries (single keywords / bare names) over one long phrase. Each query in `queries[]` reserves a share of result slots, so a bare keyword is guaranteed to surface its facts. `max_results` defaults to 30.
 7. **Content should be concise** — 1-2 sentences, standalone, using full terms (not abbreviations)
 8. **Use the tree endpoint** to explore how an entity connects to others: `GET /memory/tree/<id>`
 9. **Use the list endpoint** to browse entities: `GET /entities?entity_type=fact&limit=50`