Open scope. I thought through the design but won't be building it — leaving the notes here in case someone wants to pick it up.
The problem
POST /api/import is hard-gated to empty wikis: any user-owned page returns 409 wiki_not_empty (app/src/app/api/import/route.ts:53-72).
Users with an existing wiki who want to merge an export from another instance currently have to wipe local data first, or re-ingest sources through onboarding (which loses provenance, drafts, page plans, and chat history).
A mode=merge path that unions an incoming .kompl.zip into a non-empty wiki would close this. It is non-trivial — the seven reconciliation problems below all need explicit decisions.
Precondition: latent bug worth fixing on its own
The current import's INSERT OR IGNORE strategy is partially aspirational. Two tables have no conflict key for OR IGNORE to act on:
| Table |
Today's PK |
Effect |
provenance |
id AUTOINCREMENT, plain indexes on source_id/page_id (schema.sql:34-41,195-196) |
OR IGNORE is a no-op — re-import duplicates rows |
aliases |
id AUTOINCREMENT, plain index on alias (schema.sql:61-67,202) |
Same — duplicate alias rows accumulate |
The empty-wiki gate masks this in production (you can never re-import), but merge mode would hit it on every run. A v21 migration adding UNIQUE(source_id, page_id, contribution_type) on provenance and UNIQUE(alias) on aliases is a prerequisite — and stands on its own without merge mode.
Tables with sound dedup keys today: extractions (PK source_id), entity_mentions (PK (canonical_name, source_id)), relationship_mentions (PK (from_canonical, to_canonical, relationship_type, source_id)), page_links (UNIQUE (source_page_id, target_page_id, link_type)), pages, sources, drafts, page_plans, compile_progress (all *_id PK).
Edge cases
The cases that make merge hard. None has an obvious right answer. This list is what I came up with — there are probably edge cases I missed, and some of these might turn out to be over-thought once someone sits down with the code.
- Source-level dedup. Two instances ingest the same URL. Or the same content via different URLs. Or the same URL but the content has since changed. Three different "duplicate" semantics — pick which to honor (URL match? content-hash match? both?) and how to handle the version-changed case.
- Page-level merge — whose summary wins? Both instances compiled the same source independently → byte-different page bodies, same
page_id. Last-write-wins, recompile-from-union, keep both as variants?
- Entity resolution at the page level (the "React" problem). Two instances each have a page titled
"React" with different page_ids. Title equality is not entity equality — these may be semantically different entities (the JS framework vs the chemistry concept). Auto-merge by title? Keep separate? How does the sidebar disambiguate?
- Entity resolution at the alias level. Local:
('React', 'React (JS framework)', page_Y). Incoming: ('React', 'React (chemistry)', page_X). Same alias string, conflicting canonical. Local wins? Incoming wins? Defer to the user? Auto-merging mis-routes lookups silently — hard to undo.
- Cross-reference graph — orphans. Page X (in zip) links to Z. Z is not in the zip and not in the local wiki. With FKs on (app/src/lib/db.ts:59), the insert throws — an orphan-edge policy is mandatory. Drop with a counter? Stub a placeholder page? Refuse the import?
- Provenance graph union. Each provenance row records a real compile event. Union is the obvious answer — but only if the precondition migration above is in place.
- Schema / category drift. Instance A categorises a page as "Cryptocurrency", instance B as "Crypto". Normalize at the import layer? Recompile after merge and let the local LLM pick? Accept transient sidebar duplication?
- Vector store merge. Local uses
all-MiniLM-L6-v2 (nlp-service/services/vector_store.py:4). The zip's vectors.json may come from a different model, different chunker, or stale-relative-to-recompile. Trust them? Force a rebuild? Add a compatibility check?
- Schema version mismatch. Manifest already records
schema_version (app/src/app/api/export/route.ts:265) but the import path ignores it. Reject when incoming > local (migrate.py only runs forward on a DB)? Tolerate incoming < local (older exports)?
Design decisions any implementation has to commit to
- Mode opt-in.
?mode=merge query param vs auto-detect on non-empty wiki vs separate route?
- Source dedup precedence. URL-then-hash vs hash-only vs both-required?
- Page conflict resolution. Last-write, recompile, keep-as-variants?
- Title collisions. Auto-merge by title, never auto-merge, prompt user?
- Alias conflicts. Local wins, incoming wins, defer-and-log?
- Orphan edges. Drop, stub, reject?
- Vector handling. Trust zip vectors, force backfill, compat-check?
- Schema-too-new. 422 reject vs best-effort?
- Atomicity. Single transaction (same as current import) vs staged commit (sources first, then pages, etc.)?
- UI. Reuse the import button, add a separate "Merge" button, gate behind a setting?
Out of scope (separate enhancements)
- TF-IDF "similar but not identical content" source matching.
- Three-way conflict UI for alias divergence.
- Cross-instance vector restore with
model_name compatibility check.
- Streaming buffer reads for >5k page imports (current implementation pre-reads every
.gz blob into memory at app/src/app/api/import/route.ts:95-106 — fine for v1 sizes, will need rework at scale).
- Sidebar disambiguation UX for same-title distinct-
page_id pages.
The problem
POST /api/importis hard-gated to empty wikis: any user-owned page returns409 wiki_not_empty(app/src/app/api/import/route.ts:53-72).Users with an existing wiki who want to merge an export from another instance currently have to wipe local data first, or re-ingest sources through onboarding (which loses provenance, drafts, page plans, and chat history).
A
mode=mergepath that unions an incoming.kompl.zipinto a non-empty wiki would close this. It is non-trivial — the seven reconciliation problems below all need explicit decisions.Precondition: latent bug worth fixing on its own
The current import's
INSERT OR IGNOREstrategy is partially aspirational. Two tables have no conflict key forOR IGNOREto act on:provenanceid AUTOINCREMENT, plain indexes onsource_id/page_id(schema.sql:34-41,195-196)OR IGNOREis a no-op — re-import duplicates rowsaliasesid AUTOINCREMENT, plain index onalias(schema.sql:61-67,202)The empty-wiki gate masks this in production (you can never re-import), but merge mode would hit it on every run. A
v21migration addingUNIQUE(source_id, page_id, contribution_type)onprovenanceandUNIQUE(alias)onaliasesis a prerequisite — and stands on its own without merge mode.Tables with sound dedup keys today:
extractions(PKsource_id),entity_mentions(PK(canonical_name, source_id)),relationship_mentions(PK(from_canonical, to_canonical, relationship_type, source_id)),page_links(UNIQUE(source_page_id, target_page_id, link_type)),pages,sources,drafts,page_plans,compile_progress(all*_idPK).Edge cases
The cases that make merge hard. None has an obvious right answer. This list is what I came up with — there are probably edge cases I missed, and some of these might turn out to be over-thought once someone sits down with the code.
page_id. Last-write-wins, recompile-from-union, keep both as variants?"React"with differentpage_ids. Title equality is not entity equality — these may be semantically different entities (the JS framework vs the chemistry concept). Auto-merge by title? Keep separate? How does the sidebar disambiguate?('React', 'React (JS framework)', page_Y). Incoming:('React', 'React (chemistry)', page_X). Same alias string, conflicting canonical. Local wins? Incoming wins? Defer to the user? Auto-merging mis-routes lookups silently — hard to undo.all-MiniLM-L6-v2(nlp-service/services/vector_store.py:4). The zip'svectors.jsonmay come from a different model, different chunker, or stale-relative-to-recompile. Trust them? Force a rebuild? Add a compatibility check?schema_version(app/src/app/api/export/route.ts:265) but the import path ignores it. Reject when incoming > local (migrate.py only runs forward on a DB)? Tolerate incoming < local (older exports)?Design decisions any implementation has to commit to
?mode=mergequery param vs auto-detect on non-empty wiki vs separate route?Out of scope (separate enhancements)
model_namecompatibility check..gzblob into memory at app/src/app/api/import/route.ts:95-106 — fine for v1 sizes, will need rework at scale).page_idpages.