Skip to content

Wiki-to-wiki merge import (mode=merge) #54

@tuirk

Description

@tuirk

Open scope. I thought through the design but won't be building it — leaving the notes here in case someone wants to pick it up.

The problem

POST /api/import is hard-gated to empty wikis: any user-owned page returns 409 wiki_not_empty (app/src/app/api/import/route.ts:53-72).

Users with an existing wiki who want to merge an export from another instance currently have to wipe local data first, or re-ingest sources through onboarding (which loses provenance, drafts, page plans, and chat history).

A mode=merge path that unions an incoming .kompl.zip into a non-empty wiki would close this. It is non-trivial — the seven reconciliation problems below all need explicit decisions.

Precondition: latent bug worth fixing on its own

The current import's INSERT OR IGNORE strategy is partially aspirational. Two tables have no conflict key for OR IGNORE to act on:

Table Today's PK Effect
provenance id AUTOINCREMENT, plain indexes on source_id/page_id (schema.sql:34-41,195-196) OR IGNORE is a no-op — re-import duplicates rows
aliases id AUTOINCREMENT, plain index on alias (schema.sql:61-67,202) Same — duplicate alias rows accumulate

The empty-wiki gate masks this in production (you can never re-import), but merge mode would hit it on every run. A v21 migration adding UNIQUE(source_id, page_id, contribution_type) on provenance and UNIQUE(alias) on aliases is a prerequisite — and stands on its own without merge mode.

Tables with sound dedup keys today: extractions (PK source_id), entity_mentions (PK (canonical_name, source_id)), relationship_mentions (PK (from_canonical, to_canonical, relationship_type, source_id)), page_links (UNIQUE (source_page_id, target_page_id, link_type)), pages, sources, drafts, page_plans, compile_progress (all *_id PK).

Edge cases

The cases that make merge hard. None has an obvious right answer. This list is what I came up with — there are probably edge cases I missed, and some of these might turn out to be over-thought once someone sits down with the code.

  1. Source-level dedup. Two instances ingest the same URL. Or the same content via different URLs. Or the same URL but the content has since changed. Three different "duplicate" semantics — pick which to honor (URL match? content-hash match? both?) and how to handle the version-changed case.
  2. Page-level merge — whose summary wins? Both instances compiled the same source independently → byte-different page bodies, same page_id. Last-write-wins, recompile-from-union, keep both as variants?
  3. Entity resolution at the page level (the "React" problem). Two instances each have a page titled "React" with different page_ids. Title equality is not entity equality — these may be semantically different entities (the JS framework vs the chemistry concept). Auto-merge by title? Keep separate? How does the sidebar disambiguate?
  4. Entity resolution at the alias level. Local: ('React', 'React (JS framework)', page_Y). Incoming: ('React', 'React (chemistry)', page_X). Same alias string, conflicting canonical. Local wins? Incoming wins? Defer to the user? Auto-merging mis-routes lookups silently — hard to undo.
  5. Cross-reference graph — orphans. Page X (in zip) links to Z. Z is not in the zip and not in the local wiki. With FKs on (app/src/lib/db.ts:59), the insert throws — an orphan-edge policy is mandatory. Drop with a counter? Stub a placeholder page? Refuse the import?
  6. Provenance graph union. Each provenance row records a real compile event. Union is the obvious answer — but only if the precondition migration above is in place.
  7. Schema / category drift. Instance A categorises a page as "Cryptocurrency", instance B as "Crypto". Normalize at the import layer? Recompile after merge and let the local LLM pick? Accept transient sidebar duplication?
  8. Vector store merge. Local uses all-MiniLM-L6-v2 (nlp-service/services/vector_store.py:4). The zip's vectors.json may come from a different model, different chunker, or stale-relative-to-recompile. Trust them? Force a rebuild? Add a compatibility check?
  9. Schema version mismatch. Manifest already records schema_version (app/src/app/api/export/route.ts:265) but the import path ignores it. Reject when incoming > local (migrate.py only runs forward on a DB)? Tolerate incoming < local (older exports)?

Design decisions any implementation has to commit to

  • Mode opt-in. ?mode=merge query param vs auto-detect on non-empty wiki vs separate route?
  • Source dedup precedence. URL-then-hash vs hash-only vs both-required?
  • Page conflict resolution. Last-write, recompile, keep-as-variants?
  • Title collisions. Auto-merge by title, never auto-merge, prompt user?
  • Alias conflicts. Local wins, incoming wins, defer-and-log?
  • Orphan edges. Drop, stub, reject?
  • Vector handling. Trust zip vectors, force backfill, compat-check?
  • Schema-too-new. 422 reject vs best-effort?
  • Atomicity. Single transaction (same as current import) vs staged commit (sources first, then pages, etc.)?
  • UI. Reuse the import button, add a separate "Merge" button, gate behind a setting?

Out of scope (separate enhancements)

  • TF-IDF "similar but not identical content" source matching.
  • Three-way conflict UI for alias divergence.
  • Cross-instance vector restore with model_name compatibility check.
  • Streaming buffer reads for >5k page imports (current implementation pre-reads every .gz blob into memory at app/src/app/api/import/route.ts:95-106 — fine for v1 sizes, will need rework at scale).
  • Sidebar disambiguation UX for same-title distinct-page_id pages.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestup-for-grabsMaintainer isn't actively working on this — open for anyone to pick up

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions