feat(kiloclaw): add multi-instance identity plumbing#1578
Conversation
Instance identity types/helpers (generateInstanceId, sandboxIdFromInstanceId),
DB schema columns (instance_id, organization_id) with unique index, optional
instanceId threading through all idFromName sites, /i/{instanceId} proxy route
with ownership checks, and orgId threading into DO state + env.
Migration SQL intentionally omitted — will be generated after merge with main.
Resolved conflicts: - kiloclaw DO index.ts: kept provision opts + main's performance timing - kiloclaw routes (api.ts, kiloclaw.ts, platform.ts): kept instanceId routing + ownership checks + main's instrumented() wrappers + analytics - packages/db/schema.ts: kept both instance_id/organization_id and name columns - instance-registry.ts: kept both instanceId/organizationId and name fields - kiloclaw-internal-client.ts: kept instanceId param + main's skipCooldown option - admin-kiloclaw-instances-router.ts: adapted start() call for new signature - Generated migration 0060 with backfill for instance_id
Add isValidInstanceId() checks to kiloclaw.ts and api.ts routes before using the query param as a DO key, consistent with the /i/:instanceId/* proxy route validation.
…olumn instanceId is now kiloclaw_instances.id (the existing UUID primary key), not a separate column. Removes instance_id column, generateInstanceId(), INSTANCE_ID_LENGTH, getInstanceByInstanceId(). isValidInstanceId() now validates UUID-with-dashes format. sandboxIdFromInstanceId() strips dashes internally. Migration regenerated to only add organization_id.
Threaded instanceId through new stream-chat-credentials platform route and internal client method, consistent with other routes.
Add Multi-Instance Migration section documenting the identity model transition, what to know when making changes, and upcoming PRs. Update user scoping invariant, change checklist, and architecture map.
| userId: string, | ||
| opts?: EnsureActiveInstanceOpts | ||
| ): Promise<ActiveKiloClawInstance> { | ||
| const sandboxId = sandboxIdFromUserId(userId); |
There was a problem hiding this comment.
WARNING: Multi-instance rows still get the legacy sandbox ID
instanceId is supposed to become the routing identity, but this helper still persists sandbox_id as sandboxIdFromUserId(userId). For any org/instance-keyed flow, that means the row either collides with the user's personal row or keeps the old sandbox while the DO switches to sandboxIdFromInstanceId(instanceId), which breaks later restore/destroy/lookups that rely on kiloclaw_instances.sandbox_id.
There was a problem hiding this comment.
yep, in the plan for pr2
| const { started } = await withDORetry( | ||
| instanceStubFactory(c.env, result.data.userId), | ||
| instanceStubFactory(c.env, result.data.userId, instanceId || undefined), | ||
| stub => stub.start(result.data.userId, options), |
There was a problem hiding this comment.
WARNING: instanceId is dropped before the restore path
This stub is keyed by instanceId, but start() still only receives userId. If an instance-keyed DO loses its SQLite state, _startInner() falls back to restoreFromPostgres(userId) and rehydrates the legacy user-scoped row instead of the requested instance. The requested instanceId needs to be forwarded into the DO start path as well.
There was a problem hiding this comment.
start() passes only userId, the restore path inside the DO uses getActiveInstance(db, userId). The DO is already keyed correctly via instanceStubFactory, so it talks to the right DO. The internal restore path is a legacy recovery mechanism that will be updated in PR 2 when the Registry DO is added.
Code Review SummaryStatus: 6 Issues Found | Recommendation: Address before merge Overview
Issue Details (click to expand)WARNING
Other Observations (not in diff)Issues found in unchanged code that cannot receive inline comments:
Files Reviewed (1 files)
Reviewed by gpt-5.4-20260305 · 227,080 tokens |
Replace regex-based isValidInstanceId() checks at route level with zod InstanceIdParam (z.string().uuid()) validation. Add shared InstanceIdParam schema to instance-config.ts. Platform routes now validate ?instanceId= query params via parseInstanceIdQuery helper (previously unvalidated).
parseInstanceIdQuery now returns a discriminated union with an error Response instead of silently falling back to userId-keyed path. This is consistent with kiloclaw.ts, api.ts, and index.ts validation.
…dQuery Use new Response() matching the api.ts pattern instead of double-casting c.json() through unknown. Avoids violating the project's 'avoid as' rule.
Route was added by the stream-chat feature on main and missed during merge. Now accepts optional ?instanceId= with zod validation and ownership check, consistent with /config and /status routes.
|
|
||
| // ─── Instance-scoped identity ─────────────────────────────────────── | ||
|
|
||
| const UUID_RE = /^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/; | ||
|
|
||
| /** | ||
| * Validate that a string is a lowercase UUID with dashes. | ||
| * Must stay in sync with kiloclaw/src/auth/sandbox-id.ts. | ||
| */ | ||
| export function isValidInstanceId(id: string): boolean { | ||
| return UUID_RE.test(id); | ||
| } | ||
|
|
||
| /** | ||
| * Derive a sandboxId from an instanceId (the DB row UUID). | ||
| * Must stay in sync with kiloclaw/src/auth/sandbox-id.ts. | ||
| */ | ||
| export function sandboxIdFromInstanceId(instanceId: string): string { | ||
| if (!isValidInstanceId(instanceId)) { | ||
| throw new Error('Invalid instanceId: must be a UUID'); | ||
| } | ||
| const hex = instanceId.replace(/-/g, ''); | ||
| const prefixed = `ki_${hex}`; | ||
| if (prefixed.length > MAX_SANDBOX_ID_LENGTH) { | ||
| throw new Error( | ||
| `instanceId too long: prefixed sandboxId would be ${prefixed.length} chars (max ${MAX_SANDBOX_ID_LENGTH})` | ||
| ); | ||
| } | ||
| return prefixed; | ||
| } |
There was a problem hiding this comment.
We have this code duplicated in kiloclaw/src/auth/sandbox-id.ts?
…tance type Resolved migration numbering conflict (main added 0060+0061, ours becomes 0062). Fixed kiloclaw-router.ts query missing organizationId field required by updated ActiveKiloClawInstance type.
Keep both orgId (ours) and customSecretMeta (main) in UserConfig type and buildUserEnvVars call.
…-utils Move isValidInstanceId() and sandboxIdFromInstanceId() to @kilocode/worker-utils/instance-id.ts as the canonical source. Both sandbox-id.ts files (worker + Next.js) now re-export from worker-utils, eliminating the duplicated implementation.
The barrel import (@kilocode/worker-utils) pulls in all modules including ones with .js extensions that Next.js can't resolve. Add a subpath export for instance-id and import from it directly.
The CI moduleResolution setting can't resolve worker-utils subpath exports with .ts source files. Inline the two small functions in the Next.js copy and keep worker-utils/instance-id as the canonical source for the worker side only.
Summary
Add multi-instance identity plumbing to KiloClaw — the foundational layer for supporting multiple instances per user and org-owned instances. This is PR 1 of the multi-instance rollout; it adds all the wiring without changing any existing behavior (all new parameters are optional, all existing flows continue to use userId-keyed DOs).
Key design decisions:
instanceId=kiloclaw_instances.id— the existing UUID primary key. No new column needed.sandboxIdFromInstanceId()derives sandboxId aski_{uuid-no-dashes}(35 chars). Theki_prefix distinguishes new instance-keyed sandboxIds from legacy userId-derived ones (which are raw base64url). This lets any code path tell which derivation produced a given sandboxId without a DB lookup — important because gateway token derivation, Fly metadata recovery, and volume naming all key off sandboxId.What changed:
isValidInstanceId()validates UUID format,sandboxIdFromInstanceId()deriveski_-prefixed sandboxId from DB row UUID. Mirrored in both the worker and Next.js packages.organization_id(uuid, FK to organizations, nullable) column tokiloclaw_instances. Noinstance_idcolumn — the existingidPK serves this purpose.orgIdthreaded throughInstanceMutableState,PersistedStateSchema,loadState/resetMutableState/createMutableState,provision(),getStatus(),getDebugState(), andbuildUserEnvVars(injected asKILOCODE_ORGANIZATION_ID).idFromName()sites (platform routes, user routes, admin routes) accept optionalinstanceId(UUID) to use as the DO key instead of userId. New/i/:instanceId/*proxy route with ownership check, WebSocket support, and path prefix stripping.status.userId === authenticated userId) on all user-facing routes that acceptinstanceId. UUID format validation (isValidInstanceId()) prevents waking DOs with arbitrary keys.provision(),start(),stop(),destroy(),getStatus(),getDebugStatus()all accept optionalinstanceId.getInstanceById()(lookup by PK),getInstanceBySandboxId()for multi-instance restore paths.Verification
pnpm typecheck— passes (full monorepo)pnpm lint— passes (full monorepo, oxlint + eslint)pnpm format:check— passes (verified by pre-push hook)pnpm typecheck(kiloclaw worker only) — passespnpm lint(kiloclaw worker only) — passesorganization_idcolumn + FK constraintVisual Changes
N/A
Reviewer Notes
organization_idcolumn with FK toorganizations. No backfill, no new indexes.start()signature change: The internal client'sstart(userId, instanceId?, options?)changed caller sites — the admin router passesundefinedfor instanceId to preserveskipCooldownbehavior.