Skip to content

feat(kiloclaw): add multi-instance identity plumbing#1578

Merged
pandemicsyn merged 18 commits intomainfrom
florian/chore/org-support
Mar 27, 2026
Merged

feat(kiloclaw): add multi-instance identity plumbing#1578
pandemicsyn merged 18 commits intomainfrom
florian/chore/org-support

Conversation

@pandemicsyn
Copy link
Copy Markdown
Contributor

@pandemicsyn pandemicsyn commented Mar 26, 2026

Summary

Add multi-instance identity plumbing to KiloClaw — the foundational layer for supporting multiple instances per user and org-owned instances. This is PR 1 of the multi-instance rollout; it adds all the wiring without changing any existing behavior (all new parameters are optional, all existing flows continue to use userId-keyed DOs).

Key design decisions:

  • instanceId = kiloclaw_instances.id — the existing UUID primary key. No new column needed.
  • sandboxIdFromInstanceId() derives sandboxId as ki_{uuid-no-dashes} (35 chars). The ki_ prefix distinguishes new instance-keyed sandboxIds from legacy userId-derived ones (which are raw base64url). This lets any code path tell which derivation produced a given sandboxId without a DB lookup — important because gateway token derivation, Fly metadata recovery, and volume naming all key off sandboxId.

What changed:

  • Instance identity primitives: isValidInstanceId() validates UUID format, sandboxIdFromInstanceId() derives ki_-prefixed sandboxId from DB row UUID. Mirrored in both the worker and Next.js packages.
  • DB schema: Added organization_id (uuid, FK to organizations, nullable) column to kiloclaw_instances. No instance_id column — the existing id PK serves this purpose.
  • DO state: orgId threaded through InstanceMutableState, PersistedStateSchema, loadState/resetMutableState/createMutableState, provision(), getStatus(), getDebugState(), and buildUserEnvVars (injected as KILOCODE_ORGANIZATION_ID).
  • Routing: All 6 idFromName() sites (platform routes, user routes, admin routes) accept optional instanceId (UUID) to use as the DO key instead of userId. New /i/:instanceId/* proxy route with ownership check, WebSocket support, and path prefix stripping.
  • Authorization: Ownership checks (status.userId === authenticated userId) on all user-facing routes that accept instanceId. UUID format validation (isValidInstanceId()) prevents waking DOs with arbitrary keys.
  • Internal client: provision(), start(), stop(), destroy(), getStatus(), getDebugStatus() all accept optional instanceId.
  • DB queries: getInstanceById() (lookup by PK), getInstanceBySandboxId() for multi-instance restore paths.

Verification

  • pnpm typecheck — passes (full monorepo)
  • pnpm lint — passes (full monorepo, oxlint + eslint)
  • pnpm format:check — passes (verified by pre-push hook)
  • pnpm typecheck (kiloclaw worker only) — passes
  • pnpm lint (kiloclaw worker only) — passes
  • Migration SQL reviewed: only adds organization_id column + FK constraint
  • Code review: authorization checks verified on all user-facing routes
  • Additional verification by reviewer

Visual Changes

N/A

Reviewer Notes

  • Zero behavioral change for existing users. All new parameters are optional; without them, every code path falls back to the existing userId-keyed behavior.
  • Migration is minimal — only adds nullable organization_id column with FK to organizations. No backfill, no new indexes.
  • start() signature change: The internal client's start(userId, instanceId?, options?) changed caller sites — the admin router passes undefined for instanceId to preserve skipCooldown behavior.

Instance identity types/helpers (generateInstanceId, sandboxIdFromInstanceId),
DB schema columns (instance_id, organization_id) with unique index, optional
instanceId threading through all idFromName sites, /i/{instanceId} proxy route
with ownership checks, and orgId threading into DO state + env.

Migration SQL intentionally omitted — will be generated after merge with main.
Resolved conflicts:
- kiloclaw DO index.ts: kept provision opts + main's performance timing
- kiloclaw routes (api.ts, kiloclaw.ts, platform.ts): kept instanceId
  routing + ownership checks + main's instrumented() wrappers + analytics
- packages/db/schema.ts: kept both instance_id/organization_id and name columns
- instance-registry.ts: kept both instanceId/organizationId and name fields
- kiloclaw-internal-client.ts: kept instanceId param + main's skipCooldown option
- admin-kiloclaw-instances-router.ts: adapted start() call for new signature
- Generated migration 0060 with backfill for instance_id
Add isValidInstanceId() checks to kiloclaw.ts and api.ts routes
before using the query param as a DO key, consistent with the
/i/:instanceId/* proxy route validation.
…olumn

instanceId is now kiloclaw_instances.id (the existing UUID primary key),
not a separate column. Removes instance_id column, generateInstanceId(),
INSTANCE_ID_LENGTH, getInstanceByInstanceId(). isValidInstanceId() now
validates UUID-with-dashes format. sandboxIdFromInstanceId() strips
dashes internally. Migration regenerated to only add organization_id.
Threaded instanceId through new stream-chat-credentials platform
route and internal client method, consistent with other routes.
Add Multi-Instance Migration section documenting the identity model
transition, what to know when making changes, and upcoming PRs.
Update user scoping invariant, change checklist, and architecture map.
@pandemicsyn pandemicsyn marked this pull request as ready for review March 27, 2026 02:30
userId: string,
opts?: EnsureActiveInstanceOpts
): Promise<ActiveKiloClawInstance> {
const sandboxId = sandboxIdFromUserId(userId);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Multi-instance rows still get the legacy sandbox ID

instanceId is supposed to become the routing identity, but this helper still persists sandbox_id as sandboxIdFromUserId(userId). For any org/instance-keyed flow, that means the row either collides with the user's personal row or keeps the old sandbox while the DO switches to sandboxIdFromInstanceId(instanceId), which breaks later restore/destroy/lookups that rely on kiloclaw_instances.sandbox_id.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, in the plan for pr2

const { started } = await withDORetry(
instanceStubFactory(c.env, result.data.userId),
instanceStubFactory(c.env, result.data.userId, instanceId || undefined),
stub => stub.start(result.data.userId, options),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: instanceId is dropped before the restore path

This stub is keyed by instanceId, but start() still only receives userId. If an instance-keyed DO loses its SQLite state, _startInner() falls back to restoreFromPostgres(userId) and rehydrates the legacy user-scoped row instead of the requested instance. The requested instanceId needs to be forwarded into the DO start path as well.

Copy link
Copy Markdown
Contributor Author

@pandemicsyn pandemicsyn Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start() passes only userId, the restore path inside the DO uses getActiveInstance(db, userId). The DO is already keyed correctly via instanceStubFactory, so it talks to the right DO. The internal restore path is a legacy recovery mechanism that will be updated in PR 2 when the Registry DO is added.

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot bot commented Mar 27, 2026

Code Review Summary

Status: 6 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 6
SUGGESTION 0
Issue Details (click to expand)

WARNING

File Line Issue
src/lib/kiloclaw/instance-registry.ts 34 Registry rows still persist the legacy user-derived sandbox_id, so org/instance-keyed flows can diverge from the DO sandbox identity or collide with the personal row.
kiloclaw/src/routes/platform.ts 985 POST /api/platform/start still calls start() with only userId, so the SQLite restore path rehydrates the legacy user-scoped row instead of the requested instance.

Fix these issues in Kilo Cloud

Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

File Line Issue
src/routers/admin-kiloclaw-instances-router.ts 167 The admin list/get/stats queries still join subscriptions on user_id instead of instance_id, so users with multiple instance-scoped subscriptions can produce duplicated rows, wrong suspended state, and inflated counts.
src/routers/admin-kiloclaw-instances-router.ts 636 destroyFlyMachine still accepts only userId, so its verify/destroy flow cannot target the selected instance and will keep talking to the legacy user-scoped DO once a user has multiple instances.
src/lib/kiloclaw/instance-lifecycle.ts 51 autoResumeIfSuspended() resolves a specific instanceId, but still calls start() with only userId, so it can wake the wrong DO while clearing suspension on the targeted subscription row.
src/app/(app)/claw/components/ClawDashboard.tsx 94 Switching back to the default tab rewrites the URL to /claw, which drops instanceId from instance-scoped links and loses the selected instance on refresh/share.
Files Reviewed (1 files)
  • src/lib/kiloclaw/sandbox-id.ts - 0 issues

Reviewed by gpt-5.4-20260305 · 227,080 tokens

Replace regex-based isValidInstanceId() checks at route level with
zod InstanceIdParam (z.string().uuid()) validation. Add shared
InstanceIdParam schema to instance-config.ts. Platform routes now
validate ?instanceId= query params via parseInstanceIdQuery helper
(previously unvalidated).
parseInstanceIdQuery now returns a discriminated union with an error
Response instead of silently falling back to userId-keyed path. This
is consistent with kiloclaw.ts, api.ts, and index.ts validation.
…dQuery

Use new Response() matching the api.ts pattern instead of double-casting
c.json() through unknown. Avoids violating the project's 'avoid as' rule.
Route was added by the stream-chat feature on main and missed during
merge. Now accepts optional ?instanceId= with zod validation and
ownership check, consistent with /config and /status routes.
Comment on lines +28 to +57

// ─── Instance-scoped identity ───────────────────────────────────────

const UUID_RE = /^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/;

/**
* Validate that a string is a lowercase UUID with dashes.
* Must stay in sync with kiloclaw/src/auth/sandbox-id.ts.
*/
export function isValidInstanceId(id: string): boolean {
return UUID_RE.test(id);
}

/**
* Derive a sandboxId from an instanceId (the DB row UUID).
* Must stay in sync with kiloclaw/src/auth/sandbox-id.ts.
*/
export function sandboxIdFromInstanceId(instanceId: string): string {
if (!isValidInstanceId(instanceId)) {
throw new Error('Invalid instanceId: must be a UUID');
}
const hex = instanceId.replace(/-/g, '');
const prefixed = `ki_${hex}`;
if (prefixed.length > MAX_SANDBOX_ID_LENGTH) {
throw new Error(
`instanceId too long: prefixed sandboxId would be ${prefixed.length} chars (max ${MAX_SANDBOX_ID_LENGTH})`
);
}
return prefixed;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have this code duplicated in kiloclaw/src/auth/sandbox-id.ts?

…tance type

Resolved migration numbering conflict (main added 0060+0061, ours
becomes 0062). Fixed kiloclaw-router.ts query missing organizationId
field required by updated ActiveKiloClawInstance type.
Keep both orgId (ours) and customSecretMeta (main) in UserConfig
type and buildUserEnvVars call.
…-utils

Move isValidInstanceId() and sandboxIdFromInstanceId() to
@kilocode/worker-utils/instance-id.ts as the canonical source.
Both sandbox-id.ts files (worker + Next.js) now re-export from
worker-utils, eliminating the duplicated implementation.
The barrel import (@kilocode/worker-utils) pulls in all modules
including ones with .js extensions that Next.js can't resolve.
Add a subpath export for instance-id and import from it directly.
The CI moduleResolution setting can't resolve worker-utils subpath
exports with .ts source files. Inline the two small functions in the
Next.js copy and keep worker-utils/instance-id as the canonical
source for the worker side only.
@pandemicsyn pandemicsyn enabled auto-merge March 27, 2026 20:27
@pandemicsyn pandemicsyn merged commit ed596a1 into main Mar 27, 2026
26 checks passed
@pandemicsyn pandemicsyn deleted the florian/chore/org-support branch March 27, 2026 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants