-
Notifications
You must be signed in to change notification settings - Fork 17
Description
Context
When startAgent is called for an agent that's already in starting status (between sessionCount++ and session.create()), the current code calls stopAgent() to kill the old session before proceeding. However, stopAgent() cannot actually cancel a startup that hasn't created a sessionId yet. The original startAgent() call keeps going, subscribes to events, and can leave an extra live session that is no longer tracked in the agents map.
This was identified during PR #1336 review (comment by kilo-code-bot).
Current behavior
In container/src/process-manager.ts:startAgent():
if (existing && (existing.status === 'running' || existing.status === 'starting')) {
await stopAgent(request.agentId).catch(...);
}If existing.status === 'starting', stopAgent tries to kill it but has no sessionId to target. The original startup continues in the background, creating an orphaned session.
Expected behavior
When a new startAgent request arrives for an agent in starting status, the system should either:
- Wait for the existing startup to complete (with a timeout), then stop the resulting session
- Use an
AbortControllerto cancel the in-flight startup - Track the pending startup promise and await it before stopping
Option 2 is cleanest — thread an AbortController through the startup sequence so stopAgent can signal cancellation before session.create() completes.
Impact
Low — this race requires two startAgent calls for the same agent within the ~1-2 second window between sessionCount++ and session.create(). In practice this is rare because the reconciler runs every 5s and the DO serializes RPC calls. But it could happen during rapid container eviction/restart cycles.
Parent: #204