Skip to content

Add AgentGroup CRD and a controller to run a fleet of sandboxes#1

Draft
Sanchit2662 wants to merge 2 commits into
mainfrom
poc/multi-agentcube
Draft

Add AgentGroup CRD and a controller to run a fleet of sandboxes#1
Sanchit2662 wants to merge 2 commits into
mainfrom
poc/multi-agentcube

Conversation

@Sanchit2662
Copy link
Copy Markdown
Owner

@Sanchit2662 Sanchit2662 commented May 18, 2026

Right now agentcube starts one sandbox per task. That single sandbox has to do everything at once, running the code and handling the agent runtime side. For bigger tasks that doesn't really hold up. Usually you want a few agents doing different jobs and working together on the same task.

So I started on the multi-agent side of things . This PR is the first piece of it.

what i added

A new CRD called AgentGroup. You list a bunch of agents in it, and each agent just points at an existing CodeInterpreter or AgentRuntime for its sandbox template. Then there's an AgentGroupController that watches these and actually brings the fleet up.

The controller itself is pretty simple. When you create an AgentGroup it goes to Pending, then Initializing while it creates one Sandbox per agent, and then Running once all of them report Ready. If the spec is wrong or a runtime reference doesn't exist it goes to Failed instead. Every Sandbox it creates gets an owner reference back to the AgentGroup, so deleting the group cleans up all its sandboxes on its own and I didn't need a finalizer for that.

I followed the same controller patterns the CodeInterpreterReconciler already uses, so it should feel consistent with the rest of the codebase. It uses GenerationChangedPredicate and only writes status when something actually changed, which avoids the controller waking itself up in a loop.

what i left out on purpose

I kept this small so it's easy to review. It only does the cold path, meaning it creates plain Sandbox objects directly. No warm pool or SandboxClaim batching yet. A few things I want to do in follow up PRs:

  • gang scheduling so the whole fleet schedules together or not at all
  • a shared context store so agents can pass partial results around
  • a message bus so agents can actually talk to each other
  • the retry and degraded failure policies (right now everything just behaves like FailFast)

how to try it

kubectl apply -f example/agent-group/agent-group.yaml
kubectl get agentgroup research-task -w

You should see it move Pending to Initializing to Running, and kubectl get sandboxes -l runtime.agentcube.io/agent-group=research-task shows the sandboxes it created.

tests

I added unit tests for the controller with the controller-runtime fake client. They cover the phase transitions, the all-ready and partially-ready cases, bad specs, a missing runtime reference, losing a sandbox while the group is Running, and checking the sandboxes come out owned by the group. go build, go vet and the tests all pass, and the deepcopy plus CRD manifest are regenerated so codegen is in sync.

Keeping this as a draft since it's the starting point and there's more coming.

Adds an AgentGroup custom resource and a controller that brings up a
fleet of agent sandboxes for one task, instead of a single sandbox per
task. The controller creates one agent-sandbox Sandbox per agent,
watches their Ready condition, and moves the group through
Pending -> Initializing -> Running.

This is a first slice toward multi-agentcube support (issue volcano-sh#301).
Gang scheduling, the shared context store and the inter-agent message
bus are intentionally left out of this change.

Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
@Sanchit2662 Sanchit2662 reopened this May 18, 2026
Adds a Dependencies field of directed AgentDependency edges to AgentGroupSpec
so the Hierarchical agent graph is part of the API contract from the start.
The controller validates that every edge references a known agent and rejects
self-edges; ordered startup by these edges is left as later work.

Peer topology is accepted by the CRD enum but not implemented, so the
controller now rejects it explicitly with an UnsupportedTopology failure
instead of treating it as a silent no-op.

Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant