feat(batch): implement async planner for batch orchestration by nwangfw · Pull Request #2197 · vllm-project/aibrix

nwangfw · 2026-05-13T02:21:13Z

Pull Request Description

Replaces the synchronous Passthrough planner with an asynchronous queue-backed implementation to support the case when the provisioning takes time. The gRPC caller now returns immediately with a synthetic "validating" batch while a worker pool runs Provision + CreateBatch in the background. The plannerapi.Planner interface is unchanged; the swap is one line in server.go.

Signed-off-by: Ning Wang <n.wang.chn@hotmail.com>

gemini-code-assist

Code Review

This pull request introduces an asynchronous Scheduler implementation for the planner, replacing the previous passthrough logic. It utilizes a background worker pool to handle resource provisioning and batch submission, and includes corresponding UI updates to support new 'pending' and 'provisioning' job statuses. The review identifies critical race conditions during job cancellation and submission, potential memory leaks in the in-memory job tracking maps, and opportunities to improve maintainability through named types and stable terminal timestamps.

gemini-code-assist · 2026-05-13T02:26:15Z

+	klog.Infof("[planner.scheduler] submit job_id=%q provision_id=%q model_template=%v",
+		req.JobID, provResult.ProvisionID, req.ModelTemplate)
+
+	batch, err := q.bc.CreateBatch(q.baseCtx, req.BatchParams, aibrix)


There is a race condition where a job can be submitted to MDS even if it was canceled while Provision was running. The process worker should re-check the job state under a lock before calling CreateBatch. If the state is no longer Provisioning (e.g., it became Canceled), the worker should release the provisioned resource and abort the submission.

The race is handled at the post-CreateBatch checkpoint. If the job was canncelled. The provisioned resource will be released and the submission will be aborted.

gemini-code-assist · 2026-05-13T02:26:15Z

+	jobs       map[string]*queuedJob // JobID -> state
+	jobByBatch map[string]string     // batch.ID -> JobID (for ListJobs tagging)


The jobs and jobByBatch maps grow indefinitely as new jobs are enqueued. Since this is an in-memory implementation, it will eventually lead to a memory leak in long-running processes. Consider implementing a cleanup mechanism (e.g., a TTL-based cache or a background janitor) to remove terminal jobs (Failed, Canceled, or old Completed jobs) after a certain period.

The concern is true. Thinking to address this in a seperate PR in the near future.

gemini-code-assist · 2026-05-13T02:26:15Z

+type queuedJob struct {
+	req        *plannerapi.EnqueueRequest
+	state      jobState
+	batchID    string // populated when state == jobStateSubmitted
+	err        error  // populated when state == jobStateFailed
+	enqueuedAt time.Time
+}


To ensure that terminal timestamps (FailedAt, CancelledAt) remain stable and are not recalculated on every read, consider adding failedAt and canceledAt fields to the queuedJob struct. These should be populated in markFailed and Cancel respectively.

Suggested change

type queuedJob struct {

req *plannerapi.EnqueueRequest

state jobState

batchID string // populated when state == jobStateSubmitted

err error // populated when state == jobStateFailed

enqueuedAt time.Time

}

type queuedJob struct {

req *plannerapi.EnqueueRequest

state jobState

batchID string // populated when state == jobStateSubmitted

err error // populated when state == jobStateFailed

enqueuedAt time.Time

failedAt time.Time

canceledAt time.Time

}

nwangfw · 2026-05-13T16:24:50Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces an asynchronous 'Scheduler' planner to replace the synchronous 'Passthrough' implementation. The new scheduler manages job lifecycles by introducing 'pending' and 'provisioning' states, utilizing a worker pool to handle resource provisioning and batch submission. Corresponding updates were made to the API proto definitions, frontend components to support the new statuses, and the server initialization logic. My feedback suggests extracting the complex anonymous struct used for the planner decision into a named type to improve code readability and maintainability.

Signed-off-by: Ning Wang <n.wang.chn@hotmail.com>

Jeffwan · 2026-05-14T03:12:50Z

 	}
-	planner := plannerimpl.NewPassthrough(batchClient, rm.Provisioner)
+	planner := plannerimpl.NewScheduler(batchClient, rm.Provisioner, plannerimpl.DefaultWorkerCount)
+	s.planner = planner


122 and 123 could be single line?

Jeffwan · 2026-05-14T03:22:05Z

+// success/error/latency; the struct also records every call for later
+// assertion and tracks peak concurrent in-flight Provisions so worker-pool
+// parallelism can be measured directly.
+type fakeProvisioner struct {


this should be common provisioner interface. Can they be reused or not?

why not put them under async_planner_test.go..

Right now, the test file is only used by planner. Maybe we can keep it here until later it is reused by another compoment.

Jeffwan · 2026-05-14T03:23:26Z

+	provPollInterval time.Duration
+}
+
+// jobState is the planner-side lifecycle. Before submission, statusFor maps


it should be not part of this file? is there planner types could be used?

No existing planner type fits. jobState describes the AsyncPlanner-internal pre-submission lifecycle (Pending → Provisioning, plus terminal Failed/Canceled before MDS takes over) and isn't referenced outside this file,

job state can not be used by other planner implmenetation? I think this should be common? I didn't get this part, if you have a new planner implementation, then you will have other jobstate definition?

We only have one planner now. If we will have multiple planners, I agree that we should make the job state in common.

Jeffwan · 2026-05-14T08:41:48Z

 		}
 	}
+	if c, ok := s.planner.(io.Closer); ok {
+		if err := c.Close(); err != nil {


what happened for those workers on provisioning?

There are two cases. If the the planner has received the provisionID, it will release the resource accordingly. If the RM has allocated internally but hasn't returned the ProvisionID, there will be a caveat, bounded by the RPC duration. In practice, this window should be very small?

Jeffwan · 2026-05-14T08:43:11Z

+
+// ListJobs merges MDS batches with local not-yet-submitted jobs. Local jobs
+// are shown only on the first page so the MDS cursor remains valid.
+func (q *AsyncPlanner) ListJobs(ctx context.Context, req *plannerapi.ListJobsRequest) (*plannerapi.ListJobsResponse, error) {


is paging still working? seem not?

Actually this is something I want to discuss and get feedback on regarding how to merge job status for jobs in queue and jobs in MDS.

Paging works but not perfectly. Right now, the implementation idea is to display all local jobs (jobs not submited to MDS) first. The jobs in MDS will be displayed after that. The reason we can't paginate strictly across both: the MDS After cursor only encodes positions in MDS's batch ID space; local jobs have no stable IDs (they're in-memory, lost on restart), so we can't build a unified cursor across both sources.

The fix of this issue can be deferred to a future PR.

The jobs in MDS will be displayed after that.

I didn't get this part. there's logical job shown on UI. Since you manage the job state, you can do the merge whithin planner?

Planner does the job merge and currently displays sorted local jobs first followed by the sorted MDS job. Ideally, we don't need to distinguish them and sort them uniformly. I can address this issue in a future PR.

Jeffwan · 2026-05-14T08:43:59Z

+// provReadyTimeout caps how long a single worker will wait for a
+// provision to reach Running. Beyond this, the job is marked Failed and
+// the resource is released.
+const provReadyTimeout = 2 * time.Minute


I do not quite understand why it's 2 mins? any analytics?

It depends on the provisioning speed of resource providers. The 2*min is more than enough for the current k8s provisioning process but we may need to adjust this value for other resource providers.

how to measure it, what if we switch ot other cloud providers? I just like to know 2mins is from some of your test or a magic number you just defined?

2 mins are from my local provisioning tests. I feel that we need to test and config different numbers for different cloud providers.

Jeffwan · 2026-05-14T08:45:08Z

+// MDS for a submitted job. A cancel that lands mid-Provision or
+// mid-CreateBatch is honored at the worker's post-CreateBatch checkpoint
+// (which forwards CancelBatch and releases the resource).
+func (q *AsyncPlanner) Cancel(ctx context.Context, jobID string) (*plannerapi.Job, error) {


does it work with checkpoint pretty well?

Yes, we re-read state at the end of process() after CreateBatch returns to check if a cancel landed; covered by TestCancelDuringProvisioning* and TestCancelDuringCreateBatch*. Let me know if you see any potential issues.

Jeffwan · 2026-05-14T08:45:39Z

+
+	aibrix := plannerclient.AIBrixExtraBody{
+		JobID: req.JobID,
+		PlannerDecision: &struct {


why do you replicate a struct here?

Created a named type for reuse.

I mean this is not defined in earlier PR?

Jeffwan · 2026-05-14T08:57:45Z

+	return out
+}
+
+func (q *AsyncPlanner) deleteJob(jobID string) {


how do you manage jobs in the queue？ Seem delete is not invoked if job is full or expired?

Jobs are dequeued and removed from the Go channel implicitly by the worker's <-q.submit receive — no explicit delete is needed for the queue itself. The previous deleteJob has been renamed to rollbackEnqueue for clarity and is only used to undo the bookkeeping insert in two edge cases where Enqueue itself fails. Added some comments in the code as well.

could you point it out? I didn't find the logics

Is this part of code that you are looking for?
https://github.com/nwangfw/aibrix/blob/201ef7fc838728db85160e63ca2947e85c8920b0/apps/console/api/planner/impl/planner.go#L155-L159

Signed-off-by: Ning Wang <n.wang.chn@hotmail.com>

DwyaneShi · 2026-05-15T00:20:04Z

+// details (currently always nil — RM doesn't return them yet).
+type PlannerDecision struct {
+	ProvisionID               string `json:"provision_id,omitempty"`
+	ProvisionResourceDeadline int64  `json:"provision_resource_deadline,omitempty"`


need put a comment here to explain if the deadline is a unix timestamp in seconds, milliseconds, etc.

DwyaneShi · 2026-05-15T00:26:57Z

+type PlannerDecision struct {
+	ProvisionID               string `json:"provision_id,omitempty"`
+	ProvisionResourceDeadline int64  `json:"provision_resource_deadline,omitempty"`
+	ResourceDetails           []struct {


let's have enough comments to explain these fields

need a comment to explain that the actual gpu num of each deployment is defined in the template, otherwise other contributors will get confused

Sure. Comments here are added for every field here.

DwyaneShi · 2026-05-15T00:31:45Z

+	bc   plannerclient.BatchClient
+	prov provisioner.Provisioner
+
+	submit chan string // buffered FIFO of pending JobIDs


seems like planner's and queue's implementations are mixed together, let's split them for better flexibility

Great point and have split the queue implementation into a separate file.

DwyaneShi · 2026-05-15T00:43:04Z

+	}
+}
+
+func (q *Planner) markFailed(jobID string, err error) {


Once a job is marked as failed, we need to release its provision (if valid) as well.

Yes, the current code will release the provision for if a job has been marked as failed under the process().

DwyaneShi · 2026-05-15T00:48:22Z

+		return
+	}
+	if job.state != jobStatePending {
+		// Cancel raced ahead; drop without provisioning.


didn't get what does the comment mean, let's put a simple example to explain it

got this part after reviewing the cancel function. let's explicitly check if the state is cancelled and write error log if encountering other invalid states.

Added an example here regarding what does this statement mean. The state info is logged so that we can figure out the exact reason.

DwyaneShi · 2026-05-15T00:57:34Z

+		return &plannerapi.Job{JobID: jobID, Batch: placeholderBatch(req, openai.BatchStatusCancelled, enqueuedAt, terminalAt)}, nil
+	case jobStateSubmitted:
+		klog.Infof("[planner] cancel submitted job_id=%q batch_id=%q", jobID, batchID)
+		batch, err := q.bc.CancelBatch(ctx, batchID)


need to release provision

Great catch. Release provision added.

DwyaneShi · 2026-05-15T03:34:21Z

+const queueCapacity = 256
+
+// jobQueue owns the buffered FIFO of pending JobIDs.
+type jobQueue struct {


let's have a queue interface first and this impl as a fifo queue implementation. do not use the name of jobQueue, a name describing its functionality is preferred since we will have multiple queue implementations in the future.

Make sense. How about this version?

Signed-off-by: Ning Wang <n.wang.chn@hotmail.com>

nwangfw added 2 commits May 12, 2026 18:57

queue-based planner implementation

8c942d0

Signed-off-by: Ning Wang <n.wang.chn@hotmail.com>

queue-based planner renaming

d31fd04

Signed-off-by: Ning Wang <n.wang.chn@hotmail.com>

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

Comment thread apps/console/api/planner/impl/async_planner.go Outdated

nwangfw force-pushed the feat/planner-queue-impl-clean branch from f44f9dd to 3d0c69e Compare May 13, 2026 21:59

nwangfw changed the title ~~[WIP] feat(planner): implement async planner~~ feat(planner): implement async planner May 13, 2026

nwangfw assigned nwangfw and unassigned nwangfw May 13, 2026

nwangfw requested review from DwyaneShi, Jeffwan and zhangjyr May 13, 2026 22:58

provision-code-modification

b5f75c1

Signed-off-by: Ning Wang <n.wang.chn@hotmail.com>

nwangfw force-pushed the feat/planner-queue-impl-clean branch from 3d0c69e to b5f75c1 Compare May 14, 2026 04:27

Jeffwan reviewed May 14, 2026

View reviewed changes

nwangfw added 2 commits May 14, 2026 15:50

address-review-comments

f0c6140

Signed-off-by: Ning Wang <n.wang.chn@hotmail.com>

remove-async-from-planner-name

fd956d6

Signed-off-by: Ning Wang <n.wang.chn@hotmail.com>

DwyaneShi reviewed May 15, 2026

View reviewed changes

nwangfw force-pushed the feat/planner-queue-impl-clean branch from 357ddbe to 9beeae4 Compare May 15, 2026 03:27

DwyaneShi reviewed May 15, 2026

View reviewed changes

address-haiyang-review

201ef7f

Signed-off-by: Ning Wang <n.wang.chn@hotmail.com>

nwangfw force-pushed the feat/planner-queue-impl-clean branch from 9beeae4 to 201ef7f Compare May 15, 2026 04:23

Jeffwan changed the title ~~feat(planner): implement async planner~~ feat(batch): implement async planner for batch orchestration May 16, 2026

Jeffwan merged commit 05551ab into vllm-project:main May 16, 2026
3 checks passed

		jobs map[string]*queuedJob // JobID -> state
		jobByBatch map[string]string // batch.ID -> JobID (for ListJobs tagging)

Conversation

nwangfw commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Description

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nwangfw commented May 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jeffwan May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nwangfw commented May 13, 2026 •

edited

Loading

Jeffwan May 15, 2026 •

edited

Loading