Batch refactoring to support dynamic worker. Deployment can be used as job worker now. by zhangjyr · Pull Request #2185 · vllm-project/aibrix

zhangjyr · 2026-05-06T08:02:11Z

Pull Request Description

Batch refactored to support dynamic worker, major refactoring includes:

Job driver moved to job_driver folder, adding deployment_driver to support k8s Deployment as job worker.
Added MongoDB and Redis as job entity manager
Support planner_decision field in OpenAI batch api's Aibrix extension, which can specify job drivers on a per-job basis. (Aibrix extension is OpenAI SDK compatible and sits in the extra_body field)
New MDS options:

--disable-k8s-support: Disable kubernetes support. If disabled, jobs that depend on k8s resources may fail. This is mainly for debug purpose.
--disable-inference-endpoint: Disable inference endpoint so that batch api can not invoke inference engine directly. This can be useful if aibrix.planner_decision is a must, and avoid setting INFERENCE_ENGINE_ENDPOINT
--enable-mongo-job: Enable MongoDB as the persistent job entity manager.
--enable-redis-job: Enable Redis as the persistent job entity manager.
--registry-provider: Registry provider for model templates and profiles, default configmap (only option currently).

TODOs:

Scheduler now schedule jobs serially, need more production-ready scheduler.
Deployment driver now drive tasks serially, need concurrency support.
aibrix.planner_decision.provision_resource_deadline is not followed currently.
job update API is not implemented for resource update purpose.

Related Issues

Resolves: #2149

Important: Before submitting, please complete the description above and review the checklist below.

Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

[Bug]: Corrections to existing functionality
[CI]: Changes to build process or CI pipeline
[Docs]: Updates or additions to documentation
[API]: Modifications to aibrix's API or interface
[CLI]: Changes or additions to the Command Line Interface
[Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

PR title includes appropriate prefix(es)
Changes are clearly explained in the PR description
New and existing tests pass successfully
Code adheres to project style and best practices
Documentation updated to reflect changes (if applicable)
Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>

Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com> # Conflicts: # python/aibrix/aibrix/batch/driver.py # python/aibrix/aibrix/batch/job_driver/local_driver.py # python/aibrix/aibrix/batch/job_entity/batch_job.py # python/aibrix/aibrix/batch/job_entity/k8s_transformer.py # python/aibrix/aibrix/batch/manifest/__init__.py # python/aibrix/aibrix/batch/manifest/renderer.py # python/aibrix/aibrix/batch/manifest/storage_env.py # python/aibrix/aibrix/batch/template/schema.py # python/aibrix/aibrix/metadata/api/v1/batch.py # python/aibrix/aibrix/metadata/app.py # python/aibrix/aibrix/metadata/cache/job.py # python/aibrix/aibrix/storage/factory.py # python/aibrix/poetry.lock # python/aibrix/tests/batch/conftest.py # python/aibrix/tests/batch/test_batch_usage.py # python/aibrix/tests/batch/test_job_cache_store.py # python/aibrix/tests/batch/test_manifest_renderer.py # python/aibrix/tests/batch/testdata/template_configmaps_unittest.yaml # python/aibrix/tests/metadata/conftest.py # python/aibrix/tests/metadata/test_app_integration.py # samples/batch/batch_v1alpha1_model_deployment_templates.yaml

Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>

gemini-code-assist

Code Review

This pull request introduces a new DeploymentDriver for batch jobs, enabling the execution of models as long-running Kubernetes deployments. Key changes include refactoring the BatchDriver to use an InfrastructureContext and a new create_job_driver factory for selecting job execution strategies. The DeploymentDriver manages Kubernetes Deployments and Services, including a robust KubernetesServiceInferenceClient with fallback mechanisms. The BatchJobSpec has been updated to use a consolidated AibrixMetadata field for better structure. Additionally, MongoJobCache and RedisJobCache have been added for persistent job management, and storage configurations now use more explicit environment variables. Review feedback highlights critical issues such as a resource leak in DeploymentDriver where deployments and services are not consistently torn down, a problematic singleton pattern in DeploymentDriver, overly broad exception handling in KubernetesServiceInferenceClient, a race condition in local port reservation, a thread-safety bug in _snapshot_usage_to_status, and a hardcoded host path in the deployment manifest affecting portability. A naming convention violation for deploymentJobDriver was also noted.

gemini-code-assist · 2026-05-06T08:09:00Z

+    # BUG: This function is not thread-safe. Usage from multiple workers can overwrite each other.
    async def _snapshot_usage_to_status(self, job_id: str) -> None:


As noted in the comment, _snapshot_usage_to_status is not thread-safe. Since multiple workers may attempt to update the job status concurrently, an asyncio.Lock should be used to protect the shared state and ensure atomic updates to the BatchJob status.

Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>

Jeffwan · 2026-05-08T06:19:08Z

+        help="Enable native kubernetes jobs as the job executor.",
+    )
+    parser.add_argument(
+        "--enable-mongo-job",


the name is confusing here

is this mutually exclusive with --enable-k8s-job? seem they are not at the same level

Mutually exclusive. Now dryrun, enable-k8s-job, enable-mongo-job, and enable-redis-job are exclusive. I added a check on app start. The logic behind this is that while the previous k8s job, the current mongo job, and the redis job all specify how jobs are persistent, while the dryrun specifically uses local storage to prevent online storage pollution.
It is possible that storage-backed jobs (mongo,redis) leverage the renderer to start a k8s job; we can do it later.

they are not same level, I feel we should no define args like mongo-job etc

It essentially defines how a logical job is stored. While --enable-k8s-job specified both job persistence and job deployment. A name like --job-store-provider may be better here.

Jeffwan · 2026-05-08T06:36:22Z

+    resource_details: Optional[List[ResourceDetail]] = None
+
+
+class ModelTemplateRef(_Strict):


aibrix/python/aibrix/aibrix/batch/template/schema.py

Line 271 in 0a3d7d7

class ModelDeploymentTemplateSpec(_Strict):

check here. this is duplicated

I am aware of the equivalent schema defined in metadata/api/v1/batch.py; currently, it is simply an alias to provide documentation:

class TemplateRef(ModelTemplateRef): """Reference to a ModelDeploymentTemplate registered via ConfigMap. Wire shape (under ``extra_body.aibrix.model_template``):: { "name": "llama3-70b-prod", "version": "v1.3.0", # optional; "" / null = latest active "overrides": { # optional, allowlisted "engine_args": {"max_num_seqs": "512"} }, } """

However, ModelTemplateRef and ModelDeploymentTemplateSpec are not duplicated. The former provides reference and user overrides, while the latter provides a template spec.

ok. we can do some clean up later if helpful

Jeffwan · 2026-05-08T06:39:32Z

@@ -0,0 +1,508 @@
+# Copyright 2026 The Aibrix Team.


how to decide which driver to go? based on resource_type?

Yes, decide by the resource_type. This might not be intuitive for in-house APIs, but it makes sense to describe the resource accessibility in the public cloud (e.g., AWS SageMaker). The name of "resource" assumes computing resources are bound to accessing APIs.

On the relationship with the previous "template.spec.provider_config.type" (if restored), the provider_config provides user-specified settings, while the resource_type specifies the final planner decision. We do need to discuss the interaction between these settings.

I think the resource_type itself is not a good name.. please drive the discussion internally to rename it..

Jeffwan · 2026-05-08T06:48:59Z

+
+
+class KubernetesServiceInferenceClient:
+    _gateway_base_url = "http://127.0.0.1:8888"


is gateway required if we use kubernetes deployment?

In the long term, I plan to route batch tasks using the AIBrix gateway for the following reasons:

Reuse existing deployments (deployment keep-alive to offer hot job executors). Implicitly, it would be beneficial to maintain a client pool by reusing the gateway connections.

Possible online/offline multiplexing reusing online deployments.

Again, KubernetesServiceInferenceClient is just a useful private all-purpose implementation that tries anything accessible.

let's postpone the future work to later PRs. this makes the PR huge and hard to review

Jeffwan · 2026-05-08T06:50:24Z

+
+_AIBRIX_MODEL_NAME_KEY = "model.aibrix.ai/name"
+_RUNTIME_CONTAINER_NAME = "aibrix-runtime"
+_RUNTIME_IMAGE = "aibrix/runtime:nightly"


this is the client worker, right? what's deployment image?

should it read from modelDeploymentTemplate?

modelDeploymentTemplate defines the llm_engine, and _RUNTIME_CONTAINER_NAME defines the sidecar runtime, which is not a client worker but supports AIBrix features if needed. With this version, I didn't try to justify which deployment spec is optimal for the batch jobs. We can add a runtime-related spec to the model template to customize this part.

" which is not a client worker but supports AIBrix features if needed." I didn't get it. what's the feature? In ToB env, we use runtime to download models but for batch, I do not think we needed it?

Jeffwan · 2026-05-08T06:51:22Z

+                    },
+                    "spec": {
+                        "containers": (
+                            [self._runtime_container(template, port)]


I didn't get this part

It simply means that I will add the runtime sidecar to each deployment.

Same above, I feel this is not needed?

_needs_runtime_sidecar now always returns False.

Add metastore definition to profile to support metastore customization of worker pods. Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>

Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>

Jeffwan

overall lgtm.

Jeffwan

overall lgtm.

Jingyuan Zhang added 3 commits May 5, 2026 15:49

Rebase with sign off

7e04c8c

Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>

Add k8s deployment mode to batch_api_smoke.py

04ef35e

Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>

gemini-code-assist Bot reviewed May 6, 2026

View reviewed changes

Add check for exclusive job entities.

af90efc

Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>

Jeffwan requested changes May 8, 2026

View reviewed changes

Jingyuan Zhang added 5 commits May 8, 2026 09:20

Fixes for gemini code reviewer comments

b3312ea

Add metastore definition to profile to support metastore customization of worker pods. Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>

lint fix

5c278bf

Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>

Bug fix

e2e0065

Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>

Lint fix

d6e4c52

Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>

Merge commit '3c36553f554b2a2582971b148c5171b068c5d1b9' into lego_batch

3232b6a

Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>

zhangjyr force-pushed the lego_batch branch from 69beb5c to 3232b6a Compare May 11, 2026 06:43

Jingyuan Zhang added 4 commits May 11, 2026 00:00

Updated deployment tests

d226fc8

Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>

Updated deployment tests

5f070a5

Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>

Lint fix

b0ca8f8

Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>

Merge commit '497d8b750e3c54a002a734a46b7504a004626d9c' into lego_batch

8042ae5

Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>

zhangjyr changed the title ~~[WIP] Batch refactoring to support dynamic worker. Deployment can be used as job worker now.~~ Batch refactoring to support dynamic worker. Deployment can be used as job worker now. May 16, 2026

Merge branch 'main' into lego_batch

73663c7

Jeffwan reviewed May 16, 2026

View reviewed changes

Jeffwan approved these changes May 16, 2026

View reviewed changes

Jeffwan merged commit 78851fb into vllm-project:main May 16, 2026
13 checks passed

		# BUG: This function is not thread-safe. Usage from multiple workers can overwrite each other.
		async def _snapshot_usage_to_status(self, job_id: str) -> None:

		resource_details: Optional[List[ResourceDetail]] = None


		class ModelTemplateRef(_Strict):



		class KubernetesServiceInferenceClient:
		_gateway_base_url = "http://127.0.0.1:8888"

Conversation

zhangjyr commented May 6, 2026

Pull Request Description

Related Issues

Pull Request Title Format

Submission Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhangjyr May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jeffwan left a comment

Choose a reason for hiding this comment

Uh oh!

Jeffwan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

zhangjyr May 8, 2026 •

edited

Loading