Skip to content

Batch refactoring to support dynamic worker. Deployment can be used as job worker now.#2185

Merged
Jeffwan merged 14 commits into
vllm-project:mainfrom
zhangjyr:lego_batch
May 16, 2026
Merged

Batch refactoring to support dynamic worker. Deployment can be used as job worker now.#2185
Jeffwan merged 14 commits into
vllm-project:mainfrom
zhangjyr:lego_batch

Conversation

@zhangjyr
Copy link
Copy Markdown
Collaborator

@zhangjyr zhangjyr commented May 6, 2026

Pull Request Description

Batch refactored to support dynamic worker, major refactoring includes:

  1. Job driver moved to job_driver folder, adding deployment_driver to support k8s Deployment as job worker.
  2. Added MongoDB and Redis as job entity manager
  3. Support planner_decision field in OpenAI batch api's Aibrix extension, which can specify job drivers on a per-job basis. (Aibrix extension is OpenAI SDK compatible and sits in the extra_body field)
  4. New MDS options:
  • --disable-k8s-support: Disable kubernetes support. If disabled, jobs that depend on k8s resources may fail. This is mainly for debug purpose.
  • --disable-inference-endpoint: Disable inference endpoint so that batch api can not invoke inference engine directly. This can be useful if aibrix.planner_decision is a must, and avoid setting INFERENCE_ENGINE_ENDPOINT
  • --enable-mongo-job: Enable MongoDB as the persistent job entity manager.
  • --enable-redis-job: Enable Redis as the persistent job entity manager.
  • --registry-provider: Registry provider for model templates and profiles, default configmap (only option currently).

TODOs:

  • Scheduler now schedule jobs serially, need more production-ready scheduler.
  • Deployment driver now drive tasks serially, need concurrency support.
  • aibrix.planner_decision.provision_resource_deadline is not followed currently.
  • job update API is not implemented for resource update purpose.

Related Issues

Resolves: #2149

Important: Before submitting, please complete the description above and review the checklist below.


Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

  • [Bug]: Corrections to existing functionality
  • [CI]: Changes to build process or CI pipeline
  • [Docs]: Updates or additions to documentation
  • [API]: Modifications to aibrix's API or interface
  • [CLI]: Changes or additions to the Command Line Interface
  • [Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

  • PR title includes appropriate prefix(es)
  • Changes are clearly explained in the PR description
  • New and existing tests pass successfully
  • Code adheres to project style and best practices
  • Documentation updated to reflect changes (if applicable)
  • Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

Jingyuan Zhang added 3 commits May 5, 2026 15:49
Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>
Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>

# Conflicts:
#	python/aibrix/aibrix/batch/driver.py
#	python/aibrix/aibrix/batch/job_driver/local_driver.py
#	python/aibrix/aibrix/batch/job_entity/batch_job.py
#	python/aibrix/aibrix/batch/job_entity/k8s_transformer.py
#	python/aibrix/aibrix/batch/manifest/__init__.py
#	python/aibrix/aibrix/batch/manifest/renderer.py
#	python/aibrix/aibrix/batch/manifest/storage_env.py
#	python/aibrix/aibrix/batch/template/schema.py
#	python/aibrix/aibrix/metadata/api/v1/batch.py
#	python/aibrix/aibrix/metadata/app.py
#	python/aibrix/aibrix/metadata/cache/job.py
#	python/aibrix/aibrix/storage/factory.py
#	python/aibrix/poetry.lock
#	python/aibrix/tests/batch/conftest.py
#	python/aibrix/tests/batch/test_batch_usage.py
#	python/aibrix/tests/batch/test_job_cache_store.py
#	python/aibrix/tests/batch/test_manifest_renderer.py
#	python/aibrix/tests/batch/testdata/template_configmaps_unittest.yaml
#	python/aibrix/tests/metadata/conftest.py
#	python/aibrix/tests/metadata/test_app_integration.py
#	samples/batch/batch_v1alpha1_model_deployment_templates.yaml
Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new DeploymentDriver for batch jobs, enabling the execution of models as long-running Kubernetes deployments. Key changes include refactoring the BatchDriver to use an InfrastructureContext and a new create_job_driver factory for selecting job execution strategies. The DeploymentDriver manages Kubernetes Deployments and Services, including a robust KubernetesServiceInferenceClient with fallback mechanisms. The BatchJobSpec has been updated to use a consolidated AibrixMetadata field for better structure. Additionally, MongoJobCache and RedisJobCache have been added for persistent job management, and storage configurations now use more explicit environment variables. Review feedback highlights critical issues such as a resource leak in DeploymentDriver where deployments and services are not consistently torn down, a problematic singleton pattern in DeploymentDriver, overly broad exception handling in KubernetesServiceInferenceClient, a race condition in local port reservation, a thread-safety bug in _snapshot_usage_to_status, and a hardcoded host path in the deployment manifest affecting portability. A naming convention violation for deploymentJobDriver was also noted.

Comment thread python/aibrix/aibrix/batch/job_driver/deployment_driver.py Outdated
Comment thread python/aibrix/aibrix/batch/job_driver/deployment_driver.py Outdated
Comment thread python/aibrix/aibrix/batch/job_driver/deployment_driver.py
Comment thread python/aibrix/aibrix/batch/job_driver/deployment_driver.py Outdated
Comment thread python/aibrix/aibrix/batch/job_driver/deployment_driver.py Outdated
Comment on lines +128 to 129
# BUG: This function is not thread-safe. Usage from multiple workers can overwrite each other.
async def _snapshot_usage_to_status(self, job_id: str) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

As noted in the comment, _snapshot_usage_to_status is not thread-safe. Since multiple workers may attempt to update the job status concurrently, an asyncio.Lock should be used to protect the shared state and ensure atomic updates to the BatchJob status.

Comment thread python/aibrix/aibrix/batch/manifest/deployment_renderer.py
Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>
help="Enable native kubernetes jobs as the job executor.",
)
parser.add_argument(
"--enable-mongo-job",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name is confusing here

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this mutually exclusive with --enable-k8s-job? seem they are not at the same level

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutually exclusive. Now dryrun, enable-k8s-job, enable-mongo-job, and enable-redis-job are exclusive. I added a check on app start. The logic behind this is that while the previous k8s job, the current mongo job, and the redis job all specify how jobs are persistent, while the dryrun specifically uses local storage to prevent online storage pollution.
It is possible that storage-backed jobs (mongo,redis) leverage the renderer to start a k8s job; we can do it later.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are not same level, I feel we should no define args like mongo-job etc

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It essentially defines how a logical job is stored. While --enable-k8s-job specified both job persistence and job deployment. A name like --job-store-provider may be better here.

resource_details: Optional[List[ResourceDetail]] = None


class ModelTemplateRef(_Strict):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class ModelDeploymentTemplateSpec(_Strict):

check here. this is duplicated

Copy link
Copy Markdown
Collaborator Author

@zhangjyr zhangjyr May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am aware of the equivalent schema defined in metadata/api/v1/batch.py; currently, it is simply an alias to provide documentation:

class TemplateRef(ModelTemplateRef):
    """Reference to a ModelDeploymentTemplate registered via ConfigMap.

    Wire shape (under ``extra_body.aibrix.model_template``)::

        {
            "name": "llama3-70b-prod",
            "version": "v1.3.0",  # optional; "" / null = latest active
            "overrides": {  # optional, allowlisted
                "engine_args": {"max_num_seqs": "512"}
            },
        }
    """

However, ModelTemplateRef and ModelDeploymentTemplateSpec are not duplicated. The former provides reference and user overrides, while the latter provides a template spec.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. we can do some clean up later if helpful

Comment thread python/aibrix/aibrix/batch/job_driver/__init__.py
@@ -0,0 +1,508 @@
# Copyright 2026 The Aibrix Team.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how to decide which driver to go? based on resource_type?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, decide by the resource_type. This might not be intuitive for in-house APIs, but it makes sense to describe the resource accessibility in the public cloud (e.g., AWS SageMaker). The name of "resource" assumes computing resources are bound to accessing APIs.

On the relationship with the previous "template.spec.provider_config.type" (if restored), the provider_config provides user-specified settings, while the resource_type specifies the final planner decision. We do need to discuss the interaction between these settings.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the resource_type itself is not a good name.. please drive the discussion internally to rename it..

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Comment thread python/aibrix/aibrix/batch/job_driver/driver.py
Comment thread python/aibrix/aibrix/batch/job_driver/openai_driver.py
Comment thread python/aibrix/aibrix/metadata/cache/job.py


class KubernetesServiceInferenceClient:
_gateway_base_url = "http://127.0.0.1:8888"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is gateway required if we use kubernetes deployment?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the long term, I plan to route batch tasks using the AIBrix gateway for the following reasons:

  1. Reuse existing deployments (deployment keep-alive to offer hot job executors). Implicitly, it would be beneficial to maintain a client pool by reusing the gateway connections.
  2. Possible online/offline multiplexing reusing online deployments.

Again, KubernetesServiceInferenceClient is just a useful private all-purpose implementation that tries anything accessible.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's postpone the future work to later PRs. this makes the PR huge and hard to review


_AIBRIX_MODEL_NAME_KEY = "model.aibrix.ai/name"
_RUNTIME_CONTAINER_NAME = "aibrix-runtime"
_RUNTIME_IMAGE = "aibrix/runtime:nightly"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the client worker, right? what's deployment image?

should it read from modelDeploymentTemplate?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modelDeploymentTemplate defines the llm_engine, and _RUNTIME_CONTAINER_NAME defines the sidecar runtime, which is not a client worker but supports AIBrix features if needed. With this version, I didn't try to justify which deployment spec is optimal for the batch jobs. We can add a runtime-related spec to the model template to customize this part.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

" which is not a client worker but supports AIBrix features if needed." I didn't get it. what's the feature? In ToB env, we use runtime to download models but for batch, I do not think we needed it?

},
"spec": {
"containers": (
[self._runtime_container(template, port)]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get this part

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It simply means that I will add the runtime sidecar to each deployment.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same above, I feel this is not needed?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_needs_runtime_sidecar now always returns False.

Jingyuan Zhang added 5 commits May 8, 2026 09:20
Add metastore definition to profile to support metastore customization of worker pods.

Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>
Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>
Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>
Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>
Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>
Jingyuan Zhang added 4 commits May 11, 2026 00:00
Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>
Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>
Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>
Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>
@zhangjyr zhangjyr changed the title [WIP] Batch refactoring to support dynamic worker. Deployment can be used as job worker now. Batch refactoring to support dynamic worker. Deployment can be used as job worker now. May 16, 2026
Copy link
Copy Markdown
Collaborator

@Jeffwan Jeffwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm.

Copy link
Copy Markdown
Collaborator

@Jeffwan Jeffwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm.

@Jeffwan Jeffwan merged commit 78851fb into vllm-project:main May 16, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AIbrix Batch E2E problem in main branch

2 participants