Batch refactoring to support dynamic worker. Deployment can be used as job worker now.#2185
Conversation
Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>
Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com> # Conflicts: # python/aibrix/aibrix/batch/driver.py # python/aibrix/aibrix/batch/job_driver/local_driver.py # python/aibrix/aibrix/batch/job_entity/batch_job.py # python/aibrix/aibrix/batch/job_entity/k8s_transformer.py # python/aibrix/aibrix/batch/manifest/__init__.py # python/aibrix/aibrix/batch/manifest/renderer.py # python/aibrix/aibrix/batch/manifest/storage_env.py # python/aibrix/aibrix/batch/template/schema.py # python/aibrix/aibrix/metadata/api/v1/batch.py # python/aibrix/aibrix/metadata/app.py # python/aibrix/aibrix/metadata/cache/job.py # python/aibrix/aibrix/storage/factory.py # python/aibrix/poetry.lock # python/aibrix/tests/batch/conftest.py # python/aibrix/tests/batch/test_batch_usage.py # python/aibrix/tests/batch/test_job_cache_store.py # python/aibrix/tests/batch/test_manifest_renderer.py # python/aibrix/tests/batch/testdata/template_configmaps_unittest.yaml # python/aibrix/tests/metadata/conftest.py # python/aibrix/tests/metadata/test_app_integration.py # samples/batch/batch_v1alpha1_model_deployment_templates.yaml
Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a new DeploymentDriver for batch jobs, enabling the execution of models as long-running Kubernetes deployments. Key changes include refactoring the BatchDriver to use an InfrastructureContext and a new create_job_driver factory for selecting job execution strategies. The DeploymentDriver manages Kubernetes Deployments and Services, including a robust KubernetesServiceInferenceClient with fallback mechanisms. The BatchJobSpec has been updated to use a consolidated AibrixMetadata field for better structure. Additionally, MongoJobCache and RedisJobCache have been added for persistent job management, and storage configurations now use more explicit environment variables. Review feedback highlights critical issues such as a resource leak in DeploymentDriver where deployments and services are not consistently torn down, a problematic singleton pattern in DeploymentDriver, overly broad exception handling in KubernetesServiceInferenceClient, a race condition in local port reservation, a thread-safety bug in _snapshot_usage_to_status, and a hardcoded host path in the deployment manifest affecting portability. A naming convention violation for deploymentJobDriver was also noted.
| # BUG: This function is not thread-safe. Usage from multiple workers can overwrite each other. | ||
| async def _snapshot_usage_to_status(self, job_id: str) -> None: |
Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>
| help="Enable native kubernetes jobs as the job executor.", | ||
| ) | ||
| parser.add_argument( | ||
| "--enable-mongo-job", |
There was a problem hiding this comment.
the name is confusing here
There was a problem hiding this comment.
is this mutually exclusive with --enable-k8s-job? seem they are not at the same level
There was a problem hiding this comment.
Mutually exclusive. Now dryrun, enable-k8s-job, enable-mongo-job, and enable-redis-job are exclusive. I added a check on app start. The logic behind this is that while the previous k8s job, the current mongo job, and the redis job all specify how jobs are persistent, while the dryrun specifically uses local storage to prevent online storage pollution.
It is possible that storage-backed jobs (mongo,redis) leverage the renderer to start a k8s job; we can do it later.
There was a problem hiding this comment.
they are not same level, I feel we should no define args like mongo-job etc
There was a problem hiding this comment.
It essentially defines how a logical job is stored. While --enable-k8s-job specified both job persistence and job deployment. A name like --job-store-provider may be better here.
| resource_details: Optional[List[ResourceDetail]] = None | ||
|
|
||
|
|
||
| class ModelTemplateRef(_Strict): |
There was a problem hiding this comment.
check here. this is duplicated
There was a problem hiding this comment.
I am aware of the equivalent schema defined in metadata/api/v1/batch.py; currently, it is simply an alias to provide documentation:
class TemplateRef(ModelTemplateRef):
"""Reference to a ModelDeploymentTemplate registered via ConfigMap.
Wire shape (under ``extra_body.aibrix.model_template``)::
{
"name": "llama3-70b-prod",
"version": "v1.3.0", # optional; "" / null = latest active
"overrides": { # optional, allowlisted
"engine_args": {"max_num_seqs": "512"}
},
}
"""However, ModelTemplateRef and ModelDeploymentTemplateSpec are not duplicated. The former provides reference and user overrides, while the latter provides a template spec.
There was a problem hiding this comment.
ok. we can do some clean up later if helpful
| @@ -0,0 +1,508 @@ | |||
| # Copyright 2026 The Aibrix Team. | |||
There was a problem hiding this comment.
how to decide which driver to go? based on resource_type?
There was a problem hiding this comment.
Yes, decide by the resource_type. This might not be intuitive for in-house APIs, but it makes sense to describe the resource accessibility in the public cloud (e.g., AWS SageMaker). The name of "resource" assumes computing resources are bound to accessing APIs.
On the relationship with the previous "template.spec.provider_config.type" (if restored), the provider_config provides user-specified settings, while the resource_type specifies the final planner decision. We do need to discuss the interaction between these settings.
There was a problem hiding this comment.
I think the resource_type itself is not a good name.. please drive the discussion internally to rename it..
|
|
||
|
|
||
| class KubernetesServiceInferenceClient: | ||
| _gateway_base_url = "http://127.0.0.1:8888" |
There was a problem hiding this comment.
is gateway required if we use kubernetes deployment?
There was a problem hiding this comment.
In the long term, I plan to route batch tasks using the AIBrix gateway for the following reasons:
- Reuse existing deployments (deployment keep-alive to offer hot job executors). Implicitly, it would be beneficial to maintain a client pool by reusing the gateway connections.
- Possible online/offline multiplexing reusing online deployments.
Again, KubernetesServiceInferenceClient is just a useful private all-purpose implementation that tries anything accessible.
There was a problem hiding this comment.
let's postpone the future work to later PRs. this makes the PR huge and hard to review
|
|
||
| _AIBRIX_MODEL_NAME_KEY = "model.aibrix.ai/name" | ||
| _RUNTIME_CONTAINER_NAME = "aibrix-runtime" | ||
| _RUNTIME_IMAGE = "aibrix/runtime:nightly" |
There was a problem hiding this comment.
this is the client worker, right? what's deployment image?
should it read from modelDeploymentTemplate?
There was a problem hiding this comment.
modelDeploymentTemplate defines the llm_engine, and _RUNTIME_CONTAINER_NAME defines the sidecar runtime, which is not a client worker but supports AIBrix features if needed. With this version, I didn't try to justify which deployment spec is optimal for the batch jobs. We can add a runtime-related spec to the model template to customize this part.
There was a problem hiding this comment.
" which is not a client worker but supports AIBrix features if needed." I didn't get it. what's the feature? In ToB env, we use runtime to download models but for batch, I do not think we needed it?
| }, | ||
| "spec": { | ||
| "containers": ( | ||
| [self._runtime_container(template, port)] |
There was a problem hiding this comment.
It simply means that I will add the runtime sidecar to each deployment.
There was a problem hiding this comment.
Same above, I feel this is not needed?
There was a problem hiding this comment.
_needs_runtime_sidecar now always returns False.
Add metastore definition to profile to support metastore customization of worker pods. Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>
Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>
Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>
Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>
Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>
Pull Request Description
Batch refactored to support dynamic worker, major refactoring includes:
TODOs:
Related Issues
Resolves: #2149
Important: Before submitting, please complete the description above and review the checklist below.
Contribution Guidelines (Expand for Details)
We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:
Pull Request Title Format
Your PR title should start with one of these prefixes to indicate the nature of the change:
[Bug]: Corrections to existing functionality[CI]: Changes to build process or CI pipeline[Docs]: Updates or additions to documentation[API]: Modifications to aibrix's API or interface[CLI]: Changes or additions to the Command Line Interface[Misc]: For changes not covered above (use sparingly)Note: For changes spanning multiple categories, use multiple prefixes in order of importance.
Submission Checklist
By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.