feat: inject PET_* envs into init containers via envInjection config#3516
feat: inject PET_* envs into init containers via envInjection config#3516panpan0000 wants to merge 3 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR adds Torch runtime configuration for injecting PET distributed-topology environment variables into selected non-trainer containers, enabling init containers or sidecars to perform distributed preflight checks before trainer startup.
Changes:
- Adds
envInjectionAPI types and generated CRD/apply/deepcopy/openapi artifacts for Torch ML policy configuration. - Adds runtime helpers and Torch plugin logic to inject PET env vars into configured containers.
- Extends JobSet build logic and Torch tests to cover init-container and sidecar injection scenarios.
Reviewed changes
Copilot reviewed 12 out of 19 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| pkg/apis/trainer/v1alpha1/trainingruntime_types.go | Adds Torch envInjection API structs. |
| pkg/apis/trainer/v1alpha1/zz_generated.deepcopy.go | Generated deepcopy support. |
| pkg/apis/trainer/v1alpha1/zz_generated.openapi.go | Generated OpenAPI schema updates. |
| pkg/client/applyconfiguration/utils.go | Registers new apply configuration types. |
| pkg/client/applyconfiguration/trainer/v1alpha1/*.go | Adds/updates generated apply configuration builders. |
| pkg/runtime/runtime.go | Adds container lookup by PodSet name. |
| pkg/runtime/framework/plugins/torch/torch.go | Injects PET env vars into configured containers. |
| pkg/runtime/framework/plugins/torch/torch_test.go | Adds Torch envInjection test cases. |
| pkg/runtime/framework/plugins/jobset/jobset.go | Propagates init-container mutations into JobSet output. |
| manifests/base/crds/*.yaml | Updates base CRDs with envInjection schema. |
| charts/kubeflow-trainer/crds/*.yaml | Updates Helm CRDs with envInjection schema. |
| api/openapi-spec/swagger.json | Updates published OpenAPI spec. |
| pkg/util/testing/wrapper.go | Adds test helper for Torch envInjection policy. |
Files not reviewed (7)
- pkg/apis/trainer/v1alpha1/zz_generated.deepcopy.go: Language not supported
- pkg/client/applyconfiguration/trainer/v1alpha1/envinjection.go: Language not supported
- pkg/client/applyconfiguration/trainer/v1alpha1/envinjectiontarget.go: Language not supported
- pkg/client/applyconfiguration/trainer/v1alpha1/mlpolicy.go: Language not supported
- pkg/client/applyconfiguration/trainer/v1alpha1/mlpolicysource.go: Language not supported
- pkg/client/applyconfiguration/trainer/v1alpha1/torchmlpolicysource.go: Language not supported
- pkg/client/applyconfiguration/utils.go: Language not supported
7f649f9 to
b874a62
Compare
- Add EnvInjection and EnvInjectionTarget types to TorchMLPolicySource - Add FindContainerByPodSetName helper in runtime package - Update torch plugin to inject PET envs into additional containers - Update jobset plugin to sync init containers - Add 5 comprehensive unit tests for envInjection scenarios - Add Python API models for envInjection types Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
b874a62 to
dbecc48
Compare
Change JobName from required to optional with omitempty tag to satisfy kubeapilinter validation. Add default marker to CRD schema for listMapKey requirement. Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
What this PR does / why we need it:
follow up of KEP #3417
Today, PET_* environment variables (PET_NNODES, PET_NPROC_PER_NODE, PET_NODE_RANK, PET_MASTER_ADDR, PET_MASTER_PORT) are injected only into the main trainer container. Init containers cannot read these distributed topology envs, which blocks preflight distributed checks before expensive training startup.
This PR adds an
envInjectionfield toTorchMLPolicySourcethat allows users to opt-in PET_* env injection into selected containers (init containers or sidecars) in the trainer replicated job.Which issue(s) this PR fixes:
Fixes #3416
Changes:
EnvInjectionandEnvInjectionTargettypes toTorchMLPolicySourceFindContainerByPodSetNamehelper to find con2. Runtime - AddFindContainerByPodSetNamehelper to find con2. Runtime - AddFindContainerByPodSetNamehelper to find con2. Runtime - AddFindContainerByPodSetNamehelper to find con2. Runtime - AddFindContainerByPodSetNamehelper to find con2. **Runtime(ru2. Runtime - AddFindContainerByPodSetNamehelper to find con2. Runtime - AddFindContainerByPodSetNamehelper to find con2. Runtime - AddFindContainerByPodSetNamehelper to find con2. Runtime - AddFindContainerByPodSetNamehelper to find con2. Runtime - AddFindke t2. **Runtime** - AddFindContainerByPodSetNamehelper to find con2. **Runtime** - AddFindContainerByPodSetNamehelper to find con2. **Runtime** - AddFindContainerByPodSetNamehelper tonot2. **Runtime** - AddFindContachM2. **Runtirce2. *Runtiting PET_ environment variables into init containers and sidecars.