Skip to content

feat: inject PET_* envs into init containers via envInjection config#3516

Draft
panpan0000 wants to merge 3 commits into
kubeflow:masterfrom
panpan0000:feature/pet-env-init-containers
Draft

feat: inject PET_* envs into init containers via envInjection config#3516
panpan0000 wants to merge 3 commits into
kubeflow:masterfrom
panpan0000:feature/pet-env-init-containers

Conversation

@panpan0000
Copy link
Copy Markdown
Contributor

@panpan0000 panpan0000 commented May 17, 2026

What this PR does / why we need it:

follow up of KEP #3417

Today, PET_* environment variables (PET_NNODES, PET_NPROC_PER_NODE, PET_NODE_RANK, PET_MASTER_ADDR, PET_MASTER_PORT) are injected only into the main trainer container. Init containers cannot read these distributed topology envs, which blocks preflight distributed checks before expensive training startup.

This PR adds an envInjection field to TorchMLPolicySource that allows users to opt-in PET_* env injection into selected containers (init containers or sidecars) in the trainer replicated job.

Which issue(s) this PR fixes:

Fixes #3416

Changes:

  1. API - Add EnvInjection and EnvInjectionTarget types to TorchMLPolicySource
  2. Runtime - Add FindContainerByPodSetName helper to find con2. Runtime - Add FindContainerByPodSetName helper to find con2. Runtime - Add FindContainerByPodSetName helper to find con2. Runtime - Add FindContainerByPodSetName helper to find con2. Runtime - Add FindContainerByPodSetName helper to find con2. **Runtime(ru2. Runtime - Add FindContainerByPodSetName helper to find con2. Runtime - Add FindContainerByPodSetName helper to find con2. Runtime - Add FindContainerByPodSetName helper to find con2. Runtime - Add FindContainerByPodSetName helper to find con2. Runtime - Add Findke t2. **Runtime** - Add FindContainerByPodSetNamehelper to find con2. **Runtime** - AddFindContainerByPodSetNamehelper to find con2. **Runtime** - AddFindContainerByPodSetNamehelper tonot2. **Runtime** - AddFindContachM2. **Runtirce2. *Runtiting PET_ environment variables into init containers and sidecars.



Co-authored-by: AI Assistant 

Copilot AI review requested due to automatic review settings May 17, 2026 06:32
@panpan0000 panpan0000 marked this pull request as draft May 17, 2026 06:32
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Torch runtime configuration for injecting PET distributed-topology environment variables into selected non-trainer containers, enabling init containers or sidecars to perform distributed preflight checks before trainer startup.

Changes:

  • Adds envInjection API types and generated CRD/apply/deepcopy/openapi artifacts for Torch ML policy configuration.
  • Adds runtime helpers and Torch plugin logic to inject PET env vars into configured containers.
  • Extends JobSet build logic and Torch tests to cover init-container and sidecar injection scenarios.

Reviewed changes

Copilot reviewed 12 out of 19 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pkg/apis/trainer/v1alpha1/trainingruntime_types.go Adds Torch envInjection API structs.
pkg/apis/trainer/v1alpha1/zz_generated.deepcopy.go Generated deepcopy support.
pkg/apis/trainer/v1alpha1/zz_generated.openapi.go Generated OpenAPI schema updates.
pkg/client/applyconfiguration/utils.go Registers new apply configuration types.
pkg/client/applyconfiguration/trainer/v1alpha1/*.go Adds/updates generated apply configuration builders.
pkg/runtime/runtime.go Adds container lookup by PodSet name.
pkg/runtime/framework/plugins/torch/torch.go Injects PET env vars into configured containers.
pkg/runtime/framework/plugins/torch/torch_test.go Adds Torch envInjection test cases.
pkg/runtime/framework/plugins/jobset/jobset.go Propagates init-container mutations into JobSet output.
manifests/base/crds/*.yaml Updates base CRDs with envInjection schema.
charts/kubeflow-trainer/crds/*.yaml Updates Helm CRDs with envInjection schema.
api/openapi-spec/swagger.json Updates published OpenAPI spec.
pkg/util/testing/wrapper.go Adds test helper for Torch envInjection policy.
Files not reviewed (7)
  • pkg/apis/trainer/v1alpha1/zz_generated.deepcopy.go: Language not supported
  • pkg/client/applyconfiguration/trainer/v1alpha1/envinjection.go: Language not supported
  • pkg/client/applyconfiguration/trainer/v1alpha1/envinjectiontarget.go: Language not supported
  • pkg/client/applyconfiguration/trainer/v1alpha1/mlpolicy.go: Language not supported
  • pkg/client/applyconfiguration/trainer/v1alpha1/mlpolicysource.go: Language not supported
  • pkg/client/applyconfiguration/trainer/v1alpha1/torchmlpolicysource.go: Language not supported
  • pkg/client/applyconfiguration/utils.go: Language not supported

Comment thread pkg/runtime/framework/plugins/torch/torch.go Outdated
@panpan0000 panpan0000 force-pushed the feature/pet-env-init-containers branch 4 times, most recently from 7f649f9 to b874a62 Compare May 17, 2026 13:07
- Add EnvInjection and EnvInjectionTarget types to TorchMLPolicySource
- Add FindContainerByPodSetName helper in runtime package
- Update torch plugin to inject PET envs into additional containers
- Update jobset plugin to sync init containers
- Add 5 comprehensive unit tests for envInjection scenarios
- Add Python API models for envInjection types

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
@panpan0000 panpan0000 force-pushed the feature/pet-env-init-containers branch from b874a62 to dbecc48 Compare May 18, 2026 01:33
Change JobName from required to optional with omitempty tag to satisfy
kubeapilinter validation. Add default marker to CRD schema for listMapKey
requirement.

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support injecting Torch PET_* envs into trainer init containers

2 participants