-
Notifications
You must be signed in to change notification settings - Fork 38
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Suggestion Description
Feature request: integrate GPU partitioning plans into AMD GPU Operator
Summary
- Introduce cluster-scoped
PartitioningPlanandNodePartitioningcustom resources to orchestrate AMD GPU partitioning using DCM profiles. - Add controllers that safely drain, taint, label, verify, and restore nodes while coordinating rollouts and surfacing detailed status/conditions.
- Leverage existing AMD GPU Operator primitives (DCM ConfigMap,
amd-dcmtaint,dcm.amd.com/gpu-config-profilelabel,amd.com/*allocatable resources) to provide an end-to-end partition management workflow.
Background & Motivation
Today, Device Config Manager supports applying static partition profiles, but fleet operators still need to script the manual parts: selecting nodes, draining workloads, enforcing serial rollouts, waiting for device plugin health, and validating allocatable GPUs. Introducing PartitioningPlan/NodePartitioning controllers would automate that lifecycle so platform teams can request multi-node GPU partitioning with a single CR.
Proposed Solution
API additions
PartitioningPlan(cluster scope): High-level rollout CR, which links one or more DCM profiles to a set of nodes via label selectors. Status tracks phase, per-node summaries, and typed conditions for UIs/alerting.NodePartitioning(cluster scope): Per-node work item owned by a plan. Responsible for orchestrating the partitioning for a single node.PartitioningProfileSpec: Inline DCM profile metadata with optional expected allocatable resources to assert after reconciliation, e.g. expectamd.com.com/cpx_nps4: 63
Controller behavior
- PartitioningPlan controller
- Discovers matching nodes, filters control plane hosts by default, and rejects ambiguous selector overlaps
- Blocks conflicting ownership when another plan targets the same node
- Projects desired
NodePartitioningspecs, computing deterministic hashes of the requested profile for drift detection - Aggregates per-node phase counts, emits user-facing conditions (PlanReady, RolloutProgressing/Completed/Degraded, Paused), and maintains a lightweight node status cache for dashboards
- Cleans up stale child CRs when nodes fall out of scope, ensuring serialized ownership
- State machine drives taint/cordon, drain, DCM profile application, operator wait, verification, and cleanup
- Dry-run mode to see the
NodePartitioningresources that would get created, without impacting the driver - Each phase updates strongly-typed conditions (NodeCordoned, NodeTainted, DrainCompleted, ProfileApplied, OperatorReady, Verified) and only advances once prerequisites succeed
- Verification asserts the DCM label and
amd.com/*allocatable availability before untainting/uncordoning and marking success - Watches node events and allocatable resource deltas to retry automatically when the operator surfaces readiness
Custom Resource Examples
PartitioningPlan (apiVersion: amd.com/v1alpha1)
apiVersion: amd.com/v1alpha1
kind: PartitioningPlan
metadata:
name: mi300-partitioning
spec:
dryRun: false
rollout:
maxParallel: 1 # run at most one partitioning process at a time
maxUnavailable: 1 # out of the targeted nodes, at most one can be unavailable at any time
excludeControlPlane: true # if false, potentially include control plane nodes, default to true
rules:
- description: "Partitioned MI300X nodes"
selector:
matchLabels:
amd.com/gpu.product-name: AMD_Instinct_MI300X_OAM
profile:
expectedResources:
amd.com/cpx_nps4: "63"
dcmProfileName: "cpx"Acceptance Criteria / Work Items
- Import the CRD definitions and generated deepcopy code into the AMD GPU Operator repo.
- Add manager registration, field indexers, and RBAC for both controllers.
- Package reconcilers with operator images and expose a feature flag for partitioning.
- Document the workflow, including dry-run semantics and example plans, in the operator docs.
- Provide automated tests (unit ) covering selector conflicts, hash drift, and the state machine transitions.
- Chainsaw tests for running end-to-end tests on real clusters
- Any larger scale (>3 worker node) clusters to test on?
Operating System
No response
GPU
No response
ROCm Component
No response
amitaekbote and ctiml
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request