Skip to content

[Feature]: Orchestration of cluster partitioning workflow #362

@salexo

Description

@salexo

Suggestion Description

Feature request: integrate GPU partitioning plans into AMD GPU Operator

Summary

  • Introduce cluster-scoped PartitioningPlan and NodePartitioning custom resources to orchestrate AMD GPU partitioning using DCM profiles.
  • Add controllers that safely drain, taint, label, verify, and restore nodes while coordinating rollouts and surfacing detailed status/conditions.
  • Leverage existing AMD GPU Operator primitives (DCM ConfigMap, amd-dcm taint, dcm.amd.com/gpu-config-profile label, amd.com/* allocatable resources) to provide an end-to-end partition management workflow.

Background & Motivation

Today, Device Config Manager supports applying static partition profiles, but fleet operators still need to script the manual parts: selecting nodes, draining workloads, enforcing serial rollouts, waiting for device plugin health, and validating allocatable GPUs. Introducing PartitioningPlan/NodePartitioning controllers would automate that lifecycle so platform teams can request multi-node GPU partitioning with a single CR.

Proposed Solution

API additions

  • PartitioningPlan (cluster scope): High-level rollout CR, which links one or more DCM profiles to a set of nodes via label selectors. Status tracks phase, per-node summaries, and typed conditions for UIs/alerting.
  • NodePartitioning (cluster scope): Per-node work item owned by a plan. Responsible for orchestrating the partitioning for a single node.
  • PartitioningProfileSpec: Inline DCM profile metadata with optional expected allocatable resources to assert after reconciliation, e.g. expect amd.com.com/cpx_nps4: 63

Controller behavior

  • PartitioningPlan controller
    • Discovers matching nodes, filters control plane hosts by default, and rejects ambiguous selector overlaps
    • Blocks conflicting ownership when another plan targets the same node
    • Projects desired NodePartitioning specs, computing deterministic hashes of the requested profile for drift detection
    • Aggregates per-node phase counts, emits user-facing conditions (PlanReady, RolloutProgressing/Completed/Degraded, Paused), and maintains a lightweight node status cache for dashboards
    • Cleans up stale child CRs when nodes fall out of scope, ensuring serialized ownership
    • State machine drives taint/cordon, drain, DCM profile application, operator wait, verification, and cleanup
    • Dry-run mode to see the NodePartitioning resources that would get created, without impacting the driver
    • Each phase updates strongly-typed conditions (NodeCordoned, NodeTainted, DrainCompleted, ProfileApplied, OperatorReady, Verified) and only advances once prerequisites succeed
    • Verification asserts the DCM label and amd.com/* allocatable availability before untainting/uncordoning and marking success
    • Watches node events and allocatable resource deltas to retry automatically when the operator surfaces readiness

Custom Resource Examples

PartitioningPlan (apiVersion: amd.com/v1alpha1)

apiVersion: amd.com/v1alpha1
kind: PartitioningPlan
metadata:
  name: mi300-partitioning
spec:
  dryRun: false
  rollout:
    maxParallel: 1 # run at most one partitioning process at a time
    maxUnavailable: 1 # out of the targeted nodes, at most one can be unavailable at any time
    excludeControlPlane: true # if false, potentially include control plane nodes, default to true
  rules:
    - description: "Partitioned MI300X nodes"
      selector:
        matchLabels:
          amd.com/gpu.product-name: AMD_Instinct_MI300X_OAM
      profile:
        expectedResources:
          amd.com/cpx_nps4: "63"
        dcmProfileName: "cpx"

Acceptance Criteria / Work Items

  • Import the CRD definitions and generated deepcopy code into the AMD GPU Operator repo.
  • Add manager registration, field indexers, and RBAC for both controllers.
  • Package reconcilers with operator images and expose a feature flag for partitioning.
  • Document the workflow, including dry-run semantics and example plans, in the operator docs.
  • Provide automated tests (unit ) covering selector conflicts, hash drift, and the state machine transitions.
  • Chainsaw tests for running end-to-end tests on real clusters
    • Any larger scale (>3 worker node) clusters to test on?

Operating System

No response

GPU

No response

ROCm Component

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions