[Feature]: Orchestration of cluster partitioning workflow

### Suggestion Description

# Feature request: integrate GPU partitioning plans into AMD GPU Operator

## Summary
- Introduce cluster-scoped `PartitioningPlan` and `NodePartitioning` custom resources to orchestrate AMD GPU partitioning using DCM profiles.
- Add controllers that safely drain, taint, label, verify, and restore nodes while coordinating rollouts and surfacing detailed status/conditions.
- Leverage existing AMD GPU Operator primitives (DCM ConfigMap, `amd-dcm` taint, `dcm.amd.com/gpu-config-profile` label, `amd.com/*` allocatable resources) to provide an end-to-end partition management workflow.

## Background & Motivation
Today, Device Config Manager supports applying static partition profiles, but fleet operators still need to script the manual parts: selecting nodes, draining workloads, enforcing serial rollouts, waiting for device plugin health, and validating allocatable GPUs. Introducing `PartitioningPlan`/`NodePartitioning` controllers would automate that lifecycle so platform teams can request multi-node GPU partitioning with a single CR.

## Proposed Solution

### API additions
- **`PartitioningPlan` (cluster scope)**: High-level rollout CR, which links one or more DCM profiles to a set of nodes via label selectors. Status tracks phase, per-node summaries, and typed conditions for UIs/alerting.
- **`NodePartitioning` (cluster scope)**: Per-node work item owned by a plan. Responsible for orchestrating the partitioning for a single node.
- **`PartitioningProfileSpec`**: Inline DCM profile metadata with optional expected allocatable resources to assert after reconciliation, e.g. expect `amd.com.com/cpx_nps4: 63`

### Controller behavior
- **PartitioningPlan controller**
  - Discovers matching nodes, filters control plane hosts by default, and rejects ambiguous selector overlaps
  - Blocks conflicting ownership when another plan targets the same node
  - Projects desired `NodePartitioning` specs, computing deterministic hashes of the requested profile for drift detection
  - Aggregates per-node phase counts, emits user-facing conditions (PlanReady, RolloutProgressing/Completed/Degraded, Paused), and maintains a lightweight node status cache for dashboards
  - Cleans up stale child CRs when nodes fall out of scope, ensuring serialized ownership
  - State machine drives taint/cordon, drain, DCM profile application, operator wait, verification, and cleanup
  - Dry-run mode to see the `NodePartitioning` resources that would get created, without impacting the driver
  - Each phase updates strongly-typed conditions (NodeCordoned, NodeTainted, DrainCompleted, ProfileApplied, OperatorReady, Verified) and only advances once prerequisites succeed
  - Verification asserts the DCM label and `amd.com/*` allocatable availability before untainting/uncordoning and marking success
  - Watches node events and allocatable resource deltas to retry automatically when the operator surfaces readiness

## Custom Resource Examples

### PartitioningPlan (`apiVersion: amd.com/v1alpha1`)
```yaml
apiVersion: amd.com/v1alpha1
kind: PartitioningPlan
metadata:
  name: mi300-partitioning
spec:
  dryRun: false
  rollout:
    maxParallel: 1 # run at most one partitioning process at a time
    maxUnavailable: 1 # out of the targeted nodes, at most one can be unavailable at any time
    excludeControlPlane: true # if false, potentially include control plane nodes, default to true
  rules:
    - description: "Partitioned MI300X nodes"
      selector:
        matchLabels:
          amd.com/gpu.product-name: AMD_Instinct_MI300X_OAM
      profile:
        expectedResources:
          amd.com/cpx_nps4: "63"
        dcmProfileName: "cpx"
```

## Acceptance Criteria / Work Items
- [ ] Import the CRD definitions and generated deepcopy code into the AMD GPU Operator repo.
- [ ] Add manager registration, field indexers, and RBAC for both controllers.
- [ ] Package reconcilers with operator images and expose a feature flag for partitioning.
- [ ] Document the workflow, including dry-run semantics and example plans, in the operator docs.
- [ ] Provide automated tests (unit ) covering selector conflicts, hash drift, and the state machine transitions.
- [ ] Chainsaw tests for running end-to-end tests on real clusters
  - [ ] Any larger scale (>3 worker node) clusters to test on?


### Operating System

_No response_

### GPU

_No response_

### ROCm Component

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: Orchestration of cluster partitioning workflow #362

Suggestion Description

Feature request: integrate GPU partitioning plans into AMD GPU Operator

Summary

Background & Motivation

Proposed Solution

API additions

Controller behavior

Custom Resource Examples

PartitioningPlan (`apiVersion: amd.com/v1alpha1`)

Acceptance Criteria / Work Items

Operating System

GPU

ROCm Component

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Orchestration of cluster partitioning workflow #362

Description

Suggestion Description

Feature request: integrate GPU partitioning plans into AMD GPU Operator

Summary

Background & Motivation

Proposed Solution

API additions

Controller behavior

Custom Resource Examples

PartitioningPlan (apiVersion: amd.com/v1alpha1)

Acceptance Criteria / Work Items

Operating System

GPU

ROCm Component

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

PartitioningPlan (`apiVersion: amd.com/v1alpha1`)