-
Notifications
You must be signed in to change notification settings - Fork 38
Description
Problem Description
I have an AMD GPU cluster with a single node having a the label feature.node.kubernetes.io/amd-vgpu=true I have a deviceconfig to select this node for node labelling and metrics export, here is the deviceconfig:
apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
name: amd-deviceconfig
namespace: egs-gpu-operator
spec:
devicePlugin:
devicePluginImage: rocm/k8s-device-plugin:latest
enableNodeLabeller: true
nodeLabellerImage: rocm/k8s-device-plugin:labeller-latest
driver:
enable: false
metricsExporter:
enable: true
image: docker.io/rocm/device-metrics-exporter:v1.3.0
nodePort: 32500
port: 5000
serviceType: NodePort
config:
name: amd-device-metrics-config
selector:
feature.node.kubernetes.io/amd-vgpu: "true"
After I apply the deviceconfig metrics-exporter and device-plugin pod come up but are stuck in CrashLoopBackOff error,
the logs of metrics-exporter pod are:
k logs --previous amd-deviceconfig-metrics-exporter-g8vd7
Defaulted container "metrics-exporter-container" out of: metrics-exporter-container, driver-init (init)
exporter 2025/08/06 06:23:41 main.go:67: Version : v1.3.0-24
exporter 2025/08/06 06:23:41 main.go:68: BuildDate: 2025-06-08T18:40:18+0000
exporter 2025/08/06 06:23:41 main.go:69: GitCommit: 0aa271cb
exporter 2025/08/06 06:23:41 main.go:79: Debug APIs enabled
exporter 2025/08/06 06:23:41 config_handler.go:40: Running Config :/etc/metrics/config.json, gpuagent port 50061
exporter 2025/08/06 06:23:41 exporter.go:226: metrics service starting
exporter 2025/08/06 06:23:41 gpuagent.go:88: Profiler metrics client enabled
exporter 2025/08/06 06:23:41 rocpclient.go:44: NewRocProfilerClient rocpclient
exporter 2025/08/06 06:23:41 svc_handler.go:59: starting listening on socket : /var/lib/amd-metrics-exporter/amdgpu_device_metrics_exporter_grpc.socket
exporter 2025/08/06 06:23:41 svc_handler.go:68: Listening on socket /var/lib/amd-metrics-exporter/amdgpu_device_metrics_exporter_grpc.socket
exporter 2025/08/06 06:23:41 gpuagent.go:72: Agent connecting to 0.0.0.0:50061
exporter 2025/08/06 06:23:41 k8s.go:57: created k8s scheduler client
exporter 2025/08/06 06:23:41 slurm.go:54: Starting Listen on port 6601
exporter 2025/08/06 06:23:41 slurm.go:132: created slurm scheduler client
exporter 2025/08/06 06:23:41 gpuagent_metrics.go:2040: hostame master-1
exporter 2025/08/06 06:23:41 exporter.go:124: config directory for watch : /etc/metrics
exporter 2025/08/06 06:23:41 gpuagent.go:155: GPUAgent monitor started
exporter 2025/08/06 06:23:41 slurm.go:74: too many open files
and this keeps repeating.
Similarly, for the device-plugin pod logs are:
k logs --previous amd-deviceconfig-device-plugin-7m7cq
Defaulted container "device-plugin" out of: device-plugin, driver-init (init)
I0806 06:22:17.046347 1 main.go:120] AMD GPU device plugin for Kubernetes
I0806 06:22:17.046383 1 main.go:120] ./k8s-device-plugin version v1.25.2.7-98-g763445e1
I0806 06:22:17.046386 1 main.go:120] hwloc: _VERSION: 2.11.2, _API_VERSION: 0x00020b00, _COMPONENT_ABI: 7, Runtime: 0x00020b00
I0806 06:22:17.046392 1 manager.go:42] Starting device plugin manager
I0806 06:22:17.046404 1 manager.go:46] Registering for system signal notifications
I0806 06:22:17.046443 1 main.go:131] Heart beating every 30 seconds
I0806 06:22:17.046537 1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory
panic: runtime error: invalid memory address or nil pointer dereference
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x8b418a]
goroutine 1 [running]:
github.com/fsnotify/fsnotify.(*Watcher).Close(0x0)
/go/src/github.com/ROCm/k8s-device-plugin/vendor/github.com/fsnotify/fsnotify/backend_inotify.go:305 +0x2a
panic({0x931240?, 0xdf4140?})
/usr/local/go/src/runtime/panic.go:785 +0x132
github.com/fsnotify/fsnotify.(*Watcher).isClosed(...)
/go/src/github.com/ROCm/k8s-device-plugin/vendor/github.com/fsnotify/fsnotify/backend_inotify.go:296
github.com/fsnotify/fsnotify.(*Watcher).AddWith(0x0, {0x9dda56, 0x20}, {0x0, 0x0, 0x4951ca?})
/go/src/github.com/ROCm/k8s-device-plugin/vendor/github.com/fsnotify/fsnotify/backend_inotify.go:372 +0x3a
github.com/fsnotify/fsnotify.(*Watcher).Add(...)
/go/src/github.com/ROCm/k8s-device-plugin/vendor/github.com/fsnotify/fsnotify/backend_inotify.go:362
github.com/kubevirt/device-plugin-manager/pkg/dpm.(*Manager).Run(0xc000034ea0)
/go/src/github.com/ROCm/k8s-device-plugin/vendor/github.com/kubevirt/device-plugin-manager/pkg/dpm/manager.go:55 +0x209
main.main()
/go/src/github.com/ROCm/k8s-device-plugin/cmd/k8s-device-plugin/main.go:153 +0x712
Now this could be a new issue since about 5 days ago I had created an issue (#266) with the exact same setup, but had to bring the cluster down over the weekend. Since, then I am unable to return to the same cluster setup.
Operating System
Ubuntu 24.04
CPU
INTEL(R) XEON(R) PLATINUM 8568Y+
GPU
AMD Instinct MI300X VF
ROCm Version
1.3.0
ROCm Component
No response
Steps to Reproduce
On an k8s cluster with AMD GPU having the feature node label feature.node.kubernetes.io/amd-vgpu=true
Install ROCm/gpu-operator according to the docs:
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace egs-gpu-operator \
--create-namespace \
--version=v1.3.0
Create a configmap for metrics exporter from the example given in docs (https://instinct.docs.amd.com/projects/gpu-operator/en/latest/metrics/exporter.html#customize-metrics-fields-labels):
kubectl create configmap amd-device-metrics-config --from-file=examples/metricsExporter/config.json
Deviceconfig:
apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
name: amd-deviceconfig
namespace: egs-gpu-operator
spec:
devicePlugin:
devicePluginImage: rocm/k8s-device-plugin:latest
enableNodeLabeller: true
nodeLabellerImage: rocm/k8s-device-plugin:labeller-latest
driver:
enable: false
metricsExporter:
enable: true
image: docker.io/rocm/device-metrics-exporter:v1.3.0
nodePort: 32500
port: 5000
serviceType: NodePort
config:
name: amd-device-metrics-config
selector:
feature.node.kubernetes.io/amd-vgpu: "true"
apply the deviceconfig:
k apply -f amd-deviceconfig.yaml -n egs-gpu-operator
Notice the pods in the operator namespace (in my case egs-gpu-operator), having device-plugin and metrics-exporter pod stuck in CrashLoopBackOff
k get pods
NAME READY STATUS RESTARTS AGE
amd-deviceconfig-device-plugin-7m7cq 0/1 CrashLoopBackOff 8 (2m52s ago) 19m
amd-deviceconfig-metrics-exporter-g8vd7 0/1 CrashLoopBackOff 8 (113s ago) 19m
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
This is a new issue, did not observe this earlier in the same setup, it could be linked with changes made in latest images of metrics exporter and device-plugin