Skip to content

[Issue]: metrics-exporter pod and device-plugin pod crash for amd-vgpu K8s cluster #269

@gourishkb

Description

@gourishkb

Problem Description

I have an AMD GPU cluster with a single node having a the label feature.node.kubernetes.io/amd-vgpu=true I have a deviceconfig to select this node for node labelling and metrics export, here is the deviceconfig:

apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
  name: amd-deviceconfig
  namespace: egs-gpu-operator
spec:
  devicePlugin:
    devicePluginImage: rocm/k8s-device-plugin:latest
    enableNodeLabeller: true
    nodeLabellerImage: rocm/k8s-device-plugin:labeller-latest
  driver:
    enable: false
  metricsExporter:
    enable: true
    image: docker.io/rocm/device-metrics-exporter:v1.3.0
    nodePort: 32500
    port: 5000
    serviceType: NodePort
    config:
      name: amd-device-metrics-config
  selector:
    feature.node.kubernetes.io/amd-vgpu: "true"

After I apply the deviceconfig metrics-exporter and device-plugin pod come up but are stuck in CrashLoopBackOff error,

the logs of metrics-exporter pod are:

k logs --previous amd-deviceconfig-metrics-exporter-g8vd7 
Defaulted container "metrics-exporter-container" out of: metrics-exporter-container, driver-init (init)
exporter 2025/08/06 06:23:41 main.go:67: Version : v1.3.0-24
exporter 2025/08/06 06:23:41 main.go:68: BuildDate: 2025-06-08T18:40:18+0000
exporter 2025/08/06 06:23:41 main.go:69: GitCommit: 0aa271cb
exporter 2025/08/06 06:23:41 main.go:79: Debug APIs enabled
exporter 2025/08/06 06:23:41 config_handler.go:40: Running Config :/etc/metrics/config.json, gpuagent port 50061
exporter 2025/08/06 06:23:41 exporter.go:226: metrics service starting
exporter 2025/08/06 06:23:41 gpuagent.go:88: Profiler metrics client enabled
exporter 2025/08/06 06:23:41 rocpclient.go:44: NewRocProfilerClient rocpclient
exporter 2025/08/06 06:23:41 svc_handler.go:59: starting listening on socket : /var/lib/amd-metrics-exporter/amdgpu_device_metrics_exporter_grpc.socket
exporter 2025/08/06 06:23:41 svc_handler.go:68: Listening on socket /var/lib/amd-metrics-exporter/amdgpu_device_metrics_exporter_grpc.socket
exporter 2025/08/06 06:23:41 gpuagent.go:72: Agent connecting to 0.0.0.0:50061
exporter 2025/08/06 06:23:41 k8s.go:57: created k8s scheduler client
exporter 2025/08/06 06:23:41 slurm.go:54: Starting Listen on port 6601
exporter 2025/08/06 06:23:41 slurm.go:132: created slurm scheduler client
exporter 2025/08/06 06:23:41 gpuagent_metrics.go:2040: hostame master-1
exporter 2025/08/06 06:23:41 exporter.go:124: config directory for watch : /etc/metrics
exporter 2025/08/06 06:23:41 gpuagent.go:155: GPUAgent monitor started
exporter 2025/08/06 06:23:41 slurm.go:74: too many open files

and this keeps repeating.

Similarly, for the device-plugin pod logs are:

k logs --previous amd-deviceconfig-device-plugin-7m7cq
Defaulted container "device-plugin" out of: device-plugin, driver-init (init)
I0806 06:22:17.046347       1 main.go:120] AMD GPU device plugin for Kubernetes
I0806 06:22:17.046383       1 main.go:120] ./k8s-device-plugin version v1.25.2.7-98-g763445e1
I0806 06:22:17.046386       1 main.go:120] hwloc: _VERSION: 2.11.2, _API_VERSION: 0x00020b00, _COMPONENT_ABI: 7, Runtime: 0x00020b00
I0806 06:22:17.046392       1 manager.go:42] Starting device plugin manager
I0806 06:22:17.046404       1 manager.go:46] Registering for system signal notifications
I0806 06:22:17.046443       1 main.go:131] Heart beating every 30 seconds
I0806 06:22:17.046537       1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory
panic: runtime error: invalid memory address or nil pointer dereference
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x8b418a]

goroutine 1 [running]:
github.com/fsnotify/fsnotify.(*Watcher).Close(0x0)
	/go/src/github.com/ROCm/k8s-device-plugin/vendor/github.com/fsnotify/fsnotify/backend_inotify.go:305 +0x2a
panic({0x931240?, 0xdf4140?})
	/usr/local/go/src/runtime/panic.go:785 +0x132
github.com/fsnotify/fsnotify.(*Watcher).isClosed(...)
	/go/src/github.com/ROCm/k8s-device-plugin/vendor/github.com/fsnotify/fsnotify/backend_inotify.go:296
github.com/fsnotify/fsnotify.(*Watcher).AddWith(0x0, {0x9dda56, 0x20}, {0x0, 0x0, 0x4951ca?})
	/go/src/github.com/ROCm/k8s-device-plugin/vendor/github.com/fsnotify/fsnotify/backend_inotify.go:372 +0x3a
github.com/fsnotify/fsnotify.(*Watcher).Add(...)
	/go/src/github.com/ROCm/k8s-device-plugin/vendor/github.com/fsnotify/fsnotify/backend_inotify.go:362
github.com/kubevirt/device-plugin-manager/pkg/dpm.(*Manager).Run(0xc000034ea0)
	/go/src/github.com/ROCm/k8s-device-plugin/vendor/github.com/kubevirt/device-plugin-manager/pkg/dpm/manager.go:55 +0x209
main.main()
	/go/src/github.com/ROCm/k8s-device-plugin/cmd/k8s-device-plugin/main.go:153 +0x712

Now this could be a new issue since about 5 days ago I had created an issue (#266) with the exact same setup, but had to bring the cluster down over the weekend. Since, then I am unable to return to the same cluster setup.

Operating System

Ubuntu 24.04

CPU

INTEL(R) XEON(R) PLATINUM 8568Y+

GPU

AMD Instinct MI300X VF

ROCm Version

1.3.0

ROCm Component

No response

Steps to Reproduce

On an k8s cluster with AMD GPU having the feature node label feature.node.kubernetes.io/amd-vgpu=true
Install ROCm/gpu-operator according to the docs:

helm install amd-gpu-operator rocm/gpu-operator-charts \
  --namespace egs-gpu-operator \
  --create-namespace \
  --version=v1.3.0

Create a configmap for metrics exporter from the example given in docs (https://instinct.docs.amd.com/projects/gpu-operator/en/latest/metrics/exporter.html#customize-metrics-fields-labels):

kubectl create configmap amd-device-metrics-config --from-file=examples/metricsExporter/config.json

Deviceconfig:

apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
  name: amd-deviceconfig
  namespace: egs-gpu-operator
spec:
  devicePlugin:
    devicePluginImage: rocm/k8s-device-plugin:latest
    enableNodeLabeller: true
    nodeLabellerImage: rocm/k8s-device-plugin:labeller-latest
  driver:
    enable: false
  metricsExporter:
    enable: true
    image: docker.io/rocm/device-metrics-exporter:v1.3.0
    nodePort: 32500
    port: 5000
    serviceType: NodePort
    config:
      name: amd-device-metrics-config
  selector:
    feature.node.kubernetes.io/amd-vgpu: "true"

apply the deviceconfig:

k apply -f amd-deviceconfig.yaml -n egs-gpu-operator

Notice the pods in the operator namespace (in my case egs-gpu-operator), having device-plugin and metrics-exporter pod stuck in CrashLoopBackOff

k get pods
NAME                                                              READY   STATUS             RESTARTS        AGE
amd-deviceconfig-device-plugin-7m7cq                              0/1     CrashLoopBackOff   8 (2m52s ago)   19m
amd-deviceconfig-metrics-exporter-g8vd7                           0/1     CrashLoopBackOff   8 (113s ago)    19m

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

This is a new issue, did not observe this earlier in the same setup, it could be linked with changes made in latest images of metrics exporter and device-plugin

Metadata

Metadata

Assignees

Labels

RCA doneRoot Cause Analysis donebugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions