[Issue]: metrics-exporter pod and device-plugin pod crash for amd-vgpu K8s cluster

### Problem Description

I have an AMD GPU cluster with a single node having a the label ```feature.node.kubernetes.io/amd-vgpu=true``` I have a deviceconfig to select this node for node labelling and metrics export, here is the deviceconfig: 
```
apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
  name: amd-deviceconfig
  namespace: egs-gpu-operator
spec:
  devicePlugin:
    devicePluginImage: rocm/k8s-device-plugin:latest
    enableNodeLabeller: true
    nodeLabellerImage: rocm/k8s-device-plugin:labeller-latest
  driver:
    enable: false
  metricsExporter:
    enable: true
    image: docker.io/rocm/device-metrics-exporter:v1.3.0
    nodePort: 32500
    port: 5000
    serviceType: NodePort
    config:
      name: amd-device-metrics-config
  selector:
    feature.node.kubernetes.io/amd-vgpu: "true"
```

After I apply the deviceconfig metrics-exporter and device-plugin pod come up but are stuck in CrashLoopBackOff error, 

the logs of metrics-exporter pod are: 
```
k logs --previous amd-deviceconfig-metrics-exporter-g8vd7 
Defaulted container "metrics-exporter-container" out of: metrics-exporter-container, driver-init (init)
exporter 2025/08/06 06:23:41 main.go:67: Version : v1.3.0-24
exporter 2025/08/06 06:23:41 main.go:68: BuildDate: 2025-06-08T18:40:18+0000
exporter 2025/08/06 06:23:41 main.go:69: GitCommit: 0aa271cb
exporter 2025/08/06 06:23:41 main.go:79: Debug APIs enabled
exporter 2025/08/06 06:23:41 config_handler.go:40: Running Config :/etc/metrics/config.json, gpuagent port 50061
exporter 2025/08/06 06:23:41 exporter.go:226: metrics service starting
exporter 2025/08/06 06:23:41 gpuagent.go:88: Profiler metrics client enabled
exporter 2025/08/06 06:23:41 rocpclient.go:44: NewRocProfilerClient rocpclient
exporter 2025/08/06 06:23:41 svc_handler.go:59: starting listening on socket : /var/lib/amd-metrics-exporter/amdgpu_device_metrics_exporter_grpc.socket
exporter 2025/08/06 06:23:41 svc_handler.go:68: Listening on socket /var/lib/amd-metrics-exporter/amdgpu_device_metrics_exporter_grpc.socket
exporter 2025/08/06 06:23:41 gpuagent.go:72: Agent connecting to 0.0.0.0:50061
exporter 2025/08/06 06:23:41 k8s.go:57: created k8s scheduler client
exporter 2025/08/06 06:23:41 slurm.go:54: Starting Listen on port 6601
exporter 2025/08/06 06:23:41 slurm.go:132: created slurm scheduler client
exporter 2025/08/06 06:23:41 gpuagent_metrics.go:2040: hostame master-1
exporter 2025/08/06 06:23:41 exporter.go:124: config directory for watch : /etc/metrics
exporter 2025/08/06 06:23:41 gpuagent.go:155: GPUAgent monitor started
exporter 2025/08/06 06:23:41 slurm.go:74: too many open files

```
and this keeps repeating. 

Similarly, for the device-plugin pod logs are: 
```
k logs --previous amd-deviceconfig-device-plugin-7m7cq
Defaulted container "device-plugin" out of: device-plugin, driver-init (init)
I0806 06:22:17.046347       1 main.go:120] AMD GPU device plugin for Kubernetes
I0806 06:22:17.046383       1 main.go:120] ./k8s-device-plugin version v1.25.2.7-98-g763445e1
I0806 06:22:17.046386       1 main.go:120] hwloc: _VERSION: 2.11.2, _API_VERSION: 0x00020b00, _COMPONENT_ABI: 7, Runtime: 0x00020b00
I0806 06:22:17.046392       1 manager.go:42] Starting device plugin manager
I0806 06:22:17.046404       1 manager.go:46] Registering for system signal notifications
I0806 06:22:17.046443       1 main.go:131] Heart beating every 30 seconds
I0806 06:22:17.046537       1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory
panic: runtime error: invalid memory address or nil pointer dereference
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x8b418a]

goroutine 1 [running]:
github.com/fsnotify/fsnotify.(*Watcher).Close(0x0)
	/go/src/github.com/ROCm/k8s-device-plugin/vendor/github.com/fsnotify/fsnotify/backend_inotify.go:305 +0x2a
panic({0x931240?, 0xdf4140?})
	/usr/local/go/src/runtime/panic.go:785 +0x132
github.com/fsnotify/fsnotify.(*Watcher).isClosed(...)
	/go/src/github.com/ROCm/k8s-device-plugin/vendor/github.com/fsnotify/fsnotify/backend_inotify.go:296
github.com/fsnotify/fsnotify.(*Watcher).AddWith(0x0, {0x9dda56, 0x20}, {0x0, 0x0, 0x4951ca?})
	/go/src/github.com/ROCm/k8s-device-plugin/vendor/github.com/fsnotify/fsnotify/backend_inotify.go:372 +0x3a
github.com/fsnotify/fsnotify.(*Watcher).Add(...)
	/go/src/github.com/ROCm/k8s-device-plugin/vendor/github.com/fsnotify/fsnotify/backend_inotify.go:362
github.com/kubevirt/device-plugin-manager/pkg/dpm.(*Manager).Run(0xc000034ea0)
	/go/src/github.com/ROCm/k8s-device-plugin/vendor/github.com/kubevirt/device-plugin-manager/pkg/dpm/manager.go:55 +0x209
main.main()
	/go/src/github.com/ROCm/k8s-device-plugin/cmd/k8s-device-plugin/main.go:153 +0x712
```

Now this could be a new issue since about 5 days ago I had created an issue (https://github.com/ROCm/gpu-operator/issues/266)  with the exact same setup, but had to bring the cluster down over the weekend. Since, then I am unable to return to the same cluster setup. 

 

### Operating System

Ubuntu 24.04 

### CPU

INTEL(R) XEON(R) PLATINUM 8568Y+

### GPU

AMD Instinct MI300X VF

### ROCm Version

1.3.0

### ROCm Component

_No response_

### Steps to Reproduce

On an k8s cluster with AMD GPU having the feature node label ```feature.node.kubernetes.io/amd-vgpu=true```  
Install ROCm/gpu-operator according to the docs: 

```
helm install amd-gpu-operator rocm/gpu-operator-charts \
  --namespace egs-gpu-operator \
  --create-namespace \
  --version=v1.3.0
```

Create a configmap for metrics exporter from the example given in docs (https://instinct.docs.amd.com/projects/gpu-operator/en/latest/metrics/exporter.html#customize-metrics-fields-labels): 

```
kubectl create configmap amd-device-metrics-config --from-file=examples/metricsExporter/config.json
```

Deviceconfig: 
```
apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
  name: amd-deviceconfig
  namespace: egs-gpu-operator
spec:
  devicePlugin:
    devicePluginImage: rocm/k8s-device-plugin:latest
    enableNodeLabeller: true
    nodeLabellerImage: rocm/k8s-device-plugin:labeller-latest
  driver:
    enable: false
  metricsExporter:
    enable: true
    image: docker.io/rocm/device-metrics-exporter:v1.3.0
    nodePort: 32500
    port: 5000
    serviceType: NodePort
    config:
      name: amd-device-metrics-config
  selector:
    feature.node.kubernetes.io/amd-vgpu: "true"
```

apply the deviceconfig: 

```
k apply -f amd-deviceconfig.yaml -n egs-gpu-operator
```

Notice the pods in the operator namespace (in my case egs-gpu-operator), having device-plugin and metrics-exporter pod stuck in CrashLoopBackOff

```
k get pods
NAME                                                              READY   STATUS             RESTARTS        AGE
amd-deviceconfig-device-plugin-7m7cq                              0/1     CrashLoopBackOff   8 (2m52s ago)   19m
amd-deviceconfig-metrics-exporter-g8vd7                           0/1     CrashLoopBackOff   8 (113s ago)    19m

```




### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

_No response_

### Additional Information

This is a new issue, did not observe this earlier in the same setup, it could be linked with changes made in latest images of metrics exporter and device-plugin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Issue]: metrics-exporter pod and device-plugin pod crash for amd-vgpu K8s cluster #269

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: metrics-exporter pod and device-plugin pod crash for amd-vgpu K8s cluster #269

Description

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions