Skip to content

[Issue]: HELM Operator search NodeModulesConfig crd with kmm disabled #256

@cirilloblu

Description

@cirilloblu

Problem Description

Hi! The operator is on but fail with error:

E0709 16:15:22.567458       1 cmdutils.go:43] "problem running manager" err="failed to wait for DriverAndPluginReconciler caches to sync kind source: *v1beta1.NodeModulesConfig: timed out waiting for cache to be synced for Kind *v1beta1.NodeModulesConfig" logger="amd-gpu.setup"

The helm is installed with values:

            node-feature-discovery:
              enabled: true
            kmm:
              enabled: false
            installdefaultNFDRule: true
            controllerManager:
              manager:
                image:
                  repository: registry.${var.secrets.general.localdomainname}/images/rocm-gpu-operator
                imagePullSecrets: "regcred"

With kmm disabled.

The operator is v1.3.0, the helm version is also v1.3.0.

The operator search the CRD resources of the disabled kmm.
https://github.com/ROCm/gpu-operator/blob/main/helm-charts-k8s/charts/kmm/crds/nodemodulesconfig-crd.yaml

Extended logs:

I0709 16:20:23.975313       1 main.go:97] "Creating manager" logger="amd-gpu.setup" version="v1.3.0" git commit="b6ec613e" build tag="latest"
I0709 16:20:23.975392       1 main.go:99] "Parsing configuration file" logger="amd-gpu.setup" path="controller_manager_config.yaml"
I0709 16:20:23.982549       1 utils.go:198] "IsOpenShift: false" logger="amd-gpu.setup"
I0709 16:20:23.982821       1 main.go:145] "starting manager" logger="amd-gpu.setup"
I0709 16:20:23.982898       1 server.go:208] "Starting metrics server" logger="amd-gpu.controller-runtime.metrics"
I0709 16:20:23.982948       1 server.go:83] "starting server" logger="amd-gpu" name="health probe" addr="[::]:8081"
I0709 16:20:23.982960       1 server.go:247] "Serving metrics server" logger="amd-gpu.controller-runtime.metrics" bindAddress="127.0.0.1:8080" secure=false
I0709 16:20:23.983058       1 leaderelection.go:257] attempting to acquire leader lease infrastructure/gpu.amd.com...
I0709 16:20:39.195249       1 leaderelection.go:271] successfully acquired lease infrastructure/gpu.amd.com
I0709 16:20:39.195631       1 controller.go:198] "Starting EventSource" logger="amd-gpu" controller="DriverAndPluginReconciler" controllerGroup="amd.com" controllerKind="DeviceConfig" source="kind source: *v1alpha1.DeviceConfig"
I0709 16:20:39.195656       1 controller.go:198] "Starting EventSource" logger="amd-gpu" controller="DriverAndPluginReconciler" controllerGroup="amd.com" controllerKind="DeviceConfig" source="kind source: *v1.Pod"
I0709 16:20:39.195681       1 controller.go:198] "Starting EventSource" logger="amd-gpu" controller="DriverAndPluginReconciler" controllerGroup="amd.com" controllerKind="DeviceConfig" source="kind source: *v1.Service"
I0709 16:20:39.195731       1 controller.go:198] "Starting EventSource" logger="amd-gpu" controller="DriverAndPluginReconciler" controllerGroup="amd.com" controllerKind="DeviceConfig" source="kind source: *v1.Node"
I0709 16:20:39.195727       1 controller.go:198] "Starting EventSource" logger="amd-gpu" controller="DriverAndPluginReconciler" controllerGroup="amd.com" controllerKind="DeviceConfig" source="kind source: *v1.Secret"
I0709 16:20:39.195791       1 controller.go:198] "Starting EventSource" logger="amd-gpu" controller="DriverAndPluginReconciler" controllerGroup="amd.com" controllerKind="DeviceConfig" source="kind source: *v1beta1.NodeModulesConfig"
I0709 16:20:39.195828       1 controller.go:198] "Starting EventSource" logger="amd-gpu" controller="DriverAndPluginReconciler" controllerGroup="amd.com" controllerKind="DeviceConfig" source="kind source: *v1.DaemonSet"
I0709 16:20:39.195825       1 controller.go:198] "Starting EventSource" logger="amd-gpu" controller="DriverAndPluginReconciler" controllerGroup="amd.com" controllerKind="DeviceConfig" source="kind source: *v1beta1.Module"
E0709 16:20:39.212154       1 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"Module\" in version \"kmm.sigs.x-k8s.io/v1beta1\"" logger="amd-gpu.controller-runtime.source.EventHandler" kind="Module.kmm.sigs.x-k8s.io"
E0709 16:20:39.215704       1 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"NodeModulesConfig\" in version \"kmm.sigs.x-k8s.io/v1beta1\"" logger="amd-gpu.controller-runtime.source.EventHandler" kind="NodeModulesConfig.kmm.sigs.x-k8s.io"
E0709 16:20:49.197898       1 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"NodeModulesConfig\" in version \"kmm.sigs.x-k8s.io/v1beta1\"" logger="amd-gpu.controller-runtime.source.EventHandler" kind="NodeModulesConfig.kmm.sigs.x-k8s.io"
E0709 16:20:49.199870       1 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"Module\" in version \"kmm.sigs.x-k8s.io/v1beta1\"" logger="amd-gpu.controller-runtime.source.EventHandler" kind="Module.kmm.sigs.x-k8s.io"
E0709 16:20:59.198931       1 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"NodeModulesConfig\" in version \"kmm.sigs.x-k8s.io/v1beta1\"" logger="amd-gpu.controller-runtime.source.EventHandler" kind="NodeModulesConfig.kmm.sigs.x-k8s.io"
E0709 16:20:59.200973       1 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"Module\" in version \"kmm.sigs.x-k8s.io/v1beta1\"" logger="amd-gpu.controller-runtime.source.EventHandler" kind="Module.kmm.sigs.x-k8s.io"
E0709 16:21:09.198204       1 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"NodeModulesConfig\" in version \"kmm.sigs.x-k8s.io/v1beta1\"" logger="amd-gpu.controller-runtime.source.EventHandler" kind="NodeModulesConfig.kmm.sigs.x-k8s.io"
E0709 16:21:09.200110       1 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"Module\" in version \"kmm.sigs.x-k8s.io/v1beta1\"" logger="amd-gpu.controller-runtime.source.EventHandler" kind="Module.kmm.sigs.x-k8s.io"
E0709 16:21:19.198266       1 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"Module\" in version \"kmm.sigs.x-k8s.io/v1beta1\"" logger="amd-gpu.controller-runtime.source.EventHandler" kind="Module.kmm.sigs.x-k8s.io"
E0709 16:21:19.200857       1 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"NodeModulesConfig\" in version \"kmm.sigs.x-k8s.io/v1beta1\"" logger="amd-gpu.controller-runtime.source.EventHandler" kind="NodeModulesConfig.kmm.sigs.x-k8s.io"
E0709 16:21:29.201430       1 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"NodeModulesConfig\" in version \"kmm.sigs.x-k8s.io/v1beta1\"" logger="amd-gpu.controller-runtime.source.EventHandler" kind="NodeModulesConfig.kmm.sigs.x-k8s.io"
E0709 16:21:29.203700       1 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"Module\" in version \"kmm.sigs.x-k8s.io/v1beta1\"" logger="amd-gpu.controller-runtime.source.EventHandler" kind="Module.kmm.sigs.x-k8s.io"
E0709 16:21:39.199018       1 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"Module\" in version \"kmm.sigs.x-k8s.io/v1beta1\"" logger="amd-gpu.controller-runtime.source.EventHandler" kind="Module.kmm.sigs.x-k8s.io"
E0709 16:21:39.201032       1 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"NodeModulesConfig\" in version \"kmm.sigs.x-k8s.io/v1beta1\"" logger="amd-gpu.controller-runtime.source.EventHandler" kind="NodeModulesConfig.kmm.sigs.x-k8s.io"
E0709 16:21:49.198098       1 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"Module\" in version \"kmm.sigs.x-k8s.io/v1beta1\"" logger="amd-gpu.controller-runtime.source.EventHandler" kind="Module.kmm.sigs.x-k8s.io"
E0709 16:21:49.200081       1 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"NodeModulesConfig\" in version \"kmm.sigs.x-k8s.io/v1beta1\"" logger="amd-gpu.controller-runtime.source.EventHandler" kind="NodeModulesConfig.kmm.sigs.x-k8s.io"
E0709 16:21:59.198257       1 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"NodeModulesConfig\" in version \"kmm.sigs.x-k8s.io/v1beta1\"" logger="amd-gpu.controller-runtime.source.EventHandler" kind="NodeModulesConfig.kmm.sigs.x-k8s.io"
E0709 16:21:59.200101       1 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"Module\" in version \"kmm.sigs.x-k8s.io/v1beta1\"" logger="amd-gpu.controller-runtime.source.EventHandler" kind="Module.kmm.sigs.x-k8s.io"
E0709 16:22:09.198809       1 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"NodeModulesConfig\" in version \"kmm.sigs.x-k8s.io/v1beta1\"" logger="amd-gpu.controller-runtime.source.EventHandler" kind="NodeModulesConfig.kmm.sigs.x-k8s.io"
E0709 16:22:09.202906       1 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"Module\" in version \"kmm.sigs.x-k8s.io/v1beta1\"" logger="amd-gpu.controller-runtime.source.EventHandler" kind="Module.kmm.sigs.x-k8s.io"
E0709 16:22:19.198572       1 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"Module\" in version \"kmm.sigs.x-k8s.io/v1beta1\"" logger="amd-gpu.controller-runtime.source.EventHandler" kind="Module.kmm.sigs.x-k8s.io"
E0709 16:22:19.200475       1 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"NodeModulesConfig\" in version \"kmm.sigs.x-k8s.io/v1beta1\"" logger="amd-gpu.controller-runtime.source.EventHandler" kind="NodeModulesConfig.kmm.sigs.x-k8s.io"
E0709 16:22:29.198003       1 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"NodeModulesConfig\" in version \"kmm.sigs.x-k8s.io/v1beta1\"" logger="amd-gpu.controller-runtime.source.EventHandler" kind="NodeModulesConfig.kmm.sigs.x-k8s.io"
E0709 16:22:29.199931       1 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"Module\" in version \"kmm.sigs.x-k8s.io/v1beta1\"" logger="amd-gpu.controller-runtime.source.EventHandler" kind="Module.kmm.sigs.x-k8s.io"
E0709 16:22:39.196885       1 controller.go:210] "Could not wait for Cache to sync" err="failed to wait for DriverAndPluginReconciler caches to sync kind source: *v1beta1.Module: timed out waiting for cache to be synced for Kind *v1beta1.Module" logger="amd-gpu" controller="DriverAndPluginReconciler" controllerGroup="amd.com" controllerKind="DeviceConfig" source="kind source: *v1beta1.Module"
E0709 16:22:39.196986       1 controller.go:210] "Could not wait for Cache to sync" err="failed to wait for DriverAndPluginReconciler caches to sync kind source: *v1beta1.NodeModulesConfig: timed out waiting for cache to be synced for Kind *v1beta1.NodeModulesConfig" logger="amd-gpu" controller="DriverAndPluginReconciler" controllerGroup="amd.com" controllerKind="DeviceConfig" source="kind source: *v1beta1.NodeModulesConfig"
I0709 16:22:39.197067       1 internal.go:538] "Stopping and waiting for non leader election runnables" logger="amd-gpu"
I0709 16:22:39.197097       1 internal.go:542] "Stopping and waiting for leader election runnables" logger="amd-gpu"
I0709 16:22:39.197119       1 internal.go:550] "Stopping and waiting for caches" logger="amd-gpu"
E0709 16:22:39.199161       1 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"NodeModulesConfig\" in version \"kmm.sigs.x-k8s.io/v1beta1\"" logger="amd-gpu.controller-runtime.source.EventHandler" kind="NodeModulesConfig.kmm.sigs.x-k8s.io"
I0709 16:22:39.199214       1 internal.go:554] "Stopping and waiting for webhooks" logger="amd-gpu"
I0709 16:22:39.199231       1 internal.go:557] "Stopping and waiting for HTTP servers" logger="amd-gpu"
I0709 16:22:39.199256       1 server.go:254] "Shutting down metrics server with timeout of 1 minute" logger="amd-gpu.controller-runtime.metrics"
I0709 16:22:39.199263       1 server.go:68] "shutting down server" logger="amd-gpu" name="health probe" addr="[::]:8081"
I0709 16:22:39.199344       1 internal.go:561] "Wait completed, proceeding to shutdown the manager" logger="amd-gpu"
E0709 16:22:39.199392       1 cmdutils.go:43] "problem running manager" err="failed to wait for DriverAndPluginReconciler caches to sync kind source: *v1beta1.Module: timed out waiting for cache to be synced for Kind *v1beta1.Module" logger="amd-gpu.setup"

Operating System

Gentoo

CPU

Intel(R) Core(TM) i7-6700HQ

GPU

not on master node

ROCm Version

not on master node

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

Labels

RCA doneRoot Cause Analysis donebugSomething isn't workingenhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions