Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
19 changes: 13 additions & 6 deletions cmd/nvidia-ctk-installer/container/container.go
Original file line number Diff line number Diff line change
Expand Up @@ -49,12 +49,15 @@ type Options struct {
// mount.
ExecutablePath string
// EnabledCDI indicates whether CDI should be enabled.
EnableCDI bool
RuntimeName string
RuntimeDir string
SetAsDefault bool
RestartMode string
HostRootMount string
EnableCDI bool
EnableNRI bool
RuntimeName string
RuntimeDir string
SetAsDefault bool
RestartMode string
HostRootMount string
NRIPluginIndex string
NRISocket string

ConfigSources []string
}
Expand Down Expand Up @@ -128,6 +131,10 @@ func (o Options) UpdateConfig(cfg engine.Interface) error {
cfg.EnableCDI()
}

if o.EnableNRI {
cfg.EnableNRI()
}

return nil
}

Expand Down
146 changes: 146 additions & 0 deletions cmd/nvidia-ctk-installer/container/runtime/nri/plugin.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
/**
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
**/

package nri

import (
"context"
"fmt"
"os"
"strings"

"github.com/containerd/nri/pkg/api"
nriplugin "github.com/containerd/nri/pkg/stub"

"github.com/NVIDIA/nvidia-container-toolkit/internal/logger"
)

// Compile-time interface checks
var (
_ nriplugin.Plugin = (*Plugin)(nil)
)

const (
// nriCDIDeviceKey is the prefix of the key used for CDI device annotations.
nriCDIDeviceKey = "nvidia.cdi.k8s.io"
// defaultNRISocket represents the default path of the NRI socket
defaultNRISocket = "/var/run/nri/nri.sock"
)

type Plugin struct {
logger logger.Interface

stub nriplugin.Stub
}

// NewPlugin creates a new NRI plugin for injecting CDI devices
func NewPlugin(logger logger.Interface) *Plugin {
return &Plugin{
logger: logger,
}
}

// CreateContainer handles container creation requests.
func (p *Plugin) CreateContainer(_ context.Context, pod *api.PodSandbox, ctr *api.Container) (*api.ContainerAdjustment, []*api.ContainerUpdate, error) {
adjust := &api.ContainerAdjustment{}

if err := p.injectCDIDevices(pod, ctr, adjust); err != nil {
return nil, nil, err
}

return adjust, nil, nil
}

func (p *Plugin) injectCDIDevices(pod *api.PodSandbox, ctr *api.Container, a *api.ContainerAdjustment) error {
devices, err := parseCDIDevices(ctr.Name, pod.Annotations)
if err != nil {
return err
}

if len(devices) == 0 {
p.logger.Debugf("%s: no CDI devices annotated...", containerName(pod, ctr))
return nil
}

for _, name := range devices {
a.AddCDIDevice(
&api.CDIDevice{
Copy link
Member

@elezar elezar Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I am aware, this introduces restrictions on compatible containerd / cri-o versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we've moved to native CDI, the minimum supported containerd is now v1.7. Will that be affected by this change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to check when CDI Devices were added to the NRI APIs. The versions of containerd / cri-o that then started responding to these fields would be the minimum versions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here was the NRI commit: containerd/nri@c9b4798
Tagged as v0.7.0.

We would also need to check which version of containerd includes the related adjustments. (see https://github.com/containerd/containerd/blob/6dc89d8b392dbea8651ab0b141765f196836205c/internal/cri/nri/nri_api_linux.go#L321-L326).

(I wasn't able to find similar code in v1.7 ir v2.0).

Copy link

@klihub klihub Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elezar CDI injection from an NRI plugin does not work in 1.7.x. I have a fix for it which I never filed a PR from. But if it's important enough for you guys, I can file it and then see if we could get it merged.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elezar Here is the filed PR against the 1.7 maintenance branch: containerd/containerd#12650

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elezar And here is one filed against the 2.0 maintenance branch: containerd/containerd#12651

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @klihub !

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elezar @tariq1890 Not sure how realistic is to get containerd/containerd#12651 in. 2.0 is EOL'd. 12651 would require a new tagged release, and if that is an absolute no-go (it's usually not that inflexible), then it's a tough call. Anyway, you might want to chime in there and comment.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elezar @tariq1890 Not sure how realistic is to get containerd/containerd#12651 in. 2.0 is EOL'd. 12651 would require a new tagged release, and if that is an absolute no-go (it's usually not that inflexible), then it's a tough call. Anyway, you might want to chime in there and comment.

The v1.7 version of the fix has been now commented as being considered for (a late) inclusion in the upcoming v1.30 release. You might want to chime in there as well.

Name: name,
},
)
p.logger.Infof("%s: injected CDI device %q...", containerName(pod, ctr), name)
}

return nil
}

func parseCDIDevices(ctr string, annotations map[string]string) ([]string, error) {
annotation := getAnnotation(annotations, nriCDIDeviceKey, ctr)
if len(annotation) == 0 {
return nil, nil
}

cdiDevices := strings.Split(annotation, ",")
return cdiDevices, nil
}

func getAnnotation(annotations map[string]string, key, ctr string) string {
nriPluginAnnotationKey := fmt.Sprintf("%s/container.%s", key, ctr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that the original example allowed all containers in a pod to automatically have access to devices using a different key. Here, I understand that we're only allowing a SINGLE form, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, correct we are only allowing a single form. Please let me know if you have any concerns

if value, ok := annotations[nriPluginAnnotationKey]; ok {
return value
}

return ""
}

// Construct a container name for log messages.
func containerName(pod *api.PodSandbox, container *api.Container) string {
if pod != nil {
return pod.Name + "/" + container.Name
}
return container.Name
}

// Start starts the NRI plugin
func (p *Plugin) Start(ctx context.Context, nriSocketPath, nriPluginIdx string) error {
if len(nriSocketPath) == 0 {
nriSocketPath = defaultNRISocket
}
_, err := os.Stat(nriSocketPath)
if err != nil {
return fmt.Errorf("failed to find valid nri socket in %s: %w", nriSocketPath, err)
}

pluginOpts := []nriplugin.Option{
nriplugin.WithPluginIdx(nriPluginIdx),
nriplugin.WithSocketPath(nriSocketPath),
}
if p.stub, err = nriplugin.New(p, pluginOpts...); err != nil {
return fmt.Errorf("failed to initialise plugin at %s: %w", nriSocketPath, err)
}
err = p.stub.Start(ctx)
if err != nil {
return fmt.Errorf("plugin exited with error: %w", err)
}
return nil
}

// Stop stops the NRI plugin
func (p *Plugin) Stop() {
if p != nil && p.stub != nil {
p.stub.Stop()
}
}
23 changes: 23 additions & 0 deletions cmd/nvidia-ctk-installer/container/runtime/runtime.go
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@ const (
// defaultRuntimeName specifies the NVIDIA runtime to be use as the default runtime if setting the default runtime is enabled
defaultRuntimeName = "nvidia"
defaultHostRootMount = "/host"
defaultNRIPluginIdx = "10"
defaultNRISocket = "/var/run/nri/nri.sock"

runtimeSpecificDefault = "RUNTIME_SPECIFIC_DEFAULT"
)
Expand Down Expand Up @@ -94,6 +96,27 @@ func Flags(opts *Options) []cli.Flag {
Destination: &opts.EnableCDI,
Sources: cli.EnvVars("RUNTIME_ENABLE_CDI"),
},
&cli.BoolFlag{
Name: "enable-nri-in-runtime",
Usage: "Enable NRI in the configured runtime",
Destination: &opts.EnableNRI,
Value: true,
Sources: cli.EnvVars("RUNTIME_ENABLE_NRI"),
},
&cli.StringFlag{
Name: "nri-plugin-index",
Usage: "Specify the plugin index to register to NRI",
Value: defaultNRIPluginIdx,
Destination: &opts.NRIPluginIndex,
Sources: cli.EnvVars("RUNTIME_NRI_PLUGIN_INDEX"),
},
&cli.StringFlag{
Name: "nri-socket",
Usage: "Specify the path to the NRI socket file to register the NRI plugin server",
Value: defaultNRISocket,
Destination: &opts.NRISocket,
Sources: cli.EnvVars("RUNTIME_NRI_SOCKET"),
},
&cli.StringFlag{
Name: "host-root",
Usage: "Specify the path to the host root to be used when restarting the runtime using systemd",
Expand Down
46 changes: 42 additions & 4 deletions cmd/nvidia-ctk-installer/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,13 @@ import (
"os/signal"
"path/filepath"
"syscall"
"time"

"github.com/urfave/cli/v3"
"golang.org/x/sys/unix"

"github.com/NVIDIA/nvidia-container-toolkit/cmd/nvidia-ctk-installer/container/runtime"
"github.com/NVIDIA/nvidia-container-toolkit/cmd/nvidia-ctk-installer/container/runtime/nri"
"github.com/NVIDIA/nvidia-container-toolkit/cmd/nvidia-ctk-installer/toolkit"
"github.com/NVIDIA/nvidia-container-toolkit/internal/info"
"github.com/NVIDIA/nvidia-container-toolkit/internal/logger"
Expand All @@ -26,6 +28,9 @@ const (
toolkitSubDir = "toolkit"

defaultRuntime = "docker"

retryBackoff = 2 * time.Second
maxRetryAttempts = 5
)

var availableRuntimes = map[string]struct{}{"docker": {}, "crio": {}, "containerd": {}}
Expand Down Expand Up @@ -73,7 +78,7 @@ type app struct {
toolkit *toolkit.Installer
}

// NewApp creates the CLI app fro the specified options.
// NewApp creates the CLI app from the specified options.
func NewApp(logger logger.Interface) *cli.Command {
a := app{
logger: logger,
Expand All @@ -93,8 +98,8 @@ func (a app) build() *cli.Command {
Before: func(ctx context.Context, cmd *cli.Command) (context.Context, error) {
return ctx, a.Before(cmd, &options)
},
Action: func(_ context.Context, cmd *cli.Command) error {
return a.Run(cmd, &options)
Action: func(ctx context.Context, cmd *cli.Command) error {
return a.Run(ctx, cmd, &options)
},
Flags: []cli.Flag{
&cli.BoolFlag{
Expand Down Expand Up @@ -194,7 +199,7 @@ func (a *app) validateFlags(c *cli.Command, o *options) error {
// Run installs the NVIDIA Container Toolkit and updates the requested runtime.
// If the application is run as a daemon, the application waits and unconfigures
// the runtime on termination.
func (a *app) Run(c *cli.Command, o *options) error {
func (a *app) Run(ctx context.Context, c *cli.Command, o *options) error {
err := a.initialize(o.pidFile)
if err != nil {
return fmt.Errorf("unable to initialize: %v", err)
Expand Down Expand Up @@ -222,6 +227,14 @@ func (a *app) Run(c *cli.Command, o *options) error {
}

if !o.noDaemon {
if o.runtimeOptions.EnableNRI {
nriPlugin, err := a.startNRIPluginServer(ctx, o.runtimeOptions)
if err != nil {
a.logger.Errorf("unable to start NRI plugin server: %v", err)
}
defer nriPlugin.Stop()
}

err = a.waitForSignal()
if err != nil {
return fmt.Errorf("unable to wait for signal: %v", err)
Expand Down Expand Up @@ -287,6 +300,31 @@ func (a *app) waitForSignal() error {
return nil
}

func (a *app) startNRIPluginServer(ctx context.Context, opts runtime.Options) (*nri.Plugin, error) {
a.logger.Infof("Starting the NRI Plugin server....")

plugin := nri.NewPlugin(a.logger)
retriable := func() error {
return plugin.Start(ctx, opts.NRISocket, opts.NRIPluginIndex)
}
var err error
for i := 0; i < maxRetryAttempts; i++ {
err = retriable()
if err == nil {
break
}
if i == maxRetryAttempts-1 {
break
}
time.Sleep(retryBackoff)
}
if err != nil {
a.logger.Errorf("Max retries reached %d/%d, aborting", maxRetryAttempts, maxRetryAttempts)
return nil, err
}
return plugin, nil
}

func (a *app) shutdown(pidFile string) {
a.logger.Infof("Shutting Down")

Expand Down
1 change: 1 addition & 0 deletions cmd/nvidia-ctk-installer/main_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -444,6 +444,7 @@ version = 2
"--pid-file=" + filepath.Join(testRoot, "toolkit.pid"),
"--restart-mode=none",
"--toolkit-source-root=" + filepath.Join(artifactRoot, "deb"),
"--enable-nri-in-runtime=false",
}

err := app.Run(context.Background(), append(testArgs, tc.args...))
Expand Down
12 changes: 10 additions & 2 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ go 1.25.0
require (
github.com/NVIDIA/go-nvlib v0.9.1-0.20251202135446-d0f42ba016dd
github.com/NVIDIA/go-nvml v0.13.0-1
github.com/containerd/nri v0.11.0
github.com/google/uuid v1.6.0
github.com/moby/sys/mountinfo v0.7.2
github.com/moby/sys/reexec v0.1.0
Expand All @@ -25,18 +26,25 @@ require (

require (
cyphar.com/go-pathrs v0.2.1 // indirect
github.com/containerd/log v0.1.0 // indirect
github.com/containerd/ttrpc v1.2.7 // indirect
github.com/cyphar/filepath-securejoin v0.6.0 // indirect
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/fsnotify/fsnotify v1.7.0 // indirect
github.com/golang/protobuf v1.5.3 // indirect
github.com/hashicorp/errwrap v1.1.0 // indirect
github.com/kr/pretty v0.3.1 // indirect
github.com/knqyf263/go-plugin v0.9.0 // indirect
github.com/kr/text v0.2.0 // indirect
github.com/moby/sys/capability v0.4.0 // indirect
github.com/opencontainers/cgroups v0.0.4 // indirect
github.com/opencontainers/runtime-tools v0.9.1-0.20251114084447-edf4cb3d2116 // indirect
github.com/pmezard/go-difflib v1.0.0 // indirect
github.com/rogpeppe/go-internal v1.11.0 // indirect
github.com/tetratelabs/wazero v1.10.1 // indirect
github.com/xeipuuv/gojsonpointer v0.0.0-20190905194746-02993c407bfb // indirect
gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c // indirect
google.golang.org/genproto/googleapis/rpc v0.0.0-20230731190214-cbb8c96f2d6d // indirect
google.golang.org/grpc v1.57.1 // indirect
google.golang.org/protobuf v1.36.5 // indirect
gopkg.in/yaml.v3 v3.0.1 // indirect
sigs.k8s.io/yaml v1.4.0 // indirect
)
Loading
Loading