Skip to content

fix(gpu): add systemd ordering to prevent MIG device detection race#8247

Open
surajssd wants to merge 1 commit intomainfrom
suraj/fix-mig-timing-issue
Open

fix(gpu): add systemd ordering to prevent MIG device detection race#8247
surajssd wants to merge 1 commit intomainfrom
suraj/fix-mig-timing-issue

Conversation

@surajssd
Copy link
Copy Markdown
Member

@surajssd surajssd commented Apr 7, 2026

What this PR does / why we need it:

Fixes a race condition on MIG-enabled GPU nodes where nvidia-device-plugin.service starts before mig-partition.service has finished creating MIG instances, causing the device plugin to report "No devices found. Waiting indefinitely."

On MIG nodes, the A100 GPU requires a VM reboot to activate MIG mode. After the reboot, both mig-partition.service and nvidia-device-plugin.service start concurrently via systemd's multi-user.target since there is no ordering dependency between them. The nvidia-device-plugin scans for MIG devices almost instantly (~2s), but mig-partition needs ~3-4s to partition the GPU. By the time partitioning completes, the device plugin has already entered an unrecoverable "waiting indefinitely" state.

This PR adds three directives to mig-partition.service:

  • Before=nvidia-device-plugin.service — tells systemd to complete mig-partition before starting nvidia-device-plugin. This is purely an ordering directive; if nvidia-device-plugin.service doesn't exist or isn't being started, it is silently ignored.
  • Type=oneshot — semantically correct for a service that runs a script and exits. Critically, with the previous default Type=simple, systemd considers the service "started" as soon as the process forks, so Before= alone wouldn't actually wait for partitioning to complete. With Type=oneshot, systemd waits for the script to exit before marking the service as started.
  • RemainAfterExit=yes — keeps the service in "active (exited)" state after completion, required for Before=/After= ordering to work correctly with oneshot services.

- Add `Before=nvidia-device-plugin.service` to ensure `mig-partition`
  completes before `nvidia-device-plugin` starts on reboot
- Set `Type=oneshot` so systemd waits for the partitioning script to
  finish before considering the service "started"
- Add `RemainAfterExit=yes` to keep the service in "active (exited)"
  state, required for `Before=`/`After=` ordering with oneshot services

On MIG-enabled GPU nodes, both services start concurrently on the
second boot (required for MIG mode activation). Without ordering,
`nvidia-device-plugin` scans for MIG devices before `mig-partition`
has finished creating them and enters "No devices found. Waiting
indefinitely."

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates the mig-partition.service systemd unit to enforce startup ordering so MIG partitioning completes before the NVIDIA device plugin scans for devices on MIG-enabled nodes.

Changes:

  • Add systemd ordering (Before=) to start MIG partitioning ahead of nvidia-device-plugin.service.
  • Switch the service to Type=oneshot so systemd waits for the script to exit before considering the unit started.
  • Keep the unit active after exit (RemainAfterExit=yes) to retain “active (exited)” state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants