Skip to content

test: add E2E test for fully managed GPU + network isolation#8253

Open
ganeshkumarashok wants to merge 1 commit intomainfrom
aganeshkumar/fully-managed-gpu-network-isolated-e2e
Open

test: add E2E test for fully managed GPU + network isolation#8253
ganeshkumarashok wants to merge 1 commit intomainfrom
aganeshkumar/fully-managed-gpu-network-isolated-e2e

Conversation

@ganeshkumarashok
Copy link
Copy Markdown
Contributor

Summary

  • Adds Test_Ubuntu2404_FullyManagedGPU_NetworkIsolated E2E test to verify the fully managed GPU experience works on network-isolated Ubuntu 24.04 nodes
  • Uses Standard_NC6s_v3 (CUDA) — the CUDA driver container is pre-cached on the VHD, so it works without MCR access. GRID driver containers (used by converged GPUs like NV6ads_A10_v5) are NOT VHD-cached and fail in network isolation.
  • Validates: network isolation marker, nvidia-device-plugin, GPU resource advertisement, GPU workload scheduling, DCGM packages/services/scraping, NPD health checks

Test plan

  • go build ./... and go vet ./... pass
  • Test_Ubuntu2404_FullyManagedGPU_NetworkIsolated passes locally (994s, cluster abe2e-azure-networkisolated-v1-9eb2f in westus3)
  • GPU E2E pipeline passes in CI

Add Test_Ubuntu2404_FullyManagedGPU_NetworkIsolated to verify the fully
managed GPU experience works on network-isolated nodes (NSG-blocked
egress, private ACR only).

Uses Standard_NC6s_v3 (CUDA) instead of Standard_NV6ads_A10_v5 (GRID)
because the CUDA driver container image is pre-cached on the Ubuntu VHD,
while the GRID driver container requires pulling from MCR at runtime,
which fails in network isolation.

Validates: network isolation marker, nvidia-device-plugin install and
service, GPU resource advertisement, GPU workload scheduling, DCGM
packages and services, DCGM exporter scraping, and NPD health checks.

GRID license NPD checks are intentionally omitted as they only apply to
converged/GRID GPU sizes.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an E2E coverage scenario to validate that the “fully managed GPU experience” works correctly for network-isolated Ubuntu 24.04 nodes (including device plugin, GPU scheduling, DCGM exporter, and NPD checks), using a CUDA-based VM size that can operate without public registry access.

Changes:

  • Adds Test_Ubuntu2404_FullyManagedGPU_NetworkIsolated E2E scenario using Standard_NC6s_v3 on ClusterAzureNetworkIsolated.
  • Configures private egress + ACR credential provider settings needed for network-isolated image pulls.
  • Validates network isolation marker, GPU packages/services/resources, DCGM exporter health/scraping, and NPD health checks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants