test: add E2E test for fully managed GPU + network isolation#8253
Open
ganeshkumarashok wants to merge 1 commit intomainfrom
Open
test: add E2E test for fully managed GPU + network isolation#8253ganeshkumarashok wants to merge 1 commit intomainfrom
ganeshkumarashok wants to merge 1 commit intomainfrom
Conversation
Add Test_Ubuntu2404_FullyManagedGPU_NetworkIsolated to verify the fully managed GPU experience works on network-isolated nodes (NSG-blocked egress, private ACR only). Uses Standard_NC6s_v3 (CUDA) instead of Standard_NV6ads_A10_v5 (GRID) because the CUDA driver container image is pre-cached on the Ubuntu VHD, while the GRID driver container requires pulling from MCR at runtime, which fails in network isolation. Validates: network isolation marker, nvidia-device-plugin install and service, GPU resource advertisement, GPU workload scheduling, DCGM packages and services, DCGM exporter scraping, and NPD health checks. GRID license NPD checks are intentionally omitted as they only apply to converged/GRID GPU sizes.
Contributor
There was a problem hiding this comment.
Pull request overview
Adds an E2E coverage scenario to validate that the “fully managed GPU experience” works correctly for network-isolated Ubuntu 24.04 nodes (including device plugin, GPU scheduling, DCGM exporter, and NPD checks), using a CUDA-based VM size that can operate without public registry access.
Changes:
- Adds
Test_Ubuntu2404_FullyManagedGPU_NetworkIsolatedE2E scenario usingStandard_NC6s_v3onClusterAzureNetworkIsolated. - Configures private egress + ACR credential provider settings needed for network-isolated image pulls.
- Validates network isolation marker, GPU packages/services/resources, DCGM exporter health/scraping, and NPD health checks.
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Test_Ubuntu2404_FullyManagedGPU_NetworkIsolatedE2E test to verify the fully managed GPU experience works on network-isolated Ubuntu 24.04 nodesStandard_NC6s_v3(CUDA) — the CUDA driver container is pre-cached on the VHD, so it works without MCR access. GRID driver containers (used by converged GPUs like NV6ads_A10_v5) are NOT VHD-cached and fail in network isolation.Test plan
go build ./...andgo vet ./...passTest_Ubuntu2404_FullyManagedGPU_NetworkIsolatedpasses locally (994s, clusterabe2e-azure-networkisolated-v1-9eb2fin westus3)