Skip to content

fix(e2e): add DAG dependencies to prevent route table race condition#8268

Merged
djsly merged 3 commits intomainfrom
djsly/fix-dag-route-table-race
Apr 9, 2026
Merged

fix(e2e): add DAG dependencies to prevent route table race condition#8268
djsly merged 3 commits intomainfrom
djsly/fix-dag-route-table-race

Conversation

@djsly
Copy link
Copy Markdown
Collaborator

@djsly djsly commented Apr 9, 2026

Problem

PR #8149 refactored prepareCluster to use concurrent DAG execution. This introduced a race condition between three operations that mutate the VNet/subnet:

  1. getOrCreateBastion — creates AzureBastionSubnet
  2. addFirewallRules — creates AzureFirewallSubnet + associates abe2e-fw-rt with aks-subnet
  3. collectGarbageVMSS — deletes old VMSS → triggers AKS cloud-controller route reconciliation

Running them concurrently causes the AKS pod route table (aks-agentpool-*-routetable) to be detached from the subnet. The firewall route table (abe2e-fw-rt) wins the association, but has no pod routes. Result:

  • NetworkUnavailable: Waiting for cloud routes — never cleared
  • cni plugin not initialized — CNI config never written
  • 65/154 E2E tests fail (100% of kubenet/overlay, 0% of AzureCNI)

Evidence from Azure

Route Table Routes Subnet Association
aks-agentpool-13663576-routetable 63 pod routes 0 subnets
abe2e-fw-rt 2 (vnet-local + firewall) 1 (aks-subnet)

Fix

Add DAG dependency edges to restore the sequential ordering for VNet-mutating operations:

  • addFirewallRules / addNetworkIsolatedSettings → depends on bastion (serialize subnet writes)
  • collectGarbageVMSS → depends on network setup (prevent VMSS deletion from racing with route table association)

All other tasks remain fully concurrent (kube client, identity, maintenance, subnet read, extract params).

DAG Before vs After

BEFORE (#8149):  bastion ──┐
                            ├──► g.Wait()
                 firewall ─┤
                            │
                 gcVMSS ───┘   (all concurrent, race on VNet)

AFTER (this PR): bastion ──► firewall ──► gcVMSS
                                          (serialized VNet mutations)
                 kube ─────┐
                 identity ─┤
                 subnet ───┤  (still concurrent — no VNet writes)
                 maint ────┘

Fixes the E2E infrastructure failures observed on PR #8230 and the scriptless E2E gate.

addFirewallRules and collectGarbageVMSS both trigger Azure subnet/route
table mutations. Running them concurrently (introduced in #8149) can
cause the AKS-managed pod route table to be detached from the subnet:

1. addFirewallRules associates abe2e-fw-rt with aks-subnet (displacing
   the AKS route table)
2. collectGarbageVMSS deletes stale VMSS, triggering cloud-controller
   route reconciliation that races with the subnet association
3. getOrCreateBastion creates AzureBastionSubnet (VNet-level mutation)

The race leaves aks-agentpool-*-routetable with 0 subnet associations,
so CCM-programmed pod routes are unreachable. Nodes stay stuck on
NetworkUnavailable: 'Waiting for cloud routes' forever.

Fix: add DAG edges so that:
- addFirewallRules / addNetworkIsolatedSettings wait for bastion
  (serialize VNet subnet mutations)
- collectGarbageVMSS waits for network setup
  (prevent VMSS deletion from racing with route table association)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses an E2E infrastructure regression introduced by the concurrent DAG-based execution in prepareCluster by adding dependency edges that serialize Azure VNet/subnet-mutating operations, preventing route table association races that can strand AKS pod route tables.

Changes:

  • Make addFirewallRules and addNetworkIsolatedSettings depend on the bastion task to avoid concurrent subnet/VNet mutations.
  • Make collectGarbageVMSS depend on completion of network setup to prevent VMSS deletion (and ensuing route reconciliation) from racing with subnet route table association.

Replace broad 'e2e/' path trigger with 'e2e/scenario_win_test.go' only.
Previously, any change to shared E2E infrastructure (validators.go,
cluster.go, dag/, etc.) would trigger a full Windows VHD build (~1hr)
even though standalone e2e-windows.yaml already covers E2E changes.

This matches the Linux VHD build pipeline pattern which does not
include e2e/ in its trigger paths at all.

Also removes the now-unnecessary 'e2e/scenario_test.go' exclude entry
since e2e/ is no longer broadly included.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@timmy-wright timmy-wright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is sensible - see comment.

Copilot AI review requested due to automatic review settings April 9, 2026 23:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment on lines +93 to +97
// networkSetup associates a route table with aks-subnet (firewall or
// network-isolated NSG). It must run after bastion (both mutate the
// VNet) and before collectGarbageVMSS (VMSS deletion triggers AKS
// cloud-controller route reconciliation that can race with the subnet
// association and leave the AKS pod route table detached).
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new comment says networkSetup associates a route table with aks-subnet (firewall or network-isolated NSG), but addNetworkIsolatedSettings actually attaches an NSG (no route table). Consider rewording to avoid implying a route table association in the network-isolated path (e.g., describe this as subnet/VNet mutations: firewall route-table association OR network-isolated NSG update).

Suggested change
// networkSetup associates a route table with aks-subnet (firewall or
// network-isolated NSG). It must run after bastion (both mutate the
// VNet) and before collectGarbageVMSS (VMSS deletion triggers AKS
// cloud-controller route reconciliation that can race with the subnet
// association and leave the AKS pod route table detached).
// networkSetup performs subnet/VNet mutations for the cluster network:
// firewall setup associates the AKS subnet route table, while
// network-isolated setup updates the subnet/NSG settings. It must run
// after bastion (both mutate the VNet) and before collectGarbageVMSS
// (VMSS deletion triggers AKS cloud-controller route reconciliation
// that can race with the firewall subnet association and leave the AKS
// pod route table detached).

Copilot uses AI. Check for mistakes.
@djsly djsly merged commit 85acc06 into main Apr 9, 2026
21 of 28 checks passed
@djsly djsly deleted the djsly/fix-dag-route-table-race branch April 9, 2026 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants