refactor(e2e): add DAG-based concurrent task execution for cluster setup by r2k1 · Pull Request #8149 · Azure/AgentBaker

r2k1 · 2026-03-21T20:32:50Z

What this PR does / why we need it:

Adds a new e2e/dag package that provides a lightweight, type-safe DAG execution library using closures and generics, then refactors prepareCluster to use it for concurrent cluster setup.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Copilot · 2026-03-21T20:48:08Z

e2e/cluster.go

 	if errors.As(err, &azErr) && azErr.StatusCode == 404 {
-		return createNewMaintenanceConfiguration(ctx, cluster)
+		_, err = createNewMaintenanceConfiguration(ctx, cluster)
+		return err


In the 404 path, ensureMaintenanceConfiguration returns the raw error from createNewMaintenanceConfiguration without adding cluster context, while the non-404 path wraps errors with the cluster name. Wrap the 404/create error similarly so failures are attributable to the cluster being prepared.

Suggested change

return err

if err != nil {

return fmt.Errorf("failed to create maintenance configuration 'default' for cluster %q: %w", *cluster.Name, err)

}

return nil

Copilot · 2026-03-21T20:48:08Z

e2e/dag/dag.go

+func (g *Group) recordError(err error) {
+	g.mu.Lock()
+	g.errs = append(g.errs, err)
+	g.mu.Unlock()
+	g.cancel()
+}


Group.recordError records every non-nil error, including context.Canceled/context.DeadlineExceeded from tasks that are only exiting because another task failed and canceled the group. In failure cases this can flood DAGError with cancellation noise and obscure the root cause. Consider suppressing context cancellation errors once the group has already observed a “real” failure (or only recording the first error).

Copilot · 2026-03-21T20:48:08Z

e2e/dag/dag.go

+// launch runs fn in a new goroutine after all deps complete.
+// If any dep failed or ctx is cancelled, onSkip is called instead of fn.
+func (g *Group) launch(deps []Dep, fn func(), onSkip func()) {
+	g.wg.Add(1)
+	go func() {
+		defer g.wg.Done()
+
+		for _, d := range deps {
+			d.wait()
+		}
+
+		for _, d := range deps {
+			if d.failed() {
+				onSkip()
+				return
+			}
+		}
+
+		if g.ctx.Err() != nil {
+			onSkip()
+			return
+		}
+
+		fn()
+	}()


The executor blocks on Dep.wait() with no cycle detection. If a cycle is accidentally introduced (e.g., via wiring mistakes), Wait() will hang indefinitely. Consider adding a lightweight cycle check (even just in debug/tests), or documenting this sharp edge prominently so future uses don’t create non-obvious deadlocks.

Copilot · 2026-03-21T20:48:08Z

e2e/kube.go

 )

-func getClusterKubeClient(ctx context.Context, resourceGroupName, clusterName string) (*Kubeclient, error) {
+func getClusterKubeClient(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (*Kubeclient, error) {


getClusterKubeClient now dereferences cluster.Location and cluster.Name without validating the input. Since the signature changed from (resourceGroupName, clusterName string) to (*ManagedCluster), it would be safer to add nil checks for cluster, cluster.Location, and cluster.Name and return a descriptive error instead of panicking.

Suggested change

func getClusterKubeClient(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (*Kubeclient, error) {

func getClusterKubeClient(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (*Kubeclient, error) {

if cluster == nil {

return nil, fmt.Errorf("managed cluster is nil")

}

if cluster.Location == nil || *cluster.Location == "" {

return nil, fmt.Errorf("managed cluster location is nil or empty")

}

if cluster.Name == nil || *cluster.Name == "" {

return nil, fmt.Errorf("managed cluster name is nil or empty")

}

Copilot · 2026-03-21T20:48:09Z

e2e/kube.go

 }

-func getClusterSubnetID(ctx context.Context, mcResourceGroupName string) (string, error) {
+func getClusterSubnetID(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (string, error) {


getClusterSubnetID now relies on cluster.Properties.NodeResourceGroup and dereferences it directly. With the new *ManagedCluster signature, add guards for cluster == nil, cluster.Properties == nil, and cluster.Properties.NodeResourceGroup == nil to avoid panics and return a clearer error when cluster metadata is incomplete.

Suggested change

func getClusterSubnetID(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (string, error) {

func getClusterSubnetID(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (string, error) {

if cluster == nil {

return "", fmt.Errorf("managed cluster is nil")

}

if cluster.Properties == nil {

return "", fmt.Errorf("managed cluster properties are nil")

}

if cluster.Properties.NodeResourceGroup == nil || *cluster.Properties.NodeResourceGroup == "" {

return "", fmt.Errorf("managed cluster NodeResourceGroup is nil or empty")

}

Copilot · 2026-03-21T20:48:09Z

e2e/cluster.go

+	if !isNetworkIsolated {
+		dag.Run(g, func(ctx context.Context) error { return addFirewallRules(ctx, cluster) })
 	}
 	if isNetworkIsolated {
-		if err := addNetworkIsolatedSettings(ctx, cluster, *cluster.Location); err != nil {
-			return nil, fmt.Errorf("add network isolated settings: %w", err)
-		}
-	}
-	if !isNetworkIsolated { // network isolated cluster blocks all egress via NSG
-		if err := addFirewallRules(ctx, cluster, *cluster.Location); err != nil {
-			return nil, fmt.Errorf("add firewall rules: %w", err)
-		}
+		dag.Run(g, func(ctx context.Context) error { return addNetworkIsolatedSettings(ctx, cluster) })
 	}
+	needACR := isNetworkIsolated || attachPrivateAcr
+	acrNonAnon := dag.Run2(g, kube, identity, addACR(cluster, needACR, true))
+	acrAnon := dag.Run2(g, kube, identity, addACR(cluster, needACR, false))
+	dag.Run1(g, kube, ensureDebugDaemonsets(cluster, isNetworkIsolated), acrNonAnon, acrAnon)
+	extract := dag.Go1(g, kube, extractClusterParams(cluster))


prepareCluster no longer enforces the previous ordering where network changes (firewall rules / network-isolated subnet+NSG updates) completed before EnsureDebugDaemonsets runs. Because the daemonset creation triggers pod scheduling/image pulls, running it concurrently with route table/NSG updates can introduce e2e flakiness. Consider capturing the firewall/NSG task(s) as Effect values and adding them as explicit dependencies to the Run1(... ensureDebugDaemonsets ...) task (and possibly any other k8s API tasks that assume steady node egress).

Implements Execute() which discovers the DAG, validates for cycles, and runs tasks concurrently respecting dependency order. Supports CancelDependents (skip downstream on failure) and CancelAll (cancel context) error strategies, plus MaxConcurrency semaphore limiting.

Adds integration tests mirroring the complete spec example (CreateRG → CreateVNet → CreateSubnet → CreateCluster → RunTests → Teardown), plus tests for transitive dependency access, mid-pipeline failure with CancelDependents, shared tasks across independent subgraphs, and empty graph. Removes unused dead code that triggered go vet copylock warning.

- Replace 9-param notifyDependents with runState struct methods - Eliminate redundant `failed` map — derive from results directly - Collapse 3 near-identical early-exit blocks into abort() helper - Launch dependents outside the mutex (collect-then-launch) - Remove spurious mu.Lock after wg.Wait (all goroutines done) - Sort DAGError.Error() output for deterministic error messages - Add ErrorStrategy.String() for consistency with TaskStatus - Remove dead test code (unused shared var, dead twoDepTask type) - Extract buildSpecDAG() helper to reduce copy-paste in tests - Reduce pre-canceled context test timeout from 2s to 100ms

Add e2e/dag package — a lightweight, type-safe DAG executor using Go generics. Two verbs: Go (returns value) and Run (side-effect), with numbered variants (Go1-Go3, Run1-Run3) for typed dependency injection. Replace the sequential prepareCluster implementation with a DAG version that runs independent tasks (bastion, subnet, kube, identity, firewall, garbage collection, etc.) concurrently after cluster creation completes. Also fix pre-existing fmt.Sprintf %%w usage in config/config.go. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace the clusterSetup struct and its 12 one-liner methods with inline closures. Each closure is 1-2 lines binding the cluster local to the real function call. Everything reads top-to-bottom in one place. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Move cluster creation back into the DAG so all tasks use typed dependency injection. Where function signatures match exactly (getOrCreateBastion, getClusterKubeletIdentity, collectGarbageVMSS), pass them directly without closures. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Extract named helpers for conditional tasks (configureFirewall, configureNetworkIsolation, setupACR, ensureDebugDaemonsets) and update function signatures (getClusterSubnetID, getClusterKubeClient, extractClusterParameters) to accept *ManagedCluster directly so they can be passed to the DAG without closures. prepareCluster now reads as a pure declarative DAG with no inline anonymous functions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- addFirewallRules/addNetworkIsolatedSettings: derive location from cluster instead of taking it as a param. Use conditional DAG registration instead of runtime checks. - addPrivateAzureContainerRegistry: derive resourceGroupName from cluster.Location internally. - ensureMaintenanceConfiguration: replaces getOrCreate wrapper, returns error only (value was never used). - getClusterSubnetID/getClusterKubeClient: take *ManagedCluster instead of extracted strings. - GoN/RunN: accept optional extra ...Dep barrier deps so typed dep tasks can also wait on untyped barriers (e.g. ACR effects). Remaining helpers (newClusterTask, addACRTask, addDebugDaemonsets) exist because they genuinely need parameter binding that can't be absorbed into the underlying functions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Pull cluster creation out of the DAG. Use generic bind/bindRun helpers to pass functions directly for tasks that only need cluster. Remaining factory helpers (addACR, ensureDebugDaemonsets, extractClusterParams) bind cluster to functions that also receive DAG-provided values (kube, identity) — these can't be eliminated without putting cluster back in the DAG. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…kages

- Fix package doc to reference correct API names (Go/Run not Spawn/Do) - Fix section comments in tests to match current API names - Wait() now surfaces ctx.Err() when parent context is cancelled - TestParentContextCancelled asserts error is context.Canceled - ensureMaintenanceConfiguration 404 path wraps error with cluster context - Firewall/NSG tasks are now dependencies of ensureDebugDaemonsets

Add comprehensive test coverage including: - Go1-Go3/Run1-Run3 success and skip-on-dep-failure paths - Diamond topology, transitive skip, cancellation noise - Parent context cancellation and deadline exceeded - DAGError formatting (single and multiple errors) - Effect as dependency, empty group, Result.Get/MustGet - Cycle behavior documentation (typed API compile-time safety, untyped API deadlock, self-dependency deadlock) Clean up test comments to keep tests concise.

- Add panic recovery in launch() so panics in task functions are captured as errors instead of crashing the process. - Unify onSkip/onPanic into a single onFail(error) callback. - Update MustGet docs to reference actual API names. - Document cycle-deadlock limitation in package comment. - Add tests for panic in Go, Run, and downstream skip after panic.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Copilot · 2026-04-08T23:28:30Z

e2e/cluster.go

+	identity := dag.Go(g, func(ctx context.Context) (*armcontainerservice.UserAssignedIdentity, error) {
+		return getClusterKubeletIdentity(ctx, cluster)


getClusterKubeletIdentity now takes a context.Context but doesn't use it. Since this is just to fit the dag.Go signature, consider keeping the original signature and wrapping it in an inline closure in prepareCluster (avoids unnecessary API churn and keeps the function's contract minimal).

Suggested change

identity := dag.Go(g, func(ctx context.Context) (*armcontainerservice.UserAssignedIdentity, error) {

return getClusterKubeletIdentity(ctx, cluster)

identity := dag.Go(g, func(context.Context) (*armcontainerservice.UserAssignedIdentity, error) {

return getClusterKubeletIdentity(cluster)

timmy-wright

LGTM. Worthwhile looking at the copilot comments tho. I'm surprised there's not already a DAG library we can use.

addFirewallRules and collectGarbageVMSS both trigger Azure subnet/route table mutations. Running them concurrently (introduced in #8149) can cause the AKS-managed pod route table to be detached from the subnet: 1. addFirewallRules associates abe2e-fw-rt with aks-subnet (displacing the AKS route table) 2. collectGarbageVMSS deletes stale VMSS, triggering cloud-controller route reconciliation that races with the subnet association 3. getOrCreateBastion creates AzureBastionSubnet (VNet-level mutation) The race leaves aks-agentpool-*-routetable with 0 subnet associations, so CCM-programmed pod routes are unreachable. Nodes stay stuck on NetworkUnavailable: 'Waiting for cloud routes' forever. Fix: add DAG edges so that: - addFirewallRules / addNetworkIsolatedSettings wait for bastion (serialize VNet subnet mutations) - collectGarbageVMSS waits for network setup (prevent VMSS deletion from racing with route table association) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

r2k1 requested review from AbelHu, Devinwong, SriHarsha001, awesomenix, cameronmeissner, djsly, ganeshkumarashok, junjiezhang1997, lilypan26, phealy, sulixu, surajssd, timmy-wright, yagmurbaydogan, yewmsft and zachary-bailey as code owners March 21, 2026 20:32

r2k1 changed the base branch from master to main March 21, 2026 20:33

r2k1 requested review from calvin197, mxj220 and pdamianov-dev as code owners March 21, 2026 20:33

Copilot AI review requested due to automatic review settings March 21, 2026 20:35

r2k1 temporarily deployed to test March 21, 2026 20:35 — with GitHub Actions Inactive

Copilot started reviewing on behalf of r2k1 March 21, 2026 20:35 View session

This comment was marked as outdated.

Sign in to view

r2k1 temporarily deployed to test March 21, 2026 20:41 — with GitHub Actions Inactive

Copilot AI review requested due to automatic review settings March 21, 2026 20:43

r2k1 temporarily deployed to test March 21, 2026 20:43 — with GitHub Actions Inactive

Copilot started reviewing on behalf of r2k1 March 21, 2026 20:44 View session

Copilot AI reviewed Mar 21, 2026

View reviewed changes

r2k1 temporarily deployed to test March 21, 2026 21:03 — with GitHub Actions Inactive

r2k1 and others added 21 commits April 9, 2026 11:27

Rename taskflow to tasks, use named Deps structs throughout

0d71977

tasks: add core types — Task interface, Config, DAGError, TaskStatus

892d19e

tasks: add graph discovery via reflection

498d49e

tasks: add cycle detection via topological sort

32f035a

polish: improve naming, comments, and error wrapping in dag/tasks pac…

1561de7

…kages

refactor(e2e): remove unused tasks package in favor of dag package

761f59b

chore: remove design spec docs, revert config.go changes

4917f0b

refactor(e2e): remove bind/bindRun helpers, use inline closures

016bba0

docs(e2e): add note to keep prepareCluster minimal

bb011c9

r2k1 force-pushed the r2k1/dag branch from 7531ea5 to 584d813 Compare April 8, 2026 23:27

r2k1 temporarily deployed to test April 8, 2026 23:27 — with GitHub Actions Inactive

Copilot AI reviewed Apr 8, 2026

View reviewed changes

r2k1 enabled auto-merge (squash) April 8, 2026 23:28

timmy-wright approved these changes Apr 9, 2026

View reviewed changes

r2k1 merged commit d6f5b37 into main Apr 9, 2026
29 of 31 checks passed

r2k1 deleted the r2k1/dag branch April 9, 2026 09:49

djsly mentioned this pull request Apr 9, 2026

fix(e2e): add DAG dependencies to prevent route table race condition #8268

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(e2e): add DAG-based concurrent task execution for cluster setup#8149

refactor(e2e): add DAG-based concurrent task execution for cluster setup#8149
r2k1 merged 22 commits intomainfrom
r2k1/dag

r2k1 commented Mar 21, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

timmy-wright left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-func getClusterKubeClient(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (*Kubeclient, error) {
+func getClusterKubeClient(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (*Kubeclient, error) {
+	if cluster == nil {
+		return nil, fmt.Errorf("managed cluster is nil")
+	}
+	if cluster.Location == nil || *cluster.Location == "" {
+		return nil, fmt.Errorf("managed cluster location is nil or empty")
+	}
+	if cluster.Name == nil || *cluster.Name == "" {
+		return nil, fmt.Errorf("managed cluster name is nil or empty")
+	}

-func getClusterSubnetID(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (string, error) {
+func getClusterSubnetID(ctx context.Context, cluster *armcontainerservice.ManagedCluster) (string, error) {
+	if cluster == nil {
+		return "", fmt.Errorf("managed cluster is nil")
+	}
+	if cluster.Properties == nil {
+		return "", fmt.Errorf("managed cluster properties are nil")
+	}
+	if cluster.Properties.NodeResourceGroup == nil || *cluster.Properties.NodeResourceGroup == "" {
+		return "", fmt.Errorf("managed cluster NodeResourceGroup is nil or empty")
+	}

		identity := dag.Go(g, func(ctx context.Context) (*armcontainerservice.UserAssignedIdentity, error) {
		return getClusterKubeletIdentity(ctx, cluster)

Conversation

r2k1 commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

timmy-wright left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

r2k1 commented Mar 21, 2026 •

edited

Loading