fix(cse): prevent CSE timeout overrun with per-op budget and pre-command guard by djsly · Pull Request #8230 · Azure/AgentBaker

djsly · 2026-04-02T20:16:22Z

Linux CSE runs can exceed the VM provisioner's ~16-minute client window when downloads retry indefinitely or a long-running operation starts near the deadline. This PR adds a global timeout guard and per-download wall-clock budget to bound execution time.

`cse_helpers.sh`

_retrycmd_internal: Add pre-attempt check_cse_timeout guard (gated on CSE_STARTTIME_SECONDS being set — no-op and no noise during VHD build). Cap per-attempt effectiveTimeout to min(timeoutVal, remainingCseTime) so a single attempt cannot outlive the provisioning window.
_retry_file_curl_internal: Add optional maxBudget (4th arg, default 0). When > 0, caps effectiveTimeout to min(timeout, maxBudget - elapsed) per attempt, bounding total download wall-clock to the budget regardless of retry count. Loop restructured so curl fires on every iteration (previously the last attempt was skipped).
retrycmd_get_tarball: Backward-compatible signature detection — if 3rd arg is non-numeric (old callers: <retries> <wait_sleep> <tarball> <url>), defaults timeout to 60s. New callers pass explicit <timeout> as 3rd arg.
retrycmd_curl_file: Exposes new optional max_budget_s 6th arg (default 0).

Call sites (`cse_install.sh`, `cse_install_ubuntu.sh`, `cse_install_mariner.sh`)

Pass 300s per-operation budget to all provisioning-time download sites (credential-provider, oras, TLS bootstrap, CNI ×2, crictl, k8s binaries, nvidia GPG key, containerd, runc, nvidia repo). Without budget: 120 retries × 60s = ~2h10m theoretical max per download; with budget: capped at 5 min total.

`cse_config.sh`

Reduce nvidia-smi per-try timeout from 300s → 30s at 3 call sites (nvidia-smi is a status check, not a long-running init).

Backward compatibility

max_budget_s defaults to 0 (no cap); VHD build callers unaffected.
check_cse_timeout guards are skipped when CSE_STARTTIME_SECONDS is unset, preserving existing VHD build behavior exactly.

…and guard Remove hardcoded 60s timeout from retrycmd_get_tarball; caller now passes timeout as 3rd positional arg (matches retrycmd_curl_file style). Add optional max_budget_s (6th arg, default 0) to retrycmd_get_tarball and retrycmd_curl_file. When > 0, _retry_file_curl_internal tracks wall-clock time and exits early (return 2) once the budget is consumed, bounding any single download to 5 min at provisioning time regardless of retry count. Add pre-command check_cse_timeout guard at the TOP of _retrycmd_internal loop. Previously the guard only fired after a failed command, meaning a new 300-600s operation could START at minute 12:59 and run to ~18 min. The guard now prevents starting any new attempt when >780s have elapsed since CSE_STARTTIME_SECONDS. Pass 300s budget to all 11 provisioning-time download call sites: cse_install.sh (credential-provider, oras, secure-TLS, CNI x2, crictl, k8s) ubuntu/cse_install_ubuntu.sh (nvidia GPG key, containerd, runc) mariner/cse_install_mariner.sh (nvidia repo file) Reduce nvidia-smi per-try timeout 300s to 30s in cse_config.sh (3 sites). nvidia-smi is a CLI status check, not a long-running driver operation; the 300s value was a copy-paste from the GPU driver install command. Update spec tests: fix parameter tables, add budget-exceeded test and pre-command CSE timeout test in cse_retry_helpers_spec.sh; update expected call signature in cse_install_spec.sh.

Copilot

Pull request overview

This PR aims to prevent Linux CSE runs from exceeding the VM provisioner’s ~16-minute client window by adding (1) a pre-attempt global timeout guard and (2) a per-download wall-clock budget, then wiring those controls into provisioning-time download call sites and updating ShellSpec coverage accordingly.

Changes:

Add a pre-command check_cse_timeout guard in _retrycmd_internal and a per-operation maxBudget in _retry_file_curl_internal (propagated via retrycmd_curl_file / retrycmd_get_tarball).
Update provisioning-time download call sites to pass a 300s per-operation budget and adjust retrycmd_get_tarball call signature to include an explicit timeout.
Reduce nvidia-smi per-try timeout from 300s to 30s and update/add ShellSpec expectations for the new retry helper behaviors.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
spec/parts/linux/cloud-init/artifacts/cse_retry_helpers_spec.sh	Updates parameter tables and adds tests for pre-attempt global timeout and per-operation budget exit behavior.
spec/parts/linux/cloud-init/artifacts/cse_install_spec.sh	Updates expected `retrycmd_get_tarball` invocation signature to include timeout + budget.
parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh	Adds 300s per-operation budget to provisioning-time curl downloads (NVIDIA key, containerd, runc).
parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh	Adds 300s per-operation budget to NVIDIA repo file download.
parts/linux/cloud-init/artifacts/cse_install.sh	Adds explicit timeout + 300s per-operation budgets to provisioning-time tarball/curl downloads (credential provider, oras, CNI, crictl, k8s).
parts/linux/cloud-init/artifacts/cse_helpers.sh	Implements pre-attempt global timeout guard and per-operation budget; changes `retrycmd_get_tarball` signature.
parts/linux/cloud-init/artifacts/cse_config.sh	Lowers `nvidia-smi` retry timeout from 300s to 30s at 3 call sites.

parts/linux/cloud-init/artifacts/cse_helpers.sh

…allers Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/8f356829-1593-46b5-94dc-6fa4a85aa931 Co-authored-by: djsly <4981802+djsly@users.noreply.github.com>

parts/linux/cloud-init/artifacts/cse_helpers.sh

The check_cse_timeout warning message was written to stdout, which pollutes pipes. In the 2004fips VHD build, the pipeline: retrycmd_if_failure 5 10 1200 yes | ua enable fips-updates sends stdout of the left side as stdin to ua enable fips-updates. The warning 'CSE_STARTTIME_SECONDS is not set' was read by ua as the answer to the 'Are you sure? (y/N)' prompt instead of 'y' from yes, causing 'Operation cancelled by user' and exit 184. Fix: redirect the warning to stderr (>&2), consistent with the error message that was already on stderr. Update the ShellSpec test to match. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

parts/linux/cloud-init/artifacts/cse_helpers.sh:347

The per-operation maxBudget check happens before starting timeout $timeout curl ..., but the curl attempt itself can still run longer than the remaining budget (and then exceed the budget by up to timeout + sleep). If maxBudget is intended to be a hard upper bound (“total wall-clock seconds across all retries”), cap the per-attempt timeout to min(timeout, maxBudget - elapsed) and/or re-check the budget immediately after each attempt as well.

        # Check per-operation budget if set -- prevents a single download from consuming the entire CSE window
        if [ "${maxBudget}" -gt 0 ]; then
            local opElapsed
            opElapsed=$(( $(date +%s) - opStartTime ))
            if [ "$opElapsed" -ge "$maxBudget" ]; then
                echo "Operation budget of ${maxBudget}s exceeded after ${opElapsed}s, exiting early." >&2
                return 2
            fi
        fi
        # check if global cse timeout is approaching
        if ! check_cse_timeout; then
            echo "CSE timeout approaching, exiting early." >&2
            return 2
        else
            if [ "$i" -gt 1 ]; then
                sleep $waitSleep
            fi
            timeout $timeout curl -fsSLv $url -o $filePath > $CURL_OUTPUT 2>&1
            if [ "$?" -ne 0 ]; then

parts/linux/cloud-init/artifacts/cse_helpers.sh

spec/parts/linux/cloud-init/artifacts/cse_retry_helpers_spec.sh

…args - _retrycmd_internal: cap per-attempt timeout to remaining CSE budget (780s - elapsed) so a long timeout cannot overrun the provisioning window even when per-attempt timeouts are large - _retry_file_curl_internal: restructure loop so curl runs BEFORE the check+exhaustion logic, fixing a bug where retries=N only executed N-1 curl attempts (and retries=1 executed zero) - retrycmd_get_tarball spec: remove extra positional arg that shifted tarball path into URL position Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Replace [[ $3 =~ ^[0-9]+$ ]] with POSIX-compatible case statement in retrycmd_get_tarball to fix SC3010 shellcheck error - Fix _retry_file_curl_internal: keep pre-check at top of loop for efficiency (skip curl if file already valid), add final check after last curl attempt so retries=N always executes N curl attempts Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

parts/linux/cloud-init/artifacts/cse_helpers.sh

- Replace broken date mock (export -f date doesn't work in POSIX sh) with real-time approach: timeout mock sleeps 2s, budget=1s - Set CSE_STARTTIME_SECONDS in BeforeEach to prevent check_cse_timeout warnings in all retry tests - Explicitly unset CSE_STARTTIME_SECONDS in the check_cse_timeout test that verifies the unset behavior Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…internal Same pattern as _retrycmd_internal: when maxBudget > 0, effectiveTimeout is capped to min(timeout, maxBudget - elapsed) so a single curl attempt cannot overrun the documented per-operation budget. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Copilot · 2026-04-08T02:25:38Z

parts/linux/cloud-init/artifacts/cse_helpers.sh

+        # Check CSE timeout BEFORE starting each attempt. This prevents launching a new long-running
+        # operation (e.g. a 300-600s GPU install) when we are already near the global provisioning
+        # timeout, which would push total CSE execution past the 16-minute client window.
+        if ! check_cse_timeout "$shouldLog"; then
+            echo "CSE timeout approaching, exiting early." >&2


In _retrycmd_internal, the new pre-attempt check_cse_timeout call will emit the "CSE_STARTTIME_SECONDS ... is not set" warning (now on stderr) whenever CSE_STARTTIME_SECONDS is unset. That’s expected during VHD build and also for any callers that don’t run under cse_start.sh, so this change can spam build/provision logs even when the command succeeds on the first attempt. Consider skipping the pre-attempt check_cse_timeout/early-exit path unless CSE_STARTTIME_SECONDS is set (or otherwise gating the warning) so the guard only applies in real CSE runs where the variable is exported.

@copilot apply changes based on this feedback

Fixed in 57fa71a. Both check_cse_timeout calls in _retrycmd_internal (pre-attempt and post-failure) are now gated behind [ -n "${CSE_STARTTIME_SECONDS:-}" ], so the guard is a no-op when the variable is unset — no warnings during VHD builds or non-CSE callers.

…ARTTIME_SECONDS guard Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/0bcd6f17-76f8-4ef1-ba7a-cf3a6f43d658 Co-authored-by: djsly <4981802+djsly@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 2, 2026 20:16

djsly requested review from AbelHu, Devinwong, SriHarsha001, awesomenix, calvin197, cameronmeissner, ganeshkumarashok, junjiezhang1997, lilypan26, mxj220, pdamianov-dev, phealy, r2k1, sulixu, surajssd, timmy-wright and zachary-bailey as code owners April 2, 2026 20:16

djsly temporarily deployed to test April 2, 2026 20:16 — with GitHub Actions Inactive

Copilot started reviewing on behalf of djsly April 2, 2026 20:17 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

parts/linux/cloud-init/artifacts/cse_helpers.sh Show resolved Hide resolved

parts/linux/cloud-init/artifacts/cse_helpers.sh Show resolved Hide resolved

Copilot started work on behalf of djsly April 2, 2026 20:26 View session

fix(cse): add backward compat to retrycmd_get_tarball for old 4-arg c…

970b78a

…allers Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/8f356829-1593-46b5-94dc-6fa4a85aa931 Co-authored-by: djsly <4981802+djsly@users.noreply.github.com>

Copilot finished work on behalf of djsly April 2, 2026 20:33

Devinwong reviewed Apr 2, 2026

View reviewed changes

parts/linux/cloud-init/artifacts/cse_helpers.sh Show resolved Hide resolved

pdamianov-dev approved these changes Apr 6, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings April 7, 2026 21:52

djsly temporarily deployed to test April 7, 2026 21:52 — with GitHub Actions Inactive

Copilot started reviewing on behalf of djsly April 7, 2026 21:53 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

parts/linux/cloud-init/artifacts/cse_helpers.sh Outdated Show resolved Hide resolved

parts/linux/cloud-init/artifacts/cse_helpers.sh Outdated Show resolved Hide resolved

spec/parts/linux/cloud-init/artifacts/cse_retry_helpers_spec.sh Outdated Show resolved Hide resolved

djsly temporarily deployed to test April 8, 2026 01:38 — with GitHub Actions Inactive

Copilot AI review requested due to automatic review settings April 8, 2026 01:57

djsly temporarily deployed to test April 8, 2026 01:57 — with GitHub Actions Inactive

djsly force-pushed the djsly/37384340 branch from 6901282 to e9d3178 Compare April 8, 2026 01:57

djsly temporarily deployed to test April 8, 2026 01:57 — with GitHub Actions Inactive

Copilot started reviewing on behalf of djsly April 8, 2026 01:57 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

parts/linux/cloud-init/artifacts/cse_helpers.sh Outdated Show resolved Hide resolved

djsly temporarily deployed to test April 8, 2026 02:07 — with GitHub Actions Inactive

Copilot AI review requested due to automatic review settings April 8, 2026 02:20

djsly temporarily deployed to test April 8, 2026 02:20 — with GitHub Actions Inactive

Copilot started reviewing on behalf of djsly April 8, 2026 02:21 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

Copilot started work on behalf of djsly April 8, 2026 02:40 View session

fix: gate check_cse_timeout calls in _retrycmd_internal behind CSE_ST…

57fa71a

…ARTTIME_SECONDS guard Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/0bcd6f17-76f8-4ef1-ba7a-cf3a6f43d658 Co-authored-by: djsly <4981802+djsly@users.noreply.github.com>

Copilot finished work on behalf of djsly April 8, 2026 02:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cse): prevent CSE timeout overrun with per-op budget and pre-command guard#8230

fix(cse): prevent CSE timeout overrun with per-op budget and pre-command guard#8230
djsly wants to merge 8 commits intomainfrom
djsly/37384340

djsly commented Apr 2, 2026 •

edited by Copilot AI

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

djsly Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

djsly commented Apr 2, 2026 • edited by Copilot AI Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

cse_helpers.sh

Call sites (cse_install.sh, cse_install_ubuntu.sh, cse_install_mariner.sh)

cse_config.sh

Backward compatibility

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

djsly Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

djsly commented Apr 2, 2026 •

edited by Copilot AI

Loading

`cse_helpers.sh`

Call sites (`cse_install.sh`, `cse_install_ubuntu.sh`, `cse_install_mariner.sh`)

`cse_config.sh`