Skip to content

fix(cse): prevent CSE timeout overrun with per-op budget and pre-command guard#8230

Open
djsly wants to merge 8 commits intomainfrom
djsly/37384340
Open

fix(cse): prevent CSE timeout overrun with per-op budget and pre-command guard#8230
djsly wants to merge 8 commits intomainfrom
djsly/37384340

Conversation

@djsly
Copy link
Copy Markdown
Collaborator

@djsly djsly commented Apr 2, 2026

Linux CSE runs can exceed the VM provisioner's ~16-minute client window when downloads retry indefinitely or a long-running operation starts near the deadline. This PR adds a global timeout guard and per-download wall-clock budget to bound execution time.

cse_helpers.sh

  • _retrycmd_internal: Add pre-attempt check_cse_timeout guard (gated on CSE_STARTTIME_SECONDS being set — no-op and no noise during VHD build). Cap per-attempt effectiveTimeout to min(timeoutVal, remainingCseTime) so a single attempt cannot outlive the provisioning window.
  • _retry_file_curl_internal: Add optional maxBudget (4th arg, default 0). When > 0, caps effectiveTimeout to min(timeout, maxBudget - elapsed) per attempt, bounding total download wall-clock to the budget regardless of retry count. Loop restructured so curl fires on every iteration (previously the last attempt was skipped).
  • retrycmd_get_tarball: Backward-compatible signature detection — if 3rd arg is non-numeric (old callers: <retries> <wait_sleep> <tarball> <url>), defaults timeout to 60s. New callers pass explicit <timeout> as 3rd arg.
  • retrycmd_curl_file: Exposes new optional max_budget_s 6th arg (default 0).

Call sites (cse_install.sh, cse_install_ubuntu.sh, cse_install_mariner.sh)

Pass 300s per-operation budget to all provisioning-time download sites (credential-provider, oras, TLS bootstrap, CNI ×2, crictl, k8s binaries, nvidia GPG key, containerd, runc, nvidia repo). Without budget: 120 retries × 60s = ~2h10m theoretical max per download; with budget: capped at 5 min total.

cse_config.sh

Reduce nvidia-smi per-try timeout from 300s → 30s at 3 call sites (nvidia-smi is a status check, not a long-running init).

Backward compatibility

  • max_budget_s defaults to 0 (no cap); VHD build callers unaffected.
  • check_cse_timeout guards are skipped when CSE_STARTTIME_SECONDS is unset, preserving existing VHD build behavior exactly.

…and guard

Remove hardcoded 60s timeout from retrycmd_get_tarball; caller now
passes timeout as 3rd positional arg (matches retrycmd_curl_file style).

Add optional max_budget_s (6th arg, default 0) to retrycmd_get_tarball
and retrycmd_curl_file. When > 0, _retry_file_curl_internal tracks
wall-clock time and exits early (return 2) once the budget is consumed,
bounding any single download to 5 min at provisioning time regardless
of retry count.

Add pre-command check_cse_timeout guard at the TOP of _retrycmd_internal
loop. Previously the guard only fired after a failed command, meaning a
new 300-600s operation could START at minute 12:59 and run to ~18 min.
The guard now prevents starting any new attempt when >780s have elapsed
since CSE_STARTTIME_SECONDS.

Pass 300s budget to all 11 provisioning-time download call sites:
  cse_install.sh (credential-provider, oras, secure-TLS, CNI x2, crictl, k8s)
  ubuntu/cse_install_ubuntu.sh (nvidia GPG key, containerd, runc)
  mariner/cse_install_mariner.sh (nvidia repo file)

Reduce nvidia-smi per-try timeout 300s to 30s in cse_config.sh (3 sites).
nvidia-smi is a CLI status check, not a long-running driver operation;
the 300s value was a copy-paste from the GPU driver install command.

Update spec tests: fix parameter tables, add budget-exceeded test and
pre-command CSE timeout test in cse_retry_helpers_spec.sh; update
expected call signature in cse_install_spec.sh.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent Linux CSE runs from exceeding the VM provisioner’s ~16-minute client window by adding (1) a pre-attempt global timeout guard and (2) a per-download wall-clock budget, then wiring those controls into provisioning-time download call sites and updating ShellSpec coverage accordingly.

Changes:

  • Add a pre-command check_cse_timeout guard in _retrycmd_internal and a per-operation maxBudget in _retry_file_curl_internal (propagated via retrycmd_curl_file / retrycmd_get_tarball).
  • Update provisioning-time download call sites to pass a 300s per-operation budget and adjust retrycmd_get_tarball call signature to include an explicit timeout.
  • Reduce nvidia-smi per-try timeout from 300s to 30s and update/add ShellSpec expectations for the new retry helper behaviors.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
spec/parts/linux/cloud-init/artifacts/cse_retry_helpers_spec.sh Updates parameter tables and adds tests for pre-attempt global timeout and per-operation budget exit behavior.
spec/parts/linux/cloud-init/artifacts/cse_install_spec.sh Updates expected retrycmd_get_tarball invocation signature to include timeout + budget.
parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh Adds 300s per-operation budget to provisioning-time curl downloads (NVIDIA key, containerd, runc).
parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh Adds 300s per-operation budget to NVIDIA repo file download.
parts/linux/cloud-init/artifacts/cse_install.sh Adds explicit timeout + 300s per-operation budgets to provisioning-time tarball/curl downloads (credential provider, oras, CNI, crictl, k8s).
parts/linux/cloud-init/artifacts/cse_helpers.sh Implements pre-attempt global timeout guard and per-operation budget; changes retrycmd_get_tarball signature.
parts/linux/cloud-init/artifacts/cse_config.sh Lowers nvidia-smi retry timeout from 300s to 30s at 3 call sites.

Copilot finished work on behalf of djsly April 2, 2026 20:33
The check_cse_timeout warning message was written to stdout, which
pollutes pipes. In the 2004fips VHD build, the pipeline:

  retrycmd_if_failure 5 10 1200 yes | ua enable fips-updates

sends stdout of the left side as stdin to ua enable fips-updates.
The warning 'CSE_STARTTIME_SECONDS is not set' was read by ua as the
answer to the 'Are you sure? (y/N)' prompt instead of 'y' from yes,
causing 'Operation cancelled by user' and exit 184.

Fix: redirect the warning to stderr (>&2), consistent with the error
message that was already on stderr. Update the ShellSpec test to match.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 7, 2026 21:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

parts/linux/cloud-init/artifacts/cse_helpers.sh:347

  • The per-operation maxBudget check happens before starting timeout $timeout curl ..., but the curl attempt itself can still run longer than the remaining budget (and then exceed the budget by up to timeout + sleep). If maxBudget is intended to be a hard upper bound (“total wall-clock seconds across all retries”), cap the per-attempt timeout to min(timeout, maxBudget - elapsed) and/or re-check the budget immediately after each attempt as well.
        # Check per-operation budget if set -- prevents a single download from consuming the entire CSE window
        if [ "${maxBudget}" -gt 0 ]; then
            local opElapsed
            opElapsed=$(( $(date +%s) - opStartTime ))
            if [ "$opElapsed" -ge "$maxBudget" ]; then
                echo "Operation budget of ${maxBudget}s exceeded after ${opElapsed}s, exiting early." >&2
                return 2
            fi
        fi
        # check if global cse timeout is approaching
        if ! check_cse_timeout; then
            echo "CSE timeout approaching, exiting early." >&2
            return 2
        else
            if [ "$i" -gt 1 ]; then
                sleep $waitSleep
            fi
            timeout $timeout curl -fsSLv $url -o $filePath > $CURL_OUTPUT 2>&1
            if [ "$?" -ne 0 ]; then

…args

- _retrycmd_internal: cap per-attempt timeout to remaining CSE budget
  (780s - elapsed) so a long timeout cannot overrun the provisioning
  window even when per-attempt timeouts are large
- _retry_file_curl_internal: restructure loop so curl runs BEFORE the
  check+exhaustion logic, fixing a bug where retries=N only executed
  N-1 curl attempts (and retries=1 executed zero)
- retrycmd_get_tarball spec: remove extra positional arg that shifted
  tarball path into URL position

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 8, 2026 01:57
- Replace [[ $3 =~ ^[0-9]+$ ]] with POSIX-compatible case statement
  in retrycmd_get_tarball to fix SC3010 shellcheck error
- Fix _retry_file_curl_internal: keep pre-check at top of loop for
  efficiency (skip curl if file already valid), add final check after
  last curl attempt so retries=N always executes N curl attempts

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

- Replace broken date mock (export -f date doesn't work in POSIX sh)
  with real-time approach: timeout mock sleeps 2s, budget=1s
- Set CSE_STARTTIME_SECONDS in BeforeEach to prevent check_cse_timeout
  warnings in all retry tests
- Explicitly unset CSE_STARTTIME_SECONDS in the check_cse_timeout test
  that verifies the unset behavior

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…internal

Same pattern as _retrycmd_internal: when maxBudget > 0, effectiveTimeout
is capped to min(timeout, maxBudget - elapsed) so a single curl attempt
cannot overrun the documented per-operation budget.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 8, 2026 02:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Comment on lines +249 to +253
# Check CSE timeout BEFORE starting each attempt. This prevents launching a new long-running
# operation (e.g. a 300-600s GPU install) when we are already near the global provisioning
# timeout, which would push total CSE execution past the 16-minute client window.
if ! check_cse_timeout "$shouldLog"; then
echo "CSE timeout approaching, exiting early." >&2
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _retrycmd_internal, the new pre-attempt check_cse_timeout call will emit the "CSE_STARTTIME_SECONDS ... is not set" warning (now on stderr) whenever CSE_STARTTIME_SECONDS is unset. That’s expected during VHD build and also for any callers that don’t run under cse_start.sh, so this change can spam build/provision logs even when the command succeeds on the first attempt. Consider skipping the pre-attempt check_cse_timeout/early-exit path unless CSE_STARTTIME_SECONDS is set (or otherwise gating the warning) so the guard only applies in real CSE runs where the variable is exported.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 57fa71a. Both check_cse_timeout calls in _retrycmd_internal (pre-attempt and post-failure) are now gated behind [ -n "${CSE_STARTTIME_SECONDS:-}" ], so the guard is a no-op when the variable is unset — no warnings during VHD builds or non-CSE callers.

…ARTTIME_SECONDS guard

Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/0bcd6f17-76f8-4ef1-ba7a-cf3a6f43d658

Co-authored-by: djsly <4981802+djsly@users.noreply.github.com>
Copilot finished work on behalf of djsly April 8, 2026 02:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants