fix(cse): prevent CSE timeout overrun with per-op budget and pre-command guard#8230
fix(cse): prevent CSE timeout overrun with per-op budget and pre-command guard#8230
Conversation
…and guard Remove hardcoded 60s timeout from retrycmd_get_tarball; caller now passes timeout as 3rd positional arg (matches retrycmd_curl_file style). Add optional max_budget_s (6th arg, default 0) to retrycmd_get_tarball and retrycmd_curl_file. When > 0, _retry_file_curl_internal tracks wall-clock time and exits early (return 2) once the budget is consumed, bounding any single download to 5 min at provisioning time regardless of retry count. Add pre-command check_cse_timeout guard at the TOP of _retrycmd_internal loop. Previously the guard only fired after a failed command, meaning a new 300-600s operation could START at minute 12:59 and run to ~18 min. The guard now prevents starting any new attempt when >780s have elapsed since CSE_STARTTIME_SECONDS. Pass 300s budget to all 11 provisioning-time download call sites: cse_install.sh (credential-provider, oras, secure-TLS, CNI x2, crictl, k8s) ubuntu/cse_install_ubuntu.sh (nvidia GPG key, containerd, runc) mariner/cse_install_mariner.sh (nvidia repo file) Reduce nvidia-smi per-try timeout 300s to 30s in cse_config.sh (3 sites). nvidia-smi is a CLI status check, not a long-running driver operation; the 300s value was a copy-paste from the GPU driver install command. Update spec tests: fix parameter tables, add budget-exceeded test and pre-command CSE timeout test in cse_retry_helpers_spec.sh; update expected call signature in cse_install_spec.sh.
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent Linux CSE runs from exceeding the VM provisioner’s ~16-minute client window by adding (1) a pre-attempt global timeout guard and (2) a per-download wall-clock budget, then wiring those controls into provisioning-time download call sites and updating ShellSpec coverage accordingly.
Changes:
- Add a pre-command
check_cse_timeoutguard in_retrycmd_internaland a per-operationmaxBudgetin_retry_file_curl_internal(propagated viaretrycmd_curl_file/retrycmd_get_tarball). - Update provisioning-time download call sites to pass a 300s per-operation budget and adjust
retrycmd_get_tarballcall signature to include an explicit timeout. - Reduce
nvidia-smiper-try timeout from 300s to 30s and update/add ShellSpec expectations for the new retry helper behaviors.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| spec/parts/linux/cloud-init/artifacts/cse_retry_helpers_spec.sh | Updates parameter tables and adds tests for pre-attempt global timeout and per-operation budget exit behavior. |
| spec/parts/linux/cloud-init/artifacts/cse_install_spec.sh | Updates expected retrycmd_get_tarball invocation signature to include timeout + budget. |
| parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh | Adds 300s per-operation budget to provisioning-time curl downloads (NVIDIA key, containerd, runc). |
| parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh | Adds 300s per-operation budget to NVIDIA repo file download. |
| parts/linux/cloud-init/artifacts/cse_install.sh | Adds explicit timeout + 300s per-operation budgets to provisioning-time tarball/curl downloads (credential provider, oras, CNI, crictl, k8s). |
| parts/linux/cloud-init/artifacts/cse_helpers.sh | Implements pre-attempt global timeout guard and per-operation budget; changes retrycmd_get_tarball signature. |
| parts/linux/cloud-init/artifacts/cse_config.sh | Lowers nvidia-smi retry timeout from 300s to 30s at 3 call sites. |
…allers Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/8f356829-1593-46b5-94dc-6fa4a85aa931 Co-authored-by: djsly <4981802+djsly@users.noreply.github.com>
The check_cse_timeout warning message was written to stdout, which pollutes pipes. In the 2004fips VHD build, the pipeline: retrycmd_if_failure 5 10 1200 yes | ua enable fips-updates sends stdout of the left side as stdin to ua enable fips-updates. The warning 'CSE_STARTTIME_SECONDS is not set' was read by ua as the answer to the 'Are you sure? (y/N)' prompt instead of 'y' from yes, causing 'Operation cancelled by user' and exit 184. Fix: redirect the warning to stderr (>&2), consistent with the error message that was already on stderr. Update the ShellSpec test to match. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
parts/linux/cloud-init/artifacts/cse_helpers.sh:347
- The per-operation
maxBudgetcheck happens before startingtimeout $timeout curl ..., but the curl attempt itself can still run longer than the remaining budget (and then exceed the budget by up totimeout+ sleep). IfmaxBudgetis intended to be a hard upper bound (“total wall-clock seconds across all retries”), cap the per-attempt timeout tomin(timeout, maxBudget - elapsed)and/or re-check the budget immediately after each attempt as well.
# Check per-operation budget if set -- prevents a single download from consuming the entire CSE window
if [ "${maxBudget}" -gt 0 ]; then
local opElapsed
opElapsed=$(( $(date +%s) - opStartTime ))
if [ "$opElapsed" -ge "$maxBudget" ]; then
echo "Operation budget of ${maxBudget}s exceeded after ${opElapsed}s, exiting early." >&2
return 2
fi
fi
# check if global cse timeout is approaching
if ! check_cse_timeout; then
echo "CSE timeout approaching, exiting early." >&2
return 2
else
if [ "$i" -gt 1 ]; then
sleep $waitSleep
fi
timeout $timeout curl -fsSLv $url -o $filePath > $CURL_OUTPUT 2>&1
if [ "$?" -ne 0 ]; then
spec/parts/linux/cloud-init/artifacts/cse_retry_helpers_spec.sh
Outdated
Show resolved
Hide resolved
…args - _retrycmd_internal: cap per-attempt timeout to remaining CSE budget (780s - elapsed) so a long timeout cannot overrun the provisioning window even when per-attempt timeouts are large - _retry_file_curl_internal: restructure loop so curl runs BEFORE the check+exhaustion logic, fixing a bug where retries=N only executed N-1 curl attempts (and retries=1 executed zero) - retrycmd_get_tarball spec: remove extra positional arg that shifted tarball path into URL position Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace [[ $3 =~ ^[0-9]+$ ]] with POSIX-compatible case statement in retrycmd_get_tarball to fix SC3010 shellcheck error - Fix _retry_file_curl_internal: keep pre-check at top of loop for efficiency (skip curl if file already valid), add final check after last curl attempt so retries=N always executes N curl attempts Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace broken date mock (export -f date doesn't work in POSIX sh) with real-time approach: timeout mock sleeps 2s, budget=1s - Set CSE_STARTTIME_SECONDS in BeforeEach to prevent check_cse_timeout warnings in all retry tests - Explicitly unset CSE_STARTTIME_SECONDS in the check_cse_timeout test that verifies the unset behavior Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…internal Same pattern as _retrycmd_internal: when maxBudget > 0, effectiveTimeout is capped to min(timeout, maxBudget - elapsed) so a single curl attempt cannot overrun the documented per-operation budget. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| # Check CSE timeout BEFORE starting each attempt. This prevents launching a new long-running | ||
| # operation (e.g. a 300-600s GPU install) when we are already near the global provisioning | ||
| # timeout, which would push total CSE execution past the 16-minute client window. | ||
| if ! check_cse_timeout "$shouldLog"; then | ||
| echo "CSE timeout approaching, exiting early." >&2 |
There was a problem hiding this comment.
In _retrycmd_internal, the new pre-attempt check_cse_timeout call will emit the "CSE_STARTTIME_SECONDS ... is not set" warning (now on stderr) whenever CSE_STARTTIME_SECONDS is unset. That’s expected during VHD build and also for any callers that don’t run under cse_start.sh, so this change can spam build/provision logs even when the command succeeds on the first attempt. Consider skipping the pre-attempt check_cse_timeout/early-exit path unless CSE_STARTTIME_SECONDS is set (or otherwise gating the warning) so the guard only applies in real CSE runs where the variable is exported.
There was a problem hiding this comment.
@copilot apply changes based on this feedback
There was a problem hiding this comment.
Fixed in 57fa71a. Both check_cse_timeout calls in _retrycmd_internal (pre-attempt and post-failure) are now gated behind [ -n "${CSE_STARTTIME_SECONDS:-}" ], so the guard is a no-op when the variable is unset — no warnings during VHD builds or non-CSE callers.
…ARTTIME_SECONDS guard Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/0bcd6f17-76f8-4ef1-ba7a-cf3a6f43d658 Co-authored-by: djsly <4981802+djsly@users.noreply.github.com>
Linux CSE runs can exceed the VM provisioner's ~16-minute client window when downloads retry indefinitely or a long-running operation starts near the deadline. This PR adds a global timeout guard and per-download wall-clock budget to bound execution time.
cse_helpers.sh_retrycmd_internal: Add pre-attemptcheck_cse_timeoutguard (gated onCSE_STARTTIME_SECONDSbeing set — no-op and no noise during VHD build). Cap per-attempteffectiveTimeouttomin(timeoutVal, remainingCseTime)so a single attempt cannot outlive the provisioning window._retry_file_curl_internal: Add optionalmaxBudget(4th arg, default 0). When > 0, capseffectiveTimeouttomin(timeout, maxBudget - elapsed)per attempt, bounding total download wall-clock to the budget regardless of retry count. Loop restructured so curl fires on every iteration (previously the last attempt was skipped).retrycmd_get_tarball: Backward-compatible signature detection — if 3rd arg is non-numeric (old callers:<retries> <wait_sleep> <tarball> <url>), defaults timeout to 60s. New callers pass explicit<timeout>as 3rd arg.retrycmd_curl_file: Exposes new optionalmax_budget_s6th arg (default 0).Call sites (
cse_install.sh,cse_install_ubuntu.sh,cse_install_mariner.sh)Pass 300s per-operation budget to all provisioning-time download sites (credential-provider, oras, TLS bootstrap, CNI ×2, crictl, k8s binaries, nvidia GPG key, containerd, runc, nvidia repo). Without budget: 120 retries × 60s = ~2h10m theoretical max per download; with budget: capped at 5 min total.
cse_config.shReduce
nvidia-smiper-try timeout from 300s → 30s at 3 call sites (nvidia-smiis a status check, not a long-running init).Backward compatibility
max_budget_sdefaults to 0 (no cap); VHD build callers unaffected.check_cse_timeoutguards are skipped whenCSE_STARTTIME_SECONDSis unset, preserving existing VHD build behavior exactly.