-
Notifications
You must be signed in to change notification settings - Fork 132
Run benchmarks in parallel with tests, not sequentially #1130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Move benchmark jobs from bench.yml into test.yml so tests and benchmarks run in parallel instead of sequentially, reducing merge time from ~10hrs to ~5hrs. Add cross-cancellation jobs so a failure in either group cancels the entire workflow run. Docs-only PRs continue to skip both. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CodeAnt AI is reviewing your PR. Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRemoved the standalone Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant GH as GitHub Actions (test.yml)
participant Repo as Repository (PR & Master refs)
participant Login as Login Node (build)
participant SLURM as SLURM Scheduler
participant Node as Compute Node (per-config)
participant Diff as bench_diff / Poster
rect rgba(93,156,236,0.5)
GH->>Repo: clone PR and master refs
end
rect rgba(96,169,79,0.5)
GH->>Login: run per-config builds (parallel)
Login-->>GH: build artifacts/logs
end
rect rgba(236,151,31,0.5)
GH->>SLURM: submit multi-node sbatch (run_frontier_all_*.sh)
SLURM->>Node: allocate nodes and SSH per-config tasks
Node->>Node: run frontier_test_config / frontier_bench_config
Node-->>SLURM: write per-config YAML outputs
SLURM-->>GH: job completion and collected outputs
end
rect rgba(178,120,255,0.5)
GH->>Diff: run bench_diff & generate comment
Diff-->>GH: diff results / comment payload
GH->>Repo: post comment & upload logs/artifacts
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Note Currently processing new changes in this PR. This may take a few minutes, please wait... ✏️ Tip: You can disable in-progress messages and the fortune message in your review settings. Tip CodeRabbit can generate a title for your PR based on the changes.Add ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
2 similar comments
|
Note Currently processing new changes in this PR. This may take a few minutes, please wait... ✏️ Tip: You can disable in-progress messages and the fortune message in your review settings. Tip CodeRabbit can generate a title for your PR based on the changes.Add ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Note Currently processing new changes in this PR. This may take a few minutes, please wait... ✏️ Tip: You can disable in-progress messages and the fortune message in your review settings. Tip CodeRabbit can generate a title for your PR based on the changes.Add ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
CodeAnt AI finished reviewing your PR. |
📝 WalkthroughWalkthroughThe benchmark workflow has been consolidated from a standalone file into the main test workflow. The Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
3 similar comments
📝 WalkthroughWalkthroughThe benchmark workflow has been consolidated from a standalone file into the main test workflow. The Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
📝 WalkthroughWalkthroughThe benchmark workflow has been consolidated from a standalone file into the main test workflow. The Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
📝 WalkthroughWalkthroughThe benchmark workflow has been consolidated from a standalone file into the main test workflow. The Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @.github/workflows/test.yml:
- Around line 285-287: The bench job is missing the NODE_OPTIONS env used in the
self job and can cause OOMs on Phoenix self-hosted runners; update the bench
job's env block (the same place where ACTIONS_RUNNER_FORCE_ACTIONS_NODE_VERSION
and ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION are set) to include NODE_OPTIONS:
'--max-old-space-size=2048' so Phoenix bench runners get the same Node memory
limit as the self job.
🧹 Nitpick comments (2)
.github/workflows/test.yml (2)
301-306: Parallel build: first failure orphans the second background job.GitHub Actions' default shell uses
set -eo pipefail. Ifwait %1returns non-zero, the shell exits immediately andwait %2is never called — the second build process keeps running in the background on the self-hosted runner. While the step correctly fails, the orphaned process wastes resources.A more robust pattern waits for both jobs unconditionally and then checks both exit codes:
Proposed fix
- name: Setup & Build if: matrix.build_script != '' run: | (cd pr && ${{ matrix.build_script }}) & (cd master && ${{ matrix.build_script }}) & - wait %1 && wait %2 + wait -n 2 || true # suppress set -e for individual waits + wait -n 2 || true + wait %1; s1=$? + wait %2; s2=$? + if [ "$s1" -ne 0 ] || [ "$s2" -ne 0 ]; then exit 1; fiActually, a simpler approach since both jobs have already finished:
Simpler alternative
- name: Setup & Build if: matrix.build_script != '' run: | + set +e (cd pr && ${{ matrix.build_script }}) & (cd master && ${{ matrix.build_script }}) & - wait %1 && wait %2 + wait %1; s1=$? + wait %2; s2=$? + [ "$s1" -eq 0 ] && [ "$s2" -eq 0 ]
311-314:bench_diffstep missingif: always()orif: success()guard.If the "Bench (Master v. PR)" step fails, the "Generate & Post Comment" step is skipped by default (GitHub Actions skips subsequent steps on failure). If you want the diff comment posted even on partial benchmark failure (e.g., to report which benchmarks regressed), consider adding
if: always(). If the current behavior (skip on failure) is intended, this is fine as-is.
|
Caution Review failedFailed to post review comments 📝 WalkthroughWalkthroughThe benchmark workflow has been consolidated from a standalone file into the main test workflow. The Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
Benchmark failures (e.g. transient network issues) should not cancel correctness tests. Keep only cancel-on-test-failure so that test failures still cancel benchmarks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1130 +/- ##
=======================================
Coverage 44.03% 44.03%
=======================================
Files 70 70
Lines 20649 20649
Branches 2053 2053
=======================================
Hits 9093 9093
Misses 10368 10368
Partials 1188 1188 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Reduces Frontier queue depth from 11 SLURM jobs to 2 by consolidating test configs into a single 5-node allocation and bench configs into a single 6-node allocation. Builds remain on login nodes; tests/benchmarks run on compute nodes via ssh to allocated nodes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CodeAnt AI is running Incremental review Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
CodeAnt AI Incremental review completed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In @.github/scripts/frontier_bench_post.sh:
- Around line 34-35: The issue is that the module environment sourced by ".
./mfc.sh load -c \"$flag\" -m g" is lost because it runs in a separate subshell
from the subsequent "./mfc.sh bench_diff" call; fix it by running both commands
in the same subshell or sourcing the loader in the parent shell so the
environment persists — e.g., change the two separate "(cd pr && ...)"
invocations so they are combined into a single "(cd pr && . ./mfc.sh load -c
\"$flag\" -m g && ./mfc.sh bench_diff \"../$master_yaml\" \"../$pr_yaml\")" (or
alternatively run "cd pr" then source "mfc.sh load" in the current shell before
invoking "./mfc.sh bench_diff"), ensuring the loader (./mfc.sh load) and
bench_diff run in the same shell/session.
In @.github/scripts/frontier_test_config.sh:
- Around line 28-34: The GPU probe using rocm-smi (the gpus and ngpus
assignments) is executed unconditionally and will cause CPU-only runs to fail;
move the rocm-smi call and the ngpus calculation inside the branch that runs
when device equals "gpu". Specifically, relocate the lines that set
gpus=$(rocm-smi ...) and ngpus=$(...) into the if [ "$device" = "gpu" ] branch
before invoking ./mfc.sh, and add a safe fallback for ngpus (e.g., default to 1
or exit with a clear error) if rocm-smi yields no results; ensure you reference
the same variable names (gpus, ngpus, device) and preserve the existing ./mfc.sh
invocation when fixing.
In @.github/workflows/test.yml:
- Around line 270-275: The "Setup & Build" workflow step is dead because
matrix.build_script is set to the empty string for every matrix entry, so the
conditional if: matrix.build_script != '' is never true; either remove the
entire "Setup & Build" step from the workflow or populate the matrix entries
with non-empty build_script values for the matrix entries that need building so
the condition can be true, and update any references to that step accordingly
(look for the job name/step "Setup & Build" and the matrix variable
matrix.build_script).
- Around line 310-319: The workflow only implements one-directional cancellation
via the job named cancel-on-test-failure; to make cancellation bidirectional add
a symmetric job (e.g., cancel-on-bench-failure) that uses needs: [bench], if:
failure(), runs-on: ubuntu-latest and the same steps (Cancel Workflow Run using
gh run cancel ${{ github.run_id }} --repo ${{ github.repository }} with GH_TOKEN
env) so benchmarks failing will cancel the tests; ensure the new job mirrors
cancel-on-test-failure but references needs: [bench] and an appropriate name and
position in the YAML.
🧹 Nitpick comments (4)
.github/scripts/frontier_bench_config.sh (1)
29-33: Quote$(nproc)to prevent word splitting (ShellCheck SC2046).While
nprocreturns a single integer in practice, quoting the command substitution is the safer shell idiom and silences the linter.Proposed fix
- ./mfc.sh bench --mem 1 -j $(nproc) -o "$job_slug.yaml" -- -c "$cluster" $device_opts -n $n_ranks + ./mfc.sh bench --mem 1 -j "$(nproc)" -o "$job_slug.yaml" -- -c "$cluster" $device_opts -n $n_ranks.github/scripts/run_frontier_all_benchmarks.sh (2)
78-86: Duplicated config table inside the heredoc is a maintenance hazard.The config array on lines 79–86 must be kept in exact sync with lines 15–22. If someone adds, removes, or reorders a config in one place but not the other, the node-to-config mapping silently breaks (wrong benchmark runs on wrong node, or index out of bounds). The same pattern appears in
run_frontier_all_tests.shlines 81–88.Consider generating the inner config table dynamically (e.g., write it to a file in Phase 1 and source it inside the heredoc), or at minimum add a CI-time assertion that both arrays match.
111-118:$?after a failedif waitmay not carry the original exit code in all shells.In bash,
$?in theelsebranch ofif wait "$pid"correctly reflects the child's exit status. However, this is a bash-specific behavior — the shebang (#!/bin/bash) ensures it, but worth a brief comment for future maintainers, since it's a subtle pattern..github/scripts/run_frontier_all_tests.sh (1)
56-56: Hardcoded "5-node" in the echo; the benchmarks script uses$num_nodes.Minor inconsistency —
run_frontier_all_benchmarks.shline 53 uses"Submitting ${num_nodes}-node SLURM job..."while this script hardcodes the number. Using$num_nodeskeeps it self-documenting if configs change.Proposed fix
-echo "All builds complete. Submitting 5-node SLURM job..." +echo "All builds complete. Submitting ${num_nodes}-node SLURM job..."
| - name: Setup & Build | ||
| if: matrix.build_script != '' && matrix.cluster != 'frontier_all' | ||
| run: | | ||
| (cd pr && ${{ matrix.build_script }}) & | ||
| (cd master && ${{ matrix.build_script }}) & | ||
| wait %1 && wait %2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setup & Build step is dead code — build_script is empty for every matrix entry.
All matrix entries set build_script: "" (lines 226, 234, 242, 250), so the condition matrix.build_script != '' is never true and this step never executes. Either remove it or populate build_script for the entries that need it.
🤖 Prompt for AI Agents
In @.github/workflows/test.yml around lines 270 - 275, The "Setup & Build"
workflow step is dead because matrix.build_script is set to the empty string for
every matrix entry, so the conditional if: matrix.build_script != '' is never
true; either remove the entire "Setup & Build" step from the workflow or
populate the matrix entries with non-empty build_script values for the matrix
entries that need building so the condition can be true, and update any
references to that step accordingly (look for the job name/step "Setup & Build"
and the matrix variable matrix.build_script).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 issues found across 6 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name=".github/scripts/frontier_test_config.sh">
<violation number="1" location=".github/scripts/frontier_test_config.sh:28">
P1: The `rocm-smi` command is invoked unconditionally before the device type check. With `set -e` enabled, this will cause CPU-only tests to fail immediately on nodes without ROCm installed, as the script will terminate before reaching the `else` branch. Move the GPU discovery logic inside the `if [ "$device" = "gpu" ]` block.</violation>
</file>
<file name=".github/scripts/run_frontier_all_benchmarks.sh">
<violation number="1" location=".github/scripts/run_frontier_all_benchmarks.sh:140">
P1: The job ID extraction using `grep -oE '[0-9]+'` is fragile and can incorrectly match numbers from error messages (e.g., account names like 'ENG160' or error codes). This could cause the script to proceed with an invalid job ID when `sbatch` actually failed. Parse only the specific 'Submitted batch job' line to ensure a real job ID is captured.</violation>
</file>
<file name=".github/scripts/frontier_bench_config.sh">
<violation number="1" location=".github/scripts/frontier_bench_config.sh:13">
P2: `frontier_bench_config.sh` always loads GPU mode (`-m g`) even when `device=cpu`, unlike the test config. CPU benchmark runs will get the wrong module set; mirror the device-based mode selection.</violation>
</file>
<file name=".github/scripts/frontier_bench_post.sh">
<violation number="1" location=".github/scripts/frontier_bench_post.sh:35">
P1: The environment setup and `bench_diff` invocation are in separate subshells, so the environment changes from `. ./mfc.sh load` are not visible to `bench_diff`. Both commands should run in the same subshell to ensure the loaded environment is applied.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Use rsync with --link-dest and --exclude to avoid copying the target directories back into themselves when creating per-config source copies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Move rocm-smi GPU discovery inside the gpu branch so CPU configs don't fail with set -e (frontier_test_config.sh) - Combine mfc.sh load and bench_diff into single subshell so loaded module environment persists (frontier_bench_post.sh) - Use device-based mode selection in bench config for consistency (frontier_bench_config.sh) - Parse 'Submitted batch job' line instead of bare grep for digits (run_frontier_all_tests.sh, run_frontier_all_benchmarks.sh) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prevents rapid pushes from cancelling unrelated jobs. Each job type (github, self, bench) now has its own concurrency group keyed by ref and matrix params, so a new push only cancels stale instances of the same job — not all jobs across the workflow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name=".github/workflows/test.yml">
<violation number="1" location=".github/workflows/test.yml:68">
P2: The concurrency group for the github matrix job omits matrix.precision, so the default-precision job and the `single` precision job share the same group and can cancel each other when `cancel-in-progress: true` is enabled. Include precision in the concurrency key to avoid unintended cancellations and preserve parallelism.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
|
/improve |
Run all config builds concurrently instead of sequentially. Each build runs in its own subshell/directory with output captured to a per-config log file. On failure, only the failed build logs are printed. Tests: 5 builds x -j 8 = 40 cores; Bench: 6 builds x -j 8 = 48 cores. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New pushes should cancel all stale jobs to free limited runners. Per-job concurrency was overly granular — old Phoenix/Github jobs kept running even when a new push made them irrelevant. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In @.github/scripts/frontier_bench_config.sh:
- Around line 30-34: The shell invocation of ./mfc.sh bench uses unquoted
numeric substitutions: wrap the $n_ranks and the $(nproc) expansions in double
quotes to avoid word-splitting or empty-value pitfalls (leave intentional
word-splitting for $device_opts as-is); update the two command lines that call
mfc.sh bench (the gpu branch using -j $n_ranks and the else branch using -j
$(nproc)) to use "-j \"$n_ranks\"" and "-j \"$(nproc)\"" respectively, keeping
other arguments like "$job_slug.yaml" and -c "$cluster" unchanged.
In @.github/scripts/run_frontier_all_benchmarks.sh:
- Around line 64-68: The append to build_exits is using an unquoted variable
which can cause word-splitting; change the array append from
build_exits+=($code) to append the quoted value (build_exits+=("$code")) so the
exit code is treated as a single element; update the occurrence around the else
branch where variables code, build_exits and build_failed are handled in the
run_frontier_all_benchmarks.sh script.
- Around line 29-35: The current loop using cp -al (in the for cfg ... read -r
version cluster device interface ... dir=... block) will include previously
created config dirs into subsequent copies; replace the cp-based source-copy
strategy with an rsync --archive --link-dest="$version" plus an --exclude list
(or dynamically build --exclude entries from already-created dir names) to avoid
copying sibling config directories into each other; specifically update the code
that creates "$dir" (the cp -al/cp -r calls) to invoke rsync with --link-dest
pointing at the original "$version" directory and --exclude for any
already-created config dirs so each new config is a hard-linked copy without
nesting.
In @.github/scripts/run_frontier_all_tests.sh:
- Around line 69-73: The variable expansion of code is unquoted which can cause
word-splitting; update the build failure block to quote $code everywhere it's
expanded — e.g., change echo " Build FAILED: $cluster $device $interface (exit
$code)" to ensure the exit code expansion is quoted where appropriate and push
the code into the array as build_exits+=("$code") (and similarly quote any other
$code uses) so SC2206 is resolved; locate this in the failure handling around
build_exits and build_failed in the script.
🧹 Nitpick comments (4)
.github/scripts/run_frontier_all_benchmarks.sh (1)
15-22: Duplicated config table inside the heredoc is a maintenance hazard.The config array is defined twice — once on lines 15–22 and again inside the SLURM heredoc on lines 118–125 — with a comment "must match the outer script". Any future edit to one that misses the other will silently cause node–config mismatches and wrong benchmark results.
Consider generating the inner config table dynamically (e.g., writing it to a file before sbatch and sourcing it inside, or using an
envsubst/template approach) to keep a single source of truth. The same concern applies torun_frontier_all_tests.sh.Also applies to: 117-125
.github/scripts/frontier_bench_config.sh (2)
6-9: Missingset -u— undefined positional args silently become empty strings.The orchestrator scripts use
set -euo pipefail, but this script only usesset -e. If called with fewer than 3 arguments,$1/$2/$3will silently expand to empty strings, leading to confusing downstream failures (e.g., empty$clusterin themfc.sh loadcall).Proposed fix
set -e set -x +if [ $# -ne 3 ]; then + echo "Usage: $0 <cluster> <device> <interface>" + exit 1 +fi + cluster=$1; device=$2; interface=$3
20-28: GPU detection pipeline will abort the script if no GPUs are found (viaset -e+grep).If
rocm-smireturns no numeric GPU IDs,grep -Eo '[0-9]+'exits non-zero andset -ekills the script without a clear error message. This is acceptable as a fast-fail on a compute node that should have GPUs, but a guard with an explicit error message would improve debuggability..github/scripts/run_frontier_all_tests.sh (1)
27-40:$excludesstring should be an array for robustness.Line 39 uses
$excludesas an unquoted string to achieve word splitting. While the current directory names are safe, this pattern is fragile. Using an array would be more robust and consistent with bash best practices.Proposed fix
-excludes="" +excludes=() for cfg in "${configs[@]}"; do read -r cluster device interface <<< "$cfg" - excludes+=" --exclude=test-${cluster}-${device}-${interface}" + excludes+=(--exclude="test-${cluster}-${device}-${interface}") done for cfg in "${configs[@]}"; do read -r cluster device interface <<< "$cfg" dir="test-${cluster}-${device}-${interface}" echo "Creating source copy: $dir" - rsync -a --link-dest="$(pwd)" $excludes ./ "$dir/" + rsync -a --link-dest="$(pwd)" "${excludes[@]}" ./ "$dir/" done
Prints a periodic heartbeat (every 2 min) during parallel Frontier builds showing which configs are still running with log sizes and last output line. Build logs are printed in collapsible ::group:: sections for passed builds and in full for failed builds so failures are immediately visible. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 issues found across 2 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name=".github/scripts/run_frontier_all_benchmarks.sh">
<violation number="1" location=".github/scripts/run_frontier_all_benchmarks.sh:94">
P2: `kill` can fail if the heartbeat already exited, and with `set -e` that aborts the script. Guard the kill with `|| true` to avoid false failures when the heartbeat finishes first.</violation>
</file>
<file name=".github/scripts/run_frontier_all_tests.sh">
<violation number="1" location=".github/scripts/run_frontier_all_tests.sh:99">
P2: `set -e` causes the unconditional `kill` to abort the script when the heartbeat has already exited. Guard the `kill` with `|| true` (or check the PID) so a normal heartbeat exit doesn’t fail the workflow.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/test.yml (1)
9-11:⚠️ Potential issue | 🟠 MajorAdd
scancelto cleanup handler inmonitor_slurm_job.sh.The script does set a trap on EXIT (line 13) but only kills the tail process. When GitHub Actions cancels the workflow via
cancel-in-progress: true, the monitoring script exits while the submitted SLURM job continues running on Frontier until it completes or times out.The cleanup function should call
scancel $job_idto terminate the job:Suggested cleanup handler
cleanup() { if [ -n "${tail_pid:-}" ]; then kill "${tail_pid}" 2>/dev/null || true fi if [ -n "${job_id:-}" ]; then scancel "$job_id" 2>/dev/null || true fi }
🤖 Fix all issues with AI agents
In @.github/scripts/run_frontier_all_tests.sh:
- Around line 127-129: The echo message currently hardcodes "5-node" which will
become incorrect if the node count changes; update the final status echo to
interpolate the variable num_nodes instead of the literal "5-node" (replace the
string in the echo that mentions "5-node SLURM job..." to use "$num_nodes" so it
reflects the actual configured node count), ensuring any surrounding text
remains the same and quoting is preserved.
🧹 Nitpick comments (5)
.github/scripts/run_frontier_all_benchmarks.sh (2)
15-22: Duplicated config table between outer script and SLURM heredoc is fragile.The configs array is defined twice — lines 15-22 and lines 149-156 — with only a comment ("must match the outer script") linking them. If a config is added, removed, or reordered in one place but not the other, the SLURM job will silently map the wrong config to the wrong node. Consider generating the inner config table dynamically (e.g., write it to a file in Phase 1 and source it inside the heredoc) or at minimum add a count assertion inside the SLURM job.
Minimal guard: assert count inside the SLURM job
# Config table (must match the outer script) configs=( "pr frontier gpu acc" "pr frontier gpu omp" "pr frontier_amd gpu omp" "master frontier gpu acc" "master frontier gpu omp" "master frontier_amd gpu omp" ) + +if [ "${`#configs`[@]}" -ne "$SLURM_JOB_NUM_NODES" ]; then + echo "ERROR: config count (${`#configs`[@]}) != allocated nodes ($SLURM_JOB_NUM_NODES)" + exit 1 +fiAlso applies to: 149-156
168-171: SSH commands use unquoted$SLURM_SUBMIT_DIR/$dir— safe only if paths have no spaces.Inside the SLURM heredoc, the
cdand paths passed to SSH are unquoted. SLURM submit directories on Frontier are unlikely to contain spaces, but quoting them is low-cost insurance.Proposed fix
ssh -q -o StrictHostKeyChecking=no "$node" \ - "cd $SLURM_SUBMIT_DIR/$dir && bash .github/scripts/frontier_bench_config.sh $cluster $device $interface" \ + "cd \"$SLURM_SUBMIT_DIR/$dir\" && bash .github/scripts/frontier_bench_config.sh \"$cluster\" \"$device\" \"$interface\"" \ > "$outfile" 2>&1 &.github/scripts/run_frontier_all_tests.sh (2)
14-20: Same fragile config duplication as inrun_frontier_all_benchmarks.sh.The configs array is maintained in two places (lines 14-20 and 154-160) with a "must match" comment. Same risk of silent mismatch. Consider the same mitigation — at minimum, assert
${#configs[@]} == $SLURM_JOB_NUM_NODESinside the SLURM job.Also applies to: 154-160
172-176: Same unquoted path concern in SSH commands as benchmarks script.
$SLURM_SUBMIT_DIR/$diris unquoted inside the SSH command string. Unlikely to cause issues on Frontier but quoting is low-cost.Proposed fix
ssh -q -o StrictHostKeyChecking=no "$node" \ - "cd $SLURM_SUBMIT_DIR/$dir && bash .github/scripts/frontier_test_config.sh $cluster $device $interface" \ + "cd \"$SLURM_SUBMIT_DIR/$dir\" && bash .github/scripts/frontier_test_config.sh \"$cluster\" \"$device\" \"$interface\"" \ > "$outfile" 2>&1 &.github/workflows/test.yml (1)
185-195: Build step (line 186) is effectively dead code for current matrix entries.The condition
matrix.cluster != 'phoenix' && matrix.cluster != 'frontier_all'excludes every current matrix entry (all are eitherphoenixorfrontier_all). This is harmless since both clusters handle builds internally (phoenix viasubmit.sh, frontier_all viarun_frontier_all_tests.sh), but it's worth knowing this step is a no-op today. No action needed unless you want to clean it up.
Add || true to kill command so set -e doesn't abort the script when the heartbeat subshell has already exited. Use $num_nodes instead of hardcoded "5-node" in the test script log message. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Self-hosted runners don't clean workspaces between runs, so config directories from cancelled prior runs can persist. Add rm -rf before cp -al/rsync to prevent "same file" hardlink errors when copying into an existing directory. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Eliminate config table duplication by writing configs to a file and reading it inside the sbatch heredoc (now unquoted for variable expansion) - Add EXIT trap in sbatch scripts to kill orphaned SSH processes on job cancellation - Add per-config timeout (90 min tests, 120 min benchmarks) to prevent a single hanging config from consuming the full walltime - Extract SLURM account, partition, walltime, and node count into variables at the top of each orchestration script - Validate GPU count from rocm-smi (1-16 range) with diagnostic output on failure in both test and benchmark compute-node scripts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CodeAnt AI is running Incremental review Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
CodeAnt AI Incremental review completed. |
User description
Summary
Test plan
Summary by CodeRabbit
Chores
New Features
CodeAnt-AI Description
Run benchmarks in parallel with tests and consolidate Frontier SLURM jobs
What Changed
Impact
✅ Shorter CI merge time for PRs with tests and benchmarks✅ Fewer correctness test runs killed by benchmark flakes✅ Lower Frontier SLURM queue depth and simpler node usage💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.