Harden SLURM job monitor against transient squeue failures by sbryngelson · Pull Request #1140 · MFlowCode/MFC

sbryngelson · 2026-02-13T02:10:52Z

User description

Summary

The monitor script used squeue -j $id &>/dev/null which only checks the exit code. When squeue itself fails transiently (SLURM daemon overloaded, network hiccup), this is indistinguishable from "job doesn't exist," causing the monitor to give up on jobs that are still PENDING — leaving orphaned SLURM jobs.
Adds get_job_state() that parses squeue output for the actual state string, with sacct fallback for completed/historical jobs
Never gives up on UNKNOWN state (lets CI timeout be the backstop)
Cancels orphaned SLURM jobs on abnormal monitor exit via cleanup trap
Fixes fractional read timeouts that caused bash segfaults on some systems
Includes job state in heartbeat messages for better diagnostics

Test plan

Verify Phoenix benchmark jobs no longer falsely abort when squeue is transiently unavailable
Verify Frontier test/benchmark jobs detect terminal states (COMPLETED, FAILED, CANCELLED) correctly
Verify orphaned SLURM jobs are cancelled when CI runner is killed

🤖 Generated with Claude Code

CodeAnt-AI Description

Harden SLURM job monitor: robust state checks, cancel orphaned jobs, improved streaming reliability

What Changed

Monitor now determines job state by parsing squeue output with a sacct fallback so active, completed, and historical job states are detected reliably instead of relying on squeue exit codes alone.
The monitor no longer gives up when SLURM queries temporarily fail (reports UNKNOWN and keeps waiting); it only treats explicit terminal states as job completion and prints the state in periodic heartbeats.
If the monitor exits abnormally (for example the CI runner is killed), it cancels the associated SLURM job to avoid leaving orphaned jobs; it does not cancel on normal successful completion.
Increased read timeouts and stabilized tail/drain logic to prevent watcher crashes and to ensure complete streaming of job output before exiting.

Impact

✅ Fewer orphaned SLURM jobs left when monitors are killed
✅ Clearer job state shown in heartbeat logs during runs
✅ Fewer monitor crashes and more reliable streaming of job output

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

Summary by CodeRabbit

Bug Fixes
- Improved job monitoring with robust state detection, terminal-state handling, and better handling of transient system unavailability
- More reliable cleanup that cancels jobs on abnormal exit and stabilizes output handling
- More accurate exit-code determination with a two-step retrieval and explicit failure reporting
Chores
- Refined polling, streaming, heartbeat messaging, and drain limits for more efficient and stable monitoring
Documentation
- Minor comment clarification for global parameters

The monitor script used `squeue -j $id &>/dev/null` which only checks the exit code. When squeue itself fails transiently (SLURM daemon overloaded, network hiccup), this is indistinguishable from "job doesn't exist," causing the monitor to give up on jobs that are still PENDING in the queue — leaving orphaned SLURM jobs. Changes: - Add get_job_state() that parses squeue output for the actual state string, with sacct fallback for completed/historical jobs - Never give up on UNKNOWN state (let CI timeout be the backstop) - Cancel orphaned SLURM jobs on abnormal monitor exit - Fix fractional read timeouts that caused bash segfaults - Include job state in heartbeat messages for better diagnostics Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codeant-ai · 2026-02-13T02:10:56Z

CodeAnt AI is reviewing your PR.

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

qodo-code-review · 2026-02-13T02:11:17Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Cleanup Logic The EXIT trap cancels the SLURM job whenever `monitor_success` is not set to 1, which includes normal “job failed” paths and early exits after detecting a terminal state. This can result in calling `scancel` for already-terminal jobs (or for jobs that ended just before the script exits), which may cause confusing logs or unintended cancellation if there is any race between terminal detection and cleanup. Consider tightening the conditions (e.g., only cancel on signals/aborts) or recording an explicit “job is done” flag when a terminal state is observed. # Cleanup handler to prevent orphaned tail processes and cancel orphaned jobs cleanup() { if [ -n "${tail_pid:-}" ]; then kill "${tail_pid}" 2>/dev/null \|\| true fi # Cancel the SLURM job if the monitor is exiting due to an error # (e.g., the CI runner is being killed). Don't cancel on success. if [ "${monitor_success:-0}" -ne 1 ] && [ -n "${job_id:-}" ]; then echo "Monitor exiting abnormally — cancelling SLURM job $job_id" scancel "$job_id" 2>/dev/null \|\| true fi State Detection The `sacct` fallback uses the first line returned by `sacct -j`, but `sacct` commonly returns multiple records (job, batch step, extern, etc.) and the first record may not represent the overall job state reliably. Validate that the chosen line is the correct one across clusters, or constrain the query to the job allocation/primary record to avoid misclassifying terminal vs non-terminal states. # Robustly check SLURM job state using squeue with sacct fallback. # Returns the state string (PENDING, RUNNING, COMPLETED, FAILED, etc.) # or "UNKNOWN" if both commands fail. get_job_state() { local jid="$1" local state # Try squeue first (fast, works for active jobs) state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null \| head -n1 \| tr -d ' ') if [ -n "$state" ]; then echo "$state" return fi # Fallback to sacct (works for completed/historical jobs) if command -v sacct >/dev/null 2>&1; then state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null \| head -n1 \| awk '{print $1}') if [ -n "$state" ]; then echo "$state" return fi fi echo "UNKNOWN" } # Check if a state is terminal (job is done, for better or worse) is_terminal_state() { case "$1" in COMPLETED\|FAILED\|CANCELLED\|CANCELLED+\|TIMEOUT\|OUT_OF_MEMORY\|NODE_FAIL\|PREEMPTED\|BOOT_FAIL\|DEADLINE) return 0 ;; ) return 1 ;; esac } Timing/Blocking* Switching `read -t` from a fractional timeout to `-t 1` avoids platform issues, but it also changes monitoring responsiveness: the inner read loop can now block up to 1s per attempt even when there is no output, then the loop also sleeps 1s, potentially slowing state checks/heartbeats. Confirm this doesn’t cause unacceptable latency in detecting terminal states or in emitting heartbeat messages. while true; do # Try to read from tail output (non-blocking via timeout) # Read multiple lines if available to avoid falling behind lines_read=0 while IFS= read -r -t 1 line <&3 2>/dev/null; do echo "$line" lines_read=$((lines_read + 1)) last_heartbeat=$(date +%s) # Limit burst reads to avoid starving the status check if [ $lines_read -ge 100 ]; then break fi done # Check job status current_time=$(date +%s) state=$(get_job_state "$job_id") if is_terminal_state "$state"; then echo "[$(date +%H:%M:%S)] Job $job_id reached terminal state: $state" break else # Print heartbeat if no output for 60 seconds if [ $((current_time - last_heartbeat)) -ge 60 ]; then echo "[$(date +%H:%M:%S)] Job $job_id state=$state (no new output for 60s)..." last_heartbeat=$current_time fi fi # Sleep briefly between status checks sleep 1 done # Drain any remaining output from tail after job completes echo "Draining remaining output..." drain_count=0 while IFS= read -r -t 1 line <&3 2>/dev/null; do echo "$line"

coderabbitai · 2026-02-13T02:11:33Z

📝 Walkthrough

Walkthrough

Reworks SLURM job monitoring script with improved state tracking via new utility functions (get_job_state, is_terminal_state), introduces abnormal-exit cleanup handler, replaces ad-hoc polling with a state-machine approach, enhances streaming/tailing logic and heartbeat reporting, and adds a two-step exit-code retrieval (scontrol then sacct). A minor comment change in a Fortran file is included.

Changes

Cohort / File(s)	Summary
SLURM Job Monitoring `.github/scripts/monitor_slurm_job.sh`	Adds `get_job_state` and `is_terminal_state`; replaces squeue retry polling with state-machine loops; adds abnormal-exit cleanup (conditional scancel); improves tailing (timeout, burst cap, hard drain limit) and heartbeat logging; implements two-step ExitCode retrieval (scontrol → sacct) and treats non-zero exit codes as failures.
Fortran comment tweak `src/pre_process/m_check_patches.fpp`	Non-functional change: updated public use statement comment from "Global parameters for the code" to "Global parameters".

Sequence Diagram(s)

sequenceDiagram
    actor Script as monitor_slurm_job.sh
    participant SLURM as SLURM System
    participant File as Output File

    Script->>SLURM: get_job_state(job_id)
    SLURM-->>Script: state (PENDING/RUNNING/COMPLETED/FAILED/UNKNOWN)
    Script->>Script: is_terminal_state(state)?
    alt Terminal
        Script->>File: check for output file
        alt output missing
            Script->>SLURM: scancel job_id (cleanup)
            Script-->>Script: exit (error)
        else output present
            Script->>File: tail/stream output (with burst cap)
        end
    else Non-terminal
        Script->>Script: sleep & retry (state-machine)
    end

    loop Streaming
        File->>Script: data (with read timeout)
        Script->>SLURM: periodic get_job_state()
        Script-->>Script: heartbeat log (state)
        alt terminal detected during stream
            Script->>Script: note transition, continue draining until stable or hard limit
        end
    end

    Script->>SLURM: scontrol show job job_id (get ExitCode)
    alt scontrol returns ExitCode
        SLURM-->>Script: ExitCode
    else scontrol fails
        Script->>SLURM: sacct fallback query
        SLURM-->>Script: Exit status
    end
    Script->>Script: evaluate ExitCode, set monitor_success, cleanup

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I watched the queue and learned each state,
I sniffed the logs and chased the fate,
If jobs go rogue I tidy the lawn,
I tail, I time, I hush the dawn,
A hop, a fix — the pipeline's neat! 🥕

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Harden SLURM job monitor against transient squeue failures' accurately and specifically describes the main change—improving the monitor script's robustness when squeue fails transiently.
Description check	✅ Passed	The pull request description comprehensively covers the changes, motivation, and testing plan, providing clear context about the problem and solution despite deviating from the template structure.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Merge Conflict Detection	✅ Passed	✅ No merge conflicts detected when merging into `master`

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

No actionable comments were generated in the recent review. 🎉

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

qodo-code-review · 2026-02-13T02:12:28Z

.github/scripts/monitor_slurm_job.sh

+# Check if a state is terminal (job is done, for better or worse)
+is_terminal_state() {
+  case "$1" in
+    COMPLETED|FAILED|CANCELLED|CANCELLED+|TIMEOUT|OUT_OF_MEMORY|NODE_FAIL|PREEMPTED|BOOT_FAIL|DEADLINE)
+      return 0 ;;
+    *)
+      return 1 ;;
+  esac
+}


Suggestion: Modify the is_terminal_state function to handle all terminal job states, including those with a + suffix like COMPLETED+, by stripping the suffix before evaluation. This prevents the script from hanging on job completion. [possible issue, importance: 8]

Suggested change

# Check if a state is terminal (job is done, for better or worse)

is_terminal_state() {

case "$1" in

COMPLETED|FAILED|CANCELLED|CANCELLED+|TIMEOUT|OUT_OF_MEMORY|NODE_FAIL|PREEMPTED|BOOT_FAIL|DEADLINE)

return 0 ;;

*)

return 1 ;;

esac

}

# Check if a state is terminal (job is done, for better or worse)

is_terminal_state() {

local state_base=${1%+} # remove trailing '+'

case "$state_base" in

COMPLETED|FAILED|CANCELLED|TIMEOUT|OUT_OF_MEMORY|NODE_FAIL|PREEMPTED|BOOT_FAIL|DEADLINE)

return 0 ;;

*)

return 1 ;;

esac

}

qodo-code-review · 2026-02-13T02:12:28Z

.github/scripts/monitor_slurm_job.sh

 # Close the file descriptor and kill tail
 exec 3<&-
 kill "${tail_pid}" 2>/dev/null || true
+tail_pid=""


Suggestion: To prevent a race condition that could orphan the tail process, move the tail_pid="" line from the end of the script into the cleanup function. This ensures the process ID is cleared only after the process is killed. [possible issue, importance: 5]

Suggested change

# Close the file descriptor and kill tail

exec 3<&-

kill "${tail_pid}" 2>/dev/null || true

tail_pid=""

# Close the file descriptor and kill tail

exec 3<&-

kill "${tail_pid}" 2>/dev/null || true

qodo-code-review · 2026-02-13T02:12:28Z

.github/scripts/monitor_slurm_job.sh

+    scancel "$job_id" 2>/dev/null || true
+  fi
 }
 trap cleanup EXIT


Suggestion: Modify the trap command to explicitly catch INT and TERM signals in addition to EXIT, ensuring the cleanup function always runs, even when the script is interrupted. [general, importance: 6]

Suggested change

trap cleanup EXIT

trap cleanup EXIT INT TERM

cubic-dev-ai

No issues found across 1 file

Confidence score: 5/5

Automated review surfaced no issues in the provided summaries.
No files require special attention.

codeant-ai · 2026-02-13T02:13:25Z

Nitpicks 🔍

🔒 No security issues identified
⚡ Recommended areas for review Potential Over-cancellation The cleanup handler always calls `scancel` when `monitor_success` is not set. If the monitor exits between detecting a terminal state and finishing final exit-code checks (or if the trap runs for a perceived abnormal exit while the job is actually done), the script may attempt to cancel a job unnecessarily. Consider limiting cancellation to only when the job is actually still queued/running or when SLURM queries fail repeatedly. State-parsing edge cases `get_job_state()` strips whitespace with `tr -d ' '` and picks the first sacct line. Some SLURM state strings or suffixes (e.g., multi-word states, states with trailing characters) might be mis-parsed, and choosing the first sacct line may pick a step-level state instead of the batch step. Validate parsing across squeue/sacct outputs and normalize common variants (e.g., trailing '+', parenthetical annotations). Tail PID reliability The script relies on `$!` after a process-substitution assignment (`exec 3< <(stdbuf ... tail -f ...)`) to get `tail_pid`. In some shells/process-substitution implementations `$!` may not reliably be the tail PID, making the subsequent `kill "$tail_pid"` a no-op and leaving background processes alive. Consider explicitly launching tail in background and connecting it via a FIFO or capturing PID robustly.

codeant-ai · 2026-02-13T02:13:57Z

.github/scripts/monitor_slurm_job.sh

+  local state
+
+  # Try squeue first (fast, works for active jobs)
+  state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ')


Suggestion: With set -euo pipefail enabled, any non-zero exit from the squeue pipeline inside the command substitution will cause the entire script to exit immediately instead of falling back to sacct or returning UNKNOWN, so transient squeue failures will abort the monitor instead of being treated as non-fatal. [logic error]

Severity Level: Critical 🚨

- ❌ Monitor aborts on transient squeue failures, cancelling SLURM jobs. - ⚠️ CI SLURM-based benchmark jobs become flaky under controller load. - ⚠️ Sacct/UNKNOWN fallback never reached when squeue briefly fails.

Suggested change

state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ')

state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ' || true)

Steps of Reproduction ✅

1. Run the monitor script `.github/scripts/monitor_slurm_job.sh` (entry point used by CI, see file header at `.github/scripts/monitor_slurm_job.sh:1`) with a valid SLURM job id: `./monitor_slurm_job.sh <job_id> out.log`. 2. The script enables strict mode with `set -euo pipefail` at `.github/scripts/monitor_slurm_job.sh:5` and later calls `get_job_state "$job_id"` in both the "wait for job to start" loop (`state=$(get_job_state "$job_id")` around `.github/scripts/monitor_slurm_job.sh:74`) and the main monitor loop (`state=$(get_job_state "$job_id")` around `.github/scripts/monitor_slurm_job.sh:132`). 3. During one of these calls, cause a realistic transient SLURM failure so that `squeue -j "$jid"` returns a non‑zero exit code (e.g., SLURM controller temporarily unreachable or overloaded as described in the PR summary). This executes the line `state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ')` at `.github/scripts/monitor_slurm_job.sh:40` inside `get_job_state`. 4. Because `set -euo pipefail` is active (`pipefail` makes the pipeline's exit status non‑zero when `squeue` fails, and the variable assignment is a simple command), the non‑zero status from the `state=$(...)` command at line 40 causes the entire script to exit immediately instead of continuing to the sacct fallback (`if command -v sacct` at line 47) or printing `UNKNOWN` at line 55. The `cleanup` trap at `.github/scripts/monitor_slurm_job.sh:7-19` then runs, sees `monitor_success` still unset/0 and a non‑empty `$job_id`, and executes `scancel "$job_id"`, cancelling the still‑valid job and printing "Monitor exiting abnormally — cancelling SLURM job <job_id>".

Prompt for AI Agent 🤖

This is a comment left during a code review. **Path:** .github/scripts/monitor_slurm_job.sh **Line:** 40:40 **Comment:** *Logic Error: With `set -euo pipefail` enabled, any non-zero exit from the `squeue` pipeline inside the command substitution will cause the entire script to exit immediately instead of falling back to `sacct` or returning `UNKNOWN`, so transient `squeue` failures will abort the monitor instead of being treated as non-fatal. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

👍 | 👎

codeant-ai · 2026-02-13T02:13:57Z

.github/scripts/monitor_slurm_job.sh

+
+  # Fallback to sacct (works for completed/historical jobs)
+  if command -v sacct >/dev/null 2>&1; then
+    state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}')


Suggestion: Similarly to the squeue call, the sacct pipeline in get_job_state will, under set -euo pipefail, cause the script to exit on any non-zero sacct failure (e.g., slurmdbd hiccups) instead of returning UNKNOWN, defeating the intended robustness against transient SLURM issues. [logic error]

Severity Level: Critical 🚨

- ❌ Monitor exits on transient sacct failures, cancelling jobs. - ⚠️ Historical/finished job state resolution becomes fragile and flaky. - ⚠️ UNKNOWN state path never reached when sacct briefly unavailable.

Suggested change

state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}')

state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}' || true)

Steps of Reproduction ✅

1. Run `.github/scripts/monitor_slurm_job.sh <job_id> out.log` (script entry point at `.github/scripts/monitor_slurm_job.sh:1`) under a SLURM setup where accounting (`sacct` / slurmdbd) can experience transient failures. 2. Allow the monitored job to reach a state where `squeue -j "$job_id"` no longer returns a state line (e.g., job completed and aged out of the active queue), so `get_job_state` at `.github/scripts/monitor_slurm_job.sh:35-56` falls through its initial squeue query (lines 39–42) and enters the sacct fallback guarded by `if command -v sacct` at line 47. 3. During a realistic slurmdbd/sacct hiccup, have the sacct pipeline `sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}'` at `.github/scripts/monitor_slurm_job.sh:48` return a non‑zero exit status (e.g., sacct cannot contact the accounting daemon and exits with error). 4. With `set -euo pipefail` enabled at line 5, the non‑zero status from this sacct pipeline (propagated by `pipefail`) causes the `state=$(...)` simple command at line 48 to fail, triggering `set -e` and exiting the entire script immediately, before it can echo `UNKNOWN` at line 55. The EXIT trap (`cleanup` at lines 7–17) runs, sees `monitor_success` still 0 and a non‑empty `job_id`, and calls `scancel "$job_id"`, cancelling the job or at least misreporting an internal monitor failure instead of returning UNKNOWN state as designed.

Prompt for AI Agent 🤖

This is a comment left during a code review. **Path:** .github/scripts/monitor_slurm_job.sh **Line:** 48:48 **Comment:** *Logic Error: Similarly to the squeue call, the sacct pipeline in get_job_state will, under `set -euo pipefail`, cause the script to exit on any non-zero sacct failure (e.g., slurmdbd hiccups) instead of returning `UNKNOWN`, defeating the intended robustness against transient SLURM issues. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

👍 | 👎

codeant-ai · 2026-02-13T02:14:00Z

CodeAnt AI finished reviewing your PR.

Copilot

Pull request overview

This PR hardens the SLURM job monitor script against transient squeue failures that previously caused false job abandonment. The key improvement is replacing simple exit code checks with robust state parsing that distinguishes between "job doesn't exist" and "squeue temporarily unavailable."

Changes:

Introduces get_job_state() function that parses actual SLURM state strings with sacct fallback for completed jobs
Adds is_terminal_state() helper to correctly identify when jobs are truly finished
Enhances cleanup handler to cancel orphaned SLURM jobs when the monitor exits abnormally
Fixes fractional read timeouts (0.1s → 1s) that caused bash segfaults on some systems
Improves diagnostics by including job state in heartbeat messages

Copilot · 2026-02-13T02:14:55Z

.github/scripts/monitor_slurm_job.sh

+    *)
+      # Terminal state — job finished without creating output
+      if is_terminal_state "$state"; then
+        echo "ERROR: Job $job_id reached terminal state ($state) without creating output file"
        exit 1
      fi
-      break
-    fi
-    # Exponential backoff
-    sleep_time=$((2 ** squeue_retries))
-    echo "Warning: squeue check failed, retrying in ${sleep_time}s..."
-    sleep $sleep_time
-  fi
+      # Unrecognized state, keep waiting
+      sleep 5
+      ;;


In the case where the state is neither PENDING, CONFIGURING, RUNNING, COMPLETING, UNKNOWN, nor a terminal state recognized by is_terminal_state(), the script will sleep and continue waiting indefinitely. This handles new SLURM states that might be introduced in the future. However, consider logging these unrecognized non-terminal states at least once to help diagnose unexpected SLURM behavior (e.g., "Warning: Job in unrecognized state: REQUEUED, continuing to wait...").

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.github/scripts/monitor_slurm_job.sh (1)
194-198: ⚠️ Potential issue | 🔴 Critical

Bug: grep matches both ExitCode and DerivedExitCode, causing successful jobs to be reported as failures.

When scontrol show job output contains both ExitCode=X:Y and DerivedExitCode=X:Y fields (standard output), the pattern ExitCode=[0-9]+:[0-9]+ matches both, since "ExitCode=" is a substring of "DerivedExitCode=". This produces two lines in exit_code (e.g., "0:0\n0:0"), and the comparison at line 217 ([ "$exit_code" != "0:0" ]) evaluates to true because the multi-line string doesn't equal "0:0", causing the script to incorrectly exit with error code 1 for successful jobs.
Proposed fix: anchor the pattern with a word boundary
-  exit_code=$(echo "$scontrol_output" | grep -oE 'ExitCode=[0-9]+:[0-9]+' | cut -d= -f2 || echo "")
+  exit_code=$(echo "$scontrol_output" | grep -oE '\bExitCode=[0-9]+:[0-9]+' | head -n1 | cut -d= -f2 || echo "")
The \b word boundary prevents DerivedExitCode from matching. head -n1 adds a safety net against multiple matches.

🧹 Nitpick comments (1)

.github/scripts/monitor_slurm_job.sh (1)
35-56: sacct may return sub-step states — consider filtering to the batch step.

sacct -j "$jid" returns one row per job step (job, batch, extern, etc.). head -n1 grabs the first row, which is typically the overall "job" record — but on some SLURM configurations the order isn't guaranteed or an extra blank/sub-step line may appear first. You could pin to the .batch step or filter with --parsable2 for more deterministic output.

That said, the UNKNOWN fallback makes this safe in practice.
Possible hardening
-    state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}')
+    state=$(sacct -j "$jid.batch" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}')

qodo-code-review · 2026-02-13T02:20:48Z

.github/scripts/monitor_slurm_job.sh

@@ -57,14 +111,13 @@ exec 3< <(stdbuf -oL -eL tail -f "$output_file" 2>&1)
 tail_pid=$!


Suggestion: Replace the use of process substitution (<(...)) and $! with coproc to reliably capture the tail process PID. [possible issue, importance: 7]

New proposed code:

# Start tail and redirect its output to file descriptor 3 for multiplexing # This allows us to stream tail output while also printing heartbeat messages -exec 3< <(stdbuf -oL -eL tail -f "$output_file" 2>&1) -tail_pid=$! +coproc TAIL_PROC { stdbuf -oL -eL tail -f "$output_file" 2>&1; } +exec 3<&"${TAIL_PROC[0]}" +tail_pid="$TAIL_PROC_PID"

qodo-code-review · 2026-02-13T02:20:48Z

.github/scripts/monitor_slurm_job.sh

+get_job_state() {
+  local jid="$1"
+  local state
+
+  # Try squeue first (fast, works for active jobs)
+  state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ')
+  if [ -n "$state" ]; then
+    echo "$state"
+    return
+  fi
+
+  # Fallback to sacct (works for completed/historical jobs)
+  if command -v sacct >/dev/null 2>&1; then
+    state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}')
+    if [ -n "$state" ]; then
+      echo "$state"
+      return
+    fi
+  fi
+
+  echo "UNKNOWN"
+}


Suggestion: Improve the sacct command in get_job_state to query the .batch step explicitly and use parsable output flags (-n -X -P) for more reliable job state detection. [possible issue, importance: 8]

Suggested change

get_job_state() {

local jid="$1"

local state

# Try squeue first (fast, works for active jobs)

state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ')

if [ -n "$state" ]; then

echo "$state"

return

fi

# Fallback to sacct (works for completed/historical jobs)

if command -v sacct >/dev/null 2>&1; then

state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}')

if [ -n "$state" ]; then

echo "$state"

return

fi

fi

echo "UNKNOWN"

}

get_job_state() {

local jid="$1"

local state

# Try squeue first (fast, works for active jobs)

state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ')

if [ -n "$state" ]; then

echo "$state"

return

fi

# Fallback to sacct (works for completed/historical jobs)

if command -v sacct >/dev/null 2>&1; then

state=$(sacct -j "${jid}.batch" -n -X -P -o State 2>/dev/null | head -n1 | cut -d'|' -f1)

if [ -n "$state" ]; then

echo "$state"

return

fi

fi

echo "UNKNOWN"

}

qodo-code-review · 2026-02-13T02:43:22Z

.github/scripts/monitor_slurm_job.sh

@@ -57,14 +111,13 @@ exec 3< <(stdbuf -oL -eL tail -f "$output_file" 2>&1)
 tail_pid=$!


Suggestion: Use coproc to reliably capture the tail process PID and use tail -F instead of tail -f to better handle log file rotation. [possible issue, importance: 7]

New proposed code:

-exec 3< <(stdbuf -oL -eL tail -f "$output_file" 2>&1) -tail_pid=$! +coproc TAILPROC { stdbuf -oL -eL tail -F "$output_file" 2>&1; } +exec 3<&"${TAILPROC[0]}" +tail_pid="${TAILPROC_PID}"

qodo-code-review · 2026-02-13T02:43:22Z

.github/scripts/monitor_slurm_job.sh

+get_job_state() {
+  local jid="$1"
+  local state
+
+  # Try squeue first (fast, works for active jobs)
+  state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ')
+  if [ -n "$state" ]; then
+    echo "$state"
+    return
+  fi
+
+  # Fallback to sacct (works for completed/historical jobs)
+  if command -v sacct >/dev/null 2>&1; then
+    state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}')
+    if [ -n "$state" ]; then
+      echo "$state"
+      return
+    fi
+  fi
+
+  echo "UNKNOWN"
+}


Suggestion: Wrap squeue and sacct calls with a timeout command to prevent the script from hanging if the SLURM controller is unresponsive. [possible issue, importance: 8]

Suggested change

get_job_state() {

local jid="$1"

local state

# Try squeue first (fast, works for active jobs)

state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ')

if [ -n "$state" ]; then

echo "$state"

return

fi

# Fallback to sacct (works for completed/historical jobs)

if command -v sacct >/dev/null 2>&1; then

state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}')

if [ -n "$state" ]; then

echo "$state"

return

fi

fi

echo "UNKNOWN"

}

get_job_state() {

local jid="$1"

local state

local tcmd=""

if command -v timeout >/dev/null 2>&1; then

tcmd="timeout 5"

fi

# Try squeue first (fast, works for active jobs)

state=$($tcmd squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ')

if [ -n "$state" ]; then

echo "$state"

return

fi

# Fallback to sacct (works for completed/historical jobs)

if command -v sacct >/dev/null 2>&1; then

state=$($tcmd sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}')

if [ -n "$state" ]; then

echo "$state"

return

fi

fi

echo "UNKNOWN"

}

qodo-code-review · 2026-02-13T02:43:22Z

.github/scripts/monitor_slurm_job.sh

+  if [ "${monitor_success:-0}" -ne 1 ] && [ -n "${job_id:-}" ]; then
+    echo "Monitor exiting abnormally — cancelling SLURM job $job_id"
+    scancel "$job_id" 2>/dev/null || true
+  fi


Suggestion: Before calling scancel in the cleanup handler, check if the job is in a non-terminal state to avoid accidentally cancelling a different job that has reused the same ID. [security, importance: 9]

Suggested change

if [ "${monitor_success:-0}" -ne 1 ] && [ -n "${job_id:-}" ]; then

echo "Monitor exiting abnormally — cancelling SLURM job $job_id"

scancel "$job_id" 2>/dev/null || true

fi

if [ "${monitor_success:-0}" -ne 1 ] && [ -n "${job_id:-}" ]; then

state="$(get_job_state "$job_id")"

if ! is_terminal_state "$state"; then

echo "Monitor exiting abnormally — cancelling SLURM job $job_id"

scancel "$job_id" 2>/dev/null || true

else

echo "Monitor exiting abnormally — job $job_id already terminal ($state), not cancelling"

fi

fi

Replace squeue exit-code polling with get_job_state() that parses the actual state string (squeue + sacct fallback). Never give up on UNKNOWN state — CI timeout is the backstop. Cancel orphaned SLURM jobs on abnormal monitor exit. Include job state in heartbeats. Incorporates changes from PR MFlowCode#1140. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov · 2026-02-13T05:26:12Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 44.03%. Comparing base (0ba6c02) to head (71d531a).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1140   +/-   ##
=======================================
  Coverage   44.03%   44.03%           
=======================================
  Files          70       70           
  Lines       20649    20649           
  Branches     2053     2054    +1     
=======================================
  Hits         9093     9093           
  Misses      10368    10368           
  Partials     1188     1188

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot AI review requested due to automatic review settings February 13, 2026 02:10

qodo-code-review bot added the Review effort 2/5 label Feb 13, 2026

Copilot started reviewing on behalf of sbryngelson February 13, 2026 02:11 View session

codeant-ai bot added the size:M This PR changes 30-99 lines, ignoring generated files label Feb 13, 2026

qodo-code-review bot reviewed Feb 13, 2026

View reviewed changes

cubic-dev-ai bot reviewed Feb 13, 2026

View reviewed changes

codeant-ai bot reviewed Feb 13, 2026

View reviewed changes

Copilot AI reviewed Feb 13, 2026

View reviewed changes

coderabbitai bot reviewed Feb 13, 2026

View reviewed changes

Merge branch 'master' into robust-slurm-monitor

07d6231

qodo-code-review bot reviewed Feb 13, 2026

View reviewed changes

Update m_check_patches.fpp

71d531a

qodo-code-review bot reviewed Feb 13, 2026

View reviewed changes

sbryngelson closed this Feb 13, 2026

	state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null \| head -n1 \| tr -d ' ')
	state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null \| head -n1 \| tr -d ' ' \|\| true)

	state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null \| head -n1 \| awk '{print $1}')
	state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null \| head -n1 \| awk '{print $1}' \|\| true)

		@@ -57,14 +111,13 @@ exec 3< <(stdbuf -oL -eL tail -f "$output_file" 2>&1)
		tail_pid=$!

Conversation

sbryngelson commented Feb 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Summary

Test plan

CodeAnt-AI Description

What Changed

Impact

Checking Your Pull Request

Talking to CodeAnt AI

Example

Preserve Org Learnings with CodeAnt

Example

Retrigger review

Check Your Repository Health

Summary by CodeRabbit

Uh oh!

codeant-ai bot commented Feb 13, 2026

Thanks for using CodeAnt! 🎉

Uh oh!

qodo-code-review bot commented Feb 13, 2026

PR Reviewer Guide 🔍

Uh oh!

coderabbitai bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

qodo-code-review bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-code-review bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-code-review bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codeant-ai bot commented Feb 13, 2026

Nitpicks 🔍

Uh oh!

codeant-ai bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

codeant-ai bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

codeant-ai bot commented Feb 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

qodo-code-review bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-code-review bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-code-review bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-code-review bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-code-review bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

sbryngelson commented Feb 13, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 13, 2026 •

edited

Loading