Skip to content

Harden SLURM job monitor against transient squeue failures#1140

Closed
sbryngelson wants to merge 3 commits intoMFlowCode:masterfrom
sbryngelson:robust-slurm-monitor
Closed

Harden SLURM job monitor against transient squeue failures#1140
sbryngelson wants to merge 3 commits intoMFlowCode:masterfrom
sbryngelson:robust-slurm-monitor

Conversation

@sbryngelson
Copy link
Member

@sbryngelson sbryngelson commented Feb 13, 2026

User description

Summary

  • The monitor script used squeue -j $id &>/dev/null which only checks the exit code. When squeue itself fails transiently (SLURM daemon overloaded, network hiccup), this is indistinguishable from "job doesn't exist," causing the monitor to give up on jobs that are still PENDING — leaving orphaned SLURM jobs.
  • Adds get_job_state() that parses squeue output for the actual state string, with sacct fallback for completed/historical jobs
  • Never gives up on UNKNOWN state (lets CI timeout be the backstop)
  • Cancels orphaned SLURM jobs on abnormal monitor exit via cleanup trap
  • Fixes fractional read timeouts that caused bash segfaults on some systems
  • Includes job state in heartbeat messages for better diagnostics

Test plan

  • Verify Phoenix benchmark jobs no longer falsely abort when squeue is transiently unavailable
  • Verify Frontier test/benchmark jobs detect terminal states (COMPLETED, FAILED, CANCELLED) correctly
  • Verify orphaned SLURM jobs are cancelled when CI runner is killed

🤖 Generated with Claude Code


CodeAnt-AI Description

Harden SLURM job monitor: robust state checks, cancel orphaned jobs, improved streaming reliability

What Changed

  • Monitor now determines job state by parsing squeue output with a sacct fallback so active, completed, and historical job states are detected reliably instead of relying on squeue exit codes alone.
  • The monitor no longer gives up when SLURM queries temporarily fail (reports UNKNOWN and keeps waiting); it only treats explicit terminal states as job completion and prints the state in periodic heartbeats.
  • If the monitor exits abnormally (for example the CI runner is killed), it cancels the associated SLURM job to avoid leaving orphaned jobs; it does not cancel on normal successful completion.
  • Increased read timeouts and stabilized tail/drain logic to prevent watcher crashes and to ensure complete streaming of job output before exiting.

Impact

✅ Fewer orphaned SLURM jobs left when monitors are killed
✅ Clearer job state shown in heartbeat logs during runs
✅ Fewer monitor crashes and more reliable streaming of job output

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

Summary by CodeRabbit

  • Bug Fixes

    • Improved job monitoring with robust state detection, terminal-state handling, and better handling of transient system unavailability
    • More reliable cleanup that cancels jobs on abnormal exit and stabilizes output handling
    • More accurate exit-code determination with a two-step retrieval and explicit failure reporting
  • Chores

    • Refined polling, streaming, heartbeat messaging, and drain limits for more efficient and stable monitoring
  • Documentation

    • Minor comment clarification for global parameters

The monitor script used `squeue -j $id &>/dev/null` which only checks
the exit code. When squeue itself fails transiently (SLURM daemon
overloaded, network hiccup), this is indistinguishable from "job
doesn't exist," causing the monitor to give up on jobs that are still
PENDING in the queue — leaving orphaned SLURM jobs.

Changes:
- Add get_job_state() that parses squeue output for the actual state
  string, with sacct fallback for completed/historical jobs
- Never give up on UNKNOWN state (let CI timeout be the backstop)
- Cancel orphaned SLURM jobs on abnormal monitor exit
- Fix fractional read timeouts that caused bash segfaults
- Include job state in heartbeat messages for better diagnostics

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings February 13, 2026 02:10
@codeant-ai
Copy link
Contributor

codeant-ai bot commented Feb 13, 2026

CodeAnt AI is reviewing your PR.


Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@qodo-code-review
Copy link
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Cleanup Logic

The EXIT trap cancels the SLURM job whenever monitor_success is not set to 1, which includes normal “job failed” paths and early exits after detecting a terminal state. This can result in calling scancel for already-terminal jobs (or for jobs that ended just before the script exits), which may cause confusing logs or unintended cancellation if there is any race between terminal detection and cleanup. Consider tightening the conditions (e.g., only cancel on signals/aborts) or recording an explicit “job is done” flag when a terminal state is observed.

# Cleanup handler to prevent orphaned tail processes and cancel orphaned jobs
cleanup() {
  if [ -n "${tail_pid:-}" ]; then
    kill "${tail_pid}" 2>/dev/null || true
  fi
  # Cancel the SLURM job if the monitor is exiting due to an error
  # (e.g., the CI runner is being killed). Don't cancel on success.
  if [ "${monitor_success:-0}" -ne 1 ] && [ -n "${job_id:-}" ]; then
    echo "Monitor exiting abnormally — cancelling SLURM job $job_id"
    scancel "$job_id" 2>/dev/null || true
  fi
State Detection

The sacct fallback uses the first line returned by sacct -j, but sacct commonly returns multiple records (job, batch step, extern, etc.) and the first record may not represent the overall job state reliably. Validate that the chosen line is the correct one across clusters, or constrain the query to the job allocation/primary record to avoid misclassifying terminal vs non-terminal states.

# Robustly check SLURM job state using squeue with sacct fallback.
# Returns the state string (PENDING, RUNNING, COMPLETED, FAILED, etc.)
# or "UNKNOWN" if both commands fail.
get_job_state() {
  local jid="$1"
  local state

  # Try squeue first (fast, works for active jobs)
  state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ')
  if [ -n "$state" ]; then
    echo "$state"
    return
  fi

  # Fallback to sacct (works for completed/historical jobs)
  if command -v sacct >/dev/null 2>&1; then
    state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}')
    if [ -n "$state" ]; then
      echo "$state"
      return
    fi
  fi

  echo "UNKNOWN"
}

# Check if a state is terminal (job is done, for better or worse)
is_terminal_state() {
  case "$1" in
    COMPLETED|FAILED|CANCELLED|CANCELLED+|TIMEOUT|OUT_OF_MEMORY|NODE_FAIL|PREEMPTED|BOOT_FAIL|DEADLINE)
      return 0 ;;
    *)
      return 1 ;;
  esac
}
Timing/Blocking

Switching read -t from a fractional timeout to -t 1 avoids platform issues, but it also changes monitoring responsiveness: the inner read loop can now block up to 1s per attempt even when there is no output, then the loop also sleeps 1s, potentially slowing state checks/heartbeats. Confirm this doesn’t cause unacceptable latency in detecting terminal states or in emitting heartbeat messages.

while true; do
  # Try to read from tail output (non-blocking via timeout)
  # Read multiple lines if available to avoid falling behind
  lines_read=0
  while IFS= read -r -t 1 line <&3 2>/dev/null; do
    echo "$line"
    lines_read=$((lines_read + 1))
    last_heartbeat=$(date +%s)
    # Limit burst reads to avoid starving the status check
    if [ $lines_read -ge 100 ]; then
      break
    fi
  done

  # Check job status
  current_time=$(date +%s)
  state=$(get_job_state "$job_id")

  if is_terminal_state "$state"; then
    echo "[$(date +%H:%M:%S)] Job $job_id reached terminal state: $state"
    break
  else
    # Print heartbeat if no output for 60 seconds
    if [ $((current_time - last_heartbeat)) -ge 60 ]; then
      echo "[$(date +%H:%M:%S)] Job $job_id state=$state (no new output for 60s)..."
      last_heartbeat=$current_time
    fi
  fi

  # Sleep briefly between status checks
  sleep 1
done

# Drain any remaining output from tail after job completes
echo "Draining remaining output..."
drain_count=0
while IFS= read -r -t 1 line <&3 2>/dev/null; do
  echo "$line"

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 13, 2026

📝 Walkthrough

Walkthrough

Reworks SLURM job monitoring script with improved state tracking via new utility functions (get_job_state, is_terminal_state), introduces abnormal-exit cleanup handler, replaces ad-hoc polling with a state-machine approach, enhances streaming/tailing logic and heartbeat reporting, and adds a two-step exit-code retrieval (scontrol then sacct). A minor comment change in a Fortran file is included.

Changes

Cohort / File(s) Summary
SLURM Job Monitoring
.github/scripts/monitor_slurm_job.sh
Adds get_job_state and is_terminal_state; replaces squeue retry polling with state-machine loops; adds abnormal-exit cleanup (conditional scancel); improves tailing (timeout, burst cap, hard drain limit) and heartbeat logging; implements two-step ExitCode retrieval (scontrol → sacct) and treats non-zero exit codes as failures.
Fortran comment tweak
src/pre_process/m_check_patches.fpp
Non-functional change: updated public use statement comment from "Global parameters for the code" to "Global parameters".

Sequence Diagram(s)

sequenceDiagram
    actor Script as monitor_slurm_job.sh
    participant SLURM as SLURM System
    participant File as Output File

    Script->>SLURM: get_job_state(job_id)
    SLURM-->>Script: state (PENDING/RUNNING/COMPLETED/FAILED/UNKNOWN)
    Script->>Script: is_terminal_state(state)?
    alt Terminal
        Script->>File: check for output file
        alt output missing
            Script->>SLURM: scancel job_id (cleanup)
            Script-->>Script: exit (error)
        else output present
            Script->>File: tail/stream output (with burst cap)
        end
    else Non-terminal
        Script->>Script: sleep & retry (state-machine)
    end

    loop Streaming
        File->>Script: data (with read timeout)
        Script->>SLURM: periodic get_job_state()
        Script-->>Script: heartbeat log (state)
        alt terminal detected during stream
            Script->>Script: note transition, continue draining until stable or hard limit
        end
    end

    Script->>SLURM: scontrol show job job_id (get ExitCode)
    alt scontrol returns ExitCode
        SLURM-->>Script: ExitCode
    else scontrol fails
        Script->>SLURM: sacct fallback query
        SLURM-->>Script: Exit status
    end
    Script->>Script: evaluate ExitCode, set monitor_success, cleanup
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I watched the queue and learned each state,
I sniffed the logs and chased the fate,
If jobs go rogue I tidy the lawn,
I tail, I time, I hush the dawn,
A hop, a fix — the pipeline's neat! 🥕

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Harden SLURM job monitor against transient squeue failures' accurately and specifically describes the main change—improving the monitor script's robustness when squeue fails transiently.
Description check ✅ Passed The pull request description comprehensively covers the changes, motivation, and testing plan, providing clear context about the problem and solution despite deviating from the template structure.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into master

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

No actionable comments were generated in the recent review. 🎉


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codeant-ai codeant-ai bot added the size:M This PR changes 30-99 lines, ignoring generated files label Feb 13, 2026
Comment on lines +58 to +66
# Check if a state is terminal (job is done, for better or worse)
is_terminal_state() {
case "$1" in
COMPLETED|FAILED|CANCELLED|CANCELLED+|TIMEOUT|OUT_OF_MEMORY|NODE_FAIL|PREEMPTED|BOOT_FAIL|DEADLINE)
return 0 ;;
*)
return 1 ;;
esac
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Modify the is_terminal_state function to handle all terminal job states, including those with a + suffix like COMPLETED+, by stripping the suffix before evaluation. This prevents the script from hanging on job completion. [possible issue, importance: 8]

Suggested change
# Check if a state is terminal (job is done, for better or worse)
is_terminal_state() {
case "$1" in
COMPLETED|FAILED|CANCELLED|CANCELLED+|TIMEOUT|OUT_OF_MEMORY|NODE_FAIL|PREEMPTED|BOOT_FAIL|DEADLINE)
return 0 ;;
*)
return 1 ;;
esac
}
# Check if a state is terminal (job is done, for better or worse)
is_terminal_state() {
local state_base=${1%+} # remove trailing '+'
case "$state_base" in
COMPLETED|FAILED|CANCELLED|TIMEOUT|OUT_OF_MEMORY|NODE_FAIL|PREEMPTED|BOOT_FAIL|DEADLINE)
return 0 ;;
*)
return 1 ;;
esac
}

Comment on lines 162 to +165
# Close the file descriptor and kill tail
exec 3<&-
kill "${tail_pid}" 2>/dev/null || true
tail_pid=""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: To prevent a race condition that could orphan the tail process, move the tail_pid="" line from the end of the script into the cleanup function. This ensures the process ID is cleared only after the process is killed. [possible issue, importance: 5]

Suggested change
# Close the file descriptor and kill tail
exec 3<&-
kill "${tail_pid}" 2>/dev/null || true
tail_pid=""
# Close the file descriptor and kill tail
exec 3<&-
kill "${tail_pid}" 2>/dev/null || true

scancel "$job_id" 2>/dev/null || true
fi
}
trap cleanup EXIT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Modify the trap command to explicitly catch INT and TERM signals in addition to EXIT, ensuring the cleanup function always runs, even when the script is interrupted. [general, importance: 6]

Suggested change
trap cleanup EXIT
trap cleanup EXIT INT TERM

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 1 file

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.

@codeant-ai
Copy link
Contributor

codeant-ai bot commented Feb 13, 2026

Nitpicks 🔍

🔒 No security issues identified
⚡ Recommended areas for review

  • Potential Over-cancellation
    The cleanup handler always calls scancel when monitor_success is not set. If the monitor exits between detecting a terminal state and finishing final exit-code checks (or if the trap runs for a perceived abnormal exit while the job is actually done), the script may attempt to cancel a job unnecessarily. Consider limiting cancellation to only when the job is actually still queued/running or when SLURM queries fail repeatedly.

  • State-parsing edge cases
    get_job_state() strips whitespace with tr -d ' ' and picks the first sacct line. Some SLURM state strings or suffixes (e.g., multi-word states, states with trailing characters) might be mis-parsed, and choosing the first sacct line may pick a step-level state instead of the batch step. Validate parsing across squeue/sacct outputs and normalize common variants (e.g., trailing '+', parenthetical annotations).

  • Tail PID reliability
    The script relies on $! after a process-substitution assignment (exec 3< <(stdbuf ... tail -f ...)) to get tail_pid. In some shells/process-substitution implementations $! may not reliably be the tail PID, making the subsequent kill "$tail_pid" a no-op and leaving background processes alive. Consider explicitly launching tail in background and connecting it via a FIFO or capturing PID robustly.

local state

# Try squeue first (fast, works for active jobs)
state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: With set -euo pipefail enabled, any non-zero exit from the squeue pipeline inside the command substitution will cause the entire script to exit immediately instead of falling back to sacct or returning UNKNOWN, so transient squeue failures will abort the monitor instead of being treated as non-fatal. [logic error]

Severity Level: Critical 🚨
- ❌ Monitor aborts on transient squeue failures, cancelling SLURM jobs.
- ⚠️ CI SLURM-based benchmark jobs become flaky under controller load.
- ⚠️ Sacct/UNKNOWN fallback never reached when squeue briefly fails.
Suggested change
state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ')
state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ' || true)
Steps of Reproduction ✅
1. Run the monitor script `.github/scripts/monitor_slurm_job.sh` (entry point used by CI,
see file header at `.github/scripts/monitor_slurm_job.sh:1`) with a valid SLURM job id:
`./monitor_slurm_job.sh <job_id> out.log`.

2. The script enables strict mode with `set -euo pipefail` at
`.github/scripts/monitor_slurm_job.sh:5` and later calls `get_job_state "$job_id"` in both
the "wait for job to start" loop (`state=$(get_job_state "$job_id")` around
`.github/scripts/monitor_slurm_job.sh:74`) and the main monitor loop
(`state=$(get_job_state "$job_id")` around `.github/scripts/monitor_slurm_job.sh:132`).

3. During one of these calls, cause a realistic transient SLURM failure so that `squeue -j
"$jid"` returns a non‑zero exit code (e.g., SLURM controller temporarily unreachable or
overloaded as described in the PR summary). This executes the line `state=$(squeue -j
"$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ')` at
`.github/scripts/monitor_slurm_job.sh:40` inside `get_job_state`.

4. Because `set -euo pipefail` is active (`pipefail` makes the pipeline's exit status
non‑zero when `squeue` fails, and the variable assignment is a simple command), the
non‑zero status from the `state=$(...)` command at line 40 causes the entire script to
exit immediately instead of continuing to the sacct fallback (`if command -v sacct` at
line 47) or printing `UNKNOWN` at line 55. The `cleanup` trap at
`.github/scripts/monitor_slurm_job.sh:7-19` then runs, sees `monitor_success` still
unset/0 and a non‑empty `$job_id`, and executes `scancel "$job_id"`, cancelling the
still‑valid job and printing "Monitor exiting abnormally — cancelling SLURM job <job_id>".
Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** .github/scripts/monitor_slurm_job.sh
**Line:** 40:40
**Comment:**
	*Logic Error: With `set -euo pipefail` enabled, any non-zero exit from the `squeue` pipeline inside the command substitution will cause the entire script to exit immediately instead of falling back to `sacct` or returning `UNKNOWN`, so transient `squeue` failures will abort the monitor instead of being treated as non-fatal.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
👍 | 👎


# Fallback to sacct (works for completed/historical jobs)
if command -v sacct >/dev/null 2>&1; then
state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Similarly to the squeue call, the sacct pipeline in get_job_state will, under set -euo pipefail, cause the script to exit on any non-zero sacct failure (e.g., slurmdbd hiccups) instead of returning UNKNOWN, defeating the intended robustness against transient SLURM issues. [logic error]

Severity Level: Critical 🚨
- ❌ Monitor exits on transient sacct failures, cancelling jobs.
- ⚠️ Historical/finished job state resolution becomes fragile and flaky.
- ⚠️ UNKNOWN state path never reached when sacct briefly unavailable.
Suggested change
state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}')
state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}' || true)
Steps of Reproduction ✅
1. Run `.github/scripts/monitor_slurm_job.sh <job_id> out.log` (script entry point at
`.github/scripts/monitor_slurm_job.sh:1`) under a SLURM setup where accounting (`sacct` /
slurmdbd) can experience transient failures.

2. Allow the monitored job to reach a state where `squeue -j "$job_id"` no longer returns
a state line (e.g., job completed and aged out of the active queue), so `get_job_state` at
`.github/scripts/monitor_slurm_job.sh:35-56` falls through its initial squeue query (lines
39–42) and enters the sacct fallback guarded by `if command -v sacct` at line 47.

3. During a realistic slurmdbd/sacct hiccup, have the sacct pipeline `sacct -j "$jid"
--format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}'` at
`.github/scripts/monitor_slurm_job.sh:48` return a non‑zero exit status (e.g., sacct
cannot contact the accounting daemon and exits with error).

4. With `set -euo pipefail` enabled at line 5, the non‑zero status from this sacct
pipeline (propagated by `pipefail`) causes the `state=$(...)` simple command at line 48 to
fail, triggering `set -e` and exiting the entire script immediately, before it can echo
`UNKNOWN` at line 55. The EXIT trap (`cleanup` at lines 7–17) runs, sees `monitor_success`
still 0 and a non‑empty `job_id`, and calls `scancel "$job_id"`, cancelling the job or at
least misreporting an internal monitor failure instead of returning UNKNOWN state as
designed.
Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** .github/scripts/monitor_slurm_job.sh
**Line:** 48:48
**Comment:**
	*Logic Error: Similarly to the squeue call, the sacct pipeline in get_job_state will, under `set -euo pipefail`, cause the script to exit on any non-zero sacct failure (e.g., slurmdbd hiccups) instead of returning `UNKNOWN`, defeating the intended robustness against transient SLURM issues.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
👍 | 👎

@codeant-ai
Copy link
Contributor

codeant-ai bot commented Feb 13, 2026

CodeAnt AI finished reviewing your PR.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the SLURM job monitor script against transient squeue failures that previously caused false job abandonment. The key improvement is replacing simple exit code checks with robust state parsing that distinguishes between "job doesn't exist" and "squeue temporarily unavailable."

Changes:

  • Introduces get_job_state() function that parses actual SLURM state strings with sacct fallback for completed jobs
  • Adds is_terminal_state() helper to correctly identify when jobs are truly finished
  • Enhances cleanup handler to cancel orphaned SLURM jobs when the monitor exits abnormally
  • Fixes fractional read timeouts (0.1s → 1s) that caused bash segfaults on some systems
  • Improves diagnostics by including job state in heartbeat messages

Comment on lines +94 to +102
*)
# Terminal state — job finished without creating output
if is_terminal_state "$state"; then
echo "ERROR: Job $job_id reached terminal state ($state) without creating output file"
exit 1
fi
break
fi
# Exponential backoff
sleep_time=$((2 ** squeue_retries))
echo "Warning: squeue check failed, retrying in ${sleep_time}s..."
sleep $sleep_time
fi
# Unrecognized state, keep waiting
sleep 5
;;
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where the state is neither PENDING, CONFIGURING, RUNNING, COMPLETING, UNKNOWN, nor a terminal state recognized by is_terminal_state(), the script will sleep and continue waiting indefinitely. This handles new SLURM states that might be introduced in the future. However, consider logging these unrecognized non-terminal states at least once to help diagnose unexpected SLURM behavior (e.g., "Warning: Job in unrecognized state: REQUEUED, continuing to wait...").

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/scripts/monitor_slurm_job.sh (1)

194-198: ⚠️ Potential issue | 🔴 Critical

Bug: grep matches both ExitCode and DerivedExitCode, causing successful jobs to be reported as failures.

When scontrol show job output contains both ExitCode=X:Y and DerivedExitCode=X:Y fields (standard output), the pattern ExitCode=[0-9]+:[0-9]+ matches both, since "ExitCode=" is a substring of "DerivedExitCode=". This produces two lines in exit_code (e.g., "0:0\n0:0"), and the comparison at line 217 ([ "$exit_code" != "0:0" ]) evaluates to true because the multi-line string doesn't equal "0:0", causing the script to incorrectly exit with error code 1 for successful jobs.

Proposed fix: anchor the pattern with a word boundary
-  exit_code=$(echo "$scontrol_output" | grep -oE 'ExitCode=[0-9]+:[0-9]+' | cut -d= -f2 || echo "")
+  exit_code=$(echo "$scontrol_output" | grep -oE '\bExitCode=[0-9]+:[0-9]+' | head -n1 | cut -d= -f2 || echo "")

The \b word boundary prevents DerivedExitCode from matching. head -n1 adds a safety net against multiple matches.

🧹 Nitpick comments (1)
.github/scripts/monitor_slurm_job.sh (1)

35-56: sacct may return sub-step states — consider filtering to the batch step.

sacct -j "$jid" returns one row per job step (job, batch, extern, etc.). head -n1 grabs the first row, which is typically the overall "job" record — but on some SLURM configurations the order isn't guaranteed or an extra blank/sub-step line may appear first. You could pin to the .batch step or filter with --parsable2 for more deterministic output.

That said, the UNKNOWN fallback makes this safe in practice.

Possible hardening
-    state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}')
+    state=$(sacct -j "$jid.batch" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}')

@@ -57,14 +111,13 @@ exec 3< <(stdbuf -oL -eL tail -f "$output_file" 2>&1)
tail_pid=$!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Replace the use of process substitution (<(...)) and $! with coproc to reliably capture the tail process PID. [possible issue, importance: 7]

New proposed code:
 # Start tail and redirect its output to file descriptor 3 for multiplexing
 # This allows us to stream tail output while also printing heartbeat messages
-exec 3< <(stdbuf -oL -eL tail -f "$output_file" 2>&1)
-tail_pid=$!
+coproc TAIL_PROC { stdbuf -oL -eL tail -f "$output_file" 2>&1; }
+exec 3<&"${TAIL_PROC[0]}"
+tail_pid="$TAIL_PROC_PID"

Comment on lines +35 to +56
get_job_state() {
local jid="$1"
local state

# Try squeue first (fast, works for active jobs)
state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ')
if [ -n "$state" ]; then
echo "$state"
return
fi

# Fallback to sacct (works for completed/historical jobs)
if command -v sacct >/dev/null 2>&1; then
state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}')
if [ -n "$state" ]; then
echo "$state"
return
fi
fi

echo "UNKNOWN"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Improve the sacct command in get_job_state to query the .batch step explicitly and use parsable output flags (-n -X -P) for more reliable job state detection. [possible issue, importance: 8]

Suggested change
get_job_state() {
local jid="$1"
local state
# Try squeue first (fast, works for active jobs)
state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ')
if [ -n "$state" ]; then
echo "$state"
return
fi
# Fallback to sacct (works for completed/historical jobs)
if command -v sacct >/dev/null 2>&1; then
state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}')
if [ -n "$state" ]; then
echo "$state"
return
fi
fi
echo "UNKNOWN"
}
get_job_state() {
local jid="$1"
local state
# Try squeue first (fast, works for active jobs)
state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ')
if [ -n "$state" ]; then
echo "$state"
return
fi
# Fallback to sacct (works for completed/historical jobs)
if command -v sacct >/dev/null 2>&1; then
state=$(sacct -j "${jid}.batch" -n -X -P -o State 2>/dev/null | head -n1 | cut -d'|' -f1)
if [ -n "$state" ]; then
echo "$state"
return
fi
fi
echo "UNKNOWN"
}

@@ -57,14 +111,13 @@ exec 3< <(stdbuf -oL -eL tail -f "$output_file" 2>&1)
tail_pid=$!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Use coproc to reliably capture the tail process PID and use tail -F instead of tail -f to better handle log file rotation. [possible issue, importance: 7]

New proposed code:
-exec 3< <(stdbuf -oL -eL tail -f "$output_file" 2>&1)
-tail_pid=$!
+coproc TAILPROC { stdbuf -oL -eL tail -F "$output_file" 2>&1; }
+exec 3<&"${TAILPROC[0]}"
+tail_pid="${TAILPROC_PID}"

Comment on lines +35 to +56
get_job_state() {
local jid="$1"
local state

# Try squeue first (fast, works for active jobs)
state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ')
if [ -n "$state" ]; then
echo "$state"
return
fi

# Fallback to sacct (works for completed/historical jobs)
if command -v sacct >/dev/null 2>&1; then
state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}')
if [ -n "$state" ]; then
echo "$state"
return
fi
fi

echo "UNKNOWN"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Wrap squeue and sacct calls with a timeout command to prevent the script from hanging if the SLURM controller is unresponsive. [possible issue, importance: 8]

Suggested change
get_job_state() {
local jid="$1"
local state
# Try squeue first (fast, works for active jobs)
state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ')
if [ -n "$state" ]; then
echo "$state"
return
fi
# Fallback to sacct (works for completed/historical jobs)
if command -v sacct >/dev/null 2>&1; then
state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}')
if [ -n "$state" ]; then
echo "$state"
return
fi
fi
echo "UNKNOWN"
}
get_job_state() {
local jid="$1"
local state
local tcmd=""
if command -v timeout >/dev/null 2>&1; then
tcmd="timeout 5"
fi
# Try squeue first (fast, works for active jobs)
state=$($tcmd squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ')
if [ -n "$state" ]; then
echo "$state"
return
fi
# Fallback to sacct (works for completed/historical jobs)
if command -v sacct >/dev/null 2>&1; then
state=$($tcmd sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}')
if [ -n "$state" ]; then
echo "$state"
return
fi
fi
echo "UNKNOWN"
}

Comment on lines +14 to +17
if [ "${monitor_success:-0}" -ne 1 ] && [ -n "${job_id:-}" ]; then
echo "Monitor exiting abnormally — cancelling SLURM job $job_id"
scancel "$job_id" 2>/dev/null || true
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Before calling scancel in the cleanup handler, check if the job is in a non-terminal state to avoid accidentally cancelling a different job that has reused the same ID. [security, importance: 9]

Suggested change
if [ "${monitor_success:-0}" -ne 1 ] && [ -n "${job_id:-}" ]; then
echo "Monitor exiting abnormally — cancelling SLURM job $job_id"
scancel "$job_id" 2>/dev/null || true
fi
if [ "${monitor_success:-0}" -ne 1 ] && [ -n "${job_id:-}" ]; then
state="$(get_job_state "$job_id")"
if ! is_terminal_state "$state"; then
echo "Monitor exiting abnormally — cancelling SLURM job $job_id"
scancel "$job_id" 2>/dev/null || true
else
echo "Monitor exiting abnormally — job $job_id already terminal ($state), not cancelling"
fi
fi

sbryngelson added a commit to sbryngelson/MFC that referenced this pull request Feb 13, 2026
Replace squeue exit-code polling with get_job_state() that parses
the actual state string (squeue + sacct fallback). Never give up on
UNKNOWN state — CI timeout is the backstop. Cancel orphaned SLURM
jobs on abnormal monitor exit. Include job state in heartbeats.

Incorporates changes from PR MFlowCode#1140.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov
Copy link

codecov bot commented Feb 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 44.03%. Comparing base (0ba6c02) to head (71d531a).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1140   +/-   ##
=======================================
  Coverage   44.03%   44.03%           
=======================================
  Files          70       70           
  Lines       20649    20649           
  Branches     2053     2054    +1     
=======================================
  Hits         9093     9093           
  Misses      10368    10368           
  Partials     1188     1188           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Review effort 2/5 size:M This PR changes 30-99 lines, ignoring generated files

Development

Successfully merging this pull request may close these issues.

1 participant