Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
142 changes: 87 additions & 55 deletions .github/scripts/monitor_slurm_job.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,17 @@

set -euo pipefail

# Cleanup handler to prevent orphaned tail processes
# Cleanup handler to prevent orphaned tail processes and cancel orphaned jobs
cleanup() {
if [ -n "${tail_pid:-}" ]; then
kill "${tail_pid}" 2>/dev/null || true
fi
# Cancel the SLURM job if the monitor is exiting due to an error
# (e.g., the CI runner is being killed). Don't cancel on success.
if [ "${monitor_success:-0}" -ne 1 ] && [ -n "${job_id:-}" ]; then
echo "Monitor exiting abnormally — cancelling SLURM job $job_id"
scancel "$job_id" 2>/dev/null || true
fi
Comment on lines +14 to +17
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Before calling scancel in the cleanup handler, check if the job is in a non-terminal state to avoid accidentally cancelling a different job that has reused the same ID. [security, importance: 9]

Suggested change
if [ "${monitor_success:-0}" -ne 1 ] && [ -n "${job_id:-}" ]; then
echo "Monitor exiting abnormally — cancelling SLURM job $job_id"
scancel "$job_id" 2>/dev/null || true
fi
if [ "${monitor_success:-0}" -ne 1 ] && [ -n "${job_id:-}" ]; then
state="$(get_job_state "$job_id")"
if ! is_terminal_state "$state"; then
echo "Monitor exiting abnormally — cancelling SLURM job $job_id"
scancel "$job_id" 2>/dev/null || true
else
echo "Monitor exiting abnormally — job $job_id already terminal ($state), not cancelling"
fi
fi

}
trap cleanup EXIT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Modify the trap command to explicitly catch INT and TERM signals in addition to EXIT, ensuring the cleanup function always runs, even when the script is interrupted. [general, importance: 6]

Suggested change
trap cleanup EXIT
trap cleanup EXIT INT TERM


Expand All @@ -23,30 +29,78 @@ output_file="$2"
echo "Submitted batch job $job_id"
echo "Monitoring output file: $output_file"

# Wait for file to appear with retry logic for transient squeue failures
# Robustly check SLURM job state using squeue with sacct fallback.
# Returns the state string (PENDING, RUNNING, COMPLETED, FAILED, etc.)
# or "UNKNOWN" if both commands fail.
get_job_state() {
local jid="$1"
local state

# Try squeue first (fast, works for active jobs)
state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: With set -euo pipefail enabled, any non-zero exit from the squeue pipeline inside the command substitution will cause the entire script to exit immediately instead of falling back to sacct or returning UNKNOWN, so transient squeue failures will abort the monitor instead of being treated as non-fatal. [logic error]

Severity Level: Critical 🚨
- ❌ Monitor aborts on transient squeue failures, cancelling SLURM jobs.
- ⚠️ CI SLURM-based benchmark jobs become flaky under controller load.
- ⚠️ Sacct/UNKNOWN fallback never reached when squeue briefly fails.
Suggested change
state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ')
state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ' || true)
Steps of Reproduction ✅
1. Run the monitor script `.github/scripts/monitor_slurm_job.sh` (entry point used by CI,
see file header at `.github/scripts/monitor_slurm_job.sh:1`) with a valid SLURM job id:
`./monitor_slurm_job.sh <job_id> out.log`.

2. The script enables strict mode with `set -euo pipefail` at
`.github/scripts/monitor_slurm_job.sh:5` and later calls `get_job_state "$job_id"` in both
the "wait for job to start" loop (`state=$(get_job_state "$job_id")` around
`.github/scripts/monitor_slurm_job.sh:74`) and the main monitor loop
(`state=$(get_job_state "$job_id")` around `.github/scripts/monitor_slurm_job.sh:132`).

3. During one of these calls, cause a realistic transient SLURM failure so that `squeue -j
"$jid"` returns a non‑zero exit code (e.g., SLURM controller temporarily unreachable or
overloaded as described in the PR summary). This executes the line `state=$(squeue -j
"$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ')` at
`.github/scripts/monitor_slurm_job.sh:40` inside `get_job_state`.

4. Because `set -euo pipefail` is active (`pipefail` makes the pipeline's exit status
non‑zero when `squeue` fails, and the variable assignment is a simple command), the
non‑zero status from the `state=$(...)` command at line 40 causes the entire script to
exit immediately instead of continuing to the sacct fallback (`if command -v sacct` at
line 47) or printing `UNKNOWN` at line 55. The `cleanup` trap at
`.github/scripts/monitor_slurm_job.sh:7-19` then runs, sees `monitor_success` still
unset/0 and a non‑empty `$job_id`, and executes `scancel "$job_id"`, cancelling the
still‑valid job and printing "Monitor exiting abnormally — cancelling SLURM job <job_id>".
Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** .github/scripts/monitor_slurm_job.sh
**Line:** 40:40
**Comment:**
	*Logic Error: With `set -euo pipefail` enabled, any non-zero exit from the `squeue` pipeline inside the command substitution will cause the entire script to exit immediately instead of falling back to `sacct` or returning `UNKNOWN`, so transient `squeue` failures will abort the monitor instead of being treated as non-fatal.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
👍 | 👎

if [ -n "$state" ]; then
echo "$state"
return
fi

# Fallback to sacct (works for completed/historical jobs)
if command -v sacct >/dev/null 2>&1; then
state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Similarly to the squeue call, the sacct pipeline in get_job_state will, under set -euo pipefail, cause the script to exit on any non-zero sacct failure (e.g., slurmdbd hiccups) instead of returning UNKNOWN, defeating the intended robustness against transient SLURM issues. [logic error]

Severity Level: Critical 🚨
- ❌ Monitor exits on transient sacct failures, cancelling jobs.
- ⚠️ Historical/finished job state resolution becomes fragile and flaky.
- ⚠️ UNKNOWN state path never reached when sacct briefly unavailable.
Suggested change
state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}')
state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}' || true)
Steps of Reproduction ✅
1. Run `.github/scripts/monitor_slurm_job.sh <job_id> out.log` (script entry point at
`.github/scripts/monitor_slurm_job.sh:1`) under a SLURM setup where accounting (`sacct` /
slurmdbd) can experience transient failures.

2. Allow the monitored job to reach a state where `squeue -j "$job_id"` no longer returns
a state line (e.g., job completed and aged out of the active queue), so `get_job_state` at
`.github/scripts/monitor_slurm_job.sh:35-56` falls through its initial squeue query (lines
39–42) and enters the sacct fallback guarded by `if command -v sacct` at line 47.

3. During a realistic slurmdbd/sacct hiccup, have the sacct pipeline `sacct -j "$jid"
--format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}'` at
`.github/scripts/monitor_slurm_job.sh:48` return a non‑zero exit status (e.g., sacct
cannot contact the accounting daemon and exits with error).

4. With `set -euo pipefail` enabled at line 5, the non‑zero status from this sacct
pipeline (propagated by `pipefail`) causes the `state=$(...)` simple command at line 48 to
fail, triggering `set -e` and exiting the entire script immediately, before it can echo
`UNKNOWN` at line 55. The EXIT trap (`cleanup` at lines 7–17) runs, sees `monitor_success`
still 0 and a non‑empty `job_id`, and calls `scancel "$job_id"`, cancelling the job or at
least misreporting an internal monitor failure instead of returning UNKNOWN state as
designed.
Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** .github/scripts/monitor_slurm_job.sh
**Line:** 48:48
**Comment:**
	*Logic Error: Similarly to the squeue call, the sacct pipeline in get_job_state will, under `set -euo pipefail`, cause the script to exit on any non-zero sacct failure (e.g., slurmdbd hiccups) instead of returning `UNKNOWN`, defeating the intended robustness against transient SLURM issues.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
👍 | 👎

if [ -n "$state" ]; then
echo "$state"
return
fi
fi

echo "UNKNOWN"
}
Comment on lines +35 to +56
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Improve the sacct command in get_job_state to query the .batch step explicitly and use parsable output flags (-n -X -P) for more reliable job state detection. [possible issue, importance: 8]

Suggested change
get_job_state() {
local jid="$1"
local state
# Try squeue first (fast, works for active jobs)
state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ')
if [ -n "$state" ]; then
echo "$state"
return
fi
# Fallback to sacct (works for completed/historical jobs)
if command -v sacct >/dev/null 2>&1; then
state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}')
if [ -n "$state" ]; then
echo "$state"
return
fi
fi
echo "UNKNOWN"
}
get_job_state() {
local jid="$1"
local state
# Try squeue first (fast, works for active jobs)
state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ')
if [ -n "$state" ]; then
echo "$state"
return
fi
# Fallback to sacct (works for completed/historical jobs)
if command -v sacct >/dev/null 2>&1; then
state=$(sacct -j "${jid}.batch" -n -X -P -o State 2>/dev/null | head -n1 | cut -d'|' -f1)
if [ -n "$state" ]; then
echo "$state"
return
fi
fi
echo "UNKNOWN"
}

Comment on lines +35 to +56
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Wrap squeue and sacct calls with a timeout command to prevent the script from hanging if the SLURM controller is unresponsive. [possible issue, importance: 8]

Suggested change
get_job_state() {
local jid="$1"
local state
# Try squeue first (fast, works for active jobs)
state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ')
if [ -n "$state" ]; then
echo "$state"
return
fi
# Fallback to sacct (works for completed/historical jobs)
if command -v sacct >/dev/null 2>&1; then
state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}')
if [ -n "$state" ]; then
echo "$state"
return
fi
fi
echo "UNKNOWN"
}
get_job_state() {
local jid="$1"
local state
local tcmd=""
if command -v timeout >/dev/null 2>&1; then
tcmd="timeout 5"
fi
# Try squeue first (fast, works for active jobs)
state=$($tcmd squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ')
if [ -n "$state" ]; then
echo "$state"
return
fi
# Fallback to sacct (works for completed/historical jobs)
if command -v sacct >/dev/null 2>&1; then
state=$($tcmd sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}')
if [ -n "$state" ]; then
echo "$state"
return
fi
fi
echo "UNKNOWN"
}


# Check if a state is terminal (job is done, for better or worse)
is_terminal_state() {
case "$1" in
COMPLETED|FAILED|CANCELLED|CANCELLED+|TIMEOUT|OUT_OF_MEMORY|NODE_FAIL|PREEMPTED|BOOT_FAIL|DEADLINE)
return 0 ;;
*)
return 1 ;;
esac
}
Comment on lines +58 to +66
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Modify the is_terminal_state function to handle all terminal job states, including those with a + suffix like COMPLETED+, by stripping the suffix before evaluation. This prevents the script from hanging on job completion. [possible issue, importance: 8]

Suggested change
# Check if a state is terminal (job is done, for better or worse)
is_terminal_state() {
case "$1" in
COMPLETED|FAILED|CANCELLED|CANCELLED+|TIMEOUT|OUT_OF_MEMORY|NODE_FAIL|PREEMPTED|BOOT_FAIL|DEADLINE)
return 0 ;;
*)
return 1 ;;
esac
}
# Check if a state is terminal (job is done, for better or worse)
is_terminal_state() {
local state_base=${1%+} # remove trailing '+'
case "$state_base" in
COMPLETED|FAILED|CANCELLED|TIMEOUT|OUT_OF_MEMORY|NODE_FAIL|PREEMPTED|BOOT_FAIL|DEADLINE)
return 0 ;;
*)
return 1 ;;
esac
}


# Wait for file to appear, using robust state checking.
# Never give up due to transient squeue/sacct failures — the CI job timeout
# is the ultimate backstop.
echo "Waiting for job to start..."
squeue_retries=0
max_squeue_retries=5
unknown_count=0
while [ ! -f "$output_file" ]; do
# Check if job is still queued/running
if squeue -j "$job_id" &>/dev/null; then
squeue_retries=0 # Reset on success
sleep 5
else
squeue_retries=$((squeue_retries + 1))
if [ $squeue_retries -ge $max_squeue_retries ]; then
# Job not in queue and output file doesn't exist
if [ ! -f "$output_file" ]; then
echo "ERROR: Job $job_id not in queue and output file not created"
state=$(get_job_state "$job_id")

case "$state" in
PENDING|CONFIGURING)
unknown_count=0
sleep 5
;;
RUNNING|COMPLETING)
unknown_count=0
# Job is running but output file not yet visible (NFS delay)
sleep 2
;;
UNKNOWN)
unknown_count=$((unknown_count + 1))
# Only print warning periodically to avoid log spam
if [ $((unknown_count % 12)) -eq 1 ]; then
echo "Warning: Could not query job $job_id state (SLURM may be temporarily unavailable)..."
fi
sleep 5
;;
*)
# Terminal state — job finished without creating output
if is_terminal_state "$state"; then
echo "ERROR: Job $job_id reached terminal state ($state) without creating output file"
exit 1
fi
break
fi
# Exponential backoff
sleep_time=$((2 ** squeue_retries))
echo "Warning: squeue check failed, retrying in ${sleep_time}s..."
sleep $sleep_time
fi
# Unrecognized state, keep waiting
sleep 5
;;
Comment on lines +94 to +102
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where the state is neither PENDING, CONFIGURING, RUNNING, COMPLETING, UNKNOWN, nor a terminal state recognized by is_terminal_state(), the script will sleep and continue waiting indefinitely. This handles new SLURM states that might be introduced in the future. However, consider logging these unrecognized non-terminal states at least once to help diagnose unexpected SLURM behavior (e.g., "Warning: Job in unrecognized state: REQUEUED, continuing to wait...").

Copilot uses AI. Check for mistakes.
esac
done

echo "=== Streaming output for job $job_id ==="
Expand All @@ -57,14 +111,13 @@ exec 3< <(stdbuf -oL -eL tail -f "$output_file" 2>&1)
tail_pid=$!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Replace the use of process substitution (<(...)) and $! with coproc to reliably capture the tail process PID. [possible issue, importance: 7]

New proposed code:
 # Start tail and redirect its output to file descriptor 3 for multiplexing
 # This allows us to stream tail output while also printing heartbeat messages
-exec 3< <(stdbuf -oL -eL tail -f "$output_file" 2>&1)
-tail_pid=$!
+coproc TAIL_PROC { stdbuf -oL -eL tail -f "$output_file" 2>&1; }
+exec 3<&"${TAIL_PROC[0]}"
+tail_pid="$TAIL_PROC_PID"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Use coproc to reliably capture the tail process PID and use tail -F instead of tail -f to better handle log file rotation. [possible issue, importance: 7]

New proposed code:
-exec 3< <(stdbuf -oL -eL tail -f "$output_file" 2>&1)
-tail_pid=$!
+coproc TAILPROC { stdbuf -oL -eL tail -F "$output_file" 2>&1; }
+exec 3<&"${TAILPROC[0]}"
+tail_pid="${TAILPROC_PID}"


# Monitor job status and stream output simultaneously
squeue_failures=0
last_heartbeat=$(date +%s)

while true; do
# Try to read from tail output (non-blocking via timeout)
# Read multiple lines if available to avoid falling behind
lines_read=0
while IFS= read -r -t 0.1 line <&3 2>/dev/null; do
while IFS= read -r -t 1 line <&3 2>/dev/null; do
echo "$line"
lines_read=$((lines_read + 1))
last_heartbeat=$(date +%s)
Expand All @@ -73,49 +126,30 @@ while true; do
break
fi
done

# Check job status
current_time=$(date +%s)
if ! squeue -j "$job_id" &>/dev/null; then
squeue_failures=$((squeue_failures + 1))
# Check if job actually completed using sacct (if available)
if [ $squeue_failures -ge 3 ]; then
if command -v sacct >/dev/null 2>&1; then
state=$(sacct -j "$job_id" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}')
# Consider job done only if it reached a terminal state
case "$state" in
COMPLETED|FAILED|CANCELLED|TIMEOUT|OUT_OF_MEMORY)
echo "[$(date +%H:%M:%S)] Job $job_id reached terminal state: $state"
break
;;
*)
# treat as transient failure, reset failures and continue polling
squeue_failures=0
;;
esac
else
# No sacct: assume job completed after 3 failures
echo "[$(date +%H:%M:%S)] Job $job_id no longer in queue"
break
fi
fi
state=$(get_job_state "$job_id")

if is_terminal_state "$state"; then
echo "[$(date +%H:%M:%S)] Job $job_id reached terminal state: $state"
break
else
squeue_failures=0
# Print heartbeat if no output for 60 seconds
if [ $((current_time - last_heartbeat)) -ge 60 ]; then
echo "[$(date +%H:%M:%S)] Job $job_id still running (no new output for 60s)..."
echo "[$(date +%H:%M:%S)] Job $job_id state=$state (no new output for 60s)..."
last_heartbeat=$current_time
fi
fi

# Sleep briefly between status checks
sleep 1
done

# Drain any remaining output from tail after job completes
echo "Draining remaining output..."
drain_count=0
while IFS= read -r -t 0.5 line <&3 2>/dev/null; do
while IFS= read -r -t 1 line <&3 2>/dev/null; do
echo "$line"
drain_count=$((drain_count + 1))
# Safety limit to avoid infinite loop
Expand All @@ -128,6 +162,7 @@ done
# Close the file descriptor and kill tail
exec 3<&-
kill "${tail_pid}" 2>/dev/null || true
tail_pid=""
Comment on lines 162 to +165
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: To prevent a race condition that could orphan the tail process, move the tail_pid="" line from the end of the script into the cleanup function. This ensures the process ID is cleared only after the process is killed. [possible issue, importance: 5]

Suggested change
# Close the file descriptor and kill tail
exec 3<&-
kill "${tail_pid}" 2>/dev/null || true
tail_pid=""
# Close the file descriptor and kill tail
exec 3<&-
kill "${tail_pid}" 2>/dev/null || true


# Wait for output file to finish growing (stabilize) before stopping tail
if [ -f "$output_file" ]; then
Expand All @@ -149,9 +184,6 @@ if [ -f "$output_file" ]; then
done
fi

# Stop tailing (trap will also handle this on exit)
kill "${tail_pid}" 2>/dev/null || true

echo ""
echo "=== Final output ==="
cat "$output_file"
Expand Down Expand Up @@ -187,6 +219,6 @@ if [ "$exit_code" != "0:0" ]; then
exit 1
fi

monitor_success=1
echo "Job $job_id completed successfully"
exit 0

2 changes: 1 addition & 1 deletion src/pre_process/m_check_patches.fpp
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ module m_check_patches
! Dependencies
use m_derived_types !< Definitions of the derived types

use m_global_parameters !< Global parameters for the code
use m_global_parameters !< Global parameters

use m_mpi_proxy !< Message passing interface (MPI) module proxy

Expand Down
Loading