Skip to content

Nested slurm tasks can occasionally cause outer job to hang, even when inner job completes #110

@reesekneeland

Description

@reesekneeland

Hello! I am running into an issue with running multiple nested slurm processes. If I have an outer slurm process (in this case a model training job) that calls another slurm job internally (in this case a job to preprocess and prepare data), and I run multiple of these jobs at the same time via a grid-search style submission loop with different training parameters, the inner jobs (which are identical) will conflict and one of them will fail/be cancelled, causing the outer training job to hang forever as it waits for the inner job (which is now cancelled) to complete.

I am not sure of the exact mechanism causing the issue, it could be this logic? The symptom I observe is that when I submit a multi-job grid search script that launches 80 jobs, ~20 of them will hang forever with this issue, and in my system logs I can see that the inner preprocessing jobs for these get cancelled only 5 seconds after submission. As a debugging step I tried adding a sleep(1) statement in my job submission loop to try and give a buffer for jobs to not overlap, but that did not help.

Another thing that I have noticed that may be related, whenever I have nested slurm processes I tend to get lots of these warning messages in my log, any idea as to what might be causing them? Is there some configuration in my slurm cluster that I need to change?

submitit WARNING (2025-09-11 09:31:03,202) - Call #3 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
2025-09-11T09:31:03.202337 - WARNING - submitit:135 [slurm_51557] - Call #3 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
submitit WARNING (2025-09-11 09:31:06,454) - Call #4 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
2025-09-11T09:31:06.454158 - WARNING - submitit:135 [slurm_51557] - Call #4 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
submitit WARNING (2025-09-11 09:31:12,497) - Call #5 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
2025-09-11T09:31:12.497061 - WARNING - submitit:135 [slurm_51557] - Call #5 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
submitit WARNING (2025-09-11 09:31:24,544) - Call #6 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.

In my slurm .err file these look like:

submitit WARNING (2025-09-11 09:31:01,161) - Call #2 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
sacct: error: _open_persist_conn: failed to open persistent connection to host:localhost:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
submitit WARNING (2025-09-11 09:31:03,202) - Call #3 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
sacct: error: _open_persist_conn: failed to open persistent connection to host:localhost:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
submitit WARNING (2025-09-11 09:31:06,454) - Call #4 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
sacct: error: _open_persist_conn: failed to open persistent connection to host:localhost:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
submitit WARNING (2025-09-11 09:31:12,497) - Call #5 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
sacct: error: _open_persist_conn: failed to open persistent connection to host:localhost:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused```

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions