Skip to content

[Diagnostics] Add diagnostics suite with first tool to diagnose SLURM accounting setup#7336

Merged
gmarciani merged 4 commits intoaws:developfrom
gmarciani:wip/mgiacomo/diagnostics-0414-1
Apr 15, 2026
Merged

[Diagnostics] Add diagnostics suite with first tool to diagnose SLURM accounting setup#7336
gmarciani merged 4 commits intoaws:developfrom
gmarciani:wip/mgiacomo/diagnostics-0414-1

Conversation

@gmarciani
Copy link
Copy Markdown
Contributor

@gmarciani gmarciani commented Apr 14, 2026

Description of changes

Add diagnostics suite with first tool to diagnose SLURM accounting setup.

Notes

  1. We can skip the bad-url-suffix-checker because it is complaining about a comment with an example that contains the domain amazonaws.com, so harmless.
  2. We can skip the security-exclusion checker because the use of subprocess module is intentional (nosec B404).

User Experience

The user uploads the diagnostics suite to the head node with a one-click script.
The deployment script returns the command to log directly into the folder to execute the diagnosis.

➜  bash util/diagnostics/deploy.sh --cluster-name accnt-3150-11-2 --region us-east-1 --ssh-key ~/.ssh/pem_keys/mgiacomo/mgiacomo.pem
[INFO] Retrieving head node connection info for cluster 'accnt-3150-11-2' in region 'us-east-1'...
[INFO] Head node IP: 44.195.87.177
[INFO] Default user: ec2-user
[INFO] Uploading /Volumes/workplace/aws-parallelcluster-dev/aws-parallelcluster/util/diagnostics to ec2-user@44.195.87.177:~/
... OMITTED OUTPUT ...
[INFO] Done. Files uploaded to /home/ec2-user/diagnostics/
[INFO] Installing requirements on head node...
... OMITTED OUTPUT ...
[INFO] Requirements installed successfully.
[INFO] Next steps: log into the head node and run the diagnostics scripts from ~/diagnostics/
[INFO]   ssh -i /Users/mgiacomo/.ssh/pem_keys/mgiacomo/mgiacomo.pem ec2-user@44.195.87.177 -t 'cd ~/diagnostics && bash -l'

The user logs into the head node in the diagnostics folder:

➜  ssh -i /Users/mgiacomo/.ssh/pem_keys/mgiacomo/mgiacomo.pem ec2-user@44.195.87.177 -t 'cd ~/diagnostics && bash -l'

This is the helper of the first diagnosis tool about SLURM acocunting:

[ec2-user@ip-27-6-37-106 diagnostics]$ ./diagnose-slurm-accounting.py --help
Usage: diagnose-slurm-accounting.py [OPTIONS]

  Diagnose SLURM accounting setup.

Options:
  --db-endpoint TEXT  Database endpoint. If not specified, determined from the
                      cluster configuration in S3.
  --db-port INTEGER   Database port. If not specified, determined from the
                      cluster configuration in S3.
  --db-user TEXT      Database user. If not specified, determined from the
                      cluster configuration in S3.
  --secret-arn TEXT   Secret ARN for the database password. If not specified,
                      determined from the cluster configuration in S3.
  --region TEXT       AWS region. If not specified, determined from the local
                      /etc/chef/dna.json file.
  -h, --help          Show this message and exit.

This is an example of diagnosis made for SLURM accounting:

[ec2-user@ip-27-6-37-106 diagnostics]$ ./diagnose-slurm-accounting.py
2026-04-14 21:53:25,249 INFO: Some arguments are missing. Attempting to determine values automatically...
/home/ec2-user/.local/lib/python3.9/site-packages/boto3/compat.py:89: PythonDeprecationWarning: Boto3 will no longer support Python 3.9 starting April 29, 2026. To continue receiving service updates, bug fixes, and security updates please upgrade to Python 3.10 or later. More information can be found here: https://aws.amazon.com/blogs/developer/python-support-policy-updates-for-aws-sdks-and-tools/
  warnings.warn(warning, PythonDeprecationWarning)
2026-04-14 21:53:25,270 INFO: Found credentials from IAM Role: accnt-3150-11-2-RoleHeadNode-3nZd4yHCV7QF
[✓] Downloaded cluster configuration from S3
2026-04-14 21:53:25,458 INFO: Database Endpoint: slurm-accounting-cluster-11.cluster-c1yheob1ikdf.us-east-1.rds.amazonaws.com
2026-04-14 21:53:25,458 INFO: Database Port: 3306
2026-04-14 21:53:25,458 INFO: Database User: clusteradmin
2026-04-14 21:53:25,458 INFO: Secret ARN: arn:aws:secretsmanager:us-east-1:319414405305:secret:AccountingClusterAdminSecre-mo0xsZT8XRA3-zIQDCe
2026-04-14 21:53:25,459 INFO: Region: us-east-1
[✓] Database endpoint reachability check
[✓] Database endpoint matches configuration
2026-04-14 21:53:25,537 INFO: Found credentials from IAM Role: accnt-3150-11-2-RoleHeadNode-3nZd4yHCV7QF
[✓] Secret is plain text password
[✓] Database user matches configuration
[✓] Database password matches secret
[✓] MySQL connection test
[✓] User clusteradmin has correct MySQL permissions
2026-04-14 21:53:25,889 INFO: Grants for user 'clusteradmin':
    GRANT USAGE ON *.* TO `clusteradmin`@`%`
    GRANT `rds_superuser_role`@`%` TO `clusteradmin`@`%`
[✓] No errors related to MySQL in slurmdbd logs
2026-04-14 21:53:25,954 INFO: All checks completed!

Tests

  • See user experience above, which has been tested on a cluster with slurm accounting enabled.

References

  • Link to impacted open issues.
  • Link to related PRs in other packages (i.e. cookbook, node).
  • Link to documentation useful to understand the changes.

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@gmarciani gmarciani changed the title Wip/mgiacomo/diagnostics 0414 1 [Diagnostics] Add diagnostics suite with first tool to diagnose SLURM accounting setup Apr 14, 2026
@gmarciani gmarciani added skip-changelog-update Disables the check that enforces changelog updates in PRs 3.x labels Apr 14, 2026
Comment thread util/diagnostics/diagnose-slurm-accounting.py Dismissed
Comment thread util/diagnostics/common.py Fixed
@gmarciani gmarciani force-pushed the wip/mgiacomo/diagnostics-0414-1 branch from a757e50 to 36caf02 Compare April 14, 2026 22:10
@gmarciani gmarciani added the skip-bad-url-suffix-check Skip the checks regarding the bad URL suffix label Apr 14, 2026
@gmarciani gmarciani marked this pull request as ready for review April 14, 2026 22:12
@gmarciani gmarciani requested review from a team as code owners April 14, 2026 22:12
@gmarciani gmarciani force-pushed the wip/mgiacomo/diagnostics-0414-1 branch 2 times, most recently from d7af7de to 3ac826a Compare April 14, 2026 22:19
@himani2411
Copy link
Copy Markdown
Contributor

The user uploads the diagnostics suite to the head node with a one-click script.

Any specific reason for not keeping this script already as part of /examples or maybe another folder like /examples/diagnostic in the cookbook, so that this script is already in the AMI?


echo "[INFO] Installing requirements on head node..."

ssh "${SSH_ARGS[@]}" "${DEFAULT_USER}@${HEAD_NODE_IP}" "pip install -r ~/${REMOTE_DIR}/requirements.txt"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Non-blocking] We should create a virtual environement so that we do not install packages which could be related to CVE and can be picked up during a scan especially when we are not baking them into the AMI

Copy link
Copy Markdown
Contributor Author

@gmarciani gmarciani Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, will do in follow up PR

Comment thread util/diagnostics/diagnose-slurm-accounting.py Outdated
Comment thread util/diagnostics/diagnose-slurm-accounting.py Outdated
Comment thread util/diagnostics/diagnose-slurm-accounting.py Outdated
Comment thread util/diagnostics/common.py Outdated
Comment thread util/diagnostics/deploy.sh
Comment thread util/diagnostics/deploy.sh Outdated
@gmarciani
Copy link
Copy Markdown
Contributor Author

gmarciani commented Apr 15, 2026

The user uploads the diagnostics suite to the head node with a one-click script.

Any specific reason for not keeping this script already as part of /examples or maybe another folder like /examples/diagnostic in the cookbook, so that this script is already in the AMI?

We will not ship the diagnostics suite till it is stable and relevant enough. For now the goal is to use the suite internally (us and the support engineers) and I wanted the most immediate place to store it which is in the main pcluster package.
Once it will be ready to be shared with customers, we can consider to bake it into the AMI.

@gmarciani gmarciani force-pushed the wip/mgiacomo/diagnostics-0414-1 branch from 3ac826a to f9b35e2 Compare April 15, 2026 17:57
@gmarciani gmarciani enabled auto-merge (rebase) April 15, 2026 17:58
Comment thread util/diagnostics/common.py Dismissed
Comment thread util/diagnostics/common.py Dismissed
@gmarciani gmarciani force-pushed the wip/mgiacomo/diagnostics-0414-1 branch from f9b35e2 to 6d72b65 Compare April 15, 2026 18:17
@gmarciani gmarciani added the skip-security-exclusions-check Skip the checks regarding the security exclusions label Apr 15, 2026
@gmarciani gmarciani merged commit 70a1d06 into aws:develop Apr 15, 2026
24 checks passed
@gmarciani gmarciani deleted the wip/mgiacomo/diagnostics-0414-1 branch April 15, 2026 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.x skip-bad-url-suffix-check Skip the checks regarding the bad URL suffix skip-changelog-update Disables the check that enforces changelog updates in PRs skip-security-exclusions-check Skip the checks regarding the security exclusions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants