Slurm
We use SLURM (https://slurm.schedmd.com/overview.html ) as a workload manager to schedule jobs onto compute resources. Via SLURM we can ensure that each user gets a fair share of the limited compute resources and that multiple users do not interfere with each other when e.g. running benchmarks.
Important: You can only access a node via SSH when you have a SLURM allocation of that node.
Other resources: - Slurm Tutorial
Basics
IMGW special commands
There are currently a few extra commands that can be used on the Jet Cluster to facilitate usage of the nodes.
Tools:
- jobinfo
- jobinfo_remaining
- nodeinfo
- queueinfo
- watchjob
Bash | |
---|---|
1 2 3 4 5 6 |
|
jobs
MPI
status and reason codes
The squeue
command details a variety of information on an active job’s status with state and reason codes. Job state codes describe a job’s current state in queue (e.g. pending, completed). Job reason codes describe the reason why the job is in its current state.
The following tables outline a variety of job state and reason codes you may encounter when using squeue to check on your jobs.
Job State Codes
Status | Code | Explaination |
---|---|---|
COMPLETED | CD |
The job has completed successfully. |
COMPLETING | CG |
The job is finishing but some processes are still active. |
FAILED | F |
The job terminated with a non-zero exit code and failed to execute. |
PENDING | PD |
The job is waiting for resource allocation. It will eventually run. |
PREEMPTED | PR |
The job was terminated because of preemption by another job. |
RUNNING | R |
The job currently is allocated to a node and is running. |
SUSPENDED | S |
A running job has been stopped with its cores released to other jobs. |
STOPPED | ST |
A running job has been stopped with its cores retained. |
A full list of these Job State codes can be found in Slurm’s documentation.
Job Reason Codes
Reason Code | Explaination |
---|---|
Priority |
One or more higher priority jobs is in queue for running. Your job will eventually run. |
Dependency |
This job is waiting for a dependent job to complete and will run afterwards. |
Resources |
The job is waiting for resources to become available and will eventually run. |
InvalidAccount |
The job’s account is invalid. Cancel the job and rerun with correct account. |
InvaldQoS |
The job’s QoS is invalid. Cancel the job and rerun with correct account. |
QOSGrpCpuLimit |
All CPUs assigned to your job’s specified QoS are in use; job will run eventually. |
QOSGrpMaxJobsLimit |
Maximum number of jobs for your job’s QoS have been met; job will run eventually. |
QOSGrpNodeLimit |
All nodes assigned to your job’s specified QoS are in use; job will run eventually. |
PartitionCpuLimit |
All CPUs assigned to your job’s specified partition are in use; job will run eventually. |
PartitionMaxJobsLimit |
Maximum number of jobs for your job’s partition have been met; job will run eventually. |
PartitionNodeLimit |
All nodes assigned to your job’s specified partition are in use; job will run eventually. |
AssociationCpuLimit |
All CPUs assigned to your job’s specified association are in use; job will run eventually. |
AssociationMaxJobsLimit |
Maximum number of jobs for your job’s association have been met; job will run eventually. |
AssociationNodeLimit |
All nodes assigned to your job’s specified association are in use; job will run eventually. |
A full list of these Job Reason Codes can be found in Slurm’s documentation.
Get information on your jobs
Job details | |
---|---|
1 2 3 4 |
|
Job efficiency | |
---|---|
1 2 3 4 |
|
Created: December 12, 2022