Slurm
We use SLURM (https://slurm.schedmd.com/overview.html ) as a workload manager to schedule jobs onto compute resources. Via SLURM we can ensure that each user gets a fair share of the limited compute resources and that multiple users do not interfere with each other when e.g. running benchmarks.
Important: You can only access a node via SSH when you have a SLURM allocation of that node.
Other resources: - Slurm Tutorial
Basics
IMGW special commands
There are currently a few extra commands that can be used on the Jet Cluster to facilitate usage of the nodes.
Tools:
- jobinfo
- jobinfo_remaining
- nodeinfo
- queueinfo
- watchjob
Bash | |
---|---|
1 2 3 4 5 6 |
|
jobs
MPI
status and reason codes
The squeue
command details a variety of information on an active job’s status with state and reason codes. Job state codes describe a job’s current state in queue (e.g. pending, completed). Job reason codes describe the reason why the job is in its current state.
The following tables outline a variety of job state and reason codes you may encounter when using squeue to check on your jobs.
Job State Codes
Status | Code | Explaination |
---|---|---|
COMPLETED | CD |
The job has completed successfully. |
COMPLETING | CG |
The job is finishing but some processes are still active. |
FAILED | F |
The job terminated with a non-zero exit code and failed to execute. |
PENDING | PD |
The job is waiting for resource allocation. It will eventually run. |
PREEMPTED | PR |
The job was terminated because of preemption by another job. |
RUNNING | R |
The job currently is allocated to a node and is running. |
SUSPENDED | S |
A running job has been stopped with its cores released to other jobs. |
STOPPED | ST |
A running job has been stopped with its cores retained. |
A full list of these Job State codes can be found in Slurm’s documentation.
Job Reason Codes
Reason Code | Explaination |
---|---|
Priority |
One or more higher priority jobs is in queue for running. Your job will eventually run. |
Dependency |
This job is waiting for a dependent job to complete and will run afterwards. |
Resources |
The job is waiting for resources to become available and will eventually run. |
InvalidAccount |
The job’s account is invalid. Cancel the job and rerun with correct account. |
InvaldQoS |
The job’s QoS is invalid. Cancel the job and rerun with correct account. |
QOSGrpCpuLimit |
All CPUs assigned to your job’s specified QoS are in use; job will run eventually. |
QOSGrpMaxJobsLimit |
Maximum number of jobs for your job’s QoS have been met; job will run eventually. |
QOSGrpNodeLimit |
All nodes assigned to your job’s specified QoS are in use; job will run eventually. |
PartitionCpuLimit |
All CPUs assigned to your job’s specified partition are in use; job will run eventually. |
PartitionMaxJobsLimit |
Maximum number of jobs for your job’s partition have been met; job will run eventually. |
PartitionNodeLimit |
All nodes assigned to your job’s specified partition are in use; job will run eventually. |
AssociationCpuLimit |
All CPUs assigned to your job’s specified association are in use; job will run eventually. |
AssociationMaxJobsLimit |
Maximum number of jobs for your job’s association have been met; job will run eventually. |
AssociationNodeLimit |
All nodes assigned to your job’s specified association are in use; job will run eventually. |
A full list of these Job Reason Codes can be found in Slurm’s documentation.
Get information on your jobs
Job details | |
---|---|
1 2 3 4 |
|
Job efficiency | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
There is a helpful script that can report job efficiency for job arrays too.
seff-array.py
seff-array.py | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
|
One can use that to get more detailed information on a job array:
Running job efficiency report array | |
---|---|
1 |
|
Created: January 26, 2023