Skip to content

Slurm

We use SLURM (https://slurm.schedmd.com/overview.html ) as a workload manager to schedule jobs onto compute resources. Via SLURM we can ensure that each user gets a fair share of the limited compute resources and that multiple users do not interfere with each other when e.g. running benchmarks.

Important: You can only access a node via SSH when you have a SLURM allocation of that node.

Other resources: - Slurm Tutorial

Basics

IMGW special commands

There are currently a few extra commands that can be used on the Jet Cluster to facilitate usage of the nodes.

Tools: - jobinfo - jobinfo_remaining - nodeinfo - queueinfo - watchjob

Bash
1
2
3
4
5
6
# Get information on your job
jobinfo
# or use a JOBID
jobinfo 123456
# 
jobinfo_remaining

jobs

MPI

status and reason codes

The squeue command details a variety of information on an active job’s status with state and reason codes. Job state codes describe a job’s current state in queue (e.g. pending, completed). Job reason codes describe the reason why the job is in its current state.

The following tables outline a variety of job state and reason codes you may encounter when using squeue to check on your jobs.

Job State Codes

Status Code Explaination
COMPLETED CD The job has completed successfully.
COMPLETING CG The job is finishing but some processes are still active.
FAILED F The job terminated with a non-zero exit code and failed to execute.
PENDING PD The job is waiting for resource allocation. It will eventually run.
PREEMPTED PR The job was terminated because of preemption by another job.
RUNNING R The job currently is allocated to a node and is running.
SUSPENDED S A running job has been stopped with its cores released to other jobs.
STOPPED ST A running job has been stopped with its cores retained.

A full list of these Job State codes can be found in Slurm’s documentation.

Job Reason Codes

Reason Code Explaination
Priority One or more higher priority jobs is in queue for running. Your job will eventually run.
Dependency This job is waiting for a dependent job to complete and will run afterwards.
Resources The job is waiting for resources to become available and will eventually run.
InvalidAccount The job’s account is invalid. Cancel the job and rerun with correct account.
InvaldQoS The job’s QoS is invalid. Cancel the job and rerun with correct account.
QOSGrpCpuLimit All CPUs assigned to your job’s specified QoS are in use; job will run eventually.
QOSGrpMaxJobsLimit Maximum number of jobs for your job’s QoS have been met; job will run eventually.
QOSGrpNodeLimit All nodes assigned to your job’s specified QoS are in use; job will run eventually.
PartitionCpuLimit All CPUs assigned to your job’s specified partition are in use; job will run eventually.
PartitionMaxJobsLimit Maximum number of jobs for your job’s partition have been met; job will run eventually.
PartitionNodeLimit All nodes assigned to your job’s specified partition are in use; job will run eventually.
AssociationCpuLimit All CPUs assigned to your job’s specified association are in use; job will run eventually.
AssociationMaxJobsLimit Maximum number of jobs for your job’s association have been met; job will run eventually.
AssociationNodeLimit All nodes assigned to your job’s specified association are in use; job will run eventually.

A full list of these Job Reason Codes can be found in Slurm’s documentation.

Get information on your jobs

Job details
1
2
3
4
# get all your jobs since 
sacct --start=YY-MM-DD -u $USER -o start,jobid,jobidraw,jobname,partition,maxvmsize,elapsed,state,exitcode 
# get more information on one job
sacct -j [jobid] 
Job efficiency
1
2
3
4
# get a jobs efficiency report
seff [jobid]
# example
seff 

Last update: February 1, 2024
Created: December 12, 2022