Job Scheduling
What is Slurm?
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for clusters. It facilitates the execution of parallel jobs on the cluster in an efficient manner. For more information on Slurm, users are requested to visit Slurm Workload Manager - Documentation.
What are the frequently used commands for Slurm?
Here is the list of frequently used commands. For more information, users are requested to refer Slurm Workload Manager - Documentation.
-
salloc - To allocate resources to a Slurm job with a possible set of constraints.
-
sbatch - Submits a Slurm job script.
-
scancel - Cancels a Slurm job.
-
scontrol - To query information and manage jobs.
-
sinfo - To retrieve information about partitions and nodes.
-
squeue - To query the list of pending and running jobs.
-
srun - To run a parallel job on the cluster managed by Slurm.
How to view the status of partitions?
Users can use the sinfo
command to view the status of partitions and nodes.
$ sinfo
Alternatively, users can run scontrol show partition
to know about partitions and their limits.
$ scontrol show partition
How to view my submitted jobs?
To view all the current jobs of a user, please type the following command.
$ squeue -u <username>
To view all the running jobs of a user, please type
$ squeue -u <username> -t RUNNING
How to cancel my job?
To cancel a particular job by its jobid
, use the following command.
$ scancel <jobid>
To cancel a job by its name, please type
$ scancel --name <jobname>
To cancel all the jobs of a user
$ scancel -u <username>
To cancel all the PENDING
jobs of a user
$ scancel -t PENDING -u <username>
How to control my job?
To hold a job from being scheduled
$ scontrol hold <jobid>
To release a job to be scheduled
$ scontrol release <jobid>
To requeue (cancel or rerun) a job
$ scontrol requeue <jobid>
For more information, users can use man scontrol
, man squeue
,
man scancel
, etc.
Can I run my code for a few minutes on the login node?
Users are not authorized to run their codes on the login node. Codes running on login nodes will be terminated automatically. Jobs have to be submitted through the scheduler.
How can I monitor the output of my jobs that are running?
Users can access the output and error logs generated by Slurm while running their codes using the tail
command. Users are strongly recommended to use the example scripts given above as a base for their job scripts. Job logs are written to the current directory, from which the Slurm job was executed by sbatch
.
Please note that by default sbatch
does not write the output logs. The example provided above instructs Slurm to dump the output in the format of slurm.<jobid>.out
, where <jobid>
is the job id of the slurm job. Users can use the tail
command with -f
flag to view the output file being written continuously by Slurm.
For example, if a user has a job with the job id 121, and would like to view the job output, then use the following syntax.
$ tail -f slurm.121.out
Alternatively, users can use vi
or nano
to view the output file after the job is completed.
$ vi slurm.121.out
$ nano slurm.121.out
Can I run interactive jobs?
Yes, the facility provides users access to interactive job scheduling. Users can use the srun
command to setup an interactive session on the nodes. Consider an example where a user wants to request 2 nodes with 64 cores per node.
$ srun -N 2 -n 1 -c 128 -p compute --pty bash -i
Here, -N
is the number of nodes to be used. -n
is the number of tasks (instances) to be executed. -c
is the number of cores required for the task. -p
is the partition to be used. In the present example, we chose the compute
partition. --pty bash
instructs Slurm to setup a pseudo terminal in bash. -i
tells Slurm to run this command in interactive mode.
I have job scripts written for another job scheduler. Can I use them for Slurm?
No, you cannot execute job scripts written for other job schedulers using Slurm. But, the developers of Slurm have provided users with a documentation containing the correspondences between the options of several job schedulers. Users can access the documentation at this link. Rosetta stone of Workload manager.
When is my job going to start to run?
To get an estimate of when your job is going to start, users can use the squeue
command.
$ squeue --start -j <jobid>
Please note that your job might run before the scheduled start time as jobs that have finished earlier than their requested walltime might free up the queue and resources needed for your job.
Why is my job not running?
There are several reasons why your job is not running. Users can run the squeue
command to get the status and reason.
$ squeue -j <jobid> -l
The NODELIST(REASON)
section in the output of the above command will have the reason, why Slurm is unable to run the job. Here are the most common reasons.
Job Status | Reason |
---|---|
BadConstraints | The job's constraints can not be satisfied. |
Cleaning | The job is being requeued and still cleaning up from its previous execution. |
Dependency | This job is waiting for a dependent job to complete. |
JobHeldAdmin | The job is held by a system administrator. Please contact the system administrator for more information. |
JobHeldUser | The job is held by the user. |
NonZeroExitCode | The job terminated with a non-zero exit code. |
PartitionDown | The partition required by this job is in a DOWN state. |
Priority | One or more higher priority jobs exist for this partition or advanced reservation. |
QOSResourceLimit | The job's Quality of Service (QOS) has reached some resource limit. |
ReqNodeNotAvail | Some node specifically required by the job is not currently available. |
Resources | The job is waiting for resources to become available. |
TimeLimit | The job exhausted its time limit. |
QOSMinCpuNotSatisfied | The job's CPU request doesn't meet the minimum limit of some Quality of Service (QOS). |
QOSMaxJobsPerUserLimit | The job is unable to run because the user has submitted more jobs of a certain type than are allowed to run at a time. |
PartitionTimeLimit | The job's time limit exceeds the partition's current time limit. |
QOSMaxGRESPerJob | The job's GRES request exceeds the maximum each job is allowed to use for the requested Quality of Service (QOS). |
ReqNodeNotAvail | Some node specifically required by the job is not currently available. If the error message also lists the UnavailableNodes: then it is likely that there is an upcoming reservation or maintenance window on that node. |
How Do I Optimize My Jobs for Faster Execution?
The following are some tips for optimizing your jobs for faster execution:
- Use the
--ntasks-per-node
option to specify the number of tasks to run per node. This will ensure that your job runs on a single node and will avoid the overhead of inter-node communication. - Minimize I/O operations. For example, if your job requires reading data from a file, read the data into memory at the beginning of the job and then perform all computations in memory. Write the results to a file at the end of the job.
- Keep your code up to date. If you are using a compiled language such as C or Fortran, use the latest compiler version available on the cluster. If you are using a scripting language such as Python or R, use the latest version of the interpreter available on the cluster.
How Can I Estimate the Resources (CPU, Memory) Needed for My Job?
Start with a small job using estimated resources. Monitor its usage using commands like sstat <jobid>
or seff <jobid>
. Adjust the resources based on this initial run for future submissions.
Can I Resume a Job After It Fails?
If your application supports checkpointing, you can resume from the last checkpoint after a job failure. Otherwise, the job will need to restart from the beginning.
We strongly recommend that you use checkpointing to avoid losing work in the event of a job failure.
How do I implement checkpointing in my application?
Checkpointing is a feature of the application itself. Please consult the documentation for your application to learn how to implement checkpointing.
How Do I Allocate Resources for Hybrid MPI/OpenMP Jobs in Slurm?
To allocate resources for hybrid MPI/OpenMP jobs, use the --ntasks-per-node
and --cpus-per-task
options. For example, to allocate 4 MPI tasks per node with 2 OpenMP threads per task, use the following directive in your Slurm script:
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=2
What Should I Do If a Compute Node Appears to Be Malfunctioning?
If a compute node appears to be malfunctioning, please contact the HPC team. We will investigate the issue and take appropriate action. If possible, please provide the following information:
- The name of the compute node
- The job ID of the job that was running on the compute node
- The job script that was used to submit the job
- Any relevant slurm.out or slurm.err files
What Are Strategies for Handling Jobs with Unpredictable Runtime Behavior?
For jobs with variable runtimes, consider implementing checkpointing and resubmitting the job if it doesn't complete in the expected time. Use Slurm’s job profiling tools (sstat
and seff
) to monitor and adjust resource requests.