Submitting Jobs

Basic Job Submission
Choosing a Queue
Specifying how long the job will run
Specifying node and core requirements
Specifying memory requirements
Using infiniband
Specifying the amount of scratch space needed
Specifying the allocation to be charged
Specifying email options
Specifying which shell to run the job in
Specifying the directory to run the job in

Basic Job Submission

The Deepthought HPC cluster uses a batch scheduling system to handle the queuing, scheduling, and execution of jobs. Users submit jobs using the qsub command. The qsub command takes a number of options (some of which can be omitted or defaulted). These options define various requirements of the job, which are used by the scheduler to figure out what is needed to run your job, and to schedule it to run as soon as possible, subject to the constraints on the system, usage policies, and considering the other users of the cluster.

The options to qsub can be given on the command line, or in many cases inside the job script. When given inside the job script, the option is place alone on a line starting with #PBS (you must include a space after the PBS). These #PBS lines must come before any executable line in the script. See the examples page for examples. The # at the start of these lines means they will be ignored by the shell.

The most basic parameter given to the qsub command is the script to run. This obviously must be given on command line, not inside the script file. This script gets invoked by the batch system on the first node assigned to your job; if your job involves multiple nodes (or even multiple cores on the same node), it is this scripts responsibility to launch all the tasks for the job. When the script exits, the job is considered to have finished. More information can be found in the section on Running parallel codes and in the examples section.

Choosing a Queue

The queues on Deepthought are laid out in a manner that allows short running jobs to take priority over longer jobs. This means that if two jobs are waiting in the queue, the higher priority job will run first. Note however that if a job is already running in a queue, it will be allowed to run to completion before the next job is started. The only exception to this rule is for jobs in the serial queue, which will be preempted (evicted) if a higher priority job comes along.

In addition to queue priorities, users of the cluster with paid allocations (users that contribute money or resources to the cluster) get priority over non-paying users. These paid allocations get replenished monthly, as documented here. Other users can receive one-time grants of a certain number of service units (SUs) as determined by the HPCC Allocations and Advisory Committee. In addition, "free" usage of the cluster is provided to users with paid or non-paid allocations, assuming cycles are available. "Free" jobs run at low priority and will be preempted (evicted) if a higher-priority job comes along. To specify a queue, use the -q option to qsub. Note: paid users will also need to specify their high-priority allocation in order to take advantage of their elevated priority. If no allocation is specified, the default priorities will be used. See the section Allocations for information on allocations, and the section on specifying which allocation a job is charged against.

Queues on the Deepthought Cluster

Specifying the Amount of Time Your Job Will Run

When submitting a job, it is very important to specify the amount of time you expect your job to take. If you specify a time that is too short, your job will be terminated by the scheduler before it completes. However, if you specify a time that is too long, you may run the risk of having your job sit in the queue for longer than it should, as the scheduler attempts to find available resources on which to run your job.

To specify your estimated runtime, use the walltime parameter. This value should be specified in the form HHH:MM:SS. Note that if your job is expected to run over multiple days, simply convert the number of days into hours- for example a 3 day job would have a walltime value of 72:00:00. You may leave off the leading digits if you like- so a walltime of 15:00 will represent 15 minutes. Note also that while the scheduler may show walltimes in the form DD:HH:MM:SS when you view the queue status, this format will not be accepted when you submit a job.

If you do not specify a walltime, the default (maximum) permitted walltime for the queue will be used. See the section entitled Queues for more information on queues and their assigned limits.

The following example specifies a walltime of 60 seconds, which should be more than enough for the job to complete.

#PBS -l nodes=1
#PBS -l walltime=00:00:60

hostname

Specifying Node and Core Requirements

Depending on the requirements of your job, you may need to give the scheduler more specific information about those requirements so that it can better assign you the resources that you need. By default, unless told otherwise, the scheduler will pack as many of your jobs as it can onto a given node. (You'll never share a node with someone else's jobs, but your own jobs are fair game for packing.) So, for instance if you have two jobs in the queue where you've specified nodes=2, both of these jobs can be scheduled simultaneously onto the same 4-core machine.

There are two different arguments available to specify how many cores and separate physical nodes you are allocated. The nodes argument, when specified by itself, specifies the number of cores you will be allocated. If you specify nodes by itself, your job may all fit onto a single physical node, or may be split across multiple nodes depending on what's available.

If you know you'll need 12 cores, but don't care how they're distributed, try the following:

#PBS -l nodes=12

myjob

This might give you three 4-core nodes, or an 8-core node and a 4-core node, or even two 8-core nodes.

If you require that your cores are all allocated on the same physical node, you can add the ppn argument. Specifying nodes=x:ppn=y says that you want x physical nodes with at least y cores per node.

If you know you'll need at least four 4-core nodes (16 cores total), try the following:

#PBS -l nodes=4:ppn=4

myjob

This might give you four 4-core nodes, or even four 8-core nodes, depending on what's available. Note that if you're using MPI, and you get larger nodes than you've requested, the mpirun command will pack more of your jobs onto the larger nodes leaving some nodes empty. For this reason, using the ppn argument is not recommended when using MPI.

The nodes and ppn arguments can be confusing, so you should check to make sure you're getting what you want, and not allocating more nodes than you need. Your best bet is to specify only the nodes argument, and let the scheduler pick the appropriate resources for you.

Specifying Memory Requirements

If you want to request a specific amount of memory for your job, try something like the following:

#PBS -l nodes=1
#PBS -l mem=1024mb

myjob

This example requests a single node with at least 4 processors and at least 1GB (1024MB) of memory total. Note that the mem parameter specifies the total amount of memory required across all of your allocation. So if you end up with multiple nodes allocated, this memory figure will be divided across all of them.

If you want to request a specific amount of memory on a per-core basis, use the following:

#PBS -l nodes=4
#PBS -l pmem=1024mb

myjob

This example requests at least 1GB per core, on four cores, with a total of 4GB memory requested for the entire job.

You should also note that node selection does not count memory used by the operating system, etc. So a node which nominally has 8 GB of RAM might only show 7995 MB available. So even though most nodes on the cluster are nominally 8 GB, if you request 8 GB you will exclude most nodes (as they only have something like 7995 MB available after OS takes its chunk). So a bit of care should be used in choosing the memory requirements; going a little bit under multiples of GBs may be advisible.

Using InfiniBand Nodes

A subset of the nodes on the cluster are interconnected with Dual Data Rate (DDR) InfiniBand networking, and another subset of nodes use Quad Data Rate (QDR) InfiniBand networking. To select one of these subsets, add either the ib or qib tag to your job script.

Note: To take advantage of the InfiniBand hardware, your code must have been compiled against an InfiniBand aware library. For example, the provided OpenMPI libraries (tap openmpi, tap openmpi-intel, tap openmpi-pgi) will automatically use InfiniBand if it is available.

When using OpenMPI, if you want to force your code to only use InfiniBand, add the argument --mca btl openib,self to your mpirun command. If you want to force your code to only use the TCP transport, instead add --mca btl tcp,self. (If you really want just TCP, though, please run your job in a queue other than the ib queue.)

The following example will run a job on a QDR InfiniBand node:

#PBS -l nodes=1:qib
#PBS -l walltime=00:00:60

hostname

Specifying the Amount of Scratch Space Needed

If your job requires more than a small amount (1GB) of local scratch space, it would be a good idea to specify how much you need when you submit the job so that the scheduler can assign appropriate nodes to you.

Most of the nodes currently have at least 30GB of scratch space, some have as much as 250GB available, and a few have as little as 1GB available. Scratch space is currently mounted as /tmp. Scratch space will be cleared once your job completes.

The following example specifies a scratch space requirement of 5GB. Note however that if you do this, the scheduler will set a filesize limit of 5GB. If you then try to create a file larger than that, your job will automatically be killed, so be sure to specify a size large enough for your needs.

#PBS -l nodes=1
#PBS -l file=5gb

myjob

NOTE:The size specified after file is per task(core), so adjust accordingly. I.e., if you are running an 8 core job that needs 40 GB, make sure you specify only file=5gb, otherwise it might expect 320 GB of space and will sit forever in the queue (as no such nodes are available).

Specifying the allocation to be charged

All users of the cluster are provided with at least one allocation allocation. Paid users also receive a second, high-priority allocation. And some users might have multiple allocations of either normal or high-priority.

Jobs charged to the high-priority allocation take precedence over non-paid jobs. No job will preempt another job regardless of priority, with the exception of jobs in the serial queue, which will be preempted by any job with a higher priority.

To submit jobs to an allocation other than your default (standard-priority) allocation, use the -A option to qsub.

f20-l1:~: qsub -A test-hi test.sh
4194.deepthought.umd.edu

For more information on allocations, including monitoring usage of your allocation, see the section Allocations.

Email Options

If you want to be notified via email when your job completes, you can add the -mXX option to your description file. If you want to receive mail when the job starts, replace the Xs with the letter b. If you want to receive mail when your job completes, replace the Xs with the letter e. You may add both letters if you like, and you'll get two email messages. By default, you will always be sent email if your job is aborted by the scheduler for any reason. The completion email will tell you the exit status of your job as well as the amount of resources the job consumed. Note that the CPU time and memory usage numbers provided in this email are unreliable at best. The email messages by default will be sent to your Glue account. If you'd like them to go elsewhere, you can add the -M option followed by a comma-seperated list of usernames.

#PBS -l walltime=00:00:60
#PBS -mbe -Mbob@myhost.com,jane@yourhost.com

date

Running Your Job in a Different Shell

By default, your job script will be run through whatever shell is your default shell. To change this, you'll need to add the -S option to your description file. Also note that when using the bash shell, you must explicitly run your .profile script, as it is not run for you automatically. If you have tap commands in your submit script, this is especially important because tap is defined in .profile. If you're using tcsh you don't need to worry about this.

The following example changes to using /bin/bash as the execution shell.

#PBS -lwalltime=00:01:00
#PBS -S /bin/bash

. ~/.profile   # only needed for bash shell

date
hostname

Running Your Job in a Different Directory

The working directory in which your job runs will be your home directory, unless you specify otherwise. So, even if you're sitting in /data/dt-raid5/bob/my_program when you submit your job, when the job runs, it will look in your home directory for any files that don't have a full pathname specified. The easiest way to change this behavior is to add the appropriate cd command before any other commands in your job script.

Also note that if you are using MPI, you may also need to add either the -wd option for LAM (mpirun) or the -wdir option for MPICH (mpiexec) to specify the working directory.

The following example (using LAM) switches the working directory to /data/dt-raid5/bob/my_program.

#PBS -lwalltime=00:01:00

cd /data/dt-raid5/bob/my_program

lamboot $PBS_NODEFILE

mpirun -wd /data/dt-raid5/bob/my_program C alltoall

lamhalt

There is also a -d option that you can give to qsub (or add a #PBS -d ... line in your job description file), but use of that method is not recommended. It should work for the Lustre file system, but does not work well with the /data/... volumes (qsub appears to expand all symlinks, and breaks the automount mechanism).