- Basic Job Submission
- Choosing a Queue
- Specifying how long the job will run
- Specifying node and core requirements
- Specifying memory requirements
- Using infiniband
- Specifying the amount of scratch space needed
- Specifying the allocation to be charged
- Specifying email options
- Specifying which shell to run the job in
- Specifying the directory to run the job in
Basic Job SubmissionThe Deepthought HPC cluster uses a batch scheduling system to handle the queuing, scheduling, and execution of jobs. Users submit jobs using the
qsubcommand takes a number of options (some of which can be omitted or defaulted). These options define various requirements of the job, which are used by the scheduler to figure out what is needed to run your job, and to schedule it to run as soon as possible, subject to the constraints on the system, usage policies, and considering the other users of the cluster.
The options to
qsub can be given on the command line,
or in many cases inside the job script. When given inside the job script,
the option is place alone on a line starting with
(you must include a space after the PBS). These
must come before any executable line in the script. See the
examples page for examples.
# at the start of these lines means they will be ignored
by the shell.
The most basic parameter given to the
qsub command is
the script to run. This obviously must be given on command line, not
inside the script file. This script gets invoked by the batch system on
the first node assigned to your job; if your job involves multiple nodes
(or even multiple cores on the same node), it is this scripts responsibility
to launch all the tasks for the job. When the script exits, the job is
considered to have finished. More information can be found in the section
on Running parallel codes
and in the examples section.
Choosing a QueueThe queues on Deepthought are laid out in a manner that allows short running jobs to take priority over longer jobs. This means that if two jobs are waiting in the queue, the higher priority job will run first. Note however that if a job is already running in a queue, it will be allowed to run to completion before the next job is started. The only exception to this rule is for jobs in the serial queue, which will be preempted (evicted) if a higher priority job comes along.
In addition to queue priorities, users of the cluster with paid
allocations (users that contribute money or resources to the cluster)
get priority over non-paying users. These paid allocations get
replenished monthly, as documented here.
Other users can receive one-time grants of a certain number of
service units (SUs) as determined by the
HPCC Allocations and
Advisory Committee. In addition, "free" usage of the cluster is
provided to users with paid or non-paid allocations, assuming cycles
are available. "Free" jobs run at low priority and will be preempted
(evicted) if a higher-priority job comes along. To specify a queue,
-q option to
paid users will also need to specify their high-priority allocation in
order to take advantage of their elevated priority. If no allocation is
specified, the default priorities will be used. See the section Allocations for information on allocations, and
the section on specifying which allocation a job is
Specifying the Amount of Time Your Job Will RunWhen submitting a job, it is very important to specify the amount of time you expect your job to take. If you specify a time that is too short, your job will be terminated by the scheduler before it completes. However, if you specify a time that is too long, you may run the risk of having your job sit in the queue for longer than it should, as the scheduler attempts to find available resources on which to run your job.
To specify your estimated runtime, use the
parameter. This value should be specified in the form
HHH:MM:SS. Note that if your job is expected to run over
multiple days, simply convert the number of days into hours- for
example a 3 day job would have a walltime value of 72:00:00.
You may leave off the leading digits if you like- so a walltime of
15:00 will represent 15 minutes. Note also that while the
scheduler may show walltimes in the form DD:HH:MM:SS when you
view the queue status, this format will not be accepted when you
submit a job.
If you do not specify a walltime, the default (maximum) permitted walltime for the queue will be used. See the section entitled Queues for more information on queues and their assigned limits.
The following example specifies a walltime of 60 seconds, which should be more than enough for the job to complete.
#PBS -l nodes=1 #PBS -l walltime=00:00:60 hostname
Specifying Node and Core RequirementsDepending on the requirements of your job, you may need to give the scheduler more specific information about those requirements so that it can better assign you the resources that you need. By default, unless told otherwise, the scheduler will pack as many of your jobs as it can onto a given node. (You'll never share a node with someone else's jobs, but your own jobs are fair game for packing.) So, for instance if you have two jobs in the queue where you've specified
nodes=2, both of these jobs can be scheduled simultaneously onto the same 4-core machine.
There are two different arguments available to specify how many cores and
separate physical nodes you are allocated. The
when specified by itself, specifies the number of cores you will
be allocated. If you specify
nodes by itself, your job may all
fit onto a single physical node, or may be split across multiple nodes
depending on what's available.
If you know you'll need 12 cores, but don't care how they're distributed, try the following:
#PBS -l nodes=12 myjob
This might give you three 4-core nodes, or an 8-core node and a 4-core node, or even two 8-core nodes.
If you require that your cores are all allocated on the same physical node,
you can add the
ppn argument. Specifying
nodes=x:ppn=y says that you want x physical nodes with at least
y cores per node.
If you know you'll need at least four 4-core nodes (16 cores total), try the following:
#PBS -l nodes=4:ppn=4 myjob
This might give you four 4-core nodes, or even four 8-core nodes, depending
on what's available. Note that if you're using MPI, and you get larger nodes
than you've requested, the
mpirun command will pack more of your
jobs onto the larger nodes leaving some nodes empty. For this reason, using
ppn argument is not recommended when using MPI.
ppn arguments can be confusing, so
you should check to make sure you're getting what you want, and not allocating
more nodes than you need. Your best bet is to specify only the
nodes argument, and let the scheduler pick the appropriate
resources for you.
Specifying Memory RequirementsIf you want to request a specific amount of memory for your job, try something like the following:
#PBS -l nodes=1 #PBS -l mem=1024mb myjob
This example requests a single node with at least 4 processors and at least
1GB (1024MB) of memory total. Note that the
specifies the total amount of memory required across all of your
allocation. So if you end up with multiple nodes allocated, this memory
figure will be divided across all of them.
If you want to request a specific amount of memory on a per-core basis, use the following:
#PBS -l nodes=4 #PBS -l pmem=1024mb myjob
This example requests at least 1GB per core, on four cores, with a total of 4GB memory requested for the entire job.
You should also note that node selection does not count memory used by the operating system, etc. So a node which nominally has 8 GB of RAM might only show 7995 MB available. So even though most nodes on the cluster are nominally 8 GB, if you request 8 GB you will exclude most nodes (as they only have something like 7995 MB available after OS takes its chunk). So a bit of care should be used in choosing the memory requirements; going a little bit under multiples of GBs may be advisible.
Using InfiniBand NodesA subset of the nodes on the cluster are interconnected with Dual Data Rate (DDR) InfiniBand networking, and another subset of nodes use Quad Data Rate (QDR) InfiniBand networking. To select one of these subsets, add either the
qibtag to your job script.
Note: To take advantage of the InfiniBand hardware, your code must
have been compiled against an InfiniBand aware library. For example, the provided OpenMPI libraries
will automatically use InfiniBand if it is available.
When using OpenMPI, if you want to force your code to only use
InfiniBand, add the argument
--mca btl openib,self to your
mpirun command. If you want to force your code to only use
the TCP transport, instead add
--mca btl tcp,self. (If you
really want just TCP, though, please run your job in a queue other than the
The following example will run a job on a QDR InfiniBand node:
#PBS -l nodes=1:qib #PBS -l walltime=00:00:60 hostname
Specifying the Amount of Scratch Space NeededIf your job requires more than a small amount (1GB) of local scratch space, it would be a good idea to specify how much you need when you submit the job so that the scheduler can assign appropriate nodes to you.
Most of the nodes currently have at least 30GB of scratch space, some have as much as 250GB available, and a few have as little as 1GB available. Scratch space is currently mounted as /tmp. Scratch space will be cleared once your job completes.
The following example specifies a scratch space requirement of 5GB. Note however that if you do this, the scheduler will set a filesize limit of 5GB. If you then try to create a file larger than that, your job will automatically be killed, so be sure to specify a size large enough for your needs.
#PBS -l nodes=1 #PBS -l file=5gb myjob
NOTE:The size specified after file is
per task(core), so adjust accordingly. I.e., if you
are running an 8 core job that needs 40 GB, make sure you specify only
file=5gb, otherwise it might expect 320 GB of space and will
sit forever in the queue (as no such nodes are available).
Specifying the allocation to be chargedAll users of the cluster are provided with at least one allocation allocation. Paid users also receive a second, high-priority allocation. And some users might have multiple allocations of either normal or high-priority.
Jobs charged to the high-priority allocation take precedence over non-paid jobs. No job will preempt another job regardless of priority, with the exception of jobs in the serial queue, which will be preempted by any job with a higher priority.
To submit jobs to an allocation other than your default
(standard-priority) allocation, use the
-A option to
f20-l1:~: qsub -A test-hi test.sh 4194.deepthought.umd.edu
For more information on allocations, including monitoring usage of your allocation, see the section Allocations.
If you want to be notified via email when your job completes, you can
-mXX option to your description file. If you want
to receive mail when the job starts, replace the Xs with the
b. If you want to receive mail when your job
completes, replace the Xs with the letter
may add both letters if you like, and you'll get two email messages. By
default, you will always be sent email if your job is aborted by the
scheduler for any reason. The completion email will tell you the exit
status of your job as well as the amount of resources the job
consumed. Note that the CPU time and memory usage numbers provided in
this email are unreliable at best. The email messages by default will
be sent to your Glue account. If you'd like them to go elsewhere, you
can add the
-M option followed by a comma-seperated list
#PBS -l walltime=00:00:60 #PBS -mbe -Mbob@myhost.com,firstname.lastname@example.org date
Running Your Job in a Different ShellBy default, your job script will be run through whatever shell is your default shell. To change this, you'll need to add the
-Soption to your description file. Also note that when using the bash shell, you must explicitly run your
.profilescript, as it is not run for you automatically. If you have
tapcommands in your submit script, this is especially important because
tapis defined in
.profile. If you're using
tcshyou don't need to worry about this.
The following example changes to using
/bin/bash as the
#PBS -lwalltime=00:01:00 #PBS -S /bin/bash . ~/.profile # only needed for bash shell date hostname
Running Your Job in a Different DirectoryThe working directory in which your job runs will be your home directory, unless you specify otherwise. So, even if you're sitting in
/data/dt-raid5/bob/my_programwhen you submit your job, when the job runs, it will look in your home directory for any files that don't have a full pathname specified. The easiest way to change this behavior is to add the appropriate
cdcommand before any other commands in your job script.
Also note that if you are using MPI, you may also need to add either
-wd option for LAM (
mpirun) or the
-wdir option for MPICH (
mpiexec) to specify the
The following example (using LAM) switches the working directory to
#PBS -lwalltime=00:01:00 cd /data/dt-raid5/bob/my_program lamboot $PBS_NODEFILE mpirun -wd /data/dt-raid5/bob/my_program C alltoall lamhalt
There is also a
-d option that you can give to
(or add a
#PBS -d ... line in your job description file), but
use of that method is not recommended. It should work for the Lustre file
system, but does not work well with the
(qsub appears to expand all symlinks, and breaks the automount mechanism).