qsub
command. The qsub
command takes a
number of options (some of which can be omitted or defaulted). These
options define various requirements of the job, which are used by the
scheduler to figure out what is needed to run your job, and to schedule
it to run as soon as possible, subject to the constraints on the system,
usage policies, and considering the other users of the cluster.
The options to qsub
can be given on the command line,
or in many cases inside the job script. When given inside the job script,
the option is place alone on a line starting with #PBS
(you must include a space after the PBS). These #PBS
lines
must come before any executable line in the script. See the
examples page for examples.
The #
at the start of these lines means they will be ignored
by the shell.
The most basic parameter given to the qsub
command is
the script to run. This obviously must be given on command line, not
inside the script file. This script gets invoked by the batch system on
the first node assigned to your job; if your job involves multiple nodes
(or even multiple cores on the same node), it is this scripts responsibility
to launch all the tasks for the job. When the script exits, the job is
considered to have finished. More information can be found in the section
on Running parallel codes
and in the examples section.
In addition to queue priorities, users of the cluster with paid
allocations (users that contribute money or resources to the cluster)
get priority over non-paying users. These paid allocations get
replenished monthly, as documented here.
Other users can receive one-time grants of a certain number of
service units (SUs) as determined by the
HPCC Allocations and
Advisory Committee. In addition, "free" usage of the cluster is
provided to users with paid or non-paid allocations, assuming cycles
are available. "Free" jobs run at low priority and will be preempted
(evicted) if a higher-priority job comes along. To specify a queue,
use the -q
option to qsub
. Note:
paid users will also need to specify their high-priority allocation in
order to take advantage of their elevated priority. If no allocation is
specified, the default priorities will be used. See the section Allocations for information on allocations, and
the section on specifying which allocation a job is
charged against.
Queues on the Deepthought Cluster
To specify your estimated runtime, use the walltime
parameter. This value should be specified in the form
HHH:MM:SS. Note that if your job is expected to run over
multiple days, simply convert the number of days into hours- for
example a 3 day job would have a walltime value of 72:00:00.
You may leave off the leading digits if you like- so a walltime of
15:00 will represent 15 minutes. Note also that while the
scheduler may show walltimes in the form DD:HH:MM:SS when you
view the queue status, this format will not be accepted when you
submit a job.
If you do not specify a walltime, the default (maximum) permitted walltime for the queue will be used. See the section entitled Queues for more information on queues and their assigned limits.
The following example specifies a walltime of 60 seconds, which should be more than enough for the job to complete.
#PBS -l nodes=1 #PBS -l walltime=00:00:60 hostname |
nodes=2
, both of these jobs can be scheduled
simultaneously onto the same 4-core machine.
There are two different arguments available to specify how many cores and
separate physical nodes you are allocated. The nodes
argument,
when specified by itself, specifies the number of cores you will
be allocated. If you specify nodes
by itself, your job may all
fit onto a single physical node, or may be split across multiple nodes
depending on what's available.
If you know you'll need 12 cores, but don't care how they're distributed, try the following:
#PBS -l nodes=12 myjob |
This might give you three 4-core nodes, or an 8-core node and a 4-core node, or even two 8-core nodes.
If you require that your cores are all allocated on the same physical node,
you can add the ppn
argument. Specifying
nodes=x:ppn=y
says that you want x physical nodes with at least
y cores per node.
If you know you'll need at least four 4-core nodes (16 cores total), try the following:
#PBS -l nodes=4:ppn=4 myjob |
This might give you four 4-core nodes, or even four 8-core nodes, depending
on what's available. Note that if you're using MPI, and you get larger nodes
than you've requested, the mpirun
command will pack more of your
jobs onto the larger nodes leaving some nodes empty. For this reason, using
the ppn
argument is not recommended when using MPI.
The nodes
and ppn
arguments can be confusing, so
you should check to make sure you're getting what you want, and not allocating
more nodes than you need. Your best bet is to specify only the
nodes
argument, and let the scheduler pick the appropriate
resources for you.
#PBS -l nodes=1 #PBS -l mem=1024mb myjob |
This example requests a single node with at least 4 processors and at least
1GB (1024MB) of memory total. Note that the mem
parameter
specifies the total amount of memory required across all of your
allocation. So if you end up with multiple nodes allocated, this memory
figure will be divided across all of them.
If you want to request a specific amount of memory on a per-core basis, use the following:
#PBS -l nodes=4 #PBS -l pmem=1024mb myjob |
This example requests at least 1GB per core, on four cores, with a total of 4GB memory requested for the entire job.
You should also note that node selection does not count memory used by the operating system, etc. So a node which nominally has 8 GB of RAM might only show 7995 MB available. So even though most nodes on the cluster are nominally 8 GB, if you request 8 GB you will exclude most nodes (as they only have something like 7995 MB available after OS takes its chunk). So a bit of care should be used in choosing the memory requirements; going a little bit under multiples of GBs may be advisible.
ib
or qib
tag to your job script.
Note: To take advantage of the InfiniBand hardware, your code must
have been compiled against an InfiniBand aware library. For example, the provided OpenMPI libraries
(tap openmpi
, tap openmpi-intel
, tap openmpi-pgi
)
will automatically use InfiniBand if it is available.
When using OpenMPI, if you want to force your code to only use
InfiniBand, add the argument --mca btl openib,self
to your
mpirun
command. If you want to force your code to only use
the TCP transport, instead add --mca btl tcp,self
. (If you
really want just TCP, though, please run your job in a queue other than the
ib queue.)
The following example will run a job on a QDR InfiniBand node:
#PBS -l nodes=1:qib #PBS -l walltime=00:00:60 hostname |
Most of the nodes currently have at least 30GB of scratch space, some have as much as 250GB available, and a few have as little as 1GB available. Scratch space is currently mounted as /tmp. Scratch space will be cleared once your job completes.
The following example specifies a scratch space requirement of 5GB. Note however that if you do this, the scheduler will set a filesize limit of 5GB. If you then try to create a file larger than that, your job will automatically be killed, so be sure to specify a size large enough for your needs.
#PBS -l nodes=1 #PBS -l file=5gb myjob |
NOTE:The size specified after file is
per task(core), so adjust accordingly. I.e., if you
are running an 8 core job that needs 40 GB, make sure you specify only
file=5gb
, otherwise it might expect 320 GB of space and will
sit forever in the queue (as no such nodes are available).
Jobs charged to the high-priority allocation take precedence over non-paid jobs. No job will preempt another job regardless of priority, with the exception of jobs in the serial queue, which will be preempted by any job with a higher priority.
To submit jobs to an allocation other than your default
(standard-priority) allocation, use the -A
option to
qsub
.
f20-l1:~: qsub -A test-hi test.sh 4194.deepthought.umd.edu |
For more information on allocations, including monitoring usage of your allocation, see the section Allocations.
If you want to be notified via email when your job completes, you can
add the -mXX
option to your description file. If you want
to receive mail when the job starts, replace the Xs with the
letter b
. If you want to receive mail when your job
completes, replace the Xs with the letter e
. You
may add both letters if you like, and you'll get two email messages. By
default, you will always be sent email if your job is aborted by the
scheduler for any reason. The completion email will tell you the exit
status of your job as well as the amount of resources the job
consumed. Note that the CPU time and memory usage numbers provided in
this email are unreliable at best. The email messages by default will
be sent to your Glue account. If you'd like them to go elsewhere, you
can add the -M
option followed by a comma-seperated list
of usernames.
#PBS -l walltime=00:00:60 #PBS -mbe -Mbob@myhost.com,jane@yourhost.com date |
-S
option to your description file. Also note that when
using the bash shell, you must explicitly run your
.profile
script, as it is not run for you automatically.
If you have tap
commands in your submit script, this is
especially important because tap
is defined in
.profile
. If you're using tcsh
you don't
need to worry about this.
The following example changes to using /bin/bash
as the
execution shell.
#PBS -lwalltime=00:01:00 #PBS -S /bin/bash . ~/.profile # only needed for bash shell date hostname |
/data/dt-raid5/bob/my_program
when you submit your
job, when the job runs, it will look in your home directory for any
files that don't have a full pathname specified. The easiest way to change
this behavior is to add the appropriate cd
command before
any other commands in your job script.
Also note that if you are using MPI, you may also need to add either
the -wd
option for LAM (mpirun
) or the
-wdir
option for MPICH (mpiexec
) to specify the
working directory.
The following example (using LAM) switches the working directory to
/data/dt-raid5/bob/my_program
.
#PBS -lwalltime=00:01:00 cd /data/dt-raid5/bob/my_program lamboot $PBS_NODEFILE mpirun -wd /data/dt-raid5/bob/my_program C alltoall lamhalt |
There is also a -d
option that you can give to qsub
(or add a #PBS -d ...
line in your job description file), but
use of that method is not recommended. It should work for the Lustre file
system, but does not work well with the /data/...
volumes
(qsub appears to expand all symlinks, and breaks the automount mechanism).