This page shows an example of submitting a job that uses a hybrid of both distributed-memory and shared-memory parallelism. In this case, the job will use MPI for the distributed memory parallelism, and multithreading for the shared portion; i.e. it launches a number of MPI tasks with each task using a number of threads. This hybrid technique is only supported by some codes, typically is useful when the problem can be decomposed for parallelization in two orthogonal ways.
This page provides an example of submitting such a hybrid code. It is based on the HelloUMD-HybridMPI job templates in the OnDemand portal.
This job makes use of a simple
Hello World! program called hello-umd
available in the
UMD HPC cluster software library and which supports
sequential
,
multithreaded
, and
MPI
modes of operation. The code simply prints an identifying message
from each thread of each task.
This example basically consists of a single file, the job script
submit.sh
(see for a listing and explanation
of the script) which gets submitted to the cluster via the
sbatch
command.
The script is designed to show many good practices; including:
Many of the practices above are rather overkill for such a simple job --- indeed, the vast majority of lines are for these "good practices" rather than the running of the intended code, but are included for educational purposes.
This code runs hello-umd
in multiple MPI tasks, with each
task consisting of multiple threads, saving the output
to a file in the temporary work directory, and then copying back to the
submission directory. We could have forgone all that and simply have the
output of hello-umd
go to
standard output,
which would be available in the slurm-JOBNUMBER.out
file (or whatever file you instructed Slurm to use instead). Doing such is
acceptable as long as the code is not producing an excessive amount (many
MBs) of output --- if the code produces a lot of output having it all sent
to Slurm output file can cause problems, and it is better to redirect to
a file.
The submission script submit.sh
can be
downloaded
as plain text. We present a copy with line numbers for discussion
below (click on lines to link to discussion for those lines):
Line# | Code |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Like most of our examples, this shebang uses the /bin/bash
interpretter, which is the
bash (Bourne-again) shell.
This is a compatible replacement to and enhancement of the original Unix
Bourne shell.
You can opt to specify another shell or interpretter if you so desire,
common choices are:
/bin/sh
) in your shebang (note that this
basically just uses bash in a restricted mode)/bin/csh
or /bin/tcsh
)However, we recommend the use of the bash shell, as it has the support for scripting; this might not matter for most job submission scripts because of their simplicity, but might if you start to need more advanced features. The examples generally use the bash shell for this reason.
#SBATCH
are used to control
the Slurm scheduler and will be discussed elsewhere.
But other than the cases above, feel free to use comment lines to remind yourself (and maybe others reading your script) of what the script is doing.
#SBATCH
can be used to control
how the Slurm sbatch
command submits the job. Basically, any
command line flags can be provided witha #SBATCH
line in the
script, and you can mix and match command line options and options in
#SBATCH
. NOTE: any #SBATCH
must precede any "executable lines" in the script. It is recommended that
you have nothing but the shebang line, comments and blank lines before any
#SBATCH
lines.
--ntasks=3
or -n 3
)
with each task having 15
CPU cores
(--cpus-per-task=15
or -c 15
).
Note that we do not specify a number of nodes, and we recommend that you do not specify a node count for most jobs --- by default Slurm will allocate enough nodes to satisfy this job's needs, and if you specify a value which is incorrect it will only cause problems.
We choose 3 MPI tasks of 15 cores as this will usually require multiple nodes on both Deepthought2 and Juggernaut (although some of the larger Juggernaut nodes can support this request on a single node), and so makes a better demonstration, but will still fit in the debug partition.
#SBATCH -t TIME
line sets the time limit for
the job. The requested TIME value can take any of a number of
formats, including:
It is important to set the time limit appropriately. It must be set longer than you expect the job to run, preferable with a modest cushion for error --- when the time limit is up, the job will be canceled.
You do not want to make the requested time excessive, either. Although you are only charged for the actual time used (i.e. if you requested 12 hours and the job finished in 11 hours, your job is only charged for 11 not 12 hours), there are other downsides of requesting too much wall time. Among them, the job may spend more time in the queue, and might not run at all if your account is low on funds (the scheduler will use the requested wall time to estimate the number of SUs the job will consume, and will not start a job if it and all currently running jobs are projected to have sufficient SUs to complete). And if it starts, and excessive walltime might block other jobs from running for a similar reason.
In general, you should estimate the maximum run time, and pad it by 10% or so.
In this case, the hello-umd
will run very quickly; much less
than 5 minutes.
There are several parameters you can give to Slurm/sbatch to specify the
memory to be allocated for the job. It is recommended that you always include
a memory request for your job --- if omitted it will default to 6GB per CPU
core. The recommended way to request memory is with the
--mem-per-cpu=N
flag. Here N is in MB.
This will request N MB of RAM for each CPU core allocated to the job.
Since you often wish to ensure each process in the job has sufficient memory,
this is usually the best way to do so.
An alternative is with the --mem=N
flag. This sets
the maximum memory use by node. Again, N is in MB. This
could be useful for single node jobs, especially multithreaded jobs, as there
is only a single node and threads generally share significant amounts of memory.
But for MPI jobs the --mem-per-cpu
flag is usually more
appropriate and convenient.
For MPI codes, we recommend using --mem-per-cpu
instead of
--mem
since you generally wish to ensure each MPI task has
sufficient memory.
The hello-umd
does not use much memory, so 1 GB per core
is plenty.
The lines SBATCH --share
, SBATCH --oversubscribe
,
or SBATCH --exclusive
decide whether or not other jobs are able to run on the same node(s) are
your job.
NOTE: The Slurm scheduler changed the name of the
flag for "shared" mode. The proper flag is now
#SBATCH --oversubscribe
. You must use the "oversubscribe"
flag on Juggernaut. You can currently use either form on Deepthought2, but
the "#SBATCH --share
form is deprecated and at some point will
no longer be supported. Both forms effectively do the same thing.
In exclusive mode, no other jobs are able to run on a node allocated to your job while your job is running. This greatly reduces the possibility of another job interfering with the running of your job. But if you are not using all of the resources of the node(s) your job is running on, it is also wasteful of resources. In exclusive mode, we charge your job for all of the cores on the nodes allocated to your job, regardless of whether you are using them or not.
In share/oversubscribe mode, other jobs (including those of other users) may run on the same node as your job as long as there are sufficient resources for both. We make efforts to try to prevent jobs from interfering with each other, but such methods are not perfect, so while the risk of interference is small, it is much greater risk in share mode than in exclusive mode. However, in share mode you are only charged for the requested number of cores (not all cores on the node unless you requested such), and your job might spend less time in the queue (since it can avail itself of nodes which are in use by other jobs but have some unallocated resources).
Our recommendation is that large (many-core/many-node) and/or long running jobs use exclusive mode, as the potential cost of adverse interence is greatest here. Plus large jobs tend to use most if not all cores of most of the nodes they run on, so the cost of exclusive mode tends to be less. Smaller jobs, and single core jobs in particular, generally benefit from share/oversubscribe mode, as they tend to less fully utiliize the nodes they run on (indeed, on a standard Deepthought2 node, a single core job will only use 5% of the CPU cores).
The default for the cluster is, unless you specify otherwise, to default single core jobs to share mode, and multicore/multinode jobs to exclusive mode. This is not an ideal choice, and might change in the future. We recommend that you always explicitly request either share/oversubscribe or exclusive as appropriate.
Again, as a multi-core job, #SBATCH --exclusive
is the default, but we recommend explicitly stating this.
For real production work, the debug queue is probably not adequate, in which case it is recommended that you just omit this line and let the scheduler select an appropriate partition for you.
sbatch
not to let the job process inherit the
environment of the process which invoked the sbatch
command. This requires the job script to explicitly set up its required
environment, as it can no longer depend on environmental settings you had
when you run the sbatch
command. While this may require a few
more lines in your script, it is a good practice and improves the
reproducibility of the job script --- without this it is possible the job
would only run correctly if you had a certain module loaded or variable set
when you submit the job.
module
command is available
in your script. They are generally only required if the shell specified
in the shebang line does not match your default login shell, in which
case the proper startup files likely did not get invoked.
The unalias
line is to ensure that there is no vestigal
tap
command. It is sometimes needed on RHEL6 systems,
should not be needed on the newer platforms but is harmless when not
needed. The remaining lines will read in the appropriate dot files for
the bash shell --- the if
, then
, elif
construct enables this script to work on both the Deepthought2 and
Juggernaut clusters, which have a slightly different name for the bash
startup file.
SLURM_EXPORT_ENV
to the value ALL
,
which causes the environment to be shared with other processes
spawned by Slurm commands (which also includes mpirun
and similar).
At first this might seem to contradict our recommendation to
use #SBATCH --export=NONE
, but it really does not.
The #SBATCH --export=NONE
setting will cause the
job script not to inherit the environment of
the shell in which you ran the sbatch
command.
But we are now in the job script, which because of the
--export=NONE
flag, has it's own environment which
was set up in the script. We want this environment to
be shared with other MPI tasks and processes spawned by this
job. These MPI tasks and processes will inherit the environment
set up in this script, not the environment from which the
sbatch
command ran.
This important for MPI jobs like this, because otherwise the
mpirun
code might not spawn properly.
To begin with, we do a module purge
to clear out any previously loaded
modules. This prevents them from interfering with subsequent module loads. Then we load
the default module for the cluster with module load hpcc/deepthought2
; this line
should be adjusted for the cluster being used (e.g. module load hpcc/juggernaut
for the Juggernaut cluster).
We then load the desired compiler and MPI library,
finally the hello-umd
. We recommend that you always
load the compiler module first, and then if needed the MPI library,
and then any higher
level applications. Many packages have different builds for different
compilers, MPI libraries, etc., and the module command is smart enough to
load the correct versions of these. Note that we specify the versions;
if you omitted the version the module command will usually try to the
load most recent version installed.
We recommend that you always specify the specific version you want in your job scripts --- this makes your job more reproducible. Systems staff may add newer versions of existing packages without notification, and if you do not specify a version, the default version may change without your expecting it. In particular, a job that runs fine today using today's default version might crash unexpectedly when you try running it again in six months because the packages it uses were updated and your inputs are not compatible with the new version of the code.
/tmp
is specific to a single node,
so that is usually not suitable for MPI jobs. The lustre filesystem is
accessible by all of the compute nodes of cluster, so it is a good choice
for MPI jobs.
The TMPWORKDIR="/lustre/$USER/ood-job.${SLURM_JOBID}">
or similar line
defines an environmental variable containing the name of our chosen work
directory. The ${SLURM_JOBID}
references another environmental
variable which is automatically set by Slurm (when the job starts to run) to
the job number for this job --- using this in our work directory names
helps ensure it will not conflict with any other job. The
mkdir
command creates this work directory, and the
cd
changes our working directory to that directory---
note in those last commands the use of the environmental variable we just
created to hold the directory name.
SLURM_JOBID
, SLURM_NTASKS
,
SLURM_JOB_NUM_NODES
, and SLURM_JOB_NODELIST
which are set by Slurm
at the start of the job to list the job number, the number of MPI tasks, the number of nodes,
and the names of the nodes allocated to the job. It also prints the time and date that
the job started (the date
command), the working directory (the
pwd
command), and the list of loaded modules (the module list
command). Although you are probably not interested in any of that information if the
job runs as expected, they can often be helpful in diagnosing why things did not work
as expected.
hello-umd
command,
and stores it in an environmental variable named MYEXE
, and then
outputs the path for added diagnostics. We find that MPI jobs run better when
you provide the absolute path to the executable to the mpirun
or
similar command.
mpirun
command with the absolute
path to our hello-umd
executable as the first argument.
Since we want each MPI task to run in multithreaded mode, we add the
-t $SLURM_CPUS_PER_TASK
arguments to
hello-umd
, so that it will
use as many threads as we requested.
Note that we do not do not specify the number
of tasks and instead rely on mpirun
defaulting
to the number of tasks requested from Slurm (-n
),
requested per task (-c
). We also do
not specify a fixed number of threads, but instead use the
Slurm defined environment variable SLURM_CPUS_PER_TASK
when
giving hello-umd
the number of threads to run. (We
could have given -t 0
, but then we might have been surprised to
see more threads than we expected --- because we are running in
exclusive mode, all of the
CPU cores
on the
node
will be assigned to the
job, and if there more cores on the node than required for the tasks on the
node, using -t 0
will use those cores as well. So we
recommend giving -t $SLURM_CPUS_PER_TASK
for this case.)
In general, we
recommend you also use such short cuts and/or Slurm-set environmental
variables like SLURM_NTASKS
and SLURM_CPUS_PER_TASK
)
to reduce to chance of inconsistencies. I.e., if you were to explicitly
give a number of tasks to mpirun
(and do not use SLURM_NTASKS
),
or explicitly give a number of threads to hello-umd
(not using
SLURM_CPUS_PER_TASK
) and you experiment with different task/thread
counts, then it is easy to forget to update the changes in both places, resulting
in discrepencies between what the scheduler allocated and what the job expects.
This can cause the job to immediately fail, or even worse, to waste resources
(if you end up requesting more resources than are used) or run very slowly
(if e.g. more threads are requested than the available
CPU cores
).
We run the code so as to save the output in the file
hello.out
in the current working directory.
The >
operator does output redirection, meaning that all
of the standard output goes to the specified file
(hello.out
in this case). The >&1
operator
causes the standard error output to be sent to the standard output stream
(1 is the number for the standard output stream), and since standard output
was redirected to the file, so will the standard error be.
For this simple case, we could have omitted the redirection of standard
output and standard error, in which case any such output would end up in the
Slurm output file (usually named slurm-JOBNUMBER.out
.
However, if your job produces a lot (many MBs) of output to standard
output/standard error, this can be problematic. It is good practice
to redirect output if you know your code will produce more than 1000 or so
lines of output.
The special shell variable $?
stores the exit code from the last command.
Normally it will be 0 if the command was successful, and non-zero otherwise. But it only
works for the last command, so we save it in the variable ECODE
.
ECODE
,
and then prints the date/time of completion using the date
command
ECODE
. This
means that the script will have the same exit code as the application, which will allow
Slurm to better determine if the job was successful or not. (If we did not do this, the
error code of the script will be the error code of the last command that ran, in this
case the date
command which should never fail. So even if your application
aborted, the script would return a successful (0) error code, and Slurm would think the
job succeeded if this line was omitted).
The reason for this is that if the last line does not have the proper line termination character, it will be ignored by the shell. Over the years, we have had many users confused as to why there job ended as soon as it started without error, etc. --- it turns out the last line of their script was the line which actually ran their code, and it was missing the correct line termination character. Therefore, the job ran, did some initialization and module loads, and exited without running the command they were most interested in because of a missing line termination character (which can be easily overlooked).
This problem most frequently occurs when transferring files between Unix/Linux and Windows operating systems. While there are utilities that can add the correct line termination characters, the easy solution in my opinion is to just add one or more blank lines at the end of your script --- if the shell ignores the blank lines, you do not care.
The easiest way to run this example is with the
Job Composer of
the OnDemand portal, using
the HelloUMD-HybridMPI
template.
To submit from the command line, just
sbatch submit.sh
. This will submit the job
to the scheduler, and should return a message like
Submitted batch job 23767
--- the number will vary (and is the
job number for this job). The job number can be used to reference
the job in Slurm, etc. (Please always give the job number(s) when requesting
help about a job you submitted).
Whichever method you used for submission, the job will be queued for the
debug partition and should run within 15 minutes or so. When it finishes
running, the slurm-JOBNUMBER.out
should contain
the output from our diagnostic commands (time the job started, finished,
module list, etc). The output of the hello-umd
will be in
the file hello.out
in the directory from which you submitted
the job. If you used OnDemand, these file will appear listed in the
Folder contents
section on the right.
The hello-umd
file should look something like:
hello-umd: Version 1.5
Built for compiler: gcc/8.4.0
with MPI support( usgin MPI library openmpi/3.1.5)
Hello UMD from thread 0 of 15, task 0 of 3 (pid=227028 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 2 of 15, task 0 of 3 (pid=227028 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 12 of 15, task 0 of 3 (pid=227028 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 9 of 15, task 0 of 3 (pid=227028 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 4 of 15, task 0 of 3 (pid=227028 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 10 of 15, task 0 of 3 (pid=227028 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 1 of 15, task 0 of 3 (pid=227028 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 3 of 15, task 0 of 3 (pid=227028 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 11 of 15, task 0 of 3 (pid=227028 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 8 of 15, task 0 of 3 (pid=227028 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 13 of 15, task 0 of 3 (pid=227028 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 6 of 15, task 0 of 3 (pid=227028 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 14 of 15, task 0 of 3 (pid=227028 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 7 of 15, task 0 of 3 (pid=227028 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 5 of 15, task 0 of 3 (pid=227028 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 15, task 1 of 3 (pid=227031 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 10 of 15, task 1 of 3 (pid=227031 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 7 of 15, task 1 of 3 (pid=227031 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 9 of 15, task 1 of 3 (pid=227031 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 2 of 15, task 1 of 3 (pid=227031 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 5 of 15, task 1 of 3 (pid=227031 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 1 of 15, task 1 of 3 (pid=227031 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 8 of 15, task 1 of 3 (pid=227031 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 11 of 15, task 1 of 3 (pid=227031 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 12 of 15, task 1 of 3 (pid=227031 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 13 of 15, task 1 of 3 (pid=227031 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 6 of 15, task 1 of 3 (pid=227031 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 14 of 15, task 1 of 3 (pid=227031 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 3 of 15, task 1 of 3 (pid=227031 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 4 of 15, task 1 of 3 (pid=227031 on host compute-10-0.juggernaut.umd.edu
Hello UMD from thread 0 of 15, task 2 of 3 (pid=118719 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 6 of 15, task 2 of 3 (pid=118719 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 12 of 15, task 2 of 3 (pid=118719 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 2 of 15, task 2 of 3 (pid=118719 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 11 of 15, task 2 of 3 (pid=118719 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 14 of 15, task 2 of 3 (pid=118719 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 3 of 15, task 2 of 3 (pid=118719 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 8 of 15, task 2 of 3 (pid=118719 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 1 of 15, task 2 of 3 (pid=118719 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 5 of 15, task 2 of 3 (pid=118719 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 7 of 15, task 2 of 3 (pid=118719 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 9 of 15, task 2 of 3 (pid=118719 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 10 of 15, task 2 of 3 (pid=118719 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 13 of 15, task 2 of 3 (pid=118719 on host compute-10-1.juggernaut.umd.edu
Hello UMD from thread 4 of 15, task 2 of 3 (pid=118719 on host compute-10-1.juggernaut.umd.edu
There should be two lines (one with version, one with compiler) identifying the
hello-umd
command, followed by 45 messages, one from each pair
of thread number 0 to 14 and task number 0 to 2. These lines can appear in
any order due to the parallel nature of the code. For any particular
task number, the pid and the hostname should be the same, but the pids should
be different for different task ids (and will likely not match the pids shown
above). The hostnames should be different between tasks if run on the
Deepthought2 cluster, but might not be on Juggernaut (since most Juggernaut
nodes
have at least 30
cores
, and some have over 45,
and so can fit 2 or even 3 of our 15-core tasks on a single node).