Many codes can support multithreading as a means of parallelism, as standards like OpenMP generally make it easier to code than other parallel paradigms and standard workstations and even laptops can see performance improvements from multithreading.
Multithreading is a form of shared-memory parallelism and so all of the threads need to be running on the same node in order that the tasks can access the same memory for communication over shared-memory.
This page provides an example of submitting such a multithreaded job . It is based on the HelloUMD-Multithreaded job templates in the OnDemand portal.
This job makes use of a simple
Hello World! program called hello-umd
available in the
UMD HPC cluster software library and which supports
sequential
,
multithreaded
, and
MPI
modes of operation. The code simply prints an identifying message
from each thread of each task --- for this simple multithreaded case
only a single
task
will be used, but
the task will have 10
threads
. The
scheduler will always allocate all of the
CPU cores
for a specific task
on the same
node
which will
satisfy the
shared-memory
requirement.
This example basically consists of a single file, the job script
submit.sh
(see for a listing and explanation
of the script) which gets submitted to the cluster via the
sbatch
command.
The script is designed to show many good practices; including:
Many of the practices above are rather overkill for such a simple job --- indeed, the vast majority of lines are for these "good practices" rather than the running of the intended code, but are included for educational purposes.
This code runs hello-umd
in multithreaded mode, saving the output
to a file in the temporary work directory, and then copying back to the
submission directory. We could have forgone all that and simply have the
output of hello-umd
go to
standard output,
which would be available in the slurm-JOBNUMBER.out
file (or whatever file you instructed Slurm to use instead). Doing such is
acceptable as long as the code is not producing an excessive amount (many
MBs) of output --- if the code produces a lot of output having it all sent
to Slurm output file can cause problems, and it is better to redirect to
a file.
The submission script submit.sh
can be
downloaded
as plain text. We present a copy with line numbers for discussion
below (click on lines to link to discussion for those lines):
Line# | Code |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Like most of our examples, this shebang uses the /bin/bash
interpretter, which is the
bash (Bourne-again) shell.
This is a compatible replacement to and enhancement of the original Unix
Bourne shell.
You can opt to specify another shell or interpretter if you so desire,
common choices are:
/bin/sh
) in your shebang (note that this
basically just uses bash in a restricted mode)/bin/csh
or /bin/tcsh
)However, we recommend the use of the bash shell, as it has the support for scripting; this might not matter for most job submission scripts because of their simplicity, but might if you start to need more advanced features. The examples generally use the bash shell for this reason.
#SBATCH
are used to control
the Slurm scheduler and will be discussed elsewhere.
But other than the cases above, feel free to use comment lines to remind yourself (and maybe others reading your script) of what the script is doing.
#SBATCH
can be used to control
how the Slurm sbatch
command submits the job. Basically, any
command line flags can be provided witha #SBATCH
line in the
script, and you can mix and match command line options and options in
#SBATCH
. NOTE: any #SBATCH
must precede any "executable lines" in the script. It is recommended that
you have nothing but the shebang line, comments and blank lines before any
#SBATCH
lines.
--ntasks=1
or -n 1
)
with
threads
(--cpus-per-task=
or -c
).
The scheduler will place all of the cores for a single task on the same node , which is what we need for shared memory parallelism techniques like multi-threading .
Although multithreaded processes can in theory run on fewer cores than they have threads, in such cases you do not get the full parallelism benefit (as some threads will be waiting until a CPU core becomes available after another thread finishes). In general, for high-performance computing , you want to have a separate CPU core for each thread to ensure maximal performance.
#SBATCH -t TIME
line sets the time limit for
the job. The requested TIME value can take any of a number of
formats, including:
It is important to set the time limit appropriately. It must be set longer than you expect the job to run, preferable with a modest cushion for error --- when the time limit is up, the job will be canceled.
You do not want to make the requested time excessive, either. Although you are only charged for the actual time used (i.e. if you requested 12 hours and the job finished in 11 hours, your job is only charged for 11 not 12 hours), there are other downsides of requesting too much wall time. Among them, the job may spend more time in the queue, and might not run at all if your account is low on funds (the scheduler will use the requested wall time to estimate the number of SUs the job will consume, and will not start a job if it and all currently running jobs are projected to have sufficient SUs to complete). And if it starts, and excessive walltime might block other jobs from running for a similar reason.
In general, you should estimate the maximum run time, and pad it by 10% or so.
In this case, the hello-umd
will run very quickly; much
less than 5 minutes.
There are several parameters you can give to Slurm/sbatch to specify the
memory to be allocated for the job. It is recommended that you always include
a memory request for your job --- if omitted it will default to 6GB per CPU
core. The recommended way to request memory is with the
--mem-per-cpu=N
flag. Here N is in MB.
This will request N MB of RAM for each CPU core allocated to the job.
Since you often wish to ensure each process in the job has sufficient memory,
this is usually the best way to do so.
An alternative is with the --mem=N
flag. This sets
the maximum memory use by node. Again, N is in MB. This
could be useful for single node jobs, especially multithreaded jobs, as there
is only a single node and threads generally share significant amounts of memory.
But for MPI jobs the --mem-per-cpu
flag is usually more
appropriate and convenient.
We request 10 GB of memory for the job, which is really well
more than this simple hello world code needs. We could have
instead used something like #SBATCH --mem-per-cpu=1024
to request 1 GB per
CPU core
.
Since this is a
multithreaded job
using threads,
that would also have resulted in requesting GB of RAM.
However, for multithreaded jobs, the memory use generally
tends to be independent of the number of threads, so specifying the
total memory needed is usually more convenient.
The lines SBATCH --share
, SBATCH --oversubscribe
,
or SBATCH --exclusive
decide whether or not other jobs are able to run on the same node(s) are
your job.
NOTE: The Slurm scheduler changed the name of the
flag for "shared" mode. The proper flag is now
#SBATCH --oversubscribe
. You must use the "oversubscribe"
flag on Juggernaut. You can currently use either form on Deepthought2, but
the "#SBATCH --share
form is deprecated and at some point will
no longer be supported. Both forms effectively do the same thing.
In exclusive mode, no other jobs are able to run on a node allocated to your job while your job is running. This greatly reduces the possibility of another job interfering with the running of your job. But if you are not using all of the resources of the node(s) your job is running on, it is also wasteful of resources. In exclusive mode, we charge your job for all of the cores on the nodes allocated to your job, regardless of whether you are using them or not.
In share/oversubscribe mode, other jobs (including those of other users) may run on the same node as your job as long as there are sufficient resources for both. We make efforts to try to prevent jobs from interfering with each other, but such methods are not perfect, so while the risk of interference is small, it is much greater risk in share mode than in exclusive mode. However, in share mode you are only charged for the requested number of cores (not all cores on the node unless you requested such), and your job might spend less time in the queue (since it can avail itself of nodes which are in use by other jobs but have some unallocated resources).
Our recommendation is that large (many-core/many-node) and/or long running jobs use exclusive mode, as the potential cost of adverse interence is greatest here. Plus large jobs tend to use most if not all cores of most of the nodes they run on, so the cost of exclusive mode tends to be less. Smaller jobs, and single core jobs in particular, generally benefit from share/oversubscribe mode, as they tend to less fully utiliize the nodes they run on (indeed, on a standard Deepthought2 node, a single core job will only use 5% of the CPU cores).
The default for the cluster is, unless you specify otherwise, to default single core jobs to share mode, and multicore/multinode jobs to exclusive mode. This is not an ideal choice, and might change in the future. We recommend that you always explicitly request either share/oversubscribe or exclusive as appropriate.
Again, as a single core job, #SBATCH --oversubscribe
is the default
for single core jobs, but we recommend explicitly stating this.
For real production work, the debug queue is probably not adequate, in which case it is recommended that you just omit this line and let the scheduler select an appropriate partition for you.
sbatch
not to let the job process inherit the
environment of the process which invoked the sbatch
command. This requires the job script to explicitly set up its required
environment, as it can no longer depend on environmental settings you had
when you run the sbatch
command. While this may require a few
more lines in your script, it is a good practice and improves the
reproducibility of the job script --- without this it is possible the job
would only run correctly if you had a certain module loaded or variable set
when you submit the job.
To begin with, we do a module purge
to clear out any previously loaded
modules. This prevents them from interfering with subsequent module loads. Then we load
the default module for the cluster with module load hpcc/deepthought2
; this line
should be adjusted for the cluster being used (e.g. module load hpcc/juggernaut
for the Juggernaut cluster).
module
command is available
in your script. They are generally only required if the shell specified
in the shebang line does not match your default login shell, in which
case the proper startup files likely did not get invoked.
The unalias
line is to ensure that there is no vestigal
tap
command. It is sometimes needed on RHEL6 systems,
should not be needed on the newer platforms but is harmless when not
needed. The remaining lines will read in the appropriate dot files for
the bash shell --- the if
, then
, elif
construct enables this script to work on both the Deepthought2 and
Juggernaut clusters, which have a slightly different name for the bash
startup file.
SLURM_EXPORT_ENV
to the value ALL
,
which causes the environment to be shared with other processes
spawned by Slurm commands (which also includes mpirun
and similar).
At first this might seem to contradict our recommendation to
use #SBATCH --export=NONE
, but it really does not.
The #SBATCH --export=NONE
setting will cause the
job script not to inherit the environment of
the shell in which you ran the sbatch
command.
But we are now in the job script, which because of the
--export=NONE
flag, has it's own environment which
was set up in the script. We want this environment to
be shared with other MPI tasks and processes spawned by this
job. These MPI tasks and processes will inherit the environment
set up in this script, not the environment from which the
sbatch
command ran.
This really is not needed for a simple single-core job like this, since there are no additional MPI tasks , etc. being spawned. But it is a good habit.
To begin with, we do a module purge
to clear out any
previously loaded modules. This prevents them from interfering with
subsequent module loads. Then we load the default module for the cluster
with module load hpcc/deepthought2
; this line
should be adjusted for the cluster being used (e.g.
module load hpcc/juggernaut
for the Juggernaut cluster).
Finally, the line module load hello-umd/1.5
loads the correct
version of the hello-umd
application. Note that we specify
the version; if that is omitted the module command will usually try to
load the most recent version installed. We recommend that you always
specify the specific version you want in your job scripts --- this makes
your job more reproducible. Systems staff may add newer versions of
existing packages without notification, and if you do not specify a version,
the default version may change without your expecting it. In particular,
a job that runs fine today using today's default version might crash
unexpectedly when you try running it again in six months because the
packages it uses were updated and your inputs are not compatible with the
new version of the code.
/tmp
is a good choice. /tmp
is a directory
on Unix systems where all users can write temporary files. On the
compute nodes, /tmp
will be cleaned after every job runs, so
it is a tempory file system and we need to remember to copy any files we
which to retain someplace where they will not be automatically deleted.
The TMPWORKDIR="/tmp/ood-job.${SLURM_JOBID}">
line defines
an environmental variable containing the name of our chosen work directory.
The ${SLURM_JOBID}
references another environmental variable
which is automatically set by Slurm (when the job starts to run) to the
job number for this job --- using this in our work directory names
helps ensure it will not conflict with any other job. The
mkdir
command creates this work directory, and the
cd
changes our working directory to that directory---
note in those last commands the use of the environmental variable we
just created to hold the directory name.
SLURM_JOBID
, SLURM_NTASKS
,
SLURM_JOB_NUM_NODES
, and SLURM_JOB_NODELIST
which are set by Slurm
at the start of the job to list the job number, the number of MPI tasks, the number of nodes,
and the names of the nodes allocated to the job. It also prints the time and date that
the job started (the date
command), the working directory (the
pwd
command), and the list of loaded modules (the module list
command). Although you are probably not interested in any of that information if the
job runs as expected, they can often be helpful in diagnosing why things did not work
as expected.
hello-umd
command with the
-t 0
. As per the man page for hello-umd, this causes
it to use as many threads as CPUs are available, which when run
in a job like this will result in threads being
used. We could have also done this using the argument
-t $SLURM_CPUS_PER_TASK
, where the environmental
variable $SLURM_CPUS_PER_TASK
is set by Slurm at the
start of the job to be equal to the value we
gave for --cpus-per-task
.
In general, we recommend that you use such short cuts (either
setting the value for -t
to 0 or to the value of
variable Slurm sets) to avoid inconsistencies in your scripts.
E.g., if you explicitly gave -t 10
in this script,
and later experiment with different number of threads, it is
easy to forget to change a value in some place, resulting in
a discrepency between the number of
cores
requested
from Slurm and the number of
threads
being
run. If you request more cores than the number of threads being
used, you waste CPU resources. If you request fewer cores than
threads being used, the code will likely still run but performance
will be significantly degraded. So for best efficiency, we
recommend avoiding having to specify any setting more than once
wherever possible to avoid potential discrepencies.
The special shell variable $?
stores the exit code from the last command.
Normally it will be 0 if the command was successful, and non-zero otherwise. But it only
works for the last command, so we save it in the variable ECODE
.
/tmp
directory will be erased
after your job completes. So we need to copy any important files
somewhere safe before the job ends. In this case, the only important
file is hello.out
, which we copy back to the directory
from which the sbatch
command was run (which is stored in
the environmental variable SLURM_SUBMIT_DIR
by Slurm when
the job starts).
ECODE
,
and then prints the date/time of completion using the date
command
ECODE
. This
means that the script will have the same exit code as the application, which will allow
Slurm to better determine if the job was successful or not. (If we did not do this, the
error code of the script will be the error code of the last command that ran, in this
case the date
command which should never fail. So even if your application
aborted, the script would return a successful (0) error code, and Slurm would think the
job succeeded if this line was omitted).
The reason for this is that if the last line does not have the proper line termination character, it will be ignored by the shell. Over the years, we have had many users confused as to why there job ended as soon as it started without error, etc. --- it turns out the last line of their script was the line which actually ran their code, and it was missing the correct line termination character. Therefore, the job ran, did some initialization and module loads, and exited without running the command they were most interested in because of a missing line termination character (which can be easily overlooked).
This problem most frequently occurs when transferring files between Unix/Linux and Windows operating systems. While there are utilities that can add the correct line termination characters, the easy solution in my opinion is to just add one or more blank lines at the end of your script --- if the shell ignores the blank lines, you do not care.
The easiest way to run this example is with the
Job Composer of
the OnDemand portal, using
the HelloUMD-Multithreaded
template.
To submit from the command line, just
sbatch submit.sh
. This will submit the job
to the scheduler, and should return a message like
Submitted batch job 23767
--- the number will vary (and is the
job number for this job). The job number can be used to reference
the job in Slurm, etc. (Please always give the job number(s) when requesting
help about a job you submitted).
Whichever method you used for submission, the job will be queued for the
debug partition and should run within 15 minutes or so. When it finishes
running, the slurm-JOBNUMBER.out
should contain
the output from our diagnostic commands (time the job started, finished,
module list, etc). The output of the hello-umd
will be in
the file hello.out
in the directory from which you submitted
the job. If you used OnDemand, these file will appear listed in the
Folder contents
section on the right.
The hello-umd
file should look something like:
hello-umd: Version 1.5
Built for compiler: gcc/8.4.0
Hello UMD from thread 7 of 0, task 0 of 1 (pid=55936 on host compute-0-0.juggernaut.umd.edu
Hello UMD from thread 0 of 0, task 0 of 1 (pid=55936 on host compute-0-0.juggernaut.umd.edu
Hello UMD from thread 9 of 0, task 0 of 1 (pid=55936 on host compute-0-0.juggernaut.umd.edu
Hello UMD from thread 2 of 0, task 0 of 1 (pid=55936 on host compute-0-0.juggernaut.umd.edu
Hello UMD from thread 3 of 0, task 0 of 1 (pid=55936 on host compute-0-0.juggernaut.umd.edu
Hello UMD from thread 4 of 0, task 0 of 1 (pid=55936 on host compute-0-0.juggernaut.umd.edu
Hello UMD from thread 8 of 0, task 0 of 1 (pid=55936 on host compute-0-0.juggernaut.umd.edu
Hello UMD from thread 6 of 0, task 0 of 1 (pid=55936 on host compute-0-0.juggernaut.umd.edu
Hello UMD from thread 1 of 0, task 0 of 1 (pid=55936 on host compute-0-0.juggernaut.umd.edu
Hello UMD from thread 5 of 0, task 0 of 1 (pid=55936 on host compute-0-0.juggernaut.umd.edu
There should be two lines (one with version, one with compiler) identifying the
hello-umd
command, followed by 10 messages, one from each thread 0 to 9
of task 0. They should all have the same pid and hostname, although the pid and hostname
for your job will likely differ from above.