While the vast majority of distributed memory jobs use MPI to handle any required communication among the tasks, there are occasional exceptions. We present two possible methods of launching such jobs in this page.
We actually consider a
hybrid
distributed memory
and
shared memory
parallelism as the most general case (if you wish to adapt to a distributed
memory only case, the examples are largely the same except the number of
threads would be set to 1). We use the non-MPI,
multi-threaded
version of
hello-umd
as our code. This is cheating a bit, as it really
does not do any
distributed memory
communication, so we are basically running
a bunch of distinct multi-threaded codes. However, even the
MPI-enabled
version of
hello-umd
really does not do any intra-task communication, so
it is only a bit of a stretch. A real code would need to implement some means
for the tasks to communicate with each other.
We present two submission scripts:
In both cases, we basically just invoke 3 instances of hello-umd
each with 15 threads, using the same sbatch
parameters that we
would use if this were a
hybrid
MPI
and
multi-threaded
code. We
chose this combination of instances and threads as this will require multiple
nodes on the Deepthought2 cluster but still fit in the debug queue, which
makes for a better example. It might or might not require multiple nodes on
Juggernaut, as some nodes have quite a few cores. As mentioned above, these
are actually three independent processes, but we pretend that it is
a single job communicating over something other than
MPI
. We change the default
message from hello-umd
for each "task" to make it clear which
task the message is coming from.
The hello-umd
example code is just a variant of a
Hello World! program.
Both examples basically consist of a job submission script
submit.sh
which gets submitted to the cluster via the sbatch
command. The srun
example also includes a small wrapper script
around the hello-umd
command to change the default message for
each task.
(see for a listing and explanation
of the script) which gets submitted to the cluster via the
sbatch
command.
The script is designed to show many good practices; including:
Many of the practices above are rather overkill for such a simple job --- indeed, the vast majority of lines are for these "good practices" rather than the running of the intended code, but are included for educational purposes.
srun
command
For this case, we use the Slurm srun
command to
spawn the different instances of hello-umd
. Being a
Slurm command, srun
just knows what nodes
were allocated to the job (and how many tasks allocated to each
node). In order to change the message printed by the hello-umd
command for each instantiation of the code, we use a small helper script.
The submission script submit.sh
can be
downloaded
as plain text. We present a copy with line numbers for discussion
below (click on lines to link to discussion for those lines):
Line# | Code |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Like most of our examples, this shebang uses the /bin/bash
interpretter, which is the
bash (Bourne-again) shell.
This is a compatible replacement to and enhancement of the original Unix
Bourne shell.
You can opt to specify another shell or interpretter if you so desire,
common choices are:
/bin/sh
) in your shebang (note that this
basically just uses bash in a restricted mode)/bin/csh
or /bin/tcsh
)However, we recommend the use of the bash shell, as it has the support for scripting; this might not matter for most job submission scripts because of their simplicity, but might if you start to need more advanced features. The examples generally use the bash shell for this reason.
#SBATCH
are used to control
the Slurm scheduler and will be discussed elsewhere.
But other than the cases above, feel free to use comment lines to remind yourself (and maybe others reading your script) of what the script is doing.
#SBATCH
can be used to control
how the Slurm sbatch
command submits the job. Basically, any
command line flags can be provided witha #SBATCH
line in the
script, and you can mix and match command line options and options in
#SBATCH
. NOTE: any #SBATCH
must precede any "executable lines" in the script. It is recommended that
you have nothing but the shebang line, comments and blank lines before any
#SBATCH
lines.
--ntasks=3
or -n 3
)
with each task having 15
CPU cores
for each task (--cpus-per-task=15
or -c 15
).
Note that we do not specify a number of nodes, and we recommend that you do not specify a node count for most jobs --- by default Slurm will allocate enough nodes to satisfy this job's needs, and if you specify a value which is incorrect it will only cause problems.
We choose 3 tasks of 15 cores as this will usually require multiple nodes on both Deepthought2 and Juggernaut (although some of the larger Juggernaut nodes can support this request on a single node), and so makes a better demonstration, but will still fit in the debug partition.
#SBATCH -t TIME
line sets the time limit for
the job. The requested TIME value can take any of a number of
formats, including:
It is important to set the time limit appropriately. It must be set longer than you expect the job to run, preferable with a modest cushion for error --- when the time limit is up, the job will be canceled.
You do not want to make the requested time excessive, either. Although you are only charged for the actual time used (i.e. if you requested 12 hours and the job finished in 11 hours, your job is only charged for 11 not 12 hours), there are other downsides of requesting too much wall time. Among them, the job may spend more time in the queue, and might not run at all if your account is low on funds (the scheduler will use the requested wall time to estimate the number of SUs the job will consume, and will not start a job if it and all currently running jobs are projected to have sufficient SUs to complete). And if it starts, and excessive walltime might block other jobs from running for a similar reason.
In general, you should estimate the maximum run time, and pad it by 10% or so.
In this case, the hello-umd
will run very quickly; much less
than 5 minutes.
There are several parameters you can give to Slurm/sbatch to specify the
memory to be allocated for the job. It is recommended that you always include
a memory request for your job --- if omitted it will default to 6GB per CPU
core. The recommended way to request memory is with the
--mem-per-cpu=N
flag. Here N is in MB.
This will request N MB of RAM for each CPU core allocated to the job.
Since you often wish to ensure each process in the job has sufficient memory,
this is usually the best way to do so.
An alternative is with the --mem=N
flag. This sets
the maximum memory use by node. Again, N is in MB. This
could be useful for single node jobs, especially multithreaded jobs, as there
is only a single node and threads generally share significant amounts of memory.
But for MPI jobs the --mem-per-cpu
flag is usually more
appropriate and convenient.
For MPI codes, we recommend using --mem-per-cpu
instead of
--mem
since you generally wish to ensure each MPI task has
sufficient memory.
The hello-umd
does not use much memory, so 1 GB per core
is plenty.
The lines SBATCH --share
, SBATCH --oversubscribe
,
or SBATCH --exclusive
decide whether or not other jobs are able to run on the same node(s) are
your job.
NOTE: The Slurm scheduler changed the name of the
flag for "shared" mode. The proper flag is now
#SBATCH --oversubscribe
. You must use the "oversubscribe"
flag on Juggernaut. You can currently use either form on Deepthought2, but
the "#SBATCH --share
form is deprecated and at some point will
no longer be supported. Both forms effectively do the same thing.
In exclusive mode, no other jobs are able to run on a node allocated to your job while your job is running. This greatly reduces the possibility of another job interfering with the running of your job. But if you are not using all of the resources of the node(s) your job is running on, it is also wasteful of resources. In exclusive mode, we charge your job for all of the cores on the nodes allocated to your job, regardless of whether you are using them or not.
In share/oversubscribe mode, other jobs (including those of other users) may run on the same node as your job as long as there are sufficient resources for both. We make efforts to try to prevent jobs from interfering with each other, but such methods are not perfect, so while the risk of interference is small, it is much greater risk in share mode than in exclusive mode. However, in share mode you are only charged for the requested number of cores (not all cores on the node unless you requested such), and your job might spend less time in the queue (since it can avail itself of nodes which are in use by other jobs but have some unallocated resources).
Our recommendation is that large (many-core/many-node) and/or long running jobs use exclusive mode, as the potential cost of adverse interence is greatest here. Plus large jobs tend to use most if not all cores of most of the nodes they run on, so the cost of exclusive mode tends to be less. Smaller jobs, and single core jobs in particular, generally benefit from share/oversubscribe mode, as they tend to less fully utiliize the nodes they run on (indeed, on a standard Deepthought2 node, a single core job will only use 5% of the CPU cores).
The default for the cluster is, unless you specify otherwise, to default single core jobs to share mode, and multicore/multinode jobs to exclusive mode. This is not an ideal choice, and might change in the future. We recommend that you always explicitly request either share/oversubscribe or exclusive as appropriate.
Again, as a multi-core job, #SBATCH --exclusive
is the default, but we recommend explicitly stating this.
For real production work, the debug queue is probably not adequate, in which case it is recommended that you just omit this line and let the scheduler select an appropriate partition for you.
sbatch
not to let the job process inherit the
environment of the process which invoked the sbatch
command. This requires the job script to explicitly set up its required
environment, as it can no longer depend on environmental settings you had
when you run the sbatch
command. While this may require a few
more lines in your script, it is a good practice and improves the
reproducibility of the job script --- without this it is possible the job
would only run correctly if you had a certain module loaded or variable set
when you submit the job.
module
command is available
in your script. They are generally only required if the shell specified
in the shebang line does not match your default login shell, in which
case the proper startup files likely did not get invoked.
The unalias
line is to ensure that there is no vestigal
tap
command. It is sometimes needed on RHEL6 systems,
should not be needed on the newer platforms but is harmless when not
needed. The remaining lines will read in the appropriate dot files for
the bash shell --- the if
, then
, elif
construct enables this script to work on both the Deepthought2 and
Juggernaut clusters, which have a slightly different name for the bash
startup file.
SLURM_EXPORT_ENV
to the value ALL
,
which causes the environment to be shared with other processes
spawned by Slurm commands (which also includes mpirun
and similar).
At first this might seem to contradict our recommendation to
use #SBATCH --export=NONE
, but it really does not.
The #SBATCH --export=NONE
setting will cause the
job script not to inherit the environment of
the shell in which you ran the sbatch
command.
But we are now in the job script, which because of the
--export=NONE
flag, has it's own environment which
was set up in the script. We want this environment to
be shared with other MPI tasks and processes spawned by this
job. These MPI tasks and processes will inherit the environment
set up in this script, not the environment from which the
sbatch
command ran.
This important for MPI jobs like this, because otherwise the
mpirun
code might not spawn properly.
To begin with, we do a module purge
to clear out any previously loaded
modules. This prevents them from interfering with subsequent module loads. Then we load
the default module for the cluster with module load hpcc/deepthought2
; this line
should be adjusted for the cluster being used (e.g. module load hpcc/juggernaut
for the Juggernaut cluster).
We then load the desired compiler and MPI library,
finally the hello-umd
. We recommend that you always
load the compiler module first, and then if needed the MPI library,
and then any higher
level applications. Many packages have different builds for different
compilers, MPI libraries, etc., and the module command is smart enough to
load the correct versions of these. Note that we specify the versions;
if you omitted the version the module command will usually try to the
load most recent version installed.
We recommend that you always specify the specific version you want in your job scripts --- this makes your job more reproducible. Systems staff may add newer versions of existing packages without notification, and if you do not specify a version, the default version may change without your expecting it. In particular, a job that runs fine today using today's default version might crash unexpectedly when you try running it again in six months because the packages it uses were updated and your inputs are not compatible with the new version of the code.
/tmp
is specific to a single node,
so that is usually not suitable for MPI jobs. The lustre filesystem is
accessible by all of the compute nodes of cluster, so it is a good choice
for MPI jobs.
The TMPWORKDIR="/lustre/$USER/ood-job.${SLURM_JOBID}">
or similar line
defines an environmental variable containing the name of our chosen work
directory. The ${SLURM_JOBID}
references another environmental
variable which is automatically set by Slurm (when the job starts to run) to
the job number for this job --- using this in our work directory names
helps ensure it will not conflict with any other job. The
mkdir
command creates this work directory, and the
cd
changes our working directory to that directory---
note in those last commands the use of the environmental variable we just
created to hold the directory name.
SLURM_JOBID
, SLURM_NTASKS
,
SLURM_JOB_NUM_NODES
, and SLURM_JOB_NODELIST
which are set by Slurm
at the start of the job to list the job number, the number of MPI tasks, the number of nodes,
and the names of the nodes allocated to the job. It also prints the time and date that
the job started (the date
command), the working directory (the
pwd
command), and the list of loaded modules (the module list
command). Although you are probably not interested in any of that information if the
job runs as expected, they can often be helpful in diagnosing why things did not work
as expected.
hello-umd
command,
and stores it in an environmental variable named MYEXE
, and then
outputs the path for added diagnostics. We find that MPI jobs run better when
you provide the absolute path to the executable to the mpirun
or
similar command.
In this case, we include the bash export
qualifier so that
the value of MYEXE
will be passed to the srun
command.
srun
command for this. This launches the specified
command on every node allocated to the job, as many times as the node has
tasks allocated to it.
As this is a non-MPI job, we are using a version of hello-umd
which does not support MPI. Therefore, from the standpoint of the
hello-umd
, it is always MPI task number 0, and if we simply
ran hello-umd
directly, it would be difficult to distinguish
between the different tasks. So we use a wrapper script
hello-wrapper.sh
which uses the environment variable
SLURM_PROCID
, automatically set by the srun
command
for each instance of the code to the number of the task (starting at zero).
The wrapper script invokes hello-umd
with a different message
to identify the task.
We actually directly pass the /bin/bash
binary to
srun
, and give the wrapper script as an argument to
bash
to avoid any issues with execute permissions on the script.
We run the code so as to save the output in the file
hello.out
in the current working directory.
The >
operator does output redirection, meaning that all
of the standard output goes to the specified file
(hello.out
in this case). The >&1
operator
causes the standard error output to be sent to the standard output stream
(1 is the number for the standard output stream), and since standard output
was redirected to the file, so will the standard error be.
For this simple case, we could have omitted the redirection of standard
output and standard error, in which case any such output would end up in the
Slurm output file (usually named slurm-JOBNUMBER.out
.
However, if your job produces a lot (many MBs) of output to standard
output/standard error, this can be problematic. It is good practice
to redirect output if you know your code will produce more than 1000 or so
lines of output.
The special shell variable $?
stores the exit code from the last command.
Normally it will be 0 if the command was successful, and non-zero otherwise. But it only
works for the last command, so we save it in the variable ECODE
.
ECODE
,
and then prints the date/time of completion using the date
command
ECODE
. This
means that the script will have the same exit code as the application, which will allow
Slurm to better determine if the job was successful or not. (If we did not do this, the
error code of the script will be the error code of the last command that ran, in this
case the date
command which should never fail. So even if your application
aborted, the script would return a successful (0) error code, and Slurm would think the
job succeeded if this line was omitted).
The reason for this is that if the last line does not have the proper line termination character, it will be ignored by the shell. Over the years, we have had many users confused as to why there job ended as soon as it started without error, etc. --- it turns out the last line of their script was the line which actually ran their code, and it was missing the correct line termination character. Therefore, the job ran, did some initialization and module loads, and exited without running the command they were most interested in because of a missing line termination character (which can be easily overlooked).
This problem most frequently occurs when transferring files between Unix/Linux and Windows operating systems. While there are utilities that can add the correct line termination characters, the easy solution in my opinion is to just add one or more blank lines at the end of your script --- if the shell ignores the blank lines, you do not care.
The submission script uses a small helper script (which can be downloaded here). We list it below:
Line# | Code |
---|---|
|
|
|
|
|
|
|
|
|
|
Like most of our examples, this shebang uses the /bin/bash
interpretter, which is the
bash (Bourne-again) shell.
This is a compatible replacement to and enhancement of the original Unix
Bourne shell.
You can opt to specify another shell or interpretter if you so desire,
common choices are:
/bin/sh
) in your shebang (note that this
basically just uses bash in a restricted mode)/bin/csh
or /bin/tcsh
)However, we recommend the use of the bash shell, as it has the support for scripting; this might not matter for most job submission scripts because of their simplicity, but might if you start to need more advanced features. The examples generally use the bash shell for this reason.
hello-umd
. We use the arguments
-t ${SLURM_CPUS_PER_TASK}
to have it use the
number of threads we declared in the submit.sh
script. We also use the -m
flag to change
the default message of the hello-umd
script
to identify the pseudo-task.
The environmental variable SLURM_CPUS_PER_TASK
was set by Slurm before the submit.sh
script
ran. Because we set the variable SLURM_EXPORT_ENV
to the value ALL
in the script, it gets exported
to our wrapper script by srun
.
The environmental variable SLURM_PROCID
is set
by srun
to an integer (starting at 0) to identify
which process within the task is being launched. We use this
to identify our "task".
ssh
command
For this case, we use the ssh
command to
spawn the different instances of hello-umd
on all the
nodes
allocated to the
job, once for each task allocated to the node. This information is
contained in the environment variables SLURM_JOB_NODELIST
and SLURM_TASKS_PER_NODE
which are set by Slurm automatically
when your job script starts. However, Slurm uses a condensed format for
these variables, which is difficult to use. So we make use of an external
script /software/acigs-utilities/bin/slurm_hostnames_by_tasks
to convert these to a more useful format, listing every node allocated
to the job as many times as it has tasks allocated to it.
|
NOTE: In order for this example using
ssh to
work, you must have previously enabled password-less ssh between
the nodes of the cluster.
Instructions
for setting up password-less ssh between nodes of the cluster. This only
needs to be done once, but the job will not successfully run without this.
|
The submission script submit.sh
can be
downloaded
as plain text. We present a copy with line numbers for discussion
below (click on lines to link to discussion for those lines):
Line# | Code |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Like most of our examples, this shebang uses the /bin/bash
interpretter, which is the
bash (Bourne-again) shell.
This is a compatible replacement to and enhancement of the original Unix
Bourne shell.
You can opt to specify another shell or interpretter if you so desire,
common choices are:
/bin/sh
) in your shebang (note that this
basically just uses bash in a restricted mode)/bin/csh
or /bin/tcsh
)However, we recommend the use of the bash shell, as it has the support for scripting; this might not matter for most job submission scripts because of their simplicity, but might if you start to need more advanced features. The examples generally use the bash shell for this reason.
#SBATCH
are used to control
the Slurm scheduler and will be discussed elsewhere.
But other than the cases above, feel free to use comment lines to remind yourself (and maybe others reading your script) of what the script is doing.
#SBATCH
can be used to control
how the Slurm sbatch
command submits the job. Basically, any
command line flags can be provided witha #SBATCH
line in the
script, and you can mix and match command line options and options in
#SBATCH
. NOTE: any #SBATCH
must precede any "executable lines" in the script. It is recommended that
you have nothing but the shebang line, comments and blank lines before any
#SBATCH
lines.
--ntasks=3
or -n 3
)
with each task having 15
CPU cores
(--cpus-per-task=15
or -c 15
).
Because we are not using
MPI
,
we use the term "task" in a more general sense as an instance of the
(multithreaded) program being run.
Note that we do not specify a number of nodes, and we recommend that you do not specify a node count for most jobs --- by default Slurm will allocate enough nodes to satisfy this job's needs, and if you specify a value which is incorrect it will only cause problems.
We choose 3 tasks of 15 cores as this will usually require multiple nodes on both Deepthought2 and Juggernaut (although some of the larger Juggernaut nodes can support this request on a single node), and so makes a better demonstration, but will still fit in the debug partition.
#SBATCH -t TIME
line sets the time limit for
the job. The requested TIME value can take any of a number of
formats, including:
It is important to set the time limit appropriately. It must be set longer than you expect the job to run, preferable with a modest cushion for error --- when the time limit is up, the job will be canceled.
You do not want to make the requested time excessive, either. Although you are only charged for the actual time used (i.e. if you requested 12 hours and the job finished in 11 hours, your job is only charged for 11 not 12 hours), there are other downsides of requesting too much wall time. Among them, the job may spend more time in the queue, and might not run at all if your account is low on funds (the scheduler will use the requested wall time to estimate the number of SUs the job will consume, and will not start a job if it and all currently running jobs are projected to have sufficient SUs to complete). And if it starts, and excessive walltime might block other jobs from running for a similar reason.
In general, you should estimate the maximum run time, and pad it by 10% or so.
In this case, the hello-umd
will run very quickly; much less
than 5 minutes.
There are several parameters you can give to Slurm/sbatch to specify the
memory to be allocated for the job. It is recommended that you always include
a memory request for your job --- if omitted it will default to 6GB per CPU
core. The recommended way to request memory is with the
--mem-per-cpu=N
flag. Here N is in MB.
This will request N MB of RAM for each CPU core allocated to the job.
Since you often wish to ensure each process in the job has sufficient memory,
this is usually the best way to do so.
An alternative is with the --mem=N
flag. This sets
the maximum memory use by node. Again, N is in MB. This
could be useful for single node jobs, especially multithreaded jobs, as there
is only a single node and threads generally share significant amounts of memory.
But for MPI jobs the --mem-per-cpu
flag is usually more
appropriate and convenient.
For MPI codes, we recommend using --mem-per-cpu
instead of
--mem
since you generally wish to ensure each MPI task has
sufficient memory.
The hello-umd
does not use much memory, so 1 GB per core
is plenty.
The lines SBATCH --share
, SBATCH --oversubscribe
,
or SBATCH --exclusive
decide whether or not other jobs are able to run on the same node(s) are
your job.
NOTE: The Slurm scheduler changed the name of the
flag for "shared" mode. The proper flag is now
#SBATCH --oversubscribe
. You must use the "oversubscribe"
flag on Juggernaut. You can currently use either form on Deepthought2, but
the "#SBATCH --share
form is deprecated and at some point will
no longer be supported. Both forms effectively do the same thing.
In exclusive mode, no other jobs are able to run on a node allocated to your job while your job is running. This greatly reduces the possibility of another job interfering with the running of your job. But if you are not using all of the resources of the node(s) your job is running on, it is also wasteful of resources. In exclusive mode, we charge your job for all of the cores on the nodes allocated to your job, regardless of whether you are using them or not.
In share/oversubscribe mode, other jobs (including those of other users) may run on the same node as your job as long as there are sufficient resources for both. We make efforts to try to prevent jobs from interfering with each other, but such methods are not perfect, so while the risk of interference is small, it is much greater risk in share mode than in exclusive mode. However, in share mode you are only charged for the requested number of cores (not all cores on the node unless you requested such), and your job might spend less time in the queue (since it can avail itself of nodes which are in use by other jobs but have some unallocated resources).
Our recommendation is that large (many-core/many-node) and/or long running jobs use exclusive mode, as the potential cost of adverse interence is greatest here. Plus large jobs tend to use most if not all cores of most of the nodes they run on, so the cost of exclusive mode tends to be less. Smaller jobs, and single core jobs in particular, generally benefit from share/oversubscribe mode, as they tend to less fully utiliize the nodes they run on (indeed, on a standard Deepthought2 node, a single core job will only use 5% of the CPU cores).
The default for the cluster is, unless you specify otherwise, to default single core jobs to share mode, and multicore/multinode jobs to exclusive mode. This is not an ideal choice, and might change in the future. We recommend that you always explicitly request either share/oversubscribe or exclusive as appropriate.
Again, as a multi-core job, #SBATCH --exclusive
is the default, but we recommend explicitly stating this.
For real production work, the debug queue is probably not adequate, in which case it is recommended that you just omit this line and let the scheduler select an appropriate partition for you.
sbatch
not to let the job process inherit the
environment of the process which invoked the sbatch
command. This requires the job script to explicitly set up its required
environment, as it can no longer depend on environmental settings you had
when you run the sbatch
command. While this may require a few
more lines in your script, it is a good practice and improves the
reproducibility of the job script --- without this it is possible the job
would only run correctly if you had a certain module loaded or variable set
when you submit the job.
launch_ssh_tasks
to handle
launching via ssh the instances of hello-umd
on the nodes
allocated to our job. We need to launch as many instances per node as
there were tasks allocated to the node.
We define this code in a bash function because:
This section defines the function; the code here does not actually run until the function is invoked.
hello-umd
command,
and stores it in an environmental variable named MYEXE
, and then
outputs the path for added diagnostics. We find that most jobs run better when
you provide the absolute path to the executable to the ssh command.
hello-umd
process for every task we requested.
The information we want is contained in the environment variables set
by the Slurm scheduler SLURM_JOB_NODELIST
and
SLURM_TASKS_PER_NODE
, but Slurm uses an abbreviated form in
these variables. So we use a helper script installed at
/software/acigs-utilities/bin/slurm_hostnames_by_tasks
to
do the work of converting the contents of these variables to the more useful
form described above. The argument -r ' '
causes the node
names to be separated by a space (which makes it easy to use in a bash
for loop), and the argument -n hosts_by_tasks
causes it to
print a list of hostnames with the hostnames occuring once for each task
allocated to the node. The "\" character at the end of
first line is a continuation character --- it allows us to split the command
across two lines but still have the shell interpret it as a single command.
hello-umd
to show which
task is producing the output.
ssh
to spawn an instance of
hello-umd
on the node. We give the -q
argument
to ssh
to suppress the standard ssh
output (e.g.
the unauthorized use warnings, etc). The "\" characters at the end
of the line allow us to split the line to make it more readable, but still
have the shell interpret it as a single line.
The ssh
will launch the command on the named $node
,
and will run our hello-umd
code (whose path is stored in the
variable $MYEXE
) with the argument
-t ${SLURM_CPUS_PER_TASK}
to have it run with the
requested number of threads. The argument
-m "'Hello from task $taskname'"
is also given --- this changes
the default hello-umd
message so as to identify which task
generated the message (we are using a non-MPI build of hello-umd
,
so the hello-umd
code assumes it is
task
0 of 1 in all cases,
therefore the only way to distinguish is by changing the message text.).
The "&" symbol at the end of the ssh
command causes the
ssh
command to run in the background --- this means that the
script continues immediately with the command after the ssh
command while the ssh
command continues to run. This is needed
in order to get the required parallelism. Without it, only one instance
of hello-umd
would be running at a time --- we would use
ssh
to spawn one instance of hello-umd
on the
first node, wait for it to complete, and once it is done continue the loop
and spawn an instance on the second node, etc.
Immediately after the ssh
command, we store the process id (pid)
of the ssh
command we just ran in the background. This is done
using the special shell variable $!
. We need this value so that
we can collect the exit code for that process. Finally, we increment
the variable tasknum
.
ECODE
where we will
store the overall exit code for the function. We initialize it to 0,
indicating a successful run, but if any of the spawned processes have
a non-zero exit code, we change it to match. So it will only remain
0 if all of the spawned processes have zero exit codes (indicating success).
hello-umd
example, the background processes are likely to run for much longer than the
time needed to finish this script, so here it is especially important to
wait for the background tasks to complete before exiting the submission
script (and terminating the job, which would cause the background tasks
to be prematurely terminated).
The first reason could be satisfied with a simple wait
command without the process ID number --- that would wait until all child
processes have completed. But that only returns the exit code of the last
process to complete, and if errors occur in some of the processes, those
processes would tend to terminate earlier and so therefore be missed.
In this loop, we loop over each of the process IDs of the background
ssh
processes. The first line in the loop, wait $pid
,
will cause the processing of the script to halt until the process corresponding
to $pid
has terminated. The exit code of the wait
command is then stored in the variable tmpecode
. The
wait
normally returns the exit code of the specified
background process. This exit code is the exit code of the
ssh
process, but ssh
will return the exit code of the hello-umd
command it was
running on the allocated node. So this will tell us whether the command
being run on the node was successful or not.
We then check whether the exit code of the background process was zero (which would indicate success) or not. If not, we set the exit code of the function to non-zero also, and print a warning message.
module
command is available
in your script. They are generally only required if the shell specified
in the shebang line does not match your default login shell, in which
case the proper startup files likely did not get invoked.
The unalias
line is to ensure that there is no vestigal
tap
command. It is sometimes needed on RHEL6 systems,
should not be needed on the newer platforms but is harmless when not
needed. The remaining lines will read in the appropriate dot files for
the bash shell --- the if
, then
, elif
construct enables this script to work on both the Deepthought2 and
Juggernaut clusters, which have a slightly different name for the bash
startup file.
To begin with, we do a module purge
to clear out any previously loaded
modules. This prevents them from interfering with subsequent module loads. Then we load
the default module for the cluster with module load hpcc/deepthought2
; this line
should be adjusted for the cluster being used (e.g. module load hpcc/juggernaut
for the Juggernaut cluster).
We then load the desired compiler, and
finally the hello-umd
. We recommend that you always
load the compiler module first and then any higher
level applications. Many packages have different builds for different
compilers, etc., and the module command is smart enough to
load the correct versions of these. Note that we specify the versions;
if you omitted the version the module command will usually try to the
load most recent version installed.
We recommend that you always specify the specific version you want in your job scripts --- this makes your job more reproducible. Systems staff may add newer versions of existing packages without notification, and if you do not specify a version, the default version may change without your expecting it. In particular, a job that runs fine today using today's default version might crash unexpectedly when you try running it again in six months because the packages it uses were updated and your inputs are not compatible with the new version of the code.
/tmp
is specific to a single node,
so that is usually not suitable for MPI jobs. The lustre filesystem is
accessible by all of the compute nodes of cluster, so it is a good choice
for MPI jobs.
The TMPWORKDIR="/lustre/$USER/ood-job.${SLURM_JOBID}">
or similar line
defines an environmental variable containing the name of our chosen work
directory. The ${SLURM_JOBID}
references another environmental
variable which is automatically set by Slurm (when the job starts to run) to
the job number for this job --- using this in our work directory names
helps ensure it will not conflict with any other job. The
mkdir
command creates this work directory, and the
cd
changes our working directory to that directory---
note in those last commands the use of the environmental variable we just
created to hold the directory name.
SLURM_JOBID
, SLURM_NTASKS
,
SLURM_JOB_NUM_NODES
, and SLURM_JOB_NODELIST
which are set by Slurm
at the start of the job to list the job number, the number of MPI tasks, the number of nodes,
and the names of the nodes allocated to the job. It also prints the time and date that
the job started (the date
command), the working directory (the
pwd
command), and the list of loaded modules (the module list
command). Although you are probably not interested in any of that information if the
job runs as expected, they can often be helpful in diagnosing why things did not work
as expected.
launch_ssh_tasks
function
that we defined back at the start of this
script
We run the code so as to save the output in the file
hello.out
in our work directory.
The >
operator does output redirection, meaning that all
of the standard output goes to the specified file
(hello.out
in this case). The >&1
operator
causes the standard error output to be sent to the standard output stream
(1 is the number for the standard output stream), and since standard output
was redirected to the file, so will the standard error be.
For this simple case, we could have omitted the redirection of standard
output and standard error, in which case any such output would end up in the
Slurm output file (usually named slurm-JOBNUMBER.out
.
However, if your job produces a lot (many MBs) of output to standard
output/standard error, this can be problematic. It is good practice
to redirect output if you know your code will produce more than 1000 or so
lines of output.
The special shell variable $?
stores the exit code from the last command.
Normally it will be 0 if the command was successful, and non-zero otherwise. But it only
works for the last command, so we save it in the variable ECODE
.
ECODE
,
and then prints the date/time of completion using the date
command
ECODE
. This
means that the script will have the same exit code as the application, which will allow
Slurm to better determine if the job was successful or not. (If we did not do this, the
error code of the script will be the error code of the last command that ran, in this
case the date
command which should never fail. So even if your application
aborted, the script would return a successful (0) error code, and Slurm would think the
job succeeded if this line was omitted).
The reason for this is that if the last line does not have the proper line termination character, it will be ignored by the shell. Over the years, we have had many users confused as to why there job ended as soon as it started without error, etc. --- it turns out the last line of their script was the line which actually ran their code, and it was missing the correct line termination character. Therefore, the job ran, did some initialization and module loads, and exited without running the command they were most interested in because of a missing line termination character (which can be easily overlooked).
This problem most frequently occurs when transferring files between Unix/Linux and Windows operating systems. While there are utilities that can add the correct line termination characters, the easy solution in my opinion is to just add one or more blank lines at the end of your script --- if the shell ignores the blank lines, you do not care.
The easiest way to run this example is with the
Job Composer of
the OnDemand portal, using
the HelloUMD-HybridSrun
and
HelloUMD-HybridSsh
templates.
|
NOTE: In order to successfully run the ssh example,
you must have previously enabled password-less
ssh between
the nodes of the cluster.
Instructions
for setting up password-less ssh between nodes of the cluster. This only
needs to be done once, but the job will not successfully run without this.
|
To submit the examples from the command line, just
ssh
example, you must enable (or have previously
enabled)
password-less ssh between the nodes of the cluster.sbatch submit.sh
. This will submit the job
to the scheduler, and should return a message like
Submitted batch job 23767
--- the number will vary (and is the
job number for this job). The job number can be used to reference
the job in Slurm, etc. (Please always give the job number(s) when requesting
help about a job you submitted).
Whichever method you used for submission, the job will be queued for the
debug partition and should run within 15 minutes or so. When it finishes
running, the slurm-JOBNUMBER.out
should contain
the output from our diagnostic commands (time the job started, finished,
module list, etc). The output of the hello-umd
will be in
the file hello.out
in the directory from which you submitted
the job. If you used OnDemand, these file will appear listed in the
Folder contents
section on the right.
The hello-umd
file should look something like:
Starting task 0, using executable /software/spack-software/2020.05.14/linux-rhel7-broadwell/gcc-8.4.0/hello-umd-1.5-iimg3e3xjkofyozl4yda7b6mm7tshgvd/bin/hello-umd
Starting task 1, using executable /software/spack-software/2020.05.14/linux-rhel7-broadwell/gcc-8.4.0/hello-umd-1.5-iimg3e3xjkofyozl4yda7b6mm7tshgvd/bin/hello-umd
Starting task 2, using executable /software/spack-software/2020.05.14/linux-rhel7-broadwell/gcc-8.4.0/hello-umd-1.5-iimg3e3xjkofyozl4yda7b6mm7tshgvd/bin/hello-umd
hello-umd: Version 1.5
Built for compiler: gcc/8.4.0
hello-umd: Version 1.5
Built for compiler: gcc/8.4.0
'Hello from task 2' from thread 0 of 15, task 0 of 1 (pid=46349 on host compute-10-1.juggernaut.umd.edu
'Hello from task 2' from thread 8 of 15, task 0 of 1 (pid=46349 on host compute-10-1.juggernaut.umd.edu
'Hello from task 2' from thread 7 of 15, task 0 of 1 (pid=46349 on host compute-10-1.juggernaut.umd.edu
'Hello from task 2' from thread 14 of 15, task 0 of 1 (pid=46349 on host compute-10-1.juggernaut.umd.edu
'Hello from task 2' from thread 13 of 15, task 0 of 1 (pid=46349 on host compute-10-1.juggernaut.umd.edu
'Hello from task 2' from thread 11 of 15, task 0 of 1 (pid=46349 on host compute-10-1.juggernaut.umd.edu
'Hello from task 2' from thread 1 of 15, task 0 of 1 (pid=46349 on host compute-10-1.juggernaut.umd.edu
'Hello from task 2' from thread 5 of 15, task 0 of 1 (pid=46349 on host compute-10-1.juggernaut.umd.edu
'Hello from task 2' from thread 10 of 15, task 0 of 1 (pid=46349 on host compute-10-1.juggernaut.umd.edu
'Hello from task 2' from thread 12 of 15, task 0 of 1 (pid=46349 on host compute-10-1.juggernaut.umd.edu
'Hello from task 2' from thread 4 of 15, task 0 of 1 (pid=46349 on host compute-10-1.juggernaut.umd.edu
'Hello from task 2' from thread 9 of 15, task 0 of 1 (pid=46349 on host compute-10-1.juggernaut.umd.edu
'Hello from task 2' from thread 3 of 15, task 0 of 1 (pid=46349 on host compute-10-1.juggernaut.umd.edu
'Hello from task 2' from thread 6 of 15, task 0 of 1 (pid=46349 on host compute-10-1.juggernaut.umd.edu
'Hello from task 2' from thread 2 of 15, task 0 of 1 (pid=46349 on host compute-10-1.juggernaut.umd.edu
hello-umd: Version 1.5
Built for compiler: gcc/8.4.0
'Hello from task 0' from thread 4 of 15, task 0 of 1 (pid=162280 on host compute-10-0.juggernaut.umd.edu
'Hello from task 0' from thread 0 of 15, task 0 of 1 (pid=162280 on host compute-10-0.juggernaut.umd.edu
'Hello from task 0' from thread 11 of 15, task 0 of 1 (pid=162280 on host compute-10-0.juggernaut.umd.edu
'Hello from task 0' from thread 12 of 15, task 0 of 1 (pid=162280 on host compute-10-0.juggernaut.umd.edu
'Hello from task 0' from thread 7 of 15, task 0 of 1 (pid=162280 on host compute-10-0.juggernaut.umd.edu
'Hello from task 0' from thread 6 of 15, task 0 of 1 (pid=162280 on host compute-10-0.juggernaut.umd.edu
'Hello from task 0' from thread 5 of 15, task 0 of 1 (pid=162280 on host compute-10-0.juggernaut.umd.edu
'Hello from task 0' from thread 9 of 15, task 0 of 1 (pid=162280 on host compute-10-0.juggernaut.umd.edu
'Hello from task 0' from thread 8 of 15, task 0 of 1 (pid=162280 on host compute-10-0.juggernaut.umd.edu
'Hello from task 0' from thread 1 of 15, task 0 of 1 (pid=162280 on host compute-10-0.juggernaut.umd.edu
'Hello from task 0' from thread 10 of 15, task 0 of 1 (pid=162280 on host compute-10-0.juggernaut.umd.edu
'Hello from task 0' from thread 2 of 15, task 0 of 1 (pid=162280 on host compute-10-0.juggernaut.umd.edu
'Hello from task 0' from thread 3 of 15, task 0 of 1 (pid=162280 on host compute-10-0.juggernaut.umd.edu
'Hello from task 0' from thread 13 of 15, task 0 of 1 (pid=162280 on host compute-10-0.juggernaut.umd.edu
'Hello from task 0' from thread 14 of 15, task 0 of 1 (pid=162280 on host compute-10-0.juggernaut.umd.edu
'Hello from task 1' from thread 3 of 15, task 0 of 1 (pid=162279 on host compute-10-0.juggernaut.umd.edu
'Hello from task 1' from thread 12 of 15, task 0 of 1 (pid=162279 on host compute-10-0.juggernaut.umd.edu
'Hello from task 1' from thread 9 of 15, task 0 of 1 (pid=162279 on host compute-10-0.juggernaut.umd.edu
'Hello from task 1' from thread 0 of 15, task 0 of 1 (pid=162279 on host compute-10-0.juggernaut.umd.edu
'Hello from task 1' from thread 1 of 15, task 0 of 1 (pid=162279 on host compute-10-0.juggernaut.umd.edu
'Hello from task 1' from thread 13 of 15, task 0 of 1 (pid=162279 on host compute-10-0.juggernaut.umd.edu
'Hello from task 1' from thread 7 of 15, task 0 of 1 (pid=162279 on host compute-10-0.juggernaut.umd.edu
'Hello from task 1' from thread 2 of 15, task 0 of 1 (pid=162279 on host compute-10-0.juggernaut.umd.edu
'Hello from task 1' from thread 14 of 15, task 0 of 1 (pid=162279 on host compute-10-0.juggernaut.umd.edu
'Hello from task 1' from thread 8 of 15, task 0 of 1 (pid=162279 on host compute-10-0.juggernaut.umd.edu
'Hello from task 1' from thread 10 of 15, task 0 of 1 (pid=162279 on host compute-10-0.juggernaut.umd.edu
'Hello from task 1' from thread 5 of 15, task 0 of 1 (pid=162279 on host compute-10-0.juggernaut.umd.edu
'Hello from task 1' from thread 6 of 15, task 0 of 1 (pid=162279 on host compute-10-0.juggernaut.umd.edu
'Hello from task 1' from thread 4 of 15, task 0 of 1 (pid=162279 on host compute-10-0.juggernaut.umd.edu
'Hello from task 1' from thread 11 of 15, task 0 of 1 (pid=162279 on host compute-10-0.juggernaut.umd.edu
The lines beginning Starting task ...
are from the
hello-wrapper.sh
script, and so will only be present in the
srun
case. The rest of the file should be basically the same
in both cases: a pair of lines for each of the 3 tasks identifying the
hello-umd
version and the compiler it was built with. Then there
should be 45 lines (15 lines for each of the 3 tasks) identifying the task
and thread.
Note that the identifying lines from hello-umd
do
not state an MPI library --- we are running the non-MPI
version of hello-umd
. Note also that for each of the lines
identifying the thread, after listing the thread number it lists
task 0 of 1
--- this is due to the fact that we are running
a non-MPI version of hello-umd
and so the hello-umd
does not know about the other tasks and reports all of them as zero. For
that reason, we modified the default message to have it identify the task.
For any particular task number (according to the modified hello message), the pid and the hostname should be the same, but the pids should be different for different task ids (and will likely not match the pids shown above). The hostnames should be different between tasks if run on the Deepthought2 cluster, but might not be on Juggernaut (since most Juggernaut nodes have at least 30 cores , and some have over 45, and so can fit 2 or even 3 of our 15-core tasks on a single node).