When your job starts to execute, the batch system will execute the script file you submitted on the first node assigned to your job. If your job is to run on multiple cores and/or multiple nodes, it is your script's responsibility to deliver the various tasks to the different cores and/or nodes. How to do this varies with the application, but some common techniques are discussed here.
The scheduler assigns the variable
$PBS_NODEFILE
which contains the name of a file that
lists all of the nodes that you've been assigned. If you are assigned
multiple cores on the same node, the name of that node appears multiple
times (once per core assigned) in that file.
Serial or single core jobs are the simplest case. Indeed, the batch system starts processing your job script on a core of the first node assigned to your job, and in the single core case this is the only core/node assigned to your job. So there is really nothing special that you need to do; just enter the command for your job and it should run.
NOTE: Do not confuse single-core jobs and the poorly-named serial queue. Single-core jobs can run on any of the narrow-* queues, or the debug queue (the wide-* and ib queues require multiple nodes), and most users submitting single core jobs probably want to use a narrow-* queue. The serial queue is a low priority, preemptible queue, as discussing in the section about .
The next simplest case is if your job is running on a single node but is multithreaded. I.e., OpenMP codes that are not also using MPI will typically fall into this category. Again, usually there is nothing special that you need to do.
One exception is if you are not using all the cores on the node for this job. In this case, you might need to tell the code to limit the number of cores being used. This is true for OpenMP codes, and OpenMP will by default try to use all the cores it can find.
Normally, once a node is assigned to you for one of your jobs, no one else can run jobs on that node (will it is still assigned to you). However, if room exists on the node, other jobs of yours might be placed on it. I.e., if you have several jobs requesting 4 cores, two of these jobs might get the same 8-core node. If both of these are OpenMP jobs and you do not limit the number of cores they try to use, both jobs will try to run 8 threads, one per core on the machine, resulting in 16 threads on that machine. This will cause contention between your two jobs, and actually reduce performance.
For OpenMP, you can set the environmental variable
OMP_NUM_THREADS
in your job script to match the number of
cores per node requested by the job, e.g. for our 4 core example, either
setenv OMP_NUM_THREADS 4
OMP_NUM_THREADS=4
export OMP_NUM_THREADS
OpenMPI is the preferred MPI unless your application specifically requires
one of the alternate MPI variants. OpenMPI automatically "knows" about the
contents of $PBS_NODEFILE
and as such you don't need to include it on the
command line. OpenMPI is also compiled to support all of the various
interconnect hardware, so for nodes with fast transport (InfiniBand/Myrinet),
the fastest interface will be selected automatically.
NOTE: All of the nodes in your job must be configured
to use the same MPI library and version and language bindings.
This is best done by editting
your ~/.cshrc.mine
(or ~/.bashrc.mine
if you are
using the bourne shell) to include the appropriate tap -q
or module load
command. You should ONLY tap/module load a
single MPI library in your dot file; if you have multiple tap/module load
lines, at best only the last one is effective (and likely none will work
properly).
Otherwise, you can simply invoke your MPI enabled application with the mpirun command, e.g.
mpirun -np NUMCORES MY_APPLICATION
NUMCORES
is the number of cores/tasks to
use, and MY_APPLICATION
.
If you are doing hybrid OpenMP/OpenMPI parallelization, NUMCORES should be the number of MPI tasks you wish to start, each using OMP_NUM_THREADS cores via OpenMP. If you wish to disable OpenMP parallelization, just set OMP_NUM_THREADS to 1.
NOTE: Your code must be MPI aware for the above to work. Running a non-MPI code with mpirun might succeed, but you will have NUMCORES processes running the exact same calculations, duplicating each others work, and wasting resources.
For more information, see the examples.
NOTE: Please consider using OpenMPI if your application supports it. Use of LAM is deprecated.
NOTE: All of the nodes in your job must be configured
to use the same MPI library and version and language bindings.
This is best done by editting
your ~/.cshrc.mine
(or ~/.bashrc.mine
if you are
using the bourne shell) to include the appropriate tap -q
or module load
command. You should ONLY tap/module load a
single MPI library in your dot file; if you have multiple tap/module load
lines, at best only the last one is effective (and likely none will work
properly).
The LAM MPI library requires you to explicitly setup the MPI daemons on all the nodes before you start using MPI, and tear them down after your code exits. So to run an MPI code you would typically have the following three lines:
lamboot $PBS_NODEFILE
mpirun C YOUR_APPLICATION
lamhalt
NOTE: Your code must be MPI aware for the above to work. Running a non-MPI code with mpirun might succeed, but you will have NUMCORES processes running the exact same calculations, duplicating each others work, and wasting resources.
For more information, see the examples.
NOTE: Please consider using OpenMPI if your application supports it. Use of MPICH is deprecated.
NOTE: All of the nodes in your job must be configured
to use the same MPI library and version and language bindings.
This is best done by editting
your ~/.cshrc.mine
(or ~/.bashrc.mine
if you are
using the bourne shell) to include the appropriate tap -q
or module load
command. You should ONLY tap/module load a
single MPI library in your dot file; if you have multiple tap/module load
lines, at best only the last one is effective (and likely none will work
properly).
Note also that if you've never run MPICH before, you'll need to create the file .mpd.conf in your home directory. This file should contain at least a line of the form MPD_SECRETWORD=we23jfn82933. (DO NOT use the example provided, make up your own secret word.)
The MPICH implementation of MPI also requires the MPI pool to be explicitly set up and torn down. The set up step involves starting mpd daemon processes on each of the nodes assigned to your job.
A typical MPICH job will use the following lines
mpdboot -n NUM_NODES -f NODE_FILE
mpiexec -n NUM_CORES YOUR_PROGRAM
mpdallexit
NOTE: Your code must be MPI aware for the above to work. Running a non-MPI code with mpirun might succeed, but you will have NUMCORES processes running the exact same calculations, duplicating each others work, and wasting resources.
For more information, see the examples.
The above will work as long as you do not run more than one MPI job on
the same node at the same time; since most MPI jobs use all the cores on a
node anyway, it is fine for most people. If you do run into the situation
where multiple MPI jobs are sharing nodes, when the first job calls mpdallexit,
all the mpds for all jobs will be killed, which will make the second and later
jobs unhappy. In these cases, you will want to set the environmental
variable MPD_CON_EXT
to something unique (e.g. the job id) before
calling mpdboot
, and add the --remcons
option to
mpdboot, e.g.
mpdboot -n NUM_NODES -f NODE_FILE --remcons
MPI is currently the most standard way of launching, controlling, synchronizing, and communicating across multi-node jobs, but it is not the only way. Some applications have their own process for running across multiple nodes, and in such cases you should follow their instructions.
The examples page shows an
example of using the basic ssh command
to start a process on each of the nodes assigned to your job. Something
like this could be used to break a problem into N chunks that can be
processed independently, and send each chunk to a different core/node.
However, most real parallel jobs require much more than just launching
the code: the passing of data back and forth, synchronization, etc. And
for a simple job as described is often better to submit separate jobs
in the batch system for each chunk.
Back to Top