Glossary of High Performance Computing Terminology
High Performance and related computing paradigms has its own set of terminology which might be confusing to those new to these fields. This page tries to define the terminology as used throughout this web site in simple language.
- (hardware) accelerator
- While traditionally in computing the computation takes place on a
general purpose CPU, many workloads can obtain great
improvements in performance when run on specialized processors like
field-programmable gate arrays
and other technologies. The general term for these specialized processors
These accelerators typically must be placed in a node containing a generalized CPU, and programs must be specifically coded to take advantage of these technologies. These accelerators can often be expensive, with nodes containing the accelerators generally costing several times the cost of a comparable node without the accelerators, but the performance gains for suitable workloads can be very significant. Because of this, typically only a small subset of the nodes have accelerators, and you will need to instruct the scheduler to use those nodes.
- The term "account" can have two disparate usages. The first usage,
sometimes distinguished with the terms "user" or "login", refers to the
user credentials that you use to log into the cluster, and other attributes,
etc. connected with your user identity on the cluster. This sense of the
word "account" is similar to the login/user accounts on other computer
The second usage refers to allocation accounts, sometimes simply referred to as allocaitons.
- Every user of the cluster is associated with one or more projects, and each project has associated with it one or more "allocation accounts". These allocation accounts track the compute usage by members of the project over some period of time, typically monthly, quarterly or yearly/lifetime of the project. The allocation accounts also have a limit preventing you from using more compute time than was allocated to the project/allocation account.
- Backfilling of jobs
- Backfilling is a technique used by the
scheduler to increase the utilization of the cluster. When enabled, it
allows the scheduler to run smaller, shorter jobs from
further down the queue when the job at the top of the
queue is waiting for resources, provided that the smaller, shorter jobs are
expected to complete before the additional resources needed by the larger
job are expected to become available.
For example, consider a small cluster with ten nodes (node #1 to node #10), and five of the nodes (nodes #1 to #5) are currently in use by jobs which have another 12 hours of walltime remaining, and the remaining five nodes (#6 to #10) are idle. At the top of the queue is a job (job #1) that needs 8 nodes --- since only five nodes are idle, it is currently pending and the scheduler estimates that it will remain pending for another 12 hours (when some of the jobs currently running finish and at least three of nodes #1 to #5 become idle). Next in the queue are a bunch of small jobs (jobs #2-6) that need only node for 2 hours.
Without backfilling, the scheduler would hold nodes #6 through #10 for job #1, and wait until at least three of the nodes #1 through #5 become idle (which will happen when the jobs running on them finish), and then allocate nodes to and start job #1. Since it is expected take about 12 hours for the jobs on nodes #1 through #5 to finish, nodes #6 through #10 will remain idle for about 12 hours.
With backfill, the scheduler sees that jobs #2 through #6 are small, and that if it started them on the nodes #6 through #10, they could all run and finish within 2 hours (since the scheduler will kill them if they run past 2 hours), and so running those jobs should not delay the start of job #1. This allows for more efficient use of resources.
Of course, in reality things are not quite so black and white. The scheduler knows the walltimes requested by jobs, but a job can finish before its walltime is up, and indeed they usually will (since the scheduler will terminate jobs if they exceed their requested walltimes, users are encouraged to provide a bit of padding in their walltime estimates to ensure the job will finish before then). However, if the walltimes provided by the users are somewhat reasonable estimates of the actual runtime, even if padded by 20 or 30% for safety, this backfill process can help keep the queue moving without introducing much delay into the larger jobs.
- Checkpointing is the process by which a program or job periodically saves its state to disk so that if the code is terminated for some reason, the state information can be used to resume the calculation from when the checkpoint was made. This is commonly used to provide some resilience to jobs, especially long jobs, from hardware and other failures --- when the job is resubmitted, it can (with appropriate configuration) resume from the last checkpoint rather than from the very beginning. See also:
- A computer core is the basic unit for processing on a
computer processor or GPU. Each core
is capable of performing an independent calculation at the same time, so
in the high performance computing paradigm we typically want
code with each parallel thread of execution running on a
separate core to gain maximum performance.
From a software perspective, each core looks much like a CPU , and the cores are sometimes referred to as CPUs, but for this website we will try to always use CPUs to refer to the chip/socket, and cores (or CPU cores) for the basic processing units within the CPU.
- The acronym CPU stands for the "central processing unit" of the system,
traditionally the "brain" behind the computer. It is where most of the
processing traditionally occurs.
The CPU consists of one or more cores (modern CPUs almost always consist of multiple cores, the CPUs on the Deepthought2 cluster have 10 cores per CPU, some recent chips have upwards of 60 cores per CPU). The CPU is the physical "chip"; many nodes have multiple CPUs. The CPU is sometimes referred to as a "socket" to more clearly distinguish it from the cores --- from a software perspective the cores look like CPUs, and some places use the term CPU to refer to cores, so the socket terminology is useful. On this web site, we will try to always use the term CPU to refer to the chip/socket, and refer to cores as cores (or sometimes CPU cores). Note that the Slurm scheduler (e.g.
sbatchand other commands), tends to use the term CPU to refer to a core, and uses the term
socketwhen refering to the physical processor.
- CUDA is a platform for
parallel computing, primarily used for doing
general purpose computing on GPUs. CUDA can be accessed
via specific libraries, use of compiler directives like
OpenACC, and the specialized
- CUDA compute capability
- Different GPUs support different features and command sets. Generally
this is a cumulative hierarchy, so later GPUs support all of the features
of earlier GPUs, and have some additional capabilities not supported by the
earlier ones. This hierarchy is encapsulated as the CUDA compute
capability, which is a simple version number which specifies the features
supported by a given GPU and CUDA driver library. E.g., double precision
floating point operations were introduced in CUDA compute capability 1.3; GPUs
supporting only 1.0, 1.1, or 1.2 do not support double precision, but GPUs
supporting 1.3 or higher do.
The following list gives the CUDA compute capability for GPUs on the various UMD clusters:
- K20 GPUs: 3.5 (Deepthought2 cluster)
- K80 GPUs: 3.7 (Bluecrab cluster)
- P100 GPUs: 6.0 (Bluecrab and Juggernaut clusters)
- V100 GPUs: 7.0 (Juggernaut cluster)
- A100 GPUs: 8.0 (successor to Deepthought2 ???)
- List of CUDA compute capabilities of various GPUs (from Wikipedia)
- List of computational features supported by various CUDA compute capabilities (from Wikipedia)
- Distributed Memory Parallelism
- Distributed memory parallelism is a paradigm for
in which the parallel processes do not all share a global memory address space.
Because of this, communication among the different processes cannot occur only
over shared memory. This is in contrast with
shared memory parallelism
which can use the shared memory space for inter-process communication.
Distributed memory parallelism is typically implemented by using the Message Passing Interface (MPI) for communication between the tasks. Although programming using MPI is generally harder than using shared memory paradigms like OpenMP, it has the advantage that it is not restricted to a single node. This allows the job to parallelize over more CPU cores than are available on a given node, allowing for parallel jobs spanning hundreds or even thousands of cores.
- embarassingly parallel
- An embarassingly parallel algorithm or problem is one that can be
easily split into multiple tasks that can run in
parallel with little if
any dependency or need for communication between the different tasks. These
are sometimes referred to as
In many cases, because of the lack of need for communication between the tasks, this type of job can be implemented as a series of sequential jobs, with each task in its own job.
An example of an embarassingly parallel problem might be a parameter sweep. I.e., assume you have a complex calculation that depends on three parameters, each with a valid range, and you are trying to maximize some value depending on the parameters. You can form a discrete grid by stepping over the valid ranges of the parameters, and then submit jobs for each of those triplets. After the jobs run, examine the results and find the maximum, perhaps iterating with a finer grid around the maximum.
- FLOPS, short for FLoating-point
OPerations per Second is a measurement
of computer performance, and as the name suggests, basically measures the
maximum number of calculations using floating point arithmetic that a
computer can perform per second. This is the measure commonly used to
compare the performance of HPC clusters, except that we usually talk in terms
of teraFLOPS (1 TFLOPS = 10^12 FLOPS) or petaFLOPS (1 PFLOPS = 10^15 FLOPS)
or even exaFLOPS (1 EFLOPS = 10^18 FLOPS).
- Graphics Processing Unit (GPU)
- Graphics Processing Units (GPUs) are hardware
accelerators which were initially designed to
facilitate the creation of graphics for output to a display device. However,
their highly parallel
structure makes them more efficient then general
purpose CPUs for certain types of algorithms, and so they
are quite useful for processing certain "number crunching" workloads.
High-end GPUs can be expensive, and are rapidly evolving, so only a small subset of nodes typically have GPUs, and if you wish to use them you will need to instruct the scheduler to use those nodes. Although a GPU contains a large number of cores, they are not compatible with the standard Intel x86 architecture, and codes need to written (and compiled) especially for these devices, typically using the CUDA or OpenCL platforms. Some of the applications in the standard software library, but even in those cases you need to use the versions which specifically support CUDA.
- hybrid parallelism
Hybrid parallelism is a paradigm for
parallel computing in which some of the
parallelism is done using
distributed memory parallelism
for some of the parallelism, and
shared-memory parallelism for the
rest. This is commonly done by using MPI for the
distributed-memory type, with each MPI task being
This technique is not common, but typically occurs when the problem can be decomposed for parallelization in two distinct manners, with the code using one paradigm for one decomposition and the other for the other.
Even for codes which support hybrid parallelization, then benefits and the performance gains might be highly dependent on the problem being solved. So it is advisable to benchmark to find the optimal configuration.
- High Performance Computing (HPC)
- High Performance Computing is a general term for computing with a high
level of performance. Generally high performance computing specifically
refers to running jobs which are very parallel, often running on hundreds or
even thousands of cores.
However, then term is often used more generally to encompass not only traditional high performance computing but also high-throughput computing and various machine learning paradigms. The Deepthought clusters are generally designed for High Performance Computing, and are referred to as such, but they also support High Throughput Computing as well.
- High Throughput Computing (HTC)
- High Throughput Computing refers to a computer system designed for
doing large amounts of computation as fast as possible. Unlike
High Performance Computing, High Throughput Computing
tends to run large numbers of sequential or
E.g., a typical High Performance Compute job is something that would not run on a typical workstation and requires an HPC cluster. A High Throughput Compute user typcially has jobs that likely could run on a typical workstation, but there is such a large number of these jobs that running one at a time on a single workstation would take a long time. A HTC user would want to submit a large number of these jobs, often hundreds, to a cluster which could run a significant number of the jobs at the same time, reducing the overall time to get all of the jobs run.
The Deepthought Clusters are primarily designed and intended for High Performance Computing, but they also support High Throughput Computing.
- This is Intel's implementation of an MPI library.
- When one wishes to run a calculation on the HPC cluster,
one submits a job. This job consists not only of the actual
executable program to run, but its command line arguments, and input files
needed, the environmental variables needed to have the job run properly,
output directives, and instructions to the scheduler
about what compute resources (how many cores/nodes and how much memory is
needed, how long the job can run, whether GPUs are needed, etc). All of those
specifications collectively comprise the job.
Once submitted to the scheduler, the job is assigned an unique identifer (called the job number), and is placed in the queue waiting for the scheduler to find the necessary resources for it and start it. See also:
- Overview of submitting jobs to the cluster
- Job Priority
- Every job submitted to the scheduler has an associated priority
which determines the order in which the scheduler will try to run them.
While broadly speaking jobs are run in a first-come, first-served order,
there are exceptions. Jobs submitted to the debug partition get the highest
priority (i.e. the scheduler tries to run those first) --- these jobs have
highly restrictive time limits and a small dedicated pool of nodes, so this
generally allows users to get these jobs started quickly, to promote rapid
debugging of issues.
Next in order come jobs submitted from high priority allocation accounts. Such allocation accounts are provided to projects which have contributed financially to the construction/expansion of the cluster (often indirectly via their college or deparment). Because the cluster would not exist without such contributions, such projects get some of their SUs as high priority SUs.
Most remaining jobs are run at standard priority. Projects receiving allocations from the AAC get allocation accounts with standard priority, and projects with high priority allocations running more than their monthly allotment of SUs also run at standard priorty.
- Glossary entry on the service units
- Please see service unit (SU).
- Message Passing Interface (MPI)
- The Message Passing Interface (MPI) is a standardized and portable
standard for communication between tasks in a
distributed memory parallel
job. It has become a de-facto standard for programs which can run across
There are a number of implementations of this interface. UMD HPC clusters typically provide:
- For GNU compilers, OpenMPI: an open-source MPI implementation
- For Intel compilers, Intel MPI: Intel's implementation of MPI
Both implementations include wrappers for C, C++, and Fortran compilers, as well as bindings for many scripting languages including Python and R.
- MPI Communicator
MPI jobs group the various processes
which comprise the job into objects called communicators. The default
communicator is MPI_COMM_WORLD which consists of all of the processes that
were in the job when it first started. For many jobs, all of the processes
are closely coupled, and that is the only communicator you need, but MPI
provides other mechanisms which are needed in some more complicated cases.
Each communicator has a
size, which is the number of processes in the communicator. Every process in the communicator has an unique
rank, which is an integer between 0 and the
size - 1of the communicator. See also:
- In HPC terminology, we use the term
nodeto refer to a complete computer system, comprising of at least one CPU, memory, power supply, various buses, networking devices and generally some local disk storage. This is the direct analog to your laptop or workstation.
On the clusters, we have differentiated the nodes into various specializations:
- compute nodes
- Most of the nodes fall into this category. These are the "work horses" of the cluster, where all the real computation gets done. These nodes can be further specialized into compute nodes with specific features, like nodes with large amounts of RAM (large memory nodes) or nodes which have GPUs or other specialized accelerators (e.g. GPU nodes).
- login nodes
- These are the nodes you can actually log into to submit jobs, edit files, monitor your jobs, view the results of jobs, etc. They are *not* intended to do number crunching or real computation. If the cluster has distinct DTNs, then most file transfers (certainly all transfer of large amounts of data) should be done on the DTNs. You should not run any computationally intensive processes on the login nodes.
- data transfer nodes (DTNs)
- These are nodes optimized for the transferring of the large amounts of data, both data sets needed for jobs running on the cluster, and moving results of finished jobs off the cluster. You should not run any computationally intensive processes on the login nodes.
- service nodes
- In addition to the aforementioned nodes, these are nodes on which you generally will never run processes, but which provide needed services for the cluster. E.g. the scheduler, file servers, web servers, etc.
OpenACC is a programming standard designed for
heterogeneous CPU/GPU systems as
well as other accelerators.
OpenCL is a programming framework for writing code to execute across
heterogeneous platforms of CPUs, GPUs,
and other hardware accelerators. It is an
OpenMP is implemented in the compiler, adding some pragmas to the language with which you can instruct the compiler which sections of the code are parallelizable and which must be run sequentially.
- Parallel Computing
- Parallel computing is a type of computation wherein many calculations
are carried out simultaneously. This contrasts with
sequential computing wherein only a single calculation occurs at any
Obviously, a parallel computation should be faster than a sequential computation. The speedup of an algorithm when parallelized is the ratio of the runtime of the sequential code to that of the parallelized code. Although one might naively expect the speedup (compared to the sequential calculation) of a computation parallelized over N CPU cores would be N, this is rarely the case. Typically, there are parts of an algorithm which cannot take advantage of parallelization, causing the speedup to be less than N. Generally, the addition of CPU cores will start having diminishing returns at some point, depending on the algorithm.
- In HPC terminology, a partition is a set of
compute nodes. Clusters typically divide their nodes into partitions to
reflect different hardware availabilities, or to associate priority and
quality of service settings to jobs submitted to the partition.
- Preemption of Jobs
- Most jobs on the cluster are submitted to the scheduler, wait in the
queue until resources are available, and then once the job is started, it
runs until it either completes, crashes, or is cancelled due to exceeding its
requested walltime. Even if a higher priority job is
submitted afterwards which "wants" the resources being used by the lower
priority job which already started, the running job is allowed to run to
its successful or unsuccessful completion, and the higher priority job is
forced to wait until it finishes (or other resources become available).
An exception to this is jobs submitted to the scavenger partition. These jobs, in addition to be the lowest priority, are also preemptible. This means that even after the scavenger job starts, if another, higher priority job (and every non-scavenger job is higher priority) "wants" the resources being used by the scavenger job, the higher priority job can kick the scavenger job off of the node(s) it is using, and take those node(s) for its own use. This is called preemption. Typically, the preempted scavenger job is put back in the queue, and will eventually start again (it is advisable that such scavenger jobs do checkpointing so that they do not start over from the very beginning but can make some progress towards completion in these intermittent runs).
Again, normal jobs will not get preempted, only scavenger jobs. Scavenger jobs are subject to preemption, but your allocation account does not get charged for scavenger jobs.
- Glossary entry on the scavenger partition
- Priority of Jobs
- Please see job priority.
Every user of the cluster is associated with one or more
projects. A project has one or more
allocation accounts which track the
Projects usually belong to research groups, although sometimes they belong to a department or a subset of a research group. They provide a means of organizing users and associating them with allocation accounts. Members of the same project also share membership in an Unix group, which can be used to facilitate sharing files (although by default your home and data directories are restricted to your login account only, they are generally group-owned by the project group, and you can use
chmodto allow group members to read).
- In Britain, a queue is a line of people, etc. waiting their turn
for something. This definition is borrowed in HPC to
refer to the list of jobs that are waiting to be allocated
resources and started by the scheduler.
When you submit a job to the cluster and then check its status, unless you are lucky enough to have submitted it when the cluster is very idle, you will see that the job is in a "pending" state --- i.e. the job is waiting in the queue.
The amount of time a job spends in the queue depends on how busy the cluster is, how many other jobs are in the queue, how many resources your job is requesting and for how long, the priority of the job, and many other factors. A sizable production job requesting a day or more of runtime can easily spend a couple of hours in the queue. Very short jobs (under 15 minutes) should look into the debug partition which generally has much lesser wait times.
In some sense, the different partitions each have their own queue, and so sometimes the term queue is also used to refer to a partition. Some schedulers tend to use the term queue in that fashion more than others.
- Scavenger partition
- The scavenger partition is a special
partition that allows users to submit very low priority
partition is a special
partition that allows users to submit very low priority
job that is subject to preemption,
but in return does not incur any charges against the
Because scavenger jobs do not incur charges against the allocation account, these are an useful mechanism to run jobs in excess of your SU allotment. But because of the preemption, you are strongly encouraged to make use of checkpointing in order that your job can make progress during its piecemeal run time.
- The scheduler is a critical component of an HPC
environment. It is a complex piece of code responsible for allocating
resources to the various jobs in the
queue in a timely fashion.
The scheduler is in some ways analagous to the host or hostess at a restaurant, with the restaurant customers being like jobs, the tables being like the compute resources, and the waiting list being like the queue. If the restaurant is not busy, a new customer can just come in and the host/hostess will seat them almost immediately, just like when the cluster is idle the scheduler will allocate resources to a new job almost immediately. When things are busy, the host/hostess will place customers on a waiting list, and customers will be seated in roughly the order they came in. But there are exceptions --- some parties are too big for some tables or might have special requests (e.g. indoor vs outdoor seating, etc), so sometimes smaller or less particular parties can jump ahead in the queue.
In the same way, the scheduler tries to allocate resources to jobs roughly in the order they came in, but modified because the jobs have different sizes (from single cores to thousands of cores) and can requestaGPUs or large amounts of memory. Also, unlike most restaurants, the scheduler will allow in some cases for multiple jobs to be placed on the same table. As an added twist, jobs are required to specify the maximum amount of time for which they will run, and the scheduler can use this to backfill jobs.
There are many different schedulers available for HPC clusters. At UMD, we have been using the Slurm scheduler since 2014. Slurm is an open source product used at many clusters in the world, including most of the TOP500 systems.
- Sequential Job or Process
- A sequential job or process is a job or process which does not do any form of parallelism, i.e. the code processes one instruction, then the next, and so on. These are also referred to as single core jobs or processes, as they can efficiently run on a single CPU core.
Similarly, the above formula will often be tweaked to account for other resources. E.g., memory is a restricted resource, and so a job which is using twice the average memory per CPU core might be charged as if twice as many cores were in use. The use of GPUs might also incur an increase in the number of SUs charged.
Note also that the charges are based on the resources allocated to the job, not what is actually used. So submitting a job in exclusive mode (which disallows any other jobs from sharing the node with your job) will typically result in your being charged for all the cores on the node whether you are using them or not.
Compared to the amount of computation done on the cluster, an SU is a fairly small unit, so one often talks in terms of kSU or even MSU, where 1 kSU = 1000 SU and 1 MSU = 1,000,000 SU.
I.e., if the sequential code takes T_seq seconds, and the parallel code takes T_par seconds, the speedup is given by S = T_seq/T_par.
Note: in order to use this paradigm, all of the processes using shared memory parallelism must be running on the same node, otherwise they would not be able to share memory.
I.e., if the sequential code takes T_seq seconds, and the parallel code takes T_par seconds, the speedup is given by S = T_seq/T_par.
- MPI jobs consist of a number of tasks, which is the
smallest unit of execution within MPI. All processes, etc. for a given
task run on the same node (other tasks can also be on the same node, but
a task will not be split across multiple nodes), and each task
uses the MPI for communication with the other tasks, and therefore the
tasks can all be running on different nodes (although
they can also run on the same node).
Hybrid jobs can also exist, in which each MPI task uses a shared memory parallelism technique (OpenMP or other threading paradigms) for additional parallelism. In this case, each task needs multiple CPU cores on the same node. Although all cores for a task will always be assigned on the same node, the default is one CPU core per task, so you will need to specify the number of CPU cores per task.
- A thread, often called a "lightweight process", is a thread of execution,
which is the smallest sequence of instructions which the operating system
From an HPC perspective, a multi-threaded process can have multiple threads of execution running simultaneously, assuming that the node the job is running on has available cores for each thread. Technically, a multi-threaded process can run on fewer cores than threads, but in that case you will not get full advantage of the parallelism. Generally for HPC purposes, you want a separate CPU core for each thread, to get the maximum speedup.
- Threading Building Blocks
- Threading Building Blocks (aka TBB or oneTBB) is a C++ template library
developed by Intel for shared memory