Pytorch
Contents
Summary and Version Information
Package | Pytorch |
---|---|
Description | Neural network and tensor computation library for python |
Categories | Computerscience, Research |
Version | Module tag | Availability* | GPU Ready |
Notes |
---|---|---|---|---|
0.4.1 | pytorch/0.4.1 | Deepthought2 HPCC RedHat6 |
Y | |
1.4.0 | pytorch/1.4.0 | Deepthought2 HPCC RedHat6 |
Y |
Notes:
*: Packages labelled as "available" on an HPC cluster means
that it can be used on the compute nodes of that cluster. Even software
not listed as available on an HPC cluster is generally available on the
login nodes of the cluster (assuming it is available for the
appropriate OS version; e.g. RedHat Linux 6 for the two Deepthought clusters).
This is due to the fact that the compute nodes do not use AFS and so have
copies of the AFS software tree, and so we only install packages as requested.
Contact us if you need a version
listed as not available on one of the clusters.
In general, you need to prepare your Unix environment to be able to use this software. To do this, either:
tap TAPFOO
module load MODFOO
where TAPFOO and MODFOO are one of the tags in the tap
and module columns above, respectively. The tap
command will
print a short usage text (use -q
to supress this, this is needed
in startup dot files); you can get a similar text with
module help MODFOO
. For more information on
the tap and module commands.
For packages which are libraries which other codes get built against, see the section on compiling codes for more help.
Tap/module commands listed with a version of current will set up for what we considered the most current stable and tested version of the package installed on the system. The exact version is subject to change with little if any notice, and might be platform dependent. Versions labelled new would represent a newer version of the package which is still being tested by users; if stability is not a primary concern you are encouraged to use it. Those with versions listed as old set up for an older version of the package; you should only use this if the newer versions are causing issues. Old versions may be dropped after a while. Again, the exact versions are subject to change with little if any notice.
In general, you can abbreviate the module tags. If no version is given, the default current version is used. For packages with compiler/MPI/etc dependencies, if a compiler module or MPI library was previously loaded, it will try to load the correct build of the package for those packages. If you specify the compiler/MPI dependency, it will attempt to load the compiler/MPI library for you if needed.
The PyTorch package is NOT natively installed on the Deepthought2 cluster for various technical reasons. What is provided instead are Singularity containers which have versions of both python2 and python3 installed with support for PyTorch and related python packages.
To use the PyTorch python package, you must load the appropriate environmental module
(e.g. module load pytorch
) and then launch the python interpretter inside
the Singularity container. Note: you cannot access the torch/pytorch
python packages within the native python installations (e.g. module load python
),
you must use the python installation in the container for PyTorch.
To assist with this, the following wrapper scripts have been provided:
pytorch
: Will launch the python2 interpretter within the container, with support for the torch/pytorch package as well as various other packages. Any arguments given will be passed to the python interpretter, so you can do something likepytorch myscript.py
.pytorch-python2
: This is the same aspytorch
, for completeness and symmetry.pytorch-python3
: This is likepytorch
, except that a python3 interpretter with support for the torch/pytorch package will be invoked.
Please note in all cases, the name of the module to import is torch
, not pytorch
.
In all cases, any arguments given to the wrapper scripts are passed directly to the python interpretter running within the container. E.g., you can provide the name of a python script, and that python script will run in the python container running inside your container. Your home and lustre directories are accessible from within the container, so you can read and write to files in those directories as usual.
Note that if you load the pytorch environmental module (e.g. module load pytorch
and then
issue the python
command, you will start up a natively installed python interpretter which does
NOT have the pytorch/torch python package installed. You need to start one of the python
interpretters inside the container to get these packages --- you can either do that using the correct
singularity
command, or use the friendlier wrapper scripts described above.
It is hoped that for most users, the "containerization" of this package should not cause any real issues, and hopefully not even really be noticed. However, there are some limitations to the use of containers:
- In general, you will not have access to natively installed software, just the software included in
the container. So even if some package
foo
is installed natively on Deepthought2, it is likely not accessible from within the container (unless there is a version of it also installed inside the container). - You will likely not be able to use the python
virtualenv
scripts to install new python packages for use withing the container, as the virtualenv command will be installing packages natively, which would not then be available inside the container.
However, you are permitted to create your own Singularity containers and to use them on the Deepthought2 cluster. You will need to have root access on some system (e.g. your workstation or desktop) with Singularity installed to build your own containers (we cannot provide you root access on the Deepthought2 login or compute nodes). You can also copy system provided containers and edit them. More details can be found under the software page for Singularity.
GPU suppport
The PyTorch package can make use of GPUs on nodes with GPUs. There is nothing special that needs to be
done in the module load
or the various pytorch*
commands, but you will need to instruct
the package to use the GPUs within your python code. This is typically done by replacing a line like
device = torch.device("cpu")
device = torch.device("cuda:0")
Distributed pytorch
Although the Singularity containers with pytorch do not have MPI support,
pytorch has its own distributed package (torch.distributed) which can handle
parallelizing your computations across multiple nodes. More information
on using torch.distributed
in your Python codes can be found
at the
PyTorch Distributed Tutorial and the
Distributed
Communication Documentation.
We recommend using the TCP based initialization, using something like the example script below:
#!/bin/bash
#SBATCH -ntasks=40
#SBATCH -t 00:01:00
#SBATCH --mem-per-cpu=2048
#SBATCH --exclusive
#Define module command, etc
. ~/.profile
#Load the pytorch module
module load pytorch/0.4.1
#Number of processes per node to launch (20 for CPU, 2 for GPU)
NPROC_PER_NODE=20
#This command to run your pytorch script
#You will want to replace this
COMMAND="YOUR_TRAINING_SCRIPT.py --arg1 --arg2 ..."
#We want names of master and slave nodes
MASTER=`/bin/hostname -s`
SLAVES=`scontrol show hostnames $SLURM_JOB_NODELIST | grep -v $MASTER`
#Make sure this node (MASTER) comes first
HOSTLIST="$MASTER $SLAVES"
#Get a random unused port on this host(MASTER) between 2000 and 9999
#First line gets list of unused ports
#2nd line restricts between 2000 and 9999
#3rd line gets single random port from the list
MPORT=`ss -tan | awk '{print $4}' | cut -d':' -f2 | \
grep "[2-9][0-9]\{3,3\}" | grep -v "[0-9]\{5,5\}" | \
sort | uniq | shuf`
#Launch the pytorch processes, first on master (first in $HOSTLIST) then
#on the slaves
RANK=0
for node in $HOSTLIST; do
ssh -q $node \
pytorch -m torch.distributed.launch \
--nproces_per_node=$NPROCS_PER_NODE \
--nnodes=$SLURM_JOB_NUM_NODES \
--node_rank=$RANK \
--master_addr="$MASTER" --master_port="$MPORT" \
$COMMAND &
RANK=$((RANK+1))
done
wait
The python code should have a structure looking something like:
import argparse
import torch.distributed as dist
parser.add_argument('--distributed', action='store_true', help='enables distributed processes')
parser.add_argument('--local_rank', default=0, type=int, help='number of distributed processes')
parser.add_argument('--dist_backend', default='gloo', type=str, help='distributed backend')
def main():
opt = parser.parse_args()
if opt.distributed:
dist.init_process_group(backend=opt.dist_backend, init_method='env://')
print("Initialized Rank:", dist.get_rank())
if __name__ == '__main__':
main()