Pytorch

Summary and Version Information
Singularity/usage Information
GPU support
Distributed pytorch

Summary and Version Information

Package	Pytorch
Description	Neural network and tensor computation library for python
Categories	Computerscience, Research

Version	Module tag	Availability^*	GPU Ready	Notes
0.4.1	pytorch/0.4.1	Deepthought2 HPCC RedHat6	Y
1.4.0	pytorch/1.4.0	Deepthought2 HPCC RedHat6	Y

Notes:
^*: Packages labelled as "available" on an HPC cluster means that it can be used on the compute nodes of that cluster. Even software not listed as available on an HPC cluster is generally available on the login nodes of the cluster (assuming it is available for the appropriate OS version; e.g. RedHat Linux 6 for the two Deepthought clusters). This is due to the fact that the compute nodes do not use AFS and so have copies of the AFS software tree, and so we only install packages as requested. Contact us if you need a version listed as not available on one of the clusters.

In general, you need to prepare your Unix environment to be able to use this software. To do this, either:

tap TAPFOO

module load MODFOO

where TAPFOO and MODFOO are one of the tags in the tap and module columns above, respectively. The tap command will print a short usage text (use -q to supress this, this is needed in startup dot files); you can get a similar text with module help MODFOO. For more information on the tap and module commands.

For packages which are libraries which other codes get built against, see the section on compiling codes for more help.

Tap/module commands listed with a version of current will set up for what we considered the most current stable and tested version of the package installed on the system. The exact version is subject to change with little if any notice, and might be platform dependent. Versions labelled new would represent a newer version of the package which is still being tested by users; if stability is not a primary concern you are encouraged to use it. Those with versions listed as old set up for an older version of the package; you should only use this if the newer versions are causing issues. Old versions may be dropped after a while. Again, the exact versions are subject to change with little if any notice.

In general, you can abbreviate the module tags. If no version is given, the default current version is used. For packages with compiler/MPI/etc dependencies, if a compiler module or MPI library was previously loaded, it will try to load the correct build of the package for those packages. If you specify the compiler/MPI dependency, it will attempt to load the compiler/MPI library for you if needed.

The PyTorch package is NOT natively installed on the Deepthought2 cluster for various technical reasons. What is provided instead are Singularity containers which have versions of both python2 and python3 installed with support for PyTorch and related python packages.

To use the PyTorch python package, you must load the appropriate environmental module (e.g. module load pytorch) and then launch the python interpretter inside the Singularity container. Note: you cannot access the torch/pytorch python packages within the native python installations (e.g. module load python), you must use the python installation in the container for PyTorch.

To assist with this, the following wrapper scripts have been provided:

pytorch: Will launch the python2 interpretter within the container, with support for the torch/pytorch package as well as various other packages. Any arguments given will be passed to the python interpretter, so you can do something like pytorch myscript.py.
pytorch-python2: This is the same as pytorch, for completeness and symmetry.
pytorch-python3: This is like pytorch, except that a python3 interpretter with support for the torch/pytorch package will be invoked.

Please note in all cases, the name of the module to import is torch, not pytorch.

In all cases, any arguments given to the wrapper scripts are passed directly to the python interpretter running within the container. E.g., you can provide the name of a python script, and that python script will run in the python container running inside your container. Your home and lustre directories are accessible from within the container, so you can read and write to files in those directories as usual.

Note that if you load the pytorch environmental module (e.g. module load pytorch and then issue the python command, you will start up a natively installed python interpretter which does NOT have the pytorch/torch python package installed. You need to start one of the python interpretters inside the container to get these packages --- you can either do that using the correct singularity command, or use the friendlier wrapper scripts described above.

It is hoped that for most users, the "containerization" of this package should not cause any real issues, and hopefully not even really be noticed. However, there are some limitations to the use of containers:

In general, you will not have access to natively installed software, just the software included in the container. So even if some package foo is installed natively on Deepthought2, it is likely not accessible from within the container (unless there is a version of it also installed inside the container).
You will likely not be able to use the python virtualenv scripts to install new python packages for use withing the container, as the virtualenv command will be installing packages natively, which would not then be available inside the container.

However, you are permitted to create your own Singularity containers and to use them on the Deepthought2 cluster. You will need to have root access on some system (e.g. your workstation or desktop) with Singularity installed to build your own containers (we cannot provide you root access on the Deepthought2 login or compute nodes). You can also copy system provided containers and edit them. More details can be found under the software page for Singularity.

GPU suppport

The PyTorch package can make use of GPUs on nodes with GPUs. There is nothing special that needs to be done in the module load or the various pytorch* commands, but you will need to instruct the package to use the GPUs within your python code. This is typically done by replacing a line like

device = torch.device("cpu")

with something like

device = torch.device("cuda:0")

Distributed pytorch

Although the Singularity containers with pytorch do not have MPI support, pytorch has its own distributed package (torch.distributed) which can handle parallelizing your computations across multiple nodes. More information on using torch.distributed in your Python codes can be found at the PyTorch Distributed Tutorial and the Distributed Communication Documentation.

We recommend using the TCP based initialization, using something like the example script below:

#!/bin/bash
#SBATCH -ntasks=40
#SBATCH -t 00:01:00
#SBATCH --mem-per-cpu=2048
#SBATCH --exclusive

#Define module command, etc
. ~/.profile
#Load the pytorch module
module load pytorch/0.4.1

#Number of processes per node to launch (20 for CPU, 2 for GPU)
NPROC_PER_NODE=20

#This command to run your pytorch script
#You will want to replace this
COMMAND="YOUR_TRAINING_SCRIPT.py --arg1 --arg2 ..."

#We want names of master and slave nodes
MASTER=`/bin/hostname -s`
SLAVES=`scontrol show hostnames $SLURM_JOB_NODELIST | grep -v $MASTER`
#Make sure this node (MASTER) comes first
HOSTLIST="$MASTER $SLAVES"

#Get a random unused port on this host(MASTER) between 2000 and 9999
#First line gets list of unused ports
#2nd line restricts between 2000 and 9999
#3rd line gets single random port from the list
MPORT=`ss -tan | awk '{print $4}' | cut -d':' -f2 | \
	grep "[2-9][0-9]\{3,3\}" | grep -v "[0-9]\{5,5\}" | \
	sort | uniq | shuf`



#Launch the pytorch processes, first on master (first in $HOSTLIST) then
#on the slaves
RANK=0
for node in $HOSTLIST; do
	ssh -q $node \
		pytorch -m torch.distributed.launch \
		--nproces_per_node=$NPROCS_PER_NODE \
		--nnodes=$SLURM_JOB_NUM_NODES \
		--node_rank=$RANK \
		--master_addr="$MASTER" --master_port="$MPORT" \
		$COMMAND &
	RANK=$((RANK+1))
done
wait

The python code should have a structure looking something like:

import argparse
import torch.distributed as dist

parser.add_argument('--distributed', action='store_true', help='enables distributed processes')
parser.add_argument('--local_rank', default=0, type=int, help='number of distributed processes')
parser.add_argument('--dist_backend', default='gloo', type=str, help='distributed backend')

def main():
opt = parser.parse_args()
if opt.distributed:
    dist.init_process_group(backend=opt.dist_backend, init_method='env://')

print("Initialized Rank:", dist.get_rank())

if __name__ == '__main__':
    main()

DIT-UMD

Pytorch

Contents

Summary and Version Information

GPU suppport

Distributed pytorch