R
Contents
Summary and Version Information
Package | R |
---|---|
Description | R statistical analysis package |
Categories | Numerical Analysis, Research |
Version | Module tag | Availability* | GPU Ready |
Notes |
---|---|---|---|---|
3.0.3 | R/3.0.3 | Non-HPC Glue systems Deepthought2 HPCC RedHat6 |
N | |
3.1.2 | R/3.1.2 | Non-HPC Glue systems Deepthought2 HPCC RedHat6 |
N | DEPRECATED: built with gcc/4.6.1 and openmpi/1.6.5 (Rmpi) |
3.2.2 | R/3.2.2 | Non-HPC Glue systems Deepthought2 HPCC RedHat6 |
N | DEPRECATED: built with gcc/4.9.3 and openmpi/1.8.6 (Rmpi) |
3.3.2 | R/3.3.2 | Non-HPC Glue systems Deepthought2 HPCC RedHat6 |
N | built with gcc/4.9.3 and openmpi/1.8.6 (Rmpi) |
3.5.1 | R/3.5.1 | Non-HPC Glue systems Deepthought2 HPCC 64bit-Linux |
N | built with gcc/4.9.3 and openmpi/1.8.6 (Rmpi) |
4.0.1 | R/4.0.1 | Non-HPC Glue systems All OSes |
N |
Notes:
*: Packages labelled as "available" on an HPC cluster means
that it can be used on the compute nodes of that cluster. Even software
not listed as available on an HPC cluster is generally available on the
login nodes of the cluster (assuming it is available for the
appropriate OS version; e.g. RedHat Linux 6 for the two Deepthought clusters).
This is due to the fact that the compute nodes do not use AFS and so have
copies of the AFS software tree, and so we only install packages as requested.
Contact us if you need a version
listed as not available on one of the clusters.
In general, you need to prepare your Unix environment to be able to use this software. To do this, either:
tap TAPFOO
module load MODFOO
where TAPFOO and MODFOO are one of the tags in the tap
and module columns above, respectively. The tap
command will
print a short usage text (use -q
to supress this, this is needed
in startup dot files); you can get a similar text with
module help MODFOO
. For more information on
the tap and module commands.
For packages which are libraries which other codes get built against, see the section on compiling codes for more help.
Tap/module commands listed with a version of current will set up for what we considered the most current stable and tested version of the package installed on the system. The exact version is subject to change with little if any notice, and might be platform dependent. Versions labelled new would represent a newer version of the package which is still being tested by users; if stability is not a primary concern you are encouraged to use it. Those with versions listed as old set up for an older version of the package; you should only use this if the newer versions are causing issues. Old versions may be dropped after a while. Again, the exact versions are subject to change with little if any notice.
In general, you can abbreviate the module tags. If no version is given, the default current version is used. For packages with compiler/MPI/etc dependencies, if a compiler module or MPI library was previously loaded, it will try to load the correct build of the package for those packages. If you specify the compiler/MPI dependency, it will attempt to load the compiler/MPI library for you if needed.
Installing Modules
R's capabilities can be significantly enhanced through the addition of
modules. Code can then enable the library with the library
command.
The supported R interpretters on the system have a
selection of modules
preinstalled. If a module you are interested in is not in that
list, you can either install a personal copy of the module for yourself,
or request that it be installed system wide. We will make reasonable efforts
to accomodate such requests as staffing resources allow.
Installing modules yourself
The method for installing R packages is usually fairly straightforward, but obviously not all packages will install in the same manner. But most will follow the procedure below:
module load R/X.Y.Z
to select the version of R you wish to use- Create the directory to hold your R modules, if you have not already done
so. The default is in the directory
R
underneath your home directory, but you might wish to put it elsewhere; this will have subdirectories for R version and platform added. - Unless you opted for the default directory
~/R
, you need to tell R what directory you are using. To do this, you must set the environmental variableR_LIBS_USER
. Multiple directories can be listed; separate the paths with the colon (:) character. This needs to be set whenever you wish to use the modules in R, so you will generally want to set it in your.cshrc.mine
or.Renviron
files. - There are two standard methods for installing a package, one from the
command line, and one from inside R itself. Assuming you are putting
stuff in
~/myRpkgs
and installing the packagefoo
the commands would be:- From the command line, you will first need to download a tarball
with the source code for the package. Many packages can be found
at the Comprehensive R Archive Network
(CRAN). Assuming you downloaded
foo.tar.gz
to the current directory, you could then install it with:R CMD INSTALL -l ~/myRpkgs foo.tar.gz
- From within R, the
install.packages
function will connect to CRAN and download and install the package all in one step, with:install.packages("foo", lib="~/myRpkgs", repos="http://cran.r-project.org")
- From the command line, you will first need to download a tarball
with the source code for the package. Many packages can be found
at the Comprehensive R Archive Network
(CRAN). Assuming you downloaded
If all goes well, the package is now installed in the directory you specified and should be available for use by your R scripts.
Of course, not all packages install quite that easily. If you are comfortable building modules, hopefully the error messages will provide reasonable guidance as to how to proceed. Otherwise, you can just request for Division of Information Technology staff to install it, but that might take time depending on the availability of our time.
Running R in batch mode
Although R's interactive mode is nice for certain things, when you are doing production runs with tried and true scripts, it is usually easier to use R's batch interface. This is especially useful when submitting jobs to an HPC cluster.
If you have some R code in a file test.R
and you wish to
run it from the command line (or equivalently, from a shell script), you
can simply use the Rscript
command. E.g.
Rscript --no-save --no-restore test.R
The --no-save
and --no-restore
prevent the
saving of the workspace at the end of the session and the restoring of saved
objects at startup. These are typically what you want when running in
batch mode. Older versions of R used the R CMD BATCH
instead
of the Rscript
command; the main difference with the former
is that it optionally takes the name of an output file. Both should work
with currently installed versions of R.
For use on one of the HPC clusters, you will generally need to include the above in a job script, like:
#!/bin/bash
#Request 5 hours
#SBATCH -t 5:00
#Request 4 GiB per CPU-core
#SBATCH --mem-per-cpu=4096
#Request 1 core
#SBATCH -n 1
#Get our profile (and define module command)
. ~/.profile
#Load required modules
module load R/3.3.2
cd MY_WORK_DIRECTORY
#Make sure OpenMP is not "on"
OMP_NUM_THREADS=1
export OMP_NUM_THREADS
Rscript --no-save --no-restore my_R_code.R
Using R and MPI
User of one of the high-performance computing (HPC) clusters will likely be interested in running R codes that span multiple processors often over multiple nodes. This generally is done using MPI. There are a number of R packages that deal with MPI, including
- Rmpi
- snow
- doSNOW: provides a
dopar
functionality viasnow
Most users seem to prefer the snow
package, which is presumably
higher level and therefore easier to use than Rmpi
. There are
assorted guides to using R with the snow
package on the web,
including:
- Glenn Lockwoods page on R and HPC clusters
- University of Chicago's R page
- Harvard's Rmpi page
- Bioinfomagician's page on Rmpi
- Simon Fraser University's snow page
Below are just a few tips gleaned from these pages, etc. that users at UMD might find helpful.
- For best results, use the same version of compiler and MPI as used for
building R and its MPI packages. The MPI libraries and compiler used for
the different versions of R are listed in the
version table at the top of this page.
It is best to
module load
the compiler first (not needed for gcc/4.6.1) and then the OpenMPI library. - We have also had reports of wierd errors occurring when using
Rmpi (and the packages depending on it) with Infiniband; segfaults and other
seemingly random errors when setting up connections. This appears to be
related to complications with the used of pinned memory and forking within
the R interpretter (see e.g.
CRMDA blog and
OpenMPI developers mailing list archives regarding this issue). As such,
we strongly recommend R users who wish to use MPI disable Infiniband in their
mpirun command by adding the arguments
--mca btl tcp,self
as shown in the example below. - When using snow or one of its derivatives (e.g. doSNOW), you should launch
your code with something like
#!/bin/bash #Request 5 hours #SBATCH -t 5:00 #Request 4 GiB per CPU-core #SBATCH --mem-per-cpu=4096 #Request 40 cores #SBATCH -n 40 #Get our profile (and define module command) . ~/.profile #Load required modules module load gcc/4.9.3 module load openmpi/1.8.6 module load R/3.3.2 cd MY_WORK_DIRECTORY #Make sure OpenMP is not "on" OMP_NUM_THREADS=1 export OMP_NUM_THREADS #NOTE THE -np 1 below!!!! #The --mca btl tcp,self arguments restricts communications to #tcp instead of infiniband. We have seen issues with Rmpi and infiniband mpirun -np 1 --mca btl tcp,self R CMD BATCH --no-save --no-restore my_R_code.R
NOTE the use of
-np 1
in the above. Although that looks suspicious (telling mpirun to only start one MPI tasks when we asked for 40 cores), it is actually correct for most uses of the snow (and derivative) libraries. This is because when usingsnow
, typicallysnow
will spawn its own workers. If you request something more than 1 MPI task to be launched via theopenmpi
, or omit the-np 1
altogether (which effectively is asking formpirun
to launch the number of tasks given in the#SBATCH -n
line, 40 in this case), you will end up running e.g. 40 copies of the same code, each of which will try to spawn about 40 workers viasnow
, resulting in a mess (at best very sluggish performance, and more likely wierd errors). - Most
snow
based R code will at some point invoke themakeCluster
function. This takes a parameter specifying the size of the "cluster" to create. Typically, one wants this size to be one less than the number of cores requested from Slurm. This is because the process running the R code which spawns the workers is already consuming one CPU core, so if you try to spawn a number of workers equal to the number of cores requested of Slurm, there will be one core oversubscribed, which causes issues. I typically see an error about there being an insufficient number of "slots" available, and typically the R script just hangs (doing nothing, but not dying until the job is killed for exceeding its walltime, and thereby wasting a lot of SUs). Typically, it is better to do something like:cl<-makeCluster(mpi.universe.size()-1, type="MPI")