Skip to end of banner
Go to start of banner

Interactive distributed computing in R

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 21 Next »

Using the parallel package

Starting a cluster with Bioconductor and Cloud Formation

The BioConductor group has put together a Cloud Formation stack for doing interactive parallel computing in R on Amazon AWS. Follow those instructions, selecting the number of workers and size of the EC2 instances. Once the stack comes up, which took about 10 minutes for me, you log into RStudio on the head node. You'll start R processes on the worker nodes and send commands to the workers.

stack name: StartBioCParallelClusterWithSSH
template url: https://s3.amazonaws.com/bioc-cloudformation-templates/parallel_cluster_ssh.json

After starting the cloud formation script, you'll get a URL for a head-node running R-Studio. Click on that and continue...

Connecting to workers

The IP addresses of the workers (and the head node) get stored on the head node in a file. We'll read that file and create an R process for each core on each worker.

# grab host IPs
lines <- readLines("/usr/local/Rmpi/hostfile.plain")

# we'll want to start a worker for each core on each
# machine in the cluster
hosts <- do.call(c, lapply(strsplit(lines, " "), function(host) { rep(host[1], as.integer(host[2])) }))

library(parallel)
help(package=parallel)

cl <- makePSOCKcluster(hosts)

Note that the parallel package is perfectly happy starting up several copies of R on a single machine, which can be helpful for testing.

Simple tests

Try a few simple tests to make sure we're able to evaluate code on the workers and that it buys us some speed.

# try something simple
ans <- unlist(clusterEvalQ(cl, { mean(rnorm(1000)) }), use.names=F)

# test a time-consuming job
system.time(ans <- clusterEvalQ(cl, { sapply(1:1000, function(i) {mean(rnorm(10000))}) }))

# do the same thing locally
system.time(ans2 <- sapply(1:(1000*length(hosts)), function(i) {mean(rnorm(10000))}))

Head node vs. workers

Be aware of when you're running commands on the head node and when commands are running on the workers. Many commands will be better off running on the head node. When it's time to do something in parallel, you'll need to ship data objects to the workers, which is done with clusterExport, something like the following pattern:

myBigData <- computeBigDataMatrix(fizz, buzz)
moreData  <- constructPhatDataFrame(x, y, z)

clusterExport(cl, c('myBigData', 'moreData'))

results <- clusterEvalQ(cl, { computeOn(myBigData, moreData) })

Loading packages

The BioC virtual machine image comes with tons of good stuff already installed, but inevitably, you'll need to install something else.

It might be necessary to modify the library path. If you try to install packages on the workers and get an error to the effect that the workers "cannot install packages", you need to do this.

# set lib path to install packages

clusterEvalQ(cl, { .libPaths( c('/home/ubuntu/R/library', .libPaths()) ) })
clusterEvalQ(cl, {
    install.packages("someUsefulPackage")
    require(someUsefulPackage)
})

Attaching a shared EBS volume

Asking many worker nodes to load packages and request Synapse entities isn't a recommended or scalable approach.

Instead, see Configuration of Cluster for Scientific Computing for an example of connecting a shared EBS volume to the nodes. How to do this in the context of a cloud formation stack is something yet to be figured out.

Accessing source code repos on worker nodes

Getting code onto the worker nodes can be done like so:

clusterEvalQ(cl, {
    system('svn export  --no-auth-cache --non-interactive --username joe.user --password supeRsecRet77 https://sagebionetworks.jira.com/svn/COMPBIO/trunk/users/juser/fantasticAnalysis.R')
})

<<github example>>

Return values

Return values from distributed computations have to come across a socket connection, so be careful what you return. Status values such as dim(result) can confirm that a computation succeeded and are often better than returning a whole result

results <- clusterEvalQ(cl, {
    result <- produceGiantResultMatrix(foo, bar, bat)
    dim(result)
})

Also, consider putting intermediate values in synapse, which might serve as a means of checkpointing lengthy computations.

<<synapse example>>

Stopping a cluster

stopCluster(cl)

Don't forget to delete the stack in the AWS administration console to avoid continuing charges.

Predictive Modeling

Chris Bare (Unlicensed) parallelized some of the Predictive Modeling and Pathway Analysis Pipelines demos.

To do

These items will be added to this document, if and when I figure them out.

  • Spot instances? Is this worthwhile for interactive use?
  • Create our own Cloud Formation template
  • Attach a shared EBS volume
  • Run a user-specified script on start-up

 

  • No labels