...
The BioConductor group has put together a Cloud Formation stack for doing interactive parallel computing in R on Amazon AWS. Follow those instructions, selecting the number of workers and size of the EC2 instances. Once the stack comes up, which took about 10 minutes for me, you log into RStudio on the head node. You'll start R processes on the worker nodes and send commands to the workers.
...
Code Block |
---|
# try something simple
ans <- unlist(clusterEvalQ(cl, { mean(rnorm(1000)) }), use.names=F)
# test a time-consuming job
system.time(ans <- clusterEvalQ(cl, { sapply(1:1000, function(i) {mean(rnorm(10000))}) }))
# do the same thing locally
system.time(ans2 <- sapply(1:(1000*length(hosts)), function(i) {mean(rnorm(10000))}))
# use load balancing parallel lapply
n <- length(cl)*1000
system.time(ans <- parLapplyLB(cl, 1:n, function(x) { mean(rnorm(10000)) })) |
Head node vs. workers
Be aware of when you're running commands on the head node and when commands are running on the workers. Many commands will be better off running on the head node. When it's time to do something in parallel, you'll need to ship data objects to the workers, which is done with clusterExport, something like the following pattern:
...
It might be necessary to modify the library path. If you try to install packages on the workers and get an error to the effect that the workers "cannot install packages", you need to do this.
Code Block |
---|
# set cran mirror clusterEvalQ(cl, { options(repos=structure(c(CRAN="http://cran.fhcrc.org/"))) }) # set lib path to install packages clusterEvalQ(cl, { .libPaths( c('/home/ubuntu/R/library', .libPaths()) ) }) clusterEvalQ(cl, { install.packages("someUsefulPackage") require(someUsefulPackage) }) |
Sage packages
Code Block |
---|
clusterEvalQ(cl, {
options(repos=structure(c(CRAN="http://cran.fhcrc.org/")))
source('http://depot.sagebase.org/CRAN.R')
pkgInstall("synapseClient")
pkgInstall("predictiveModeling")
|
...
library(synapseClient)
library(predictiveModeling)
}) |
Logging workers into synapse:
Code Block |
---|
clusterEvalQ(cl, { synapseLogin('joe.user@mydomain.com','secret') }) |
Asking many worker nodes to load packages and request Synapse entities isn't a recommended or scalable approach.Instead, see request Synapse entities at once is a fun and easy way to mount a distributed denial of service attack on the repository service. The service deals with this by timing out requests, which means some workers will succeed, while others will fail. A couple of tricks will help smooth over these problems.
- check if our target data already exists. That way, we can re-try in the event of partial failure without re-doing work and unnecessarily thrashing Synapse.
- throw in a few random seconds of rest for our workers. This spreads out the load on Synapse.
Code Block |
---|
clusterEvalQ(cl, {
if (!exists('expr')) {
Sys.sleep(runif(1,0,5))
expr_entity <- loadEntity('syn269056')
expr <- expr_entity$objects$eSet_expr
}
}) |
Attaching a shared EBS volume
It might be worth looking into attaching a shared EBS volume and adding that to R's .libPaths(). See Configuration of Cluster for Scientific Computing for an example of connecting a shared EBS volume to the nodesin StarCluster. How to do this in the context of a cloud formation stack is something yet to be figured out.
In general, attaching and using an EBS volume can be done like so (from StackOverflow Add EBS to ubuntu EC2 instance):
- Create EBS volume in the EC2 section of the AWS console.
- Attach EBS volume to `/dev/sdf` (EC2's external name for this particular device number).
Format file system `/dev/xvdf` (Ubuntu's internal name for this particular device number):
Code Block sudo mkfs.ext4 /dev/xvdf
Mount file system (with update to /etc/fstab so it stays mounted on reboot):
Code Block sudo mkdir -m 777 /vol echo "/dev/xvdf /vol auto noatime 0 0" | sudo tee -a /etc/fstab sudo mount /vol
To mount an existing EBS volume, attach the volume to your instance in the AWS Console, then mount it:
Code Block |
---|
sudo mkdir -m 777 /vol
sudo mount /dev/xvdf /vol |
Like a real hard-drive, EBS volumes can only be attached to a single instance. But, they can be shared by NFS. <<How to do this?>>
Accessing source code repos on worker nodes
...