Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

clusterEvalQ(cl, { .libPaths( c('/home/ubuntu/R/library', .libPaths()) ) })
clusterEvalQ(cl, {
    install.packages("someUsefulPackage")
    require(someUsefulPackage)
})

Loading sage packages

clusterEvalQ(cl, {
    options(repos=structure(c(CRAN="http://cran.fhcrc.org/")))
    source('http://depot.sagebase.org/CRAN.R')
    pkgInstall("synapseClient")
    pkgInstall("predictiveModeling")
    
    library(synapseClient)
    library(predictiveModeling)
})

...

Loading synapse entities

...

Logging in.

clusterEvalQ(cl, { synapseLogin('joe.user@mydomain.com','secret') })

Asking many worker nodes to request Synapse entities at once is a fun and easy way to mount a distributed denial of service attack on the repository service. The service deals with this by timing out requests, which means some workers will succeed, while others will fail. A couple of tricks will help smooth over these problems.First, we'll

  1. check if our target data already exists. That way, we can re-try in the event of partial failure without re-doing work and unnecessarily thrashing Synapse.

...

  1. throw in a few random seconds of rest for our workers. This spreads out the load on Synapse.
clusterEvalQ(cl, {
    if (!exists('expr')) {
        Sys.sleep(runif(1,0,5))
        expr_entity <- loadEntity('syn269056')
        expr <- expr_entity$objects$eSet_expr
    }
})

Accessing source code repos on worker nodes

-------------------------------------------
Getting code onto the worker nodes can be done like so:

clusterEvalQ(cl, {
    system('svn export  --no-auth-cache --non-interactive --username joe.user --password supeRsecRet77 https://sagebionetworks.jira.com/svn/COMPBIO/trunk/users/juser/fantasticAnalysis.R')
})

<<github example>>


Return values
-------------
Return values from distributed computations have to come across a socket connection, so be careful what you return. Status values such as dim(result) can confirm that a computation succeeded and are often better than returning a whole result.

clusterEvalQ(cl, {
  result <- produceGiantResultMatrix(foo, bar, bat)
  dim(result)
})

Also, consider putting intermediate values in synapse, which might serve as a means of checkpointing lengthy computations.

<<synapse example>>


Stopping a cluster
------------------

stopCluster(cl)

Don't forget to delete the stack in the AWS administration console to avoid continuing charges.



To do
-----

* Spot instances? Is this worthwhile for interactive use?
* Create our own Cloud Formation template
* Run a user-specified script on start-up