Document toolboxDocument toolbox

Operating the StarCluster SGE/AWS Cluster for Roche Collaboration

The Roche Cluster is a Sun Grid Engine cluster hosted on the Amazon Elastic Compute Cloud, under  a dedicated account.  The cluster is configured to run distributed jobs requiring Matlab and R.


Setup

All interactions with the cluster are through a secure channel.  You will need the associated key file, available on the Sage unix system in /work/platform/rocheCollab/rochecollab-aws-keypair.pem.  Copy this file to your computer.

 

How to upload code to the cluster

The cluster has a 50 GB shared file system, mounted as "/shared". The compiled Bayesian Network code is found under "/shared/code/roche/distrib/BNpack".  To update this code, executing the following from your local machine:

scp -i /key/folder/rochecollab-aws-keypair.pem /matlab/output/BNpack rochecollab.sagebase.org:/shared/code/roche/distrib

where "/key/folder" is the path to the folder where you copied the rochecollab-aws-keypair.pem file and "/matlab/output" is the path to the folder where the Matlab compiler created the new "BNpack" file.  The path "/shared/code/roche/distrib" is to be taken literally. 

To update the file "skeleton.R", copy the new file to /shared/code/roche/.   At the time of this writing, this code must be located on each node's local file system.  The modified R code is automatically distributed to each worker node when the cluster is restarted.  To do this, see the "Restarting the cluster" under "Cluster Administration", below.

How to run jobs

To run, connect to the head node and run the batch file:

ssh -i /key/folder/rochecollab-aws-keypair.pem sgeadmin@rochecollab.sagebase.org

/shared/code/roche/runscript1_qsub.sh

Running the script "runscript1_qsub.sh" launches a number of Sun Grid Engine jobs, automatically distributed across the cluster.  To change the parameters or the number of jobs, simply update this file before running it.   For more guidance on the 'qsub' command, please read: http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html

Note:  One "gotcha" in specifying jobs is that parameters that are quoted (e.g. because they contain spaces), require an extra level of quotes as qsub consumes the outer level.  So instead of "[110000 40000]", use "\"[110000 40000]\"".

Note:  To add more nodes to the cluster, see the section under Cluster Administration, below.

How to check the status of your jobs

Typing the Sun Grid Engine 'qstat' will display the status of the outstanding jobs.  More related Sun Grid Engine commands here: http://star.mit.edu/cluster/docs/latest/guides/sge.html

How to get output

At the time of this writing, the job creation script is configured to put standard output and standard error into /shared/data/log/<jobid>.out and /shared/data/log/<jobid>.err, respectively.  To get these or other output files, use the secure copy command from your local machine:

scp -i /key/folder/rochecollab-aws-keypair.pem rochecollab.sagebase.org:/shared/path/to/file /local/target/folder 

where "/shared/path/to/file" is the file to be retrieved and "/local/target/folder" is the folder on your local machine in which you wish to save the retrieved file.

Note:  We encourage you to write your jobs such that they push results to Synapse.

 

Cluster Administration

To execute administrative commands, log in to the dedicated administrative node:

ssh -i /key/folder/rochecollab-aws-keypair.pem ubuntu@rochecollabadmin.sagebase.org

Restarting the cluster

starcluster restart rochecluster

see also http://star.mit.edu/cluster/docs/latest/manual/launch.html#rebooting-a-cluster

Adding nodes

starcluster addnode rochecluster -b 0.10

The "-b" parameter says to use inexpensive spot instances and its value "0.10" (for example) is the maximum to pay for the node, in dollars per hour.

see also http://star.mit.edu/cluster/docs/latest/manual/addremovenode.html

Removing nodes

starcluster removenode rochecluster node001

(replacing 'node001' with the name of the node you wish to stop).  See also http://star.mit.edu/cluster/docs/latest/manual/addremovenode.html

Automatic load balancing

To allow starcluster to automatically add and remove nodes in response to the job queue, run the "starcluster loadbalance" command.  To keep the load balancer running after logging out, use the unix 'nohup' command, e.g.

nohup starcluster loadbalance rochecluster &

If the cluster was created using AWS "spot instances" then the Load Balancer will too:  From the StarCluster email list: "The loadbalancer uses ... the same price that you set for bids when the cluster was created."

Resizing volumes

Please see the instructions here: http://star.mit.edu/cluster/docs/latest/manual/volumes.html#managing-ebs-volumes