Configuration of Cluster for Scientific Computing

Overview

Our approach for setting up a cluster is illustrated in the diagram below. From our desktop we create an Amazon Web Services (AWS) account. We then create a small machine from which we create a manage the cluster. We also create a shared volume which all the machines in the cluster can use as common storage.

Set up a Project under Amazon Web Services

This will allow charges for your cluster usage to be itemized.

Create an Email Alias

Go to
https://groups.google.com/a/sagebase.org/
and create a google group under sagebase.org for the project, e.g. myproject@sagebase.org. (You could use your own email address, but this approach allows you to add others to your project and for you to be involved in multiple projects.) If you don't have permission to create a group yourself, contact the Synapse Team to have this done for you.

Make sure to allow external users to send email to the group. The series of menu choices to do this may change, but as of the time of this writing, using the 'new google groups' web interface, the way to do this is:

Go to https://groups.google.com/a/sagebase.org/
Click on "My Groups"
Click on your group
Click on "Manage" (upper right)
Click on "Permissions" (lower left) to see its sub-menu.
Click on "Posting Permissions" in the sub-menu. A number of options will appear.
In the pull-down menu to the right of "Post", make sure "Anyone" is checked. This will allow Amazon Web Services to send messages to the alias.

Create an AWS Account

Go to console.aws.amazon.com and create a new account using the google group (created above) as the email address. You will have to give a credit card number for billing, but after setting up 'consolidated billing', below, no charges will go to your card.
Send an email to mike.kellen@sagebase.org, requesting that your new AWS account be added to the Sage consolidated bill.

Create a Service Account under AWS

The AWS account supports multiple individual users. When executing administrative commands on the cluster (creating, configuring, etc.) we must act as a user within the AWS account. Therefore we create a "service account" for this purpose:

In the web console go to Services > IAM (look for an icon of a green key)

Click "Create individual IAM users"

Click "Manage Users" and then "Create New Users". This will open a dialog.

Enter a user name, e.g. "ClusterAdminSvc" and click "Create". Next click "Download Credentials". For security reasons this can only be done once, so take care to ensure the file is downloaded and placed somewhere safe and secure. Inside the file are two random strings, the user's "Access Key Id" and the "Secret Access Key" which will be used below.

Go back to Services > IAM (icon of a green key). Click on "Policies" in left window. Search for AmazonEC2FullAccess, click on box to the left of AmazonEC2FullAccess policy, then click on "Policy Actions" and attach policy to the ClusterAdminSvc account.

The user now has the permissions necessary to create and manage clusters.

Create the 'admin' machine

Create a micro EC2 machine under your account from which you can run StarCluster commands. This will also give you a place to run the StarCluster load balancer, which adds/removes nodes in response to the job queue.

Go to Services > EC2

Click "Launch Instance". This will start a wizard dialog.

Click "Quick Launch Wizard" (Avoid "Classic Wizard" which, at the time of this writing, has a bug.)

Now click on the top row labeled "More Amazon Machine Images".

Click continue.

in the search box at the top of the dialog, enter the AMI ID, ami-999d49f0. and click "search". (If changed, the latest stable ami(s) will be listed near the bottom of this page: http://star.mit.edu/cluster/.) For simplicity we are selecting the same node type that we will use later on in the cluster.

Instance type: Choose micro (t1.micro). This is the smallest, cheapest type and is sufficient for running StarCluster.

You can generally accept the defaults as you go through the wizard.

Under "instance details" there is a screen of Key/Value pairs with the first key filled in as "Name". Fill in the value with a something descriptive, like "StarCluster Administration"

"Create key pair": Select "Create a new key pair", give it a name, e.g. my-key-pair, create and download to a file in a safe and secure place.

Complete the wizard. Your new machine will start up.

Click on "Instances" (left hand side) and find the new machine. Click on the check box to the left and scroll through the pane that appears below to find the public DNS name, something like

Public DNS: ec2-111-222-333-444.compute-1.amazonaws.com

You can use this to SSH into the machine. Note: If you stop and restart the machine, this name will change.

Check that SSH is enabled for this machine. In the "EC2 Management" panel check the check box to the left of your server. In the pane that appears below you will see an entry like "Security Groups: default. view rules", where "view rules" is a hyper link. Click on "view rules" and check that port 22 is open for tcp. If not, click on "Security Groups" under "Network & Security" on the left hand side, and select the group used by the machine, and click on the "Inbound" tab in the pane that appears below. From the top most pull-down in the pane, select "SSH". For additional security, lock down the subnet (e.g. the Hutch is 140.107.0.0/16). Finally, click "Add Rule", then "Apply Rule Changes".

Install StarCluster

Download the files "scicompConfig" and "unixRsetup.py" attached to this wiki page. Edit the "scicompConfig", adding the Amazon security credentials (Access Key ID and Secret Access Key), captured above:

AWS_ACCESS_KEY_ID = <<<fill in here>>>
AWS_SECRET_ACCESS_KEY = <<<fill in here>>>

Also change all instances of "aws-keypair" to "my-key-pair" or whatever name you used when creating the key pair in AWS.

Ensure the key pair file (downloaded while creating the admin machine earlier) is private.

chmod 500 my-key-pair.pem

Move the files to your admin machine along with your key pair, e.g.:

scp -i my-key-pair.pem scicompConfig ubuntu@ec2-111-222-333-444.compute-1.amazonaws.com:~
scp -i my-key-pair.pem unixRsetup.py ubuntu@ec2-111-222-333-444.compute-1.amazonaws.com:~
scp -i my-key-pair.pem my-key-pair.pem ubuntu@ec2-111-222-333-444.compute-1.amazonaws.com:~

Note: The destination path used in the third line (~/my-key-pair.pem) must match the value of the KEY_LOCATION parameter in the scicompConfig file.

Now log into your admin machine to install and configure StarCluster:

ssh -i my-key-pair.pem ubuntu@ec2-111-222-333-444.compute-1.amazonaws.com

Install StarCluster:

sudo easy_install StarCluster

After installation completes, move the files you uploaded into place:

mkdir .starcluster
cp scicompConfig .starcluster/config
mkdir .starcluster/plugins
cp unixRsetup.py .starcluster/plugins/

More installation guidance can be found here: http://star.mit.edu/cluster/docs/latest/installation.html

Create shared volume

(Optional, but recommended) If the nodes need to share data or code (such as R packages), or have a common place to create output, it is useful to create a shared EBS volume. This createvolume command creates a 50GB volume in the us-east-1a availability zone. Replace <name> with a descriptive name of your choice.

starcluster createvolume --name <name> --detach-volume 50 us-east-1a

You will see a line like, "created volume vol-12345678". This is the ID of the created volume. You may also need to shut down the temporary EC-2 machine which is started by StarCluster to configure the volume:

starcluster terminate volumecreator

Now edit the file ~/.starcluster/config and change "vol-xxxxxxxx" to the ID of the volume.

Start up the cluster, using spot instances

Starcluster can run on spot instances, which are much cheaper than on-demand instances. To check recent spot prices, run this command on your admin machine:

starcluster spothistory -p m1.small

To start a cluster with a bid of 0.10 per instance-hour:

starcluster start -b 0.10 -c scicompcluster myCluster

Set up billing alerts

Cluster computing on the cloud gives great flexibility, allowing us to pay for just the compute we use. It is important that we manage the cluster size so that we don't leave large numbers of unused machines running for many days or weeks. To help guard against accidentally leaving a large cluster on for a long time, we can create billing alerts in the Amazon account. A description of billing alerts is here:
http://aws.amazon.com/about-aws/whats-new/2012/10/19/announcing-aws-billing-alerts-for-linked-accounts/

Here are step-by-step instructions for creating an alert trigger:

log into aws.amazon.com
go to Account Activity page
where it says "Monitor your estimated charges. Enable Now to begin ..." click Enable Now.
- (You will have to set security questions if you have not already.)
- (There may be a delay for this section of the screen to appear after creating the account, more here: https://forums.aws.amazon.com/thread.jspa?threadID=115912&tstart=0#421350 )
It takes a few minutes for the for alerts to be enabled.
Click "Set your first billing alert"
Click "Create alarm" in the dialog that appears.
- "These recipients:" Put the google email alias used for the account.
- "Exceed": Enter a reasonable threshold for the monthly bill, e.g. $500.
Create the alert.
Go to
https://groups.google.com/a/sagebase.org/
"Invite" mike.kellen@sagebase.org along with anyone else who ought to be notified about high usage levels.

Configure the shared volume

log in to the head node

 starcluster sshmaster myCluster

Now create the directory for shared R libraries. (This must match the path in unixRsetup.py)

mkdir -p /shared/code/R/lib
chmod -R 777 /shared

Any R libraries installed here will be available to all nodes.

This is a good time to do any other software installation (e.g. shared R libraries) on the shared volume. The head node (as with all the other nodes in the cluster) have internet access and can download software libraries as needed. If you want to act as the same user under which the cluster is run ('sgeadmin') then specify this user when connecting:

starcluster sshmaster -u sgeadmin myCluster

Run jobs on the cluster

Distributed computing can be done at a low level, using Sun Grid Engine commands, or via higher level commands in an application programming language. Below we give examples of both, the latter via the "R Sun Grid Engine" package.

Sun Grid Engine interface

Start jobs

The Sun Grid Engine 'qsub' command will add a job to the queue. An example is

qsub -V -wd /shared/code -N J1 -b y -o /shared/data/log/1.out -e /shared/data/log/1.err /shared/code/myApp 'param1' 1 'param2' 2

Here a binary executable 'myApp' is run with param1=1, param2=2. The standard output goes to the file 1.out in the specified directory and the standard error text goes to the file 1.err in the same directory. The job is named "J1", a tag that is used when showing the status of the job queue. The working directory for 'myApp' is set to /shared/code.

Spot instances can be terminated at any time, without warning. Jobs marked as rerunnable (using the "-r" option with qsub) will be re-queued if the machine they are running on is terminated.

For more guidance on the 'qsub' command, please read: http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html

Check status

Typing the Sun Grid Engine 'qstat' command will display the status of the outstanding jobs. More related Sun Grid Engine commands here: http://star.mit.edu/cluster/docs/latest/guides/sge.html

Programming in R

There are many ways to do distributed programming from R. This simple example uses the R Sun Grid Engine package (http://cran.r-project.org/web/packages/Rsge/index.html).

Log into the head node

starcluster sshmaster -u sgeadmin myCluster

Start an R seession and install the Rsge package

sgeadmin@master:~$ R

Copyright (C) 2012 The R Foundation for Statistical Computing
...
> install.packages("Rsge")
...
>

Create a simple function to echo the host name, and try it out.

> sayHello<- function(x){paste("Hello World, from ", system("hostname", intern=TRUE))}
> 
> lapply(1, sayHello)
[[1]]
[1] "Hello World, from  master"

Now run the same function via Sun Grid Engine:

> library(Rsge)
Loading required package: snow

Welcome to Rsge
    Version: 0.6.3 

> sge.parLapply(1:4, sayHello, njobs=4)
Completed storing environment to disk
Submitting  4 jobs...
All jobs completed
[[1]]
[1] "Hello World, from  node001"

[[2]]
[1] "Hello World, from  master"

[[3]]
[1] "Hello World, from  master"

[[4]]
[1] "Hello World, from  node001"

Running the Load Balancer

The starcluster load balancer can observe the job queue and start new nodes or remove nodes from the cluster based on demand.

If the cluster was created using AWS "spot instances" then the Load Balancer will too: From the StarCluster email list: "The loadbalancer uses ... the same price that you set for bids when the cluster was created."

starcluster loadbalance myCluster

As with the other commands, run this on the admin machine. To keep the load balancer running after you log out, use the Unix nohup command, e.g.

nohup starcluster loadbalance myCluster &