Note that we went with pre-signed urls and plan to switch to IAM tokens instead of IAM users when that feature is launched.

Dataset Hosting Design

Proposal

Sage Employee Use Case

Download Use Case

EC2 Cloud Compute Use Case

Elastic MapReduce Use Case

Assumptions

Where

Assume that the initial cloud we target is AWS but we plan to support additional clouds in the future.

For the near term we are using AWS as our external hosting partner. They have agreed to support our efforts for CTCAP. Over time we anticipate adding additional external hosting partners such as Google and Microsoft. We imagine that different scientists and/or institutions will want to take advantage of different clouds.

We can also imagine that the platform should hold locations of files in internal hosting systems, even though not all users of the platform would have access to files in those locations.

Metadata references to hosted data files should be modelled as a collection of Locations, where a Location could be of many types:

Class Location
    String provider // AWS, Google, Azure, Sage cluster – people will want to set a preferred cloud to work in
    String type     // filepath, download url, S3 url, EBS snapshot name
    String location // the actual uri or path

What

For now we are assuming that we are dealing with files. Later on we can also envision providing access to data stored in a database and/or a data warehouse.

We also assume that this data can be stored in S3 in plain text (no encryption).

We also assume that only "unrestricted" and "embargoed" data are hosted by Sage. We assume that "restricted" data is hosted elsewhere such as dbGaP.

Who

Assume tens of thousands of users will eventually use the platform.

How

Assume that users who only want to download do not need to have an AWS account.

Assume that anyone wanting to interact with data on EC2 or Elastic MapReduce will have an AWS account and will be willing to give us their account number (which is a safe piece of info to give out, it is not an AWS credential).

Design Considerations

High Level Use Cases

Users want to:

There is a more exhaustive list of stuff to consider above but what follows are some firm and some loose requirements:

Options to Consider

AWS Public Data Sets

Scenario:

Future Considerations:

Tech Details:

Pros:

Cons:

Custom Proxy

All files are kept completely private on S3 and we write a custom proxy that allows users with permission to download files whether to locations outside AWS or to EC2 hosts.

Scenario:

Tech Details:

Pros:

Cons:

S3

Skipping a description of public data on S3 because the scenario is very straightforward - if get the URL you can download the resource. For example: http://s3.amazonaws.com/nicole.deflaux/ElasticMapReduceFun/mapper.R

Unrestricted and Embargoed Data Scenario:

Resources:

Open Questions:

Options to Restrict Access to S3

S3 Pre-Signed URLs for Private Content

"Query String Request Authentication Alternative: You can authenticate certain types of requests by passing the required information as query-string parameters instead of using the Authorization HTTP header. This is useful for enabling direct third-party browser access to your private Amazon S3 data, without proxying the request. The idea is to construct a "pre-signed" request and encode it as a URL that an end-user's browser can retrieve. Additionally, you can limit a pre-signed request by specifying an expiration time."

Scenario:

Tech Details:

Pros:

Cons:

Open Questions:

Resources:

S3 Bucket Policies

This is the newer mechanism from AWS for access control. We can add AWS accounts and/or IP address masks and other conditions to the policy.


Scenario:

Tech Details:

Pros:

Cons:

Open Questions:

Resources:

S3 and IAM

This is the newest mechanism for access control and is under heavy development in 2011 to expand its features. With IAM a group of users (as opposed to AWS accounts) can be granted access to S3 resources. We decide who those users are. We stored credentials for them on their behalf and use them to pre-sign S3 URLs.  All activity is rolled up to a single bill.  (Another approach would have been to let users have their IAM credentials and sign their own requests.  Unfortunately this will not scale because we would need to have a group per dataset to get the permissions right, and the mapping of users to groups scales linearly.)

"AWS Identity and Access Management is a web service that enables Amazon Web Services (AWS) customers to manage Users and User permissions in AWS. The service is targeted at organizations with multiple Users or systems that use AWS products such as Amazon EC2, Amazon SimpleDB, and the AWS Management Console. With IAM, you can centrally manage Users, security credentials such as access keys, and permissions that control which AWS resources Users can access.

Without IAM, organizations with multiple Users and systems must either create multiple AWS Accounts, each with its own billing and subscriptions to AWS products, or employees must all share the security credentials of a single AWS Account. Also, without IAM, you have no control over the tasks a particular User or system can do and what AWS resources they might use.

IAM addresses this issue by enabling organizations to create multiple Users (each User being a person, system, or application) who can use AWS products, each with individual security credentials, all controlled by and billed to a single AWS Account. With IAM, each User is allowed to do only what they need to do as part of the User's job."

Scenario:

Tech Details:

Pros:

Cons:

Resources:

S3 ACL

This is the older mechanism from AWS for access control. It control object level, not bucket level, access.

This is ruled out for protected data because ACLs can have a max of 100 grants and it appears that these grants cannot be to groups such as groups of arbitrary AWS users.

Open Question:

Resources:

Cloud Front Private Content

Cloud Front supports the notion of private content. Cloud front URLs can be created with access policies such as an expires time and an IP mask. This is ruled out since S3 provides a similar feature and we do not need a CDN for any of the usual reasons one wants to use a CDN.

Since we do not need a CDN for the normal reason (to move often requested content closer to the user to reduce download times), this does not buy us much over S3 Pre-Signed URLs. It seems like the only added benefit is an IP mask in addition to the expires time.

Pros:

Cons:

Resources:

Options to have customers bear some of the costs

S3 "Requester Pays" Buckets

In this scenario the requester's AWS account would be charged for any download bandwidth charges incurred.

Scenario:

Tech details:

Pros:

Cons:

Resources:

Dev Pay

Dev Pay only requires Amazon.com accounts, not AWS accounts.

Dev Pay is not appropriate for our use case or providing read-only access to shared data on S3. It requires that you make a copy of all S3 data for each customer that they access via Dev Pay (they have to access their own bucket).

Dev Pay may be appropriate later on when we are facilitating read-write access to user-specific data or data shared by a small group.

Resources

Flexible Payments Service, PayPal, etc.

We can use general services for billing customers. It would be up to us to determine what is a transaction, how much to charge, etc.

We may need to keep out own transaction ledger and issue bills. We would definitely let another company deal with credit card operations.

EBS

Data is available as hard drive snaphots.

Pros:

Cons:

Recommendation: focus on S3 for now since it can meet all our use cases. Work on this later if customers ask for it due to its convenience in a cloud computing environment.

EBS snapshot ACL

Open questions:

Resources

File Organization and Format

S3 bucket URLs are best represented using unique hostnames under our DNS control to assure uniqueness.

Logging should be done to it's own bucket.

Resources:

Additional Details

Network Bandwidth

Hypothetically, the Hutch (and therefore Sage) is on a high throughput link to AWS.

The Pacific Northwest Gigapop is the point of presence for the Internet2/Abilene network in the Pacific Northwest. The PNWGP is connected to the Abilene backbone via a 10 GbE link. In turn, the Abilene Seattle node is connected via OC-192                                     links to both Sunnyvale, California and Denver, Colorado.
PNWPG offers two types of Internet2/Abilene interconnects: Internet2/Abilene transit services and Internet2/Abilene peering at Pacific Wave International Peering Exchange. See Participant Services for more information.

The Corporation for Education Network Initiatives in California (CENIC) and Pacific NorthWest GigaPoP (PNWGP) announced two 10 Gigabit per second (Gbps) connections to Amazon Simple Storage Service (Amazon S3) and Amazon Elastic Compute Cloud (Amazon EC2) for the use of CENIC's members in California, as well as PNWGP's multistate K-20 research and education community.

http://findarticles.com/p/news-articles/wireless-news/mi_hb5558/is_20100720/cenic-pacific-northwest-partner-develop/ai_n54489237/http://www.internet2.edu/maps/network/connectors_participantshttp://www.pnw-gigapop.net/partners/index.html