Dataset Hosting Design

Assumptions

Where

Assume that the initial cloud we target is AWS but we plan to support additional clouds in the future.

For the near term we are using AWS as our external hosting partner. They have agreed to support our efforts for CTCAP. Over time we anticipate adding additional external hosting partners such as Google and Microsoft. We imagine that different scientists and/or institutions will want to take advantage of different clouds.

We can also imagine that the platform should hold locations of files in internal hosting systems, even though not all users of the platform would have access to files in those locations.

Metadata references to hosted data files should be modelled as a collection of Locations, where a Location could be of many types:

Class Location
    String provider // AWS, Google, Azure, Sage cluster – people will want to set a preferred cloud to work in
    String type     // filepath, download url, S3 url, EBS snapshot name
    String location // the actual uri or path

What

For now we are assuming that we are dealing with files. Later on we can also envision providing access to data stored in a database and/or a data warehouse.

Who

Assume tens of thousands of users will eventually use the platform.

How

Assume that users who only want to download do not need to have an AWS account.

Assume that anyone wanting to interact with data on EC2 or Elastic MapReduce will have an AWS account and will be willing to give us their account number (which is a safe piece of info to give out, it is not an AWS credential).

Design Considerations

High Level Use Cases

Users want to:

There is a more exhaustive list of stuff to consider above but what follows are some firm and some loose requirements:

Options to Consider

AWS Public Data Sets

Current Scenario:

Future Scenario:

Tech Details:

Pros:

Cons:

S3

Skipping a description of public data on S3 because the scenario is very straightforward - if get the URL you can download the resource. For example: http://s3.amazonaws.com/nicole.deflaux/ElasticMapReduceFun/mapper.R

Protected Data Scenario:

Open Questions:

Resources:

Options to Restrict Access to S3

S3 Pre-Signed URLs for Private Content

Pros:

Cons:

Open Questions:

Resources:

S3 Bucket Policies

This is the newer mechanism from AWS for access control.

Open Questions:

Resources:

S3 ACL

This is the older mechanism from AWS for access control.

This is ruled out for protected data because ACLs can have a max of 100 grants and it appears that these grants cannot be to groups such as groups of arbitrary AWS users.

Open Question:

Resources:

S3 and IAM

With IAM a group of users can be granted access to S3 resources. This will be helpful for managing access Sage system administrators and Sage employees.

This is ruled out for protected data because IAM is used for managing groups of users all under a particular AWS bill (e.g., all employees of a company).

Open Questions:

Resources:

Cloud Front Private Content

Cloud Front supports the notion of private content. Cloud front urls can be created with access policies such as an expires time and an IP mask.

Pros:

Open Questions:

Resources:

Options to have customers bear some of the costs

S3 "Requester Pays" Buckets

http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?RequesterPaysBuckets.html

Dev Pay

http://docs.amazonwebservices.com/AmazonS3/2006-03-01/dev/index.html?UsingDevPay.html

Flexible Payments Service

EBS

Data is available as hard drive snaphots.

Pros:

Cons:

EBS snapshot ACL

Open questions:

Resources

Custom Proxy

All files are kept completely private on S3 and we write a custom proxy that allows users with permission to download files whether to locations outside AWS or to EC2 hosts.

Pros:

Cons:

If all else fails we can do this, but it will be more work operationally to manage a fleet of custom proxy servers.

File Organization and Format

TODO Brian will add stuff here

Resources:

Additional Details

Network Bandwidth

Hypothetically, the Hutch (and therefore Sage) is on a high throughput link to AWS.

The Pacific Northwest Gigapop is the point of presence for the Internet2/Abilene network in the Pacific Northwest. The PNWGP is connected to the Abilene backbone via a 10 GbE link. In turn, the Abilene Seattle node is connected via OC-192             links to both Sunnyvale, California and Denver, Colorado.
PNWPG offers two types of Internet2/Abilene interconnects: Internet2/Abilene transit services and Internet2/Abilene peering at Pacific Wave International Peering Exchange. See Participant Services for more information.

The Corporation for Education Network Initiatives in California (CENIC) and Pacific NorthWest GigaPoP (PNWGP) announced two 10 Gigabit per second (Gbps) connections to Amazon Simple Storage Service (Amazon S3) and Amazon Elastic Compute Cloud (Amazon EC2) for the use of CENIC's members in California, as well as PNWGP's multistate K-20 research and education community.

http://findarticles.com/p/news-articles/wireless-news/mi_hb5558/is_20100720/cenic-pacific-northwest-partner-develop/ai_n54489237/http://www.internet2.edu/maps/network/connectors_participantshttp://www.pnw-gigapop.net/partners/index.html