Dataset Hosting Design

Assumptions

Where

Assume that the initial cloud we target is AWS but we plan to support additional clouds in the future.

For the near term we are using AWS as our external hosting partner. They have agreed to support our efforts for CTCAP. Over time we anticipate adding additional external hosting partners such as Google and Microsoft. We imagine that different scientists and/or institutions will want to take advantage of different clouds.

We can also imagine that the platform should hold locations of files in internal hosting systems, even though not all users of the platform would have access to files in those locations.

Metadata references to hosted data files should be modelled as a collection of Locations, where a Location could be of many types:

an S3 URL
a Google Storage URL
an Azure Blobstore URL
an EBS snapshot id
a filepath on a Sage internal server
....

Class Location
    String provider // AWS, Google, Azure, Sage cluster – people will want to set a preferred cloud to work in
    String type     // filepath, download url, S3 url, EBS snapshot name
    String location // the actual uri or path

What

For now we are assuming that we are dealing with files. Later on we can also envision providing access to data stored in a database and/or a data warehouse.

Who

Assume tens of thousands of users will eventually use the platform.

How

Assume that users who only want to download do not need to have an AWS account.

Assume that anyone wanting to interact with data on EC2 or Elastic MapReduce will have an AWS account and will be willing to give us their account number (which is a safe piece of info to give out, it is not an AWS credential).

Design Considerations

metadata
- how to ensure we have metadata for all stuff in the cloud
file formats
- tar archives or individual files on S3?
- EBS block devices per dataset?
file layout
- how to organize what we have
- how can we enforce a clean layout for files and EBS volumes?
- how to keep track of what we have
access patterns
- we want to make the right thing be the easy thing - make it easy to do computation in the cloud
- file download will be supported but will not be the recommended use case
- recommendations and examples from the R prompt for interacting with the data when working on EC2
security
- not all data is public
- encryption or clear text?
  - key management
- one time urls?
- intrusion detection
- how to manage ACLs and bucket policies
  - are there scalability upper bounds on ACLs? e.g., can't add more than X AWS accounts to an ACL
auditability
- how to have audit logs
- how to download them and make use of them
human data and regulations
- what recommendations do we make to people getting some data from Sage and some data from dbGaP and co-mingling that data in the cloud
monitoring - what should be monitored
- access patterns
- who
- when
- what
- how much
  - data foot print
  - upload bandwidth
  - download bandwidth
  - archive to cheaper storage unused stuff
cost
- read vs. write
- cost of allowing writes
- cost of keeping same data in multiple formats
- can we take advantage of the free hosting for http://aws.amazon.com/datasets even though we want to keep an audit log?
- how to meter and bill customers for usage
operations
- how to make it efficient to manage
- reduce the burden of administrative tasks
- how to enable multiple administrators
how long does it take to get files up/down?
- upload speeds - we are on the lambda rail
- shipping hard drives
durability
- data corruption
- data loss
scalability
- if possible, we want to only be the access grantors and then let the hosting provider take care of enforcing access controls and vending data

High Level Use Cases

Users want to:

download a public dataset
download a protected dataset for which the platform has granted them access
use a public dataset on EC2
use a protected dataset for which the platform has granted them access on EC2
use a public dataset on Elastic MapReduce
use a protected dataset for which the platform has granted them access on Elastic MapReduce

There is a more exhaustive list of stuff to consider above but what follows are some firm and some loose requirements:

enforce access restrictions on protected datasets
log downloads
log EC2/EMR usage
figure out how to monitor user usage such that we could potentially charge them for usage
think about how to minimize costs
think about how to ensure that users sign a EULA before getting access to data

Options to Consider

AWS Public Data Sets

Current Scenario:

Sage currently has two data sets stored as "AWS Public Datasets" in the US West Region.
Users can discover them by browsing public datasets on http://aws.amazon.com/datasets/Biology?browse=1 and also via the platform.
Users can use them for cloud computation by spinning up EC2 instances and mounting the data as EBS volumes.
Users cannot directly download data from these public datasets, but once the have them mounted on an EC2 host, they can certainly scp them to their local system.
Users are not forced to sign a Sage-specified EULA prior to access since because they can bypass the platform directly and access this data via normal AWS mechanisms.
Users must have an AWS account to access this data.
There is no mechanism to grant access. All users with AWS accounts are granted access by default.
There is no mechanism to keep an audit log for downloads or other usage of this data.
Users pay for access by paying their own costs for EC2 and bandwidth charges.
The cost of hosting is free.

Future Scenario:

this is currently EBS only but it will also be available for S3 in the future
TODO ask Deepak what other plans they have in mind for the re-launch of AWS Public Datasets.
TODO tell Deepak our suggested features for AWS Public Datasets.

Tech Details:

You create a new "Public Dataset" by
- making an EBS snapshot in each region in which you would like it to be available
- providing the snapshot id(s) and metadata to Deepak (TODO see if this is still the case)
- then you wait for Amazon to get around to it

Pros:

free hosting!
scalable

Cons:

this won't work for public data if it is a requirement that
- all users provide an email address and agree to a EULA prior to access
- we must log downloads
this won't work for protected data unless the future implementation provides more support

Cloud Front Private Content

http://docs.amazonwebservices.com/AmazonCloudFront/latest/DeveloperGuide/index.html?PrivateContent.html#CreateFinalURL

This could be very good for the download use cases.

Note that this is likely a bad solution for the EC2/EMR use cases because Cloud Front sits outside of AWS and users will incur the inbound bandwidth charges when they pull the data down to EC2/EMR.

Dev Pay

http://docs.amazonwebservices.com/AmazonS3/2006-03-01/dev/index.html?UsingDevPay.html

S3

Skipping a description of public data on S3 because the scenario is very straightforward - if get the URL you can download the resource. For example: http://s3.amazonaws.com/nicole.deflaux/ElasticMapReduceFun/mapper.R

Protected Data Scenario:

All Sage data is stored on S3 and is not public.
Users can only discover what data is available via via the platform.
Users can use the data for cloud computation by spinning up EC2 instances and downloading the files from S3 to the hard drive of their EC2 instance. See below for more details on this.
Users can download the data from S3 to their local system. See below for more details on this.
The platform directs users to sign a Sage-specified EULA prior to gaining access to these files in S3.
Users must have a Sage platform account to access this data for download.
The platform grants access to this data. See below for details.
S3 logs all access to resources and this could serve as an audit log
- TODO list what info is logged
See proposals below regarding how users might pay for usage.
The cost of hosting not free.

Resources:

http://docs.amazonwebservices.com/AmazonS3/2006-03-01/dev/index.html?ServerLogs.html

S3 ACL

This is ruled out for protected data because ACLs can have a max of 100 grants and it appears that these grants cannot be to groups such as groups of arbitrary AWS users.

Open Question:

Confirm that grants do not apply to groups of AWS users.

Resources:

http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?S3_ACLs_UsingACLs.html

S3 Bucket Policies

This is ruled out for protected data because

Resources: http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?UsingBucketPolicies.html

S3 and IAM

With IAM a group of users can be granted access to S3 resources. This will be helpful for managing access Sage system administrators and Sage employees.

This is ruled out for protected data because IAM is used for managing groups of users all under a particular AWS bill (e.g., all employees of a company).

Open Questions:

Is there a cap on the number of users for IAM?
Confirm that IAM only intended for managing groups and users where the base assumption is all activity is rolling up to a single AWS bill.

Resources: http://docs.amazonwebservices.com/IAM/latest/UserGuide/index.html?UsingWithS3.html

S3 Pre-Signed URLs for Private Content

Pros:

Cons:

Resources:

"Query String Request Authentication Alternative: You can authenticate certain types of requests by passing the required information as query-string parameters instead of using the Authorization HTTP header. This is useful for enabling direct third-party browser access to your private Amazon S3 data, without proxying the request. The idea is to construct a "pre-signed" request and encode it as a URL that an end-user's browser can retrieve. Additionally, you can limit a pre-signed request by specifying an expiration time."
http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?RESTAuthentication.html

EBS snapshot ACL

http://docs.amazonwebservices.com/AWSEC2/latest/APIReference/index.html?ApiReference-query-ModifySnapshotAttribute.html

What is the max grant number for this?

Custom Proxy

All files are kept completely private on S3 and we write a custom proxy that allows users with permission to download files whether to locations outside AWS or to EC2 hosts.

Additional Details

Network Bandwidth

The Pacific Northwest Gigapop is the point of presence for the Internet2/Abilene network in the Pacific Northwest. The PNWGP is connected to the Abilene backbone via a 10 GbE link. In turn, the Abilene Seattle node is connected via OC-192 links to both Sunnyvale, California and Denver, Colorado.
PNWPG offers two types of Internet2/Abilene interconnects: Internet2/Abilene transit services and Internet2/Abilene peering at Pacific Wave International Peering Exchange. See Participant Services for more information.

The Corporation for Education Network Initiatives in California (CENIC) and Pacific NorthWest GigaPoP (PNWGP) announced two 10 Gigabit per second (Gbps) connections to Amazon Simple Storage Service (Amazon S3) and Amazon Elastic Compute Cloud (Amazon EC2) for the use of CENIC's members in California, as well as PNWGP's multistate K-20 research and education community.

http://findarticles.com/p/news-articles/wireless-news/mi_hb5558/is_20100720/cenic-pacific-northwest-partner-develop/ai_n54489237/http://www.internet2.edu/maps/network/connectors_participants http://www.pnw-gigapop.net/partners/index.html