Note that we went with pre-signed urls and plan to switch to IAM tokens instead of IAM users when that feature is launched.

Dataset Hosting Design

Proposal

All data is stored on S3 as our hosting partner.
All data will be served over SSL/HTTPS.
We will have one Identity and Access Management (IAM) group for read-only access all datasets.
We will generate and store IAM credentials for each user that signs any EULA for any dataset. The user will be added to the read-only access IAM group.
We never give those IAM credentials out, we only use them to generate pre-signed S3 URLs with an expiry time of an hour or so.
With these pre-signed URLs, users are able to download data directly from S3 using the Web UI, the R client, or even something simple like curl.
Our Crowd groups are more granular and tell which which users are allowed to have pre-signed URLs for which datasets.
The use of IAM allows us to merely track in the S3 access logs who has downloaded what.
Users can download this files to EC2 hosts (no bandwidth charges for Sage) or to external locations (Sage pays bandwidth charges).
For users who want to utilize Elastic MapReduce, which does not currently support IAM, we will add them to the Bucket Policy for the dataset bucket with read-only access.

Sage Employee Use Case

Download Use Case

EC2 Cloud Compute Use Case

Elastic MapReduce Use Case

Assumptions

Where

Assume that the initial cloud we target is AWS but we plan to support additional clouds in the future.

For the near term we are using AWS as our external hosting partner. They have agreed to support our efforts for CTCAP. Over time we anticipate adding additional external hosting partners such as Google and Microsoft. We imagine that different scientists and/or institutions will want to take advantage of different clouds.

We can also imagine that the platform should hold locations of files in internal hosting systems, even though not all users of the platform would have access to files in those locations.

Metadata references to hosted data files should be modelled as a collection of Locations, where a Location could be of many types:

an S3 URL
a Google Storage URL
an Azure Blobstore URL
an EBS snapshot id
a filepath on a Sage internal server
....

Class Location
    String provider // AWS, Google, Azure, Sage cluster – people will want to set a preferred cloud to work in
    String type     // filepath, download url, S3 url, EBS snapshot name
    String location // the actual uri or path

What

For now we are assuming that we are dealing with files. Later on we can also envision providing access to data stored in a database and/or a data warehouse.

We also assume that this data can be stored in S3 in plain text (no encryption).

We also assume that only "unrestricted" and "embargoed" data are hosted by Sage. We assume that "restricted" data is hosted elsewhere such as dbGaP.

Who

Assume tens of thousands of users will eventually use the platform.

How

Assume that users who only want to download do not need to have an AWS account.

Assume that anyone wanting to interact with data on EC2 or Elastic MapReduce will have an AWS account and will be willing to give us their account number (which is a safe piece of info to give out, it is not an AWS credential).

Design Considerations

metadata
- how to ensure we have metadata for all stuff in the cloud
file formats
- tar archives or individual files on S3?
- EBS block devices per dataset?
file layout
- how to organize what we have
- how can we enforce a clean layout for files and EBS volumes?
- how to keep track of what we have
access patterns
- we want to make the right thing be the easy thing - make it easy to do computation in the cloud
- file download will be supported but will not be the recommended use case
- recommendations and examples from the R prompt for interacting with the data when working on EC2
security
- not all data is public
- encryption or clear text?
  - key management
- one time urls?
- intrusion detection
- how to manage ACLs and bucket policies
  - are there scalability upper bounds on ACLs? e.g., can't add more than X AWS accounts to an ACL
auditability
- how to have audit logs
- how to download them and make use of them
human data and regulations
- what recommendations do we make to people getting some data from Sage and some data from dbGaP and co-mingling that data in the cloud
monitoring - what should be monitored
- access patterns
- who
- when
- what
- how much
  - data foot print
  - upload bandwidth
  - download bandwidth
  - archive to cheaper storage unused stuff
cost
- read vs. write
- cost of allowing writes
- cost of keeping same data in multiple formats
- can we take advantage of the free hosting for http://aws.amazon.com/datasets even though we want to keep an audit log?
- how to meter and bill customers for usage
operations
- how to make it efficient to manage
- reduce the burden of administrative tasks
- how to enable multiple administrators
how long does it take to get files up/down?
- upload speeds - we are on the lambda rail
- shipping hard drives
durability
- data corruption
- data loss
scalability
- if possible, we want to only be the access grantors and then let the hosting provider take care of enforcing access controls and vending data

High Level Use Cases

Users want to:

download an unrestricted dataset
download an embargoed dataset for which the platform has granted them access
use an unrestricted dataset on EC2
use an embargoed dataset for which the platform has granted them access on EC2
use an unrestricted dataset on Elastic MapReduce
use an embargoed dataset for which the platform has granted them access on Elastic MapReduce

There is a more exhaustive list of stuff to consider above but what follows are some firm and some loose requirements:

enforce access restrictions on datasets
log downloads
log EC2/EMR usage
figure out how to monitor user usage such that we could potentially charge them for usage
think about how to minimize costs
think about how to ensure that users sign a EULA before getting access to data

Options to Consider

AWS Public Data Sets

Scenario:

Sage currently has two data sets stored as "AWS Public Datasets" in the US West Region.
Users can discover them by browsing public datasets on http://aws.amazon.com/datasets/Biology?browse=1 and also via the platform.
Users can use them for cloud computation by spinning up EC2 instances and mounting the data as EBS volumes.
Users cannot directly download data from these public datasets, but once the have them mounted on an EC2 host, they can certainly scp them to their local system.
Users cannot directly use these files with Elastic Map Reduce. Instead they would need to first upload them to their own S3 bucket.
Users are not forced to sign a Sage-specified EULA prior to access since because they can bypass the platform directly and access this data via normal AWS mechanisms.
Users must have an AWS account to access this data.
There is no mechanism to grant access. All users with AWS accounts are granted access by default.
There is no mechanism to keep an audit log for downloads or other usage of this data.
Users pay for access by paying their own costs for EC2 and bandwidth charges if they choose to download the data out of AWS.
The cost of hosting is free.

Future Considerations:

this is currently EBS only but it will also be available for S3 in the future
TODO ask Deepak what other plans they have in mind for the re-launch of AWS Public Datasets.
TODO tell Deepak our suggested features for AWS Public Datasets.

Tech Details:

You create a new "Public Dataset" by
- making an EBS snapshot in each region in which you would like it to be available
- providing the snapshot id(s) and metadata to Deepak (TODO see if this is still the case)
- then you wait for Amazon to get around to it

Pros:

free hosting!
scalable

Cons:

this won't work for public data if it is a requirement that
- all users provide an email address and agree to a EULA prior to access
- we must log downloads
this won't work for protected data unless the future implementation provides more support

Custom Proxy

All files are kept completely private on S3 and we write a custom proxy that allows users with permission to download files whether to locations outside AWS or to EC2 hosts.

Scenario:

See details below about the general S3 scenario.
This would not be compatible with Elastic MapReduce. Users would have to download files to an EC2 host and then store them in their own bucket in S3.
The platform grants access to the data by facilitating the proxy of their request to S3.
Users do not need AWS accounts.
The platform has a custom solution for collecting payment from users to pay for outbound bandwidth when downloading outside the cloud.

Tech Details:

We would have a fleet of proxy servers running on EC2 hosts that authenticate and authorize users and then proxy download from S3.
We would need to configure auto-scaling so that capacity for this proxy fleet can grow and shrink as needed.
We would log events when data access is granted and when data is accessed.

Pros:

full flexibility
we can accurately track who has downloaded what when

Cons:

we have another service fleet to administer
we are now the scalability bottleneck

S3

Skipping a description of public data on S3 because the scenario is very straightforward - if get the URL you can download the resource. For example: http://s3.amazonaws.com/nicole.deflaux/ElasticMapReduceFun/mapper.R

Unrestricted and Embargoed Data Scenario:

All Sage data is stored on S3 and is not public.
Users can only discover what data is available via the platform.
Users can use the data for cloud computation by spinning up EC2 instances and downloading the files from S3 to the hard drive of their EC2 instance.
Users can download the data from S3 to their local system.
The platform directs users to sign a Sage-specified EULA prior to gaining access to these files in S3.
Users must have a Sage platform account to access this data for download. They may need an AWS account depending upon the mechanism we use to grant access.
The platform grants access to this data. See below for details about the various ways we might do this.
The platform will write to the audit log each time it grants access. S3
can also be configured to log all access to resources and this could serve
as a means of records for billing purposes and also intrusion detection.
- These two types of logs will have log entries about different events (granting access vs. using access) so they will not have a strict 1-to-1 mapping between entries but should have a substantial overlap.
- The platform can store anything it likes in its audit log.
- The S3 log stores normal web access log type data with the following identifiable fields:
  - client IP address is available in the log
  - "anonymous" or the user's AWS canonical user id will appear in the log
  - We can leave these logs on S3 and run Elastic MapReduce jobs on them
    when we need to do data mining. Or we can download them and do data mining locally.
See proposals below regarding how users might pay for outbound bandwidth.
The cost of hosting not free.
- Storage fees will apply.
- Data import fees:
  - Bandwidth fees apply when data is uploaded.
  - Data can also be shipped via hard drives and AWS Import/Export fees would
    apply.
- Data export fees:
  - Bandwidth fees apply when data is downloaded out of AWS. There is no charge when it is downloaded inside AWS (e.g., to an EC2 instance).
  - Data can also be shipped via hard drives and AWS Import/Export fees would
    apply.
- These same fees apply to any S3 log data we keep on S3.

Resources:

Best Effort Server Log Delivery:

The server access logging feature is designed for best effort. You can expect that most requests against a bucket that is properly configured for logging will result in a delivered log record, and that most log records will be delivered within a few hours of the time that they were recorded.
However, the server logging feature is offered on a best-effort basis. The completeness and timeliness of server logging is not guaranteed. The log record for a particular request might be delivered long after the request was actually processed, or it might not be delivered at all. The purpose of server logs is to give the bucket owner an idea of the nature of traffic against his or her bucket. It is not meant to be a complete accounting of all requests.

[Usage Report Consistency

http://docs.amazonwebservices.com/AmazonS3/2006-03-01/dev/index.html?ServerLogs.html

] It follows from the best-effort nature of the server logging feature that the usage reports available at the AWS portal might include usage that does not correspond to any request in a delivered server log.

I have it on good authority that the S3 logs are accurate but delievery can be delayed now and then

Log format details
AWS Credentials Primer
Custom Access Log Information "You can include custom information to be stored in the access log record for a request by adding a custom query-string parameter to the URL for the request. Amazon S3 will ignore query-string parameters that begin with "x-", but will include those parameters in the access log record for the request, as part of the Request-URI field of the log record. For example, a GET request for "s3.amazonaws.com/mybucket/photos/2006/08/puppy.jpg?x-user=johndoe" will work the same as the same request for "s3.amazonaws.com/mybucket/photos/2006/08/puppy.jpg", except that the "x-user=johndoe" string will be included in the Request-URI field for the associated log record. This functionality is available in the REST interface only."

Open Questions:

can we use the canonical user id to know who the user is if they have
previously given us their AWS account id? No, but we can ask them to provide
their canonical id to us.
if we stick our own query params on the S3 URL will they show up in the S3 log? - YES see above naming convention

Options to Restrict Access to S3

S3 Pre-Signed URLs for Private Content

"Query String Request Authentication Alternative: You can authenticate certain types of requests by passing the required information as query-string parameters instead of using the Authorization HTTP header. This is useful for enabling direct third-party browser access to your private Amazon S3 data, without proxying the request. The idea is to construct a "pre-signed" request and encode it as a URL that an end-user's browser can retrieve. Additionally, you can limit a pre-signed request by specifying an expiration time."

Scenario:

See above details about the general S3 scenario.
This would not be compatible with Elastic MapReduce. Users would have to download files to an EC2 host and then store them in their own bucket in S3.
The platform grants access to the data by creating pre-signed S3 URLs to Sage's private S3 files. These URLs are created on demand and have a short expiry time.
Users will not need AWS accounts.
The platform has a custom solution for collecting payment from users to pay for outbound bandwidth when downloading outside the cloud.

Tech Details:

Repository Service: the service has an API that vends pre-signed S3 URLs for data layer files. Users are authenticated and authorized by the service prior to the service returning a download URL for the layer. The default URL expiry time will be one minute.
Web Client: When the user clicks on the download button, the web servlet sends a request to the the repository service which constructs and returns a pre-signed S3 URL with an expiry time of one minute. The web servlet returns this URL to the browser as the location value of a 302 redirect. The browser begins download from S3 immediately.
R Client: When the user issues the download command at the R prompt, the R client sends a request to the the repository service which constructs and returns a pre-signed S3 URL with an expiry time of one minute. The R client uses the returned URL to begin to download the content from S3 immediately.

Pros:

S3 takes care of serving the files (once access is granted) so we don't have to run a proxy fleet.
Simple! These can be used for both the download use case and the cloud computation use cases.
Scalable
We control duration of the expiry window.

Cons:

Potential for URL reuse: When the url has not yet expired, it is possible for others to use that same URL to download files.
- For example, if someone requests a download URL from the repository service (and remember that the repository service confirms that the user is authenticated and authorized before handing out that URL) and then that person emails the URL to his team, his whole team could start downloading the data layer as long as they all kick off their downloads within that one minute window of time.
- Of course, they could also download the data locally and then let their whole team have access to it.
- This isn't much of a concern for unrestricted data because its actually easier to just make a new account on the system with a new email address.
Potential for user confusion: If a user gets her download url, and somehow does not use it right away, she'll need to reload the web page to get another or re-do the R prompt command to get a fresh URL.
- This isn't much of a concern if we ensure that the workflow of the clients (web or R) is to use the URL upon receipt.
Tracking downloads for charge-backs: We know when we vend a URL but it may not be possible to know whether that URL was actually used (no fee) or used more than once (multiple fees).

Open Questions:

Does this work with the new support for partial downloads for gigantic files? Currently assuming yes, and that the repository service would need to give out the URL a few times during the duration of the download (re-request the URL for each download chunk)
Does this work with torrent-style access? Currently assuming no.
Can we limit S3 access to HTTPS only? Currently assuming yes.

Resources:

S3 Bucket Policies

This is the newer mechanism from AWS for access control. We can add AWS accounts and/or IP address masks and other conditions to the policy.

Scenario:

See above details about the general S3 scenario.
This should work with Elastic MapReduce. We would have to edit the bucket policy for each user wanting to use the data via EMR.
The platform grants access to the data by updating the bucket policy for the dataset to which the user would like access.
Users will need AWS accounts.
The platform has a custom solution for collecting payment from users to pay for outbound bandwidth when downloading outside the cloud.

Tech Details:

The repository service grants access by adding the users AWS canonical user id to the bucket policy for the desired dataset.
The repository service vends unsigned S3 URLs.
Users will need to sign those S3 URLs with their AWS credentials using the usual toolkit provided by Amazon.
- One example tool provided by Amazon is s3curl
- I'm sure we can also get the R client to also do the signing. Users will need to make their AWS credentials available to the R client via a config file.
- We may be able to find a JavaScript library to do the signing as well for Web client use cases. But since we should not store their credentials, users would need to specify them every time.

Pros:

S3 takes care of serving the files (once access is granted) so we don't have to run a proxy fleet.
Scalable
From the S3 logs we will be able to see who downloaded the data by their canonical user id and compute charge backs for our custom solution.

Cons:

This isn't as simple for users as pre-signed URLs because they have to think about their credentials and provide them for signing.
This mechanism will not scale to grant access for tens of thousands of individuals therefore it will not be sufficient for "unrestricted data". It may scale sufficiently for "embargoed data".
Users may not feel comfortable sharing their AWS credentials with Sage tools to assist them to make signing easier.
We are trusting users whose AWS accounts we have granted access to protect their own credentials. (This is the same as trusting someone to protect their Sage username and password.)

Open Questions:

What is the upper limit on the number of grants?
What is the upper limit on the number of principals that can be listed in a single grant?
Is there a JavaScript library we can use to sign S3 URLs?

Resources:

Bucket Policy Overview
Evaluation Logic Flow Chart
"The Principal is the person or persons who receive or are denied permission according to the policy. You must specify the principal by using the principal's AWS account ID (e.g., 1234-5678-9012, with or without the hyphens). You can specify multiple principals, or a wildcard to indicate all possible users." from http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?AccessPolicyLanguage_ElementDescriptions.html

The following list describes the restrictions on Amazon S3 policies: from http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?AccessPolicyLanguage_SpecialInfo.html
- The maximum size of a policy is 20 KB
- The value for Resource must be prefixed with the bucket name or the bucket name and a path under it (bucket/). If only the bucket name is specified, without the trailing /, the policy applies to the bucket.
- Each policy must have a unique policy ID (Id)
- Each statement in a policy must have a unique statement ID (sid)
- Each policy must cover only a single bucket and resources within that bucket (when writing a policy, don't include statements that refer to other buckets or resources in other buckets)

S3 and IAM

This is the newest mechanism for access control and is under heavy development in 2011 to expand its features. With IAM a group of users (as opposed to AWS accounts) can be granted access to S3 resources. We decide who those users are. We stored credentials for them on their behalf and use them to pre-sign S3 URLs. All activity is rolled up to a single bill. (Another approach would have been to let users have their IAM credentials and sign their own requests. Unfortunately this will not scale because we would need to have a group per dataset to get the permissions right, and the mapping of users to groups scales linearly.)

"AWS Identity and Access Management is a web service that enables Amazon Web Services (AWS) customers to manage Users and User permissions in AWS. The service is targeted at organizations with multiple Users or systems that use AWS products such as Amazon EC2, Amazon SimpleDB, and the AWS Management Console. With IAM, you can centrally manage Users, security credentials such as access keys, and permissions that control which AWS resources Users can access.

Without IAM, organizations with multiple Users and systems must either create multiple AWS Accounts, each with its own billing and subscriptions to AWS products, or employees must all share the security credentials of a single AWS Account. Also, without IAM, you have no control over the tasks a particular User or system can do and what AWS resources they might use.

IAM addresses this issue by enabling organizations to create multiple Users (each User being a person, system, or application) who can use AWS products, each with individual security credentials, all controlled by and billed to a single AWS Account. With IAM, each User is allowed to do only what they need to do as part of the User's job."

Scenario:

See above details about the general S3 scenario.
This is not compatible with Elastic MapReduce. IAM currently does not support EMR.
The platform grants access to the data by creating a new IAM user and adding them to the group that has access to the dataset.
Users do not need AWS accounts.
The platform has a custom solution for collecting payment from users to pay for outbound bandwidth when downloading outside the cloud.

Tech Details:

The repository service grants access by adding the user to the correct Crowd and IAM groups for the desired dataset. The Crowd groups are "the truth" and the IAM groups are a mirror of that.

Log data

For IAM user nicole.deflaux my identity is recorded in the log as arn:aws:iam::325565585839:user/nicole.deflaux
Not that IAM usernames can be email addresses so that we can easily reconcile them with their email addresses registered with the platform -> because we will use the email they gave us when we create their IAM user.

Here is what a full log entry looks like:

d9df08ac799f2859d42a588b415111314cf66d0ffd072195f33b921db966b440 data01.sagebase.org [18/Feb/2011:19:32:56 +0000] 140.107.149.246 arn:aws:iam::325565585839:user/nicole.deflaux C6736AEED375F69E REST.GET.OBJECT human_liver_cohort/readme.txt "GET /data01.sagebase.org/human_liver_cohort/readme.txt?AWSAccessKeyId=AKIAJBHAI75QI6ANOM4Q&Expires=1298057875&Signature=XAEaaPtHUZPBEtH5SaWYPMUptw4%3D&x-amz-security-token=AQIGQXBwVGtupqEus842na80zMbEFbfpPhOsIic7z1ghm0Umjd8kybj4eaOtBCKlwVHMXi2SuasIKxwYljjDA95O%2BfZb5uF7ku4crE6OObz8d/ev7ArPime2G/a5nXRq56Jx2hAt8NDDbhnE8JqyOnKn%2BN308wx2Ud3Q2R3rSqK6t%2Bq/l0UAkhBFNM1gvjR%2BoPGYBV9Jspwfp8ww8CuZVH1Y2P2iid6ZS93K02sbGvQnhU7eCGhorhMI5kxOqy7bTbzvl2HML7zQphXRIa1wqrRSD/sBLfpK5x6A%2BcQnLrgO6FtWJMDo5rTmgEPo6esNIivWnaiI6BvPddLlBMZVtmcx39/cOBbrfK3v0vHmYb3oftseacjvBD/wfyigB5wbSgRUYNhbUu1V HTTP/1.1" 200 - 6300 6300 45 45 "https://s3-console-us-standard.console.aws.amazon.com/GetResource/Console.html?&lpuid=AIDAJDFZWQHFC725MCNDW&lpfn=nicole.deflaux&lpgn=325565585839&lpas=ACTIVE&lpiamu=t&lpak=AKIAJBHAI75QI6ANOM4Q&lpst=AQIGQXBwVGtupqEus842na80zMbEFbfpPhOsIic7z1ghm0Umjd8kybj4eaOtBCKlwVHMXi2SuasIKxwYljjDA95O%2BfZb5uF7ku4crE6OObz8d%2Fev7ArPime2G%2Fa5nXRq56Jx2hAt8NDDbhnE8JqyOnKn%2BN308wx2Ud3Q2R3rSqK6t%2Bq%2Fl0UAkhBFNM1gvjR%2BoPGYBV9Jspwfp8ww8CuZVH1Y2P2iid6ZS93K02sbGvQnhU7eCGhorhMI5kxOqy7bTbzvl2HML7zQphXRIa1wqrRSD%2FsBLfpK5x6A%2BcQnLrgO6FtWJMDo5rTmgEPo6esNIivWnaiI6BvPddLlBMZVtmcx39%2FcOBbrfK3v0vHmYb3oftseacjvBD%2FwfyigB5wbSgRUYNhbUu1V&lpts=1298051257585" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.102 Safari/534.13" -

The repository service vends credentials for the users' use.
Users will need to sign those S3 URLs with their AWS credentials using the usual toolkit provided by Amazon.
- One example tool provided by Amazon is s3curl
- I'm sure we can also get the R client to also do the signing. We can essential hide the fact that users have their own credentials just for this since the R client can communicate with the repository service and cache credentials in memory.
- We may be able to find a JavaScript library to do the signing as well for Web client use cases. If not, we could proxy download requests. That solution is described in its own section below.

Pros:

S3 takes care of serving the files (once access is granted) so we don't have to run a proxy fleet.
Scalable
By pre-signing each S3 URL with an individual's credentials, we can accurately match up entries in the audit log and the access log to facilitate a strict accounting of actual downloads.
By not giving out IAM credentials to users, we can get away with a smaller number of groups.
Users do not need to give us their AWS credentials for any reason.
We will already be using IAM to manage access to AWS resources by Sage system administrators and Sage employees.
This will scale sufficiently to meet our needs for the embargo data use case. The default limit is 5,000 users. We can ask Deepak to raise that limit for us. Also note that the default limit will be raised within a year as AWS rolls out expanded support for federated access to AWS resources.
More features will come online for IAM over time

Cons:

Potential for URL reuse: When the url has not yet expired, it is possible for others to use that same URL to download files.
- For example, if someone requests a download URL from the repository service (and remember that the repository service confirms that the user is authenticated and authorized before handing out that URL) and then that person emails the URL to his team, his whole team could start downloading the data layer as long as they all kick off their downloads within that one minute window of time.
- Of course, they could also download the data locally and then let their whole team have access to it.
- This isn't much of a concern for unrestricted data because its actually easier to just make a new account on the system with a new email address.
Potential for user confusion: If a user gets her download url, and somehow does not use it right away, she'll need to reload the web page to get another or re-do the R prompt command to get a fresh URL.
- This isn't much of a concern if we ensure that the workflow of the clients (web or R) is to use the URL upon receipt.

Resources:

A comparison of Bucket Policies, ACLs, and IAM
More details on limits
Following are the default maximums for your entities:
- Groups per AWS Account: 100
- Users per AWS Account: 5000
- Number of groups per User: 10 (that is, the User can be in this many groups) (also note that this limit can be increased by it scales linearly)
- Access keys per User: 2
- Signing certificates per User: 2
- MFA devices per User: 1
- Server certificates per AWS Account: 10
- You can request to increase the maximum number of Users or groups for your AWS Account.

S3 ACL

This is the older mechanism from AWS for access control. It control object level, not bucket level, access.

This is ruled out for protected data because ACLs can have a max of 100 grants and it appears that these grants cannot be to groups such as groups of arbitrary AWS users.

Open Question:

Confirm that grants do not apply to groups of AWS users.

Resources:

http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?S3_ACLs_UsingACLs.html

Cloud Front Private Content

Cloud Front supports the notion of private content. Cloud front URLs can be created with access policies such as an expires time and an IP mask. This is ruled out since S3 provides a similar feature and we do not need a CDN for any of the usual reasons one wants to use a CDN.

Since we do not need a CDN for the normal reason (to move often requested content closer to the user to reduce download times), this does not buy us much over S3 Pre-Signed URLs. It seems like the only added benefit is an IP mask in addition to the expires time.

Pros:

This would work for the download of unrestricted data use cases.

Cons:

Note that this is likely a bad solution for the EC2/EMR use cases because Cloud Front sits outside of AWS and users will incur the inbound bandwidth charges when they pull the data down to EC2/EMR.
There is an additional charge on top of the S3 hosting costs.

Resources:

http://docs.amazonwebservices.com/AmazonCloudFront/latest/DeveloperGuide/index.html?PrivateContent.html#CreateFinalURL

Options to have customers bear some of the costs

S3 "Requester Pays" Buckets

In this scenario the requester's AWS account would be charged for any download bandwidth charges incurred.

Scenario:

See above details about the general S3 scenario and the bucket policy or S3 ACL scenarios.
Users must have AWS accounts.

Tech details:

We would have to do this in combination with a bucket policy or an S3 ACL since an AWS account needs to be associated with the request.
We toggle a flag on the bucket to indicate this charging mechanism.
Requests must include an extra header or an extra query parameter to indicate that they know they are paying the cost for the download.
Users must sign their S3 URLs in the usual way (see above).

Pros:

This accurately defers the bandwidth costs to others.

Cons:

This has all the same cons as Bucket Policies and S3 ACLs.

Resources:

In general, bucket owners pay for all Amazon S3 storage and data transfer costs associated with their bucket. A bucket owner, however, can configure a bucket to be a Requester Pays bucket. With Requester Pays buckets, the requester instead of the bucket owner pays the cost of the request and the data download from the bucket. The bucket owner always pays the cost of storing data.
Typically, you configure buckets to be Requester Pays when you want to share data but not incur charges associated with others accessing the data. You might, for example, use Requester Pays buckets when making available large data sets, such as zip code directories, reference data, geospatial information, or web crawling data.
Important: If you enable Requester Pays on a bucket, anonymous access to that bucket is not allowed.
You must authenticate all requests involving Requester Pays buckets. The request authentication enables Amazon S3 to identify and charge the requester for their use of the Requester Pays bucket.
After you configure a bucket to be a Requester Pays bucket, requesters must include x-amz-request-payer in their requests either in the header, for POST and GET requests, or as a parameter in a REST request to show that they understand that they will be charged for the request and the data download.
Requester Pays buckets do not support the following.
- Anonymous requests
- BitTorrent
- SOAP requests
- You cannot use a Requester Pays bucket as the target bucket for end user logging, or vice versa. However, you can turn on end user logging on a Requester Pays bucket where the target bucket is a non Requester Pays bucket.
http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?RequesterPaysBuckets.html
Pre-signed URLs are not an option, from http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?RequesterPaysBucketConfiguration.html "Bucket owners who give out pre-signed URLs should think twice before configuring a bucket to be Requester Pays, especially if the URL has a very long expiry. The bucket owner is charged each time the requester uses pre-signed URLs that use the bucket owner's credentials."
From http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?ObjectsinRequesterPaysBuckets.html "For signed URLs, include x-amz-request-payer=requester in the request".

Dev Pay

Dev Pay only requires Amazon.com accounts, not AWS accounts.

Dev Pay is not appropriate for our use case or providing read-only access to shared data on S3. It requires that you make a copy of all S3 data for each customer that they access via Dev Pay (they have to access their own bucket).

Dev Pay may be appropriate later on when we are facilitating read-write access to user-specific data or data shared by a small group.

Resources

The Requester Pays feature (used alone) lets you give other Amazon S3 users access to your data, but you can't make a profit; you can only avoid paying data transfer and request costs.
DevPay (used alone) lets you give access to your data to anyone who is signed up for your product (regardless if they're Amazon S3 users). But because the DevPay bucket isn't a Requester Pays bucket, you (as the owner of the bucket) still pay for data transfer and requests (at the DevPay product's price).
You need to use DevPay together with a Requester Pays bucket if you want to charge people a premium to download your data (the overall process is described in Selling Your Data). Note that because you're using DevPay, your customers don't have to be Amazon S3 users.
When you use DevPay with your Requester Pays bucket, your customers download your data to a location outside Amazon S3. They don't copy the data from your bucket to theirs.
http://docs.amazonwebservices.com/AmazonS3/2006-03-01/dev/index.html?UsingDevPay.html
http://docs.amazonwebservices.com/AmazonDevPay/latest/DevPayDeveloperGuide/index.html?S3RequesterPays.html
How to share Dev Pay data between AWS accounts

Flexible Payments Service, PayPal, etc.

We can use general services for billing customers. It would be up to us to determine what is a transaction, how much to charge, etc.

We may need to keep out own transaction ledger and issue bills. We would definitely let another company deal with credit card operations.

EBS

Data is available as hard drive snaphots.

Pros:

Its a convenient way to access data from EC2 instances.

Cons:

This only covers our cloud compute use case, not our download use case.

Recommendation: focus on S3 for now since it can meet all our use cases. Work on this later if customers ask for it due to its convenience in a cloud computing environment.

EBS snapshot ACL

Open questions:

What is the max grant number for this?

Resources

http://docs.amazonwebservices.com/AWSEC2/latest/APIReference/index.html?ApiReference-query-ModifySnapshotAttribute.html

File Organization and Format

S3 bucket URLs are best represented using unique hostnames under our DNS control to assure uniqueness.

Recommended format: http://data01.sagebase.org.s3.amazonaws.com/(file or directory)
Data01 allows for 100 buckets numerically defined, max buckets currently allowed per account
Emb-data01/Pub-data01 if bucket separation of embargoed/unrestricted data is required

Logging should be done to it's own bucket.

Allows for bucket level permissions, easy to maintain
http://logs.sagebase.org.s3.amazonaws.com
Two paradigms for storing files (which can also be used in combination):

Store datasets in tar.gz files
- Benefits: Easier to conform to s3 URL naming standard, Compressed to reduce size in storage and transfer, much easier to load into s3 and manage in Platform
- Cons: Users do not have access to contents of file unless we store it separately in the platform
- Example: http://data01.sagebase.org.s3.amazonaws.com/tcga_curation_pacakge.tar.gz
Store datasets uncompressed
- Benefits: Users can retrieve single files from archive
- Cons: Users will have use scripts to retrieve entire dataset, Files must be named to match URL naming restrictions
- Example: http://data01.sagebase.org.s3.amazonaws.com/tcga_curation_package/sage_bionetworks_user_agreement.pdf
Store datasets in both formats
- Benefits: Gain benefits of both formats
- Cons: More than double the storage space used, file naming restriction
- Example: Both of the above

Resources:

S3 Restrictions on bucket names, etc.

Additional Details

Network Bandwidth

Hypothetically, the Hutch (and therefore Sage) is on a high throughput link to AWS.

The Pacific Northwest Gigapop is the point of presence for the Internet2/Abilene network in the Pacific Northwest. The PNWGP is connected to the Abilene backbone via a 10 GbE link. In turn, the Abilene Seattle node is connected via OC-192 links to both Sunnyvale, California and Denver, Colorado.
PNWPG offers two types of Internet2/Abilene interconnects: Internet2/Abilene transit services and Internet2/Abilene peering at Pacific Wave International Peering Exchange. See Participant Services for more information.

The Corporation for Education Network Initiatives in California (CENIC) and Pacific NorthWest GigaPoP (PNWGP) announced two 10 Gigabit per second (Gbps) connections to Amazon Simple Storage Service (Amazon S3) and Amazon Elastic Compute Cloud (Amazon EC2) for the use of CENIC's members in California, as well as PNWGP's multistate K-20 research and education community.

http://findarticles.com/p/news-articles/wireless-news/mi_hb5558/is_20100720/cenic-pacific-northwest-partner-develop/ai_n54489237/http://www.internet2.edu/maps/network/connectors_participants http://www.pnw-gigapop.net/partners/index.html