Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

For now we are assuming that we are dealing with files. Later on we can also envision providing access to data stored in a database and/or a data warehouse.

We also assume that this data can be stored in S3 in plain text (no encryption).

Who

Assume tens of thousands of users will eventually use the platform.

...

Options to Consider

AWS Public Data Sets

Current Scenario:

  • Sage currently has two data sets stored as "AWS Public Datasets" in the US West Region.
  • Users can discover them by browsing public datasets on http://aws.amazon.com/datasets/Biology?browse=1 and also via the platform.
  • Users can use them for cloud computation by spinning up EC2 instances and mounting the data as EBS volumes.
  • Users cannot directly download data from these public datasets, but once the have them mounted on an EC2 host, they can certainly scp them to their local system.
  • Users are not forced to sign a Sage-specified EULA prior to access since because they can bypass the platform directly and access this data via normal AWS mechanisms.
  • Users must have an AWS account to access this data.
  • There is no mechanism to grant access. All users with AWS accounts are granted access by default.
  • There is no mechanism to keep an audit log for downloads or other usage of this data.
  • Users pay for access by paying their own costs for EC2 and bandwidth charges if they choose to download the data out of AWS.
  • The cost of hosting is free.

Future ScenarioConsiderations:

  • this is currently EBS only but it will also be available for S3 in the future
  • TODO ask Deepak what other plans they have in mind for the re-launch of AWS Public Datasets.
  • TODO tell Deepak our suggested features for AWS Public Datasets.

...

  • All Sage data is stored on S3 and is not public.
  • Users can only discover what data is available via the platform.
  • Users can use the data for cloud computation by spinning up EC2 instances and downloading the files from S3 to the hard drive of their EC2 instance.
  • Users can download the data from S3 to their local system. See below for more details on this.
  • The platform directs users to sign a Sage-specified EULA prior to gaining access to these files in S3.
  • Users must have a Sage platform account to access this data for download.  They may need an AWS account for the cloud computation use case depending upon the mechanism we use to grant access.
  • The platform grants access to this data. See below for details about the various ways we might do this.
  • The platform will write to the audit log each time it grants access and to whom it granted access. S3 can also be configured to log all access to resources and this could serve as a means of intrusion detection.  
    • These two types of logs log different events (granting access vs. using access) so they will not have a strict 1-to-1 mapping between entries but should have a substantial overlap.  
    • The platform can store anything it likes in its audit log.  
    • The S3 log stores normal web access log type data with the following identifiable fields:
      • client IP address is available in the log
      • "anonymous" or the users AWS canonical user id will appear in the log
    • We can try to appending some other query parameter to the S3 URL to help us line it up with audit log entries.
  • See proposals below regarding how users might pay for usage.
  • The cost of hosting not free.
    • Storage fees will apply.
    • Bandwidth fees apply when data is uploaded.
    • Data can also be shipped via hard drives and AWS Import fees would apply.
    • Bandwidth fees apply when data is downloaded out of AWS. There is no charge when it is downloaded inside AWS (e.g., to an EC2 instance).
    • These same fees apply to any S3 log data we keep.

...

S3 Pre-Signed URLs for Private Content

Pros:

  • Simple! These can be used for both the download use case and the cloud computation use cases.
  • Scalable
  • We control duration of the expiry window. All URLs might be good for 1 minute.

Cons:

  • When the url has not yet expired, it is possible for others to use that same URL to download files.
  • If a user gets her download url, and does not use it right away, she'll need to reload the WebUI page to get another or re-do the R prompt command.

Open Questions:

  • How does this work with the new support for partial downloads?
  • Does this work with torrent-style access?
  • Can we limit S3 access to HTTPS only?

Resources:

S3 Bucket Policies

This is the newer mechanism from AWS for access control"Query String Request Authentication Alternative: You can authenticate certain types of requests by passing the required information as query-string parameters instead of using the Authorization HTTP header. This is useful for enabling direct third-party browser access to your private Amazon S3 data, without proxying the request. The idea is to construct a "pre-signed" request and encode it as a URL that an end-user's browser can retrieve. Additionally, you can limit a pre-signed request by specifying an expiration time."

Scenario:

  • See above details about the general S3 scenario.
  • The platform grants access to the data by creating pre-signed S3 URLs to Sage's private S3 files. These URLs are created on demand and have a short expiry time.
  • The platform has a custom solution for collecting payment from users.

Tech Details:

  • Repository Service: the service has an API that vends pre-signed S3 URLs for data layer files. Users are authenticated and authorized by the service prior to the service returning a download URL for the layer. The default URL expiry time will be one minute.
  • Web Client: When the user clicks on the download button, the web servlet sends a request to the the repository service which constructs and returns a pre-signed S3 URL with an expiry time of one minute. The web servlet returns this URL to the browser as the location value of a 302 redirect. The browser begins download from S3 immediately.
  • R Client: When the user issues the download command at the R prompt, the R client sends a request to the the repository service which constructs and returns a pre-signed S3 URL with an expiry time of one minute. The R client uses the returned URL to begin to download the content from S3 immediately.

Pros:

  • Simple! These can be used for both the download use case and the cloud computation use cases.
  • Scalable
  • We control duration of the expiry window.

Cons:

  • When the url has not yet expired, it is possible for others to use that same URL to download files. For example, if someone requests a download URL from the repository service (and remember that the repository service confirms that the user is authenticated and authorized before handing out that URL) and then emails it to his team, his whole team could start downloading the data layer as long as they all kick off their downloads within that one minute window of time.
  • If a user gets her download url, and somehow does not use it right away, she'll need to reload the web page to get another or re-do the R prompt command to get a fresh URL.
  • Regarding payments, we know when we vend a URL but it may not be possible to know whether that URL was actually used (no fee) or used more than once (multiple fees).

Open Questions:

  • How does this work with the new support for partial downloads for gigantic files? Currently assuming yes, and that the repository service would need to give out the URL a few times during the duration of the download (re-request the URL for each download chunk)
  • Does this work with torrent-style access? Currently assuming no.
  • Can we limit S3 access to HTTPS only? Currently assuming yes.

Resources:

S3 Bucket Policies

This is the newer mechanism from AWS for access control. We can add AWS accounts and/or IP address masks to the policy.

Scenario:

  • See above details about the general S3 scenario.
  • The platform grants access to the data by updating the bucket policy for the dataset to which the user would like access.
  • The platform has a custom solution for collecting payment from users.

Tech Details:

  • We would want to ensure that buckets contain the right granularity of data to which we want to provide access since its all or none of the bucket. Therefore, we might have a separate bucket for each data set.

Pros:

  • It will be useful if we want to grant access to an entire institution. For example, we may add ISB's AWS account to have read access to one or more S3 buckets of Sage data layers.
    Cons:
  • This mechanism will not scale to grant access for tens of thousands of individuals.

Open Questions:

  • What is the upper limit on the number of grants?
  • What is the upper limit on the number of principals that can be listed in a single grant?

...

This is the older mechanism from AWS for access control. It control object level, not bucket level, access.

This is ruled out for protected data because ACLs can have a max of 100 grants and it appears that these grants cannot be to groups such as groups of arbitrary AWS users.

...

All files are kept completely private on S3 and we write a custom proxy that allows users with permission to download files whether to locations outside AWS or to EC2 hostsoutside AWS or to EC2 hosts.

Tech Details:

  • We would have a fleet of proxy servers running on EC2 hosts that authenticate and authorize users and then proxy download from S3.
  • We would need to configure auto-scaling so that capacity for this proxy fleet can grow and shrink as needed.

Pros:

  • full flexibility
  • we can accurately track who has downloaded what when
    Cons:
  • we have another service fleet to administer
  • we are now the scalability bottleneck

If all else fails we can do this, but it will be more work operationally to manage a fleet of custom proxy servers.

Options to have customers bear some of the costs

...

The Pacific Northwest Gigapop is the point of presence for the Internet2/Abilene network in the Pacific Northwest. The PNWGP is connected to the Abilene backbone via a 10 GbE link. In turn, the Abilene Seattle node is connected via OC-192              192               links to both Sunnyvale, California and Denver, Colorado.
PNWPG offers two types of Internet2/Abilene interconnects: Internet2/Abilene transit services and Internet2/Abilene peering at Pacific Wave International Peering Exchange. See Participant Services for more information.

...