...
- All Sage data is stored on S3 and is not public.
- Users can only discover what data is available via the platform.
- Users can use the data for cloud computation by spinning up EC2 instances and downloading the files from S3 to the hard drive of their EC2 instance.
- Users can download the data from S3 to their local system.
- The platform directs users to sign a Sage-specified EULA prior to gaining access to these files in S3.
- Users must have a Sage platform account to access this data for download. They They may need an AWS account depending upon the mechanism we use to grant access.
- The platform grants access to this data. See below for details about the various ways we might do this.
- The platform will write to the audit log each time it grants access. S3 can also be configured to log all access to resources and this could serve as a means of intrusion detection.
- These two types of logs will have log entries about different events (granting access vs. using access) so they will not have a strict 1-to-1 mapping between entries but should have a substantial overlap.
- The platform can store anything it likes in its audit log.
- The S3 log stores normal web access log type data with the following identifiable fields:
- client IP address is available in the log
- "anonymous" or the users AWS canonical user id will appear in the log
- We can try to appending some other query parameter to the S3 URL to help us line it up with audit log entries.
- See proposals below regarding how users might pay for usage.
- The cost of hosting not free.
- Storage fees will apply.
- Bandwidth fees apply when data is uploaded.
- Data can also be shipped via hard drives and AWS Import fees would apply.
- Bandwidth fees apply when data is downloaded out of AWS. There is no charge when it is downloaded inside AWS (e.g., to an EC2 instance).
- These same fees apply to any S3 log data we keep.
...
- See above details about the general S3 scenario.
- The platform grants access to the data by creating pre-signed S3 URLs to Sage's private S3 files. These URLs are created on demand and have a short expiry time.
- Users will not need AWS accounts.
- The platform has a custom solution for collecting payment from users to pay for their usageoutbound bandwidth when downloading outside the cloud.
Tech Details:
- Repository Service: the service has an API that vends pre-signed S3 URLs for data layer files. Users are authenticated and authorized by the service prior to the service returning a download URL for the layer. The default URL expiry time will be one minute.
- Web Client: When the user clicks on the download button, the web servlet sends a request to the the repository service which constructs and returns a pre-signed S3 URL with an expiry time of one minute. The web servlet returns this URL to the browser as the location value of a 302 redirect. The browser begins download from S3 immediately.
- R Client: When the user issues the download command at the R prompt, the R client sends a request to the the repository service which constructs and returns a pre-signed S3 URL with an expiry time of one minute. The R client uses the returned URL to begin to download the content from S3 immediately.
...
- See above details about the general S3 scenario.
- The platform grants access to the data by updating the bucket policy for the dataset to which the user would like access.
- Users will need AWS accounts.
- The platform has a custom solution for collecting payment from users to pay for their usageoutbound bandwidth when downloading outside the cloud.
Tech Details:
- We would want to ensure that buckets contain the right granularity of data to which we want to provide access since its all or none of the bucket. Therefore, we might have a separate bucket for each data set.
- The repository service grants access by adding the users AWS account number to the bucket policy for the desired dataset.
- The repository service vends S3 URLs.
- Users will need to sign those S3 URLs with their AWS credentials using the usual toolkit provided by Amazon. TODO try this to understand the user experience, if stuff just shows up in the AWS console, that would be ideal
Pros:
- It will be useful if we want to grant access to an
- One example tool provided by Amazon is s3curl
- I'm sure we can also get the R client to also do the signing. Users will need to make their AWS credentials available to the R client via a config file.
- We may be able to find a JavaScript library to do the signing as well for Web client use cases. But since we should not store their credentials, users would need to specify them every time.
Pros:
- It will be useful if we want to grant access to an entire institution. For example, we may add ISB's AWS account to have read access to one or more S3 buckets.
- From the S3 logs we will be able to see who downloaded the data and compute charge backs for our custom solution.
Cons:
- This mechanism will isn't as simple for users as pre-signed URLs because they have to think about their credentials and provide them for signing.
- This mechanism will not scale to grant access for tens of thousands of individuals therefore it will not be sufficient for "unrestricted data". It may scale sufficiently for "embargoed data" and "restricted data".
- Users may not feel comfortable sharing their AWS credentials with Sage tools to assist them to make signing easier.
Open Questions:
- What is the upper limit on the number of grants?
- What is the upper limit on the number of principals that can be listed in a single grant?
- Is there a JavaScript library we can use to sign S3 URLs?
Resources:
- Bucket Policy Overview
- Evaluation Logic Flow Chart
- "The Principal is the person or persons who receive or are denied permission according to the policy. You must specify the principal by using the principal's AWS account ID (e.g., 1234-5678-9012, with or without the hyphens). You can specify multiple principals, or a wildcard to indicate all possible users." from http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?AccessPolicyLanguage_ElementDescriptions.html
...
- See above details about the general S3 scenario.
- The platform grants access to the data by creating a new IAM user and adding them to the group that has access to the dataset.
- Users will do not need AWS accounts. TODO confirm
- The platform has a custom solution for collecting payment from users to pay for their usageoutbound bandwidth when downloading outside the cloud.
Tech Details:
- TODO try a quick demo to understand what this is like for users granted access.
- TODO make sure any EC2 instances a user spins up are on their own bill.
This is ruled out for protected data because IAM is used for managing groups of users all under a particular AWS bill (e.g., all employees of a company).
- The repository service grants access by adding the user to the correct group for the desired dataset.
- The repository service vends credentials for the users' use.
- Users will need to sign those S3 URLs with their AWS credentials using the usual toolkit provided by Amazon.
- One example tool provided by Amazon is s3curl
- I'm sure we can also get the R client to also do the signing. We can essential hide the fact that users have their own credentials just for this since the R client can communicate with the repository service and cache credentials in memory.
- We may be able to find a JavaScript library to do the signing as well for Web client use cases. If not, we could proxy download requests. That solution is described in its own section below.
Pros:
- This will be helpful for managing access Sage system administrators and Sage employees.
Cons:
- This may be confusing for users. They will likely have their own AWS credentials plus separate Sage gives out credentials to users so that users can access Sage resources. We can put strict limitations on what capabilities these credentials give to users.
- Users do not need to give us their AWS credentials for any reason.
Cons:
- This has not saved us any work for the download use case if we still need to proxy requests from Web clients.
- This may be confusing for users. They will likely have their own AWS credentials plus separate credentials for each data set to which we have granted them access??? TODO confirm.
- This is currently limited to 1,000 users. We may be able to ask Deepak to raise that limit for us. The limit will be raised within a year as AWS rolls out expanded support for federated access to AWS resources.
...
All files are kept completely private on S3 and we write a custom proxy that allows users with permission to download files whether to locations outside AWS or to EC2 hosts.
Tech Details:
...
Scenario:
- See above details about the general S3 scenario.
- The platform grants access to the data by facilitating the proxy of their request to S3.
- Users do not need AWS accounts.
- The platform has a custom solution for collecting payment from users to pay for outbound bandwidth when downloading outside the cloud.
Tech Details:
- We would have a fleet of proxy servers running on EC2 hosts that authenticate and authorize users and then proxy download from S3.
- We would need to configure auto-scaling so that capacity for this proxy fleet can grow and shrink as needed.
...
In this scenario the requester's AWS account would be charged for any download bandwidth charges incurred. Currently assuming we would use this in combination with bucket policy or IAM.
TODO try this out
Open Questions:
...
Scenario:
- See above details about the general S3 scenario and the bucket policy or S3 ACL scenarios.
- Users must have AWS accounts.
Tech details:
- We would have to do this in combination with a bucket policy or an S3 ACL since an AWS account needs to be associated with the request.
- We toggle a flag on the bucket to indicate this charging mechanism.
- Requests must include an extra header or an extra query parameter to indicate that they know they are paying the cost for the download.
- Users must sign their S3 URLs in the usual way (see above).
Pros:
- This accurately defers the bandwidth costs to others.
Cons:
- This has all the same cons as Bucket Policies and S3 ACLs.
Resources:
- In general, bucket owners pay for all Amazon S3 storage and data transfer costs associated with their bucket. A bucket owner, however, can configure a bucket to be a Requester Pays bucket. With Requester Pays buckets, the requester instead of the bucket owner pays the cost of the request and the data download from the bucket. The bucket owner always pays the cost of storing data.
- Typically, you configure buckets to be Requester Pays when you want to share data but not incur charges associated with others accessing the data. You might, for example, use Requester Pays buckets when making available large data sets, such as zip code directories, reference data, geospatial information, or web crawling data.
- Important: If you enable Requester Pays on a bucket, anonymous access to that bucket is not allowed.
- You must authenticate all requests involving Requester Pays buckets. The request authentication enables Amazon S3 to identify and charge the requester for their use of the Requester Pays bucket.
- After you configure a bucket to be a Requester Pays bucket, requesters must include x-amz-request-payer in their requests either in the header, for POST and GET requests, or as a parameter in a REST request to show that they understand that they will be charged for the request and the data download.
- Requester Pays buckets do not support the following.
- Anonymous requests
- BitTorrent
- SOAP requests
- You cannot use a Requester Pays bucket as the target bucket for end user logging, or vice versa. However, you can turn on end user logging on a Requester Pays bucket where the target bucket is a non Requester Pays bucket.
- a REST request to show that they understand that they will be charged for the request and the data download.
- Requester Pays buckets do not support the following.
- Anonymous requests
- BitTorrent
- SOAP requests
- You cannot use a Requester Pays bucket as the target bucket for end user logging, or vice versa. However, you can turn on end user logging on a Requester Pays bucket where the target bucket is a non Requester Pays bucket.
- http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?RequesterPaysBuckets.html
- Pre-signed URLs are not an option, from http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?RequesterPaysBucketConfiguration.html "Bucket owners who give out pre-signed URLs should think twice before configuring a bucket to be Requester Pays, especially if the URL has a very long expiry. The bucket owner is charged each time the requester uses pre-signed URLs that use the bucket owner's credentials."
- From http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?RequesterPaysBucketsObjectsinRequesterPaysBuckets.html "For signed URLs, include x-amz-request-payer=requester in the request".
Dev Pay
Dev Pay only requires Amazon.com accounts, not AWS accounts.
Dev Pay is not appropriate for our use case or providing read-only access to shared data on S3. It requires that you make a copy of all S3 data for each customer that they access via Dev Pay (they have to access their own bucket).
Dev Pay may be appropriate later on when we are facilitating read-write access to user-specific data or data shared by a small group.
...
The Pacific Northwest Gigapop is the point of presence for the Internet2/Abilene network in the Pacific Northwest. The PNWGP is connected to the Abilene backbone via a 10 GbE link. In turn, the Abilene Seattle node is connected via OC-192 192 links to both Sunnyvale, California and Denver, Colorado.
PNWPG offers two types of Internet2/Abilene interconnects: Internet2/Abilene transit services and Internet2/Abilene peering at Pacific Wave International Peering Exchange. See Participant Services for more information.
...