...
- All Sage data is stored on S3 and is not public.
- Users can only discover what data is available via the platform.
- Users can use the data for cloud computation by spinning up EC2 instances and downloading the files from S3 to the hard drive of their EC2 instance.
- Users can download the data from S3 to their local system.
- The platform directs users to sign a Sage-specified EULA prior to gaining access to these files in S3.
- Users must have a Sage platform account to access this data for download. They may need an AWS account for the cloud computation use case depending upon the mechanism we use to grant access.
- The platform grants access to this data. See below for details about the various ways we might do this.
- The platform will write to the audit log each time it grants access and to whom it granted access. S3 can also be configured to log all access to resources and this could serve as a means of intrusion detection.
- These two types of logs will have log entries about different events (granting access vs. using access) so they will not have a strict 1-to-1 mapping between entries but should have a substantial overlap.
- The platform can store anything it likes in its audit log.
- The S3 log stores normal web access log type data with the following identifiable fields:
- client IP address is available in the log
- "anonymous" or the users AWS canonical user id will appear in the log
- We can try to appending some other query parameter to the S3 URL to help us line it up with audit log entries.
- See proposals below regarding how users might pay for usage.
- The cost of hosting not free.
- Storage fees will apply.
- Bandwidth fees apply when data is uploaded.
- Data can also be shipped via hard drives and AWS Import fees would apply.
- Bandwidth fees apply when data is downloaded out of AWS. There is no charge when it is downloaded inside AWS (e.g., to an EC2 instance).
- These same fees apply to any S3 log data we keep.
...
- See above details about the general S3 scenario.
- The platform grants access to the data by creating pre-signed S3 URLs to Sage's private S3 files. These URLs are created on demand and have a short expiry time.
- Users will not need AWS accounts.
- The platform has a custom solution for collecting payment from users to pay for their usage.
Tech Details:
- Repository Service: the service has an API that vends pre-signed S3 URLs for data layer files. Users are authenticated and authorized by the service prior to the service returning a download URL for the layer. The default URL expiry time will be one minute.
- Web Client: When the user clicks on the download button, the web servlet sends a request to the the repository service which constructs and returns a pre-signed S3 URL with an expiry time of one minute. The web servlet returns this URL to the browser as the location value of a 302 redirect. The browser begins download from S3 immediately.
- R Client: When the user issues the download command at the R prompt, the R client sends a request to the the repository service which constructs and returns a pre-signed S3 URL with an expiry time of one minute. The R client uses the returned URL to begin to download the content from S3 immediately.
...
- When the url has not yet expired, it is possible for others to use that same URL to download files. For example, if someone requests a download URL from the repository service (and remember that the repository service confirms that the user is authenticated and authorized before handing out that URL) and then that person emails it the URL to his team, his whole team could start downloading the data layer as long as they all kick off their downloads within that one minute window of time.
- If a user gets her download url, and somehow does not use it right away, she'll need to reload the web page to get another or re-do the R prompt command to get a fresh URL.
- Regarding payments, we know when we vend a URL but it may not be possible to know whether that URL was actually used (no fee) or used more than once (multiple fees).
Open Questions:
- How does this work with the new support for partial downloads for gigantic files? Currently assuming yes, and that the repository service would need to give out the URL a few This security model should be fine for "unrestricted data" since the barrier to gain access (provide a verifiable email address) is easier than stealing these URLs. But this may not be sufficient for "embargoed data" or "restricted data".
- Does this work with the new support for partial downloads for gigantic files? Currently assuming yes, and that the repository service would need to give out the URL a few times during the duration of the download (re-request the URL for each download chunk)
- Does this work with torrent-style access? Currently assuming no.
- Can we limit S3 access to HTTPS only? Currently assuming yes.
...
- See above details about the general S3 scenario.
- The platform grants access to the data by updating the bucket policy for the dataset to which the user would like access.
- Users will+need AWS accounts.
- The platform has a custom solution for collecting payment from users to pay for their usage.
Tech Details:
- We would want to ensure that buckets contain the right granularity of data to which we want to provide access since its all or none of the bucket. Therefore, we might have a separate bucket for each data set.
- The repository service grants access by adding the users AWS account number to the bucket policy for the desired dataset.
- The repository service vends S3 URLs. Users will need to sign those S3 URLs with their AWS credentials using the usual toolkit provided by Amazon.
Pros:
- It will be useful if we want to grant access to an entire institution. For example, we may add ISB's AWS account to have read access to one or more S3 buckets of Sage data layers.
- From the S3 logs we will be able to see who downloaded the data and compute charge backs for our custom solution.
Cons: - This mechanism will not scale to grant access for tens of thousands of individuals therefore it will not be sufficient for "unrestricted data". It may scale sufficiently for "embargoed data" and "restricted data".
Open Questions:
- What is the upper limit on the number of grants?
- What is the upper limit on the number of principals that can be listed in a single grant?
...
- The following list describes the restrictions on Amazon S3 policies: from http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?AccessPolicyLanguage_SpecialInfo.html
- The maximum size of a policy is 20 KB
- The value for Resource must be prefixed with the bucket name or the bucket name and a path under it (bucket/). If only the bucket name is specified, without the trailing /, the policy applies to the bucket.
- Each policy must have a unique policy ID (Id)
- Each statement in a policy must have a unique statement ID (sid)
- Each policy must cover only a single bucket and resources within that bucket (when writing a policy, don't include statements that refer to other buckets or resources in other buckets)
S3
...
and IAM
This is the older newest mechanism from AWS for access control . It control object level, not bucket level, access.
This is ruled out for protected data because ACLs can have a max of 100 grants and it appears that these grants cannot be to groups such as groups of arbitrary AWS users.
Open Question:
- Confirm that grants do not apply to groups of AWS users.
Resources:
S3 and IAM
With IAM a group of users can be granted access to S3 resources. This will be helpful for managing access Sage system administrators and Sage employees.
This is ruled out for protected data because IAM is used for managing groups of users all under a particular AWS bill (e.g., all employees of a company).
Open Questions:
- Is there a cap on the number of users for IAM?
- Confirm that IAM only intended for managing groups and users where the base assumption is all activity is rolling up to a single AWS bill.
Resources:
...
and is under heavy development in 2011 to expand its features. With IAM a group of users (as opposed to AWS accounts) can be granted access to S3 resources. We decide who those users are. All activity is rolled up to a single bill.
"AWS Identity and Access Management is a web service that enables Amazon Web Services (AWS) customers to manage Users and User permissions in AWS. The service is targeted at organizations with multiple Users or systems that use AWS products such as Amazon EC2, Amazon SimpleDB, and the AWS Management Console. With IAM, you can centrally manage Users, security credentials such as access keys, and permissions that control which AWS resources Users can access.
Without IAM, organizations with multiple Users and systems must either create multiple AWS Accounts, each with its own billing and subscriptions to AWS products, or employees must all share the security credentials of a single AWS Account. Also, without IAM, you have no control over the tasks a particular User or system can do and what AWS resources they might use.
IAM addresses this issue by enabling organizations to create multiple Users (each User being a person, system, or application) who can use AWS products, each with individual security credentials, all controlled by and billed to a single AWS Account. With IAM, each User is allowed to do only what they need to do as part of the User's job."
Scenario:
- See above details about the general S3 scenario.
- The platform grants access to the data by creating a new IAM user and adding them to the group that has access to the dataset.
- Users will+need+ AWS accounts. TODO confirm
- The platform has a custom solution for collecting payment from users to pay for their usage.
Tech Details:
- TODO try a quick demo to understand what this is like for users granted access.
- TODO make sure any EC2 instances a user spins up are on their own bill.
This is ruled out for protected data because IAM is used for managing groups of users all under a particular AWS bill (e.g., all employees of a company).
Pros:
- This will be helpful for managing access Sage system administrators and Sage employees.
Cons: - This may be confusing for users. They will likely have their own AWS credentials plus separate credentials for each data set to which we have granted them access??? TODO confirm
- This is currently limited to 1,000 users. We may be able to ask Deepak to raise that limit for us. The limit will be raised within a year as AWS rolls out expanded support for federated access to AWS resources.
Resources:
S3 ACL
This is the older mechanism from AWS for access control. It control object level, not bucket level, access.
This is ruled out for protected data because ACLs can have a max of 100 grants and it appears that these grants cannot be to groups such as groups of arbitrary AWS users.
Open Question:
- Confirm that grants do not apply to groups of AWS users.
Resources:
Cloud Front Private Content
Cloud Front supports the notion of private content. Cloud front urls can be created with access policies such as an expires time and an IP mask. This is ruled out since S3 provides a similar feature and we do not need a CDN for any of the usual reasons one wants to use a CDN.
Pros:
- This would work for the download of protected content use cases.
Cons: - Note that this is likely a bad solution for the EC2/EMR use cases because Cloud Front sits outside of AWS and users will incur the inbound bandwidth charges when they pull the data down to EC2/EMR.
- There is an additional charge on top of the S3 hosting costs.
...
The Pacific Northwest Gigapop is the point of presence for the Internet2/Abilene network in the Pacific Northwest. The PNWGP is connected to the Abilene backbone via a 10 GbE link. In turn, the Abilene Seattle node is connected via OC-192 192 links to both Sunnyvale, California and Denver, Colorado.
PNWPG offers two types of Internet2/Abilene interconnects: Internet2/Abilene transit services and Internet2/Abilene peering at Pacific Wave International Peering Exchange. See Participant Services for more information.
...