...
- this won't work for public data if it is a requirement that
- all users provide an email address and agree to a EULA prior to access
- we must log downloads
- this won't work for protected data unless the future implementation provides more support
Cloud Front Private Content
Cloud Front supports the notion of private content. Cloud front urls can be created with access policies such as an expires time and an IP mask.
Pros:
- This would work for the download of protected content use cases.
Cons: - Note that this is likely a bad solution for the EC2/EMR use cases because Cloud Front sits outside of AWS and users will incur the inbound bandwidth charges when they pull the data down to EC2/EMR.
- There is an additional charge on top of the S3 hosting costs.
Open Questions:
- Since we do not need a CDN for the normal reason (to move often requested content closer to the user to reduce download times), does this buy us much over S3 Pre-Signed URLs? It seems like the only added benefit is an IP mask in addition to the expires time.
Resources:
Dev Pay
http://docs.amazonwebservices.com/AmazonS3/2006-03-01/dev/index.html?UsingDevPay.html
S3
Skipping a description of public data on S3 because the scenario is very straightforward - if get the URL you can download the resource. For example: http://s3.amazonaws.com/nicole.deflaux/ElasticMapReduceFun/mapper.R
Protected Data Scenario:
- All Sage data is stored on S3 and is not public.
- Users can only discover what data is available via the platform.
- Users can use the data for cloud computation by spinning up EC2 instances and downloading the files from S3 to the hard drive of their EC2 instance. See below for more details on this.
- Users can download the data from S3 to their local system. See below for more details on this.
- The platform directs users to sign a Sage-specified EULA prior to gaining access to these files in S3.
- Users must have a Sage platform account to access this data for download.
- The platform grants access to this data. See below for details.
- The platform will write to the audit log each time it grants access and to whom it granted access. S3 can also be configured to log all access to resources and this could serve as a means of intrusion detection.
- client IP address is available in the log
- "anonymous" or the users AWS canonical user id will appear in the log
- See proposals below regarding how users might pay for usage.
- The cost of hosting not free.
- Storage fees will apply.
- Bandwidth fees apply when data is uploaded.
- Data can also be shipped via hard drives and AWS Import fees would apply.
- Bandwidth fees apply when data is downloaded out of AWS. There is no charge when it is downloaded inside AWS (e.g., to an EC2 instance).
Open Questions:
- can we use the canonical user id to know who the user is if they have previously given us their AWS account id?
Resources:
- Best Effort Server Log Delivery:
- The server access logging feature is designed for best effort. You can expect that most requests against a bucket that is properly configured for logging will result in a delivered log record, and that most log records will be delivered within a few hours of the time that they were recorded.
- However, the server logging feature is offered on a best-effort basis. The completeness and timeliness of server logging is not guaranteed. The log record for a particular request might be delivered long after the request was actually processed, or it might not be delivered at all. The purpose of server logs is to give the bucket owner an idea of the nature of traffic against his or her bucket. It is not meant to be a complete accounting of all requests.
- Usage Report Consistency
- It follows from the best-effort nature of the server logging feature that the usage reports available at the AWS portal might include usage that does not correspond to any request in a delivered server log.
- [Log format details | http://docs.amazonwebservices.com/AmazonS3/2006-03-01/dev/index.html?LogFormat.html}
S3 Pre-Signed URLs for Private Content
Pros:
- Simple!
- Scalable
- We control duration of the expiry window.
Cons:
- When the url has not yet expired, it is possible for others to use that same URL to download files.
- If a user gets her download url, and does not use it right away, she'll need to reload the WebUI page to get another or re-do the R prompt command.
Open Questions:
- How does this work with the new support for partial downloads?
Resources:
...
S3
Skipping a description of public data on S3 because the scenario is very straightforward - if get the URL you can download the resource. For example: http://s3.amazonaws.com/nicole.deflaux/ElasticMapReduceFun/mapper.R
Protected Data Scenario:
- All Sage data is stored on S3 and is not public.
- Users can only discover what data is available via the platform.
- Users can use the data for cloud computation by spinning up EC2 instances and downloading the files from S3 to the hard drive of their EC2 instance.
- Users can download the data from S3 to their local system. See below for more details on this.
- The platform directs users to sign a Sage-specified EULA prior to gaining access to these files in S3.
- Users must have a Sage platform account to access this data for download. They may need an AWS account for the cloud computation use case depending upon the mechanism we use to grant access.
- The platform grants access to this data. See below for details about the various ways we might do this.
- The platform will write to the audit log each time it grants access and to whom it granted access. S3 can also be configured to log all access to resources and this could serve as a means of intrusion detection.
- These two types of logs log different events (granting access vs. using access) so they will not have a strict 1-to-1 mapping between entries but should have a substantial overlap.
- The platform can store anything it likes in its audit log.
- The S3 log stores normal web access log type data with the following identifiable fields:
- client IP address is available in the log
- "anonymous" or the users AWS canonical user id will appear in the log
- We can try to appending some other query parameter to the S3 URL to help us line it up with audit log entries.
- See proposals below regarding how users might pay for usage.
- The cost of hosting not free.
- Storage fees will apply.
- Bandwidth fees apply when data is uploaded.
- Data can also be shipped via hard drives and AWS Import fees would apply.
- Bandwidth fees apply when data is downloaded out of AWS. There is no charge when it is downloaded inside AWS (e.g., to an EC2 instance).
- These same fees apply to any S3 log data we keep.
Open Questions:
- can we use the canonical user id to know who the user is if they have previously given us their AWS account id?
- if we stick our own query params on the S3 URL will they show up in the S3 log?
Resources:
- Best Effort Server Log Delivery:
- The server access logging feature is designed for best effort. You can expect that most requests against a bucket that is properly configured for logging will result in a delivered log record, and that most log records will be delivered within a few hours of the time that they were recorded.
- However, the server logging feature is offered on a best-effort basis. The completeness and timeliness of server logging is not guaranteed. The log record for a particular request might be delivered long after the request was actually processed, or it might not be delivered at all. The purpose of server logs is to give the bucket owner an idea of the nature of traffic against his or her bucket. It is not meant to be a complete accounting of all requests.
- Usage Report Consistency
- It follows from the best-effort nature of the server logging feature that the usage reports available at the AWS portal might include usage that does not correspond to any request in a delivered server log.
- Log format details
Options to Restrict Access to S3
S3 Pre-Signed URLs for Private Content
Pros:
- Simple! These can be used for both the download use case and the cloud computation use cases.
- Scalable
- We control duration of the expiry window. All URLs might be good for 1 minute.
Cons:
- When the url has not yet expired, it is possible for others to use that same URL to download files.
- If a user gets her download url, and does not use it right away, she'll need to reload the WebUI page to get another or re-do the R prompt command.
Open Questions:
- How does this work with the new support for partial downloads?
- Does this work with torrent-style access?
- Can we limit S3 access to HTTPS only?
Resources:
- "Query String Request Authentication Alternative: You can authenticate certain types of requests by passing the required information as query-string parameters instead of using the Authorization HTTP header. This is useful for enabling direct third-party browser access to your private Amazon S3 data, without proxying the request. The idea is to construct a "pre-signed" request and encode it as a URL that an end-user's browser can retrieve. Additionally, you can limit a pre-signed request by specifying an expiration time."
- http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?S3_QSAuth.html
- http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?RESTAuthentication.html
S3 Bucket Policies
This is the newer mechanism from AWS for access control.
Open Questions:
- What is the upper limit on the number of grants?
- What is the upper limit on the number of principals that can be listed in a single grant?
Resources:
- Bucket Policy Overview
- Evaluation Logic Flow Chart
- "The Principal is the person or persons who receive or are denied permission according to the policy. You must specify the principal by using the principal's AWS account ID (e.g., 1234-5678-9012, with or without the hyphens). You can specify multiple principals, or a wildcard to indicate all possible users." from http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?RESTAuthenticationAccessPolicyLanguage_ElementDescriptions.html
S3 Bucket Policies
This is the newer mechanism from AWS for access control.
Open Questions:
- What is the upper limit on the number of grants?
- What is the upper limit on the number of principals that can be listed in a single grant?
Resources:
- Bucket Policy Overview
- Evaluation Logic Flow Chart
- "The Principal is the person or persons who receive or are denied permission according to the policy. You must specify the principal by using the principal's AWS account ID (e.g., 1234-5678-9012, with or without the hyphens). You can specify multiple principals, or a wildcard to indicate all possible users." from The following list describes the restrictions on Amazon S3 policies: from http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?AccessPolicyLanguage_SpecialInfo.html
- The maximum size of a policy is 20 KB
- The value for Resource must be prefixed with the bucket name or the bucket name and a path under it (bucket/). If only the bucket name is specified, without the trailing /, the policy applies to the bucket.
- Each policy must have a unique policy ID (Id)
- Each statement in a policy must have a unique statement ID (sid)
- Each policy must cover only a single bucket and resources within that bucket (when writing a policy, don't include statements that refer to other buckets or resources in other buckets)
S3 ACL
This is the older mechanism from AWS for access control.
This is ruled out for protected data because ACLs can have a max of 100 grants and it appears that these grants cannot be to groups such as groups of arbitrary AWS users.
Open Question:
- Confirm that grants do not apply to groups of AWS users.
Resources:
- http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?AccessPolicyLanguageS3_ElementDescriptionsACLs_UsingACLs.html
- The following list describes the restrictions on Amazon S3 policies: from http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?AccessPolicyLanguage_SpecialInfo.html
- The maximum size of a policy is 20 KB
- The value for Resource must be prefixed with the bucket name or the bucket name and a path under it (bucket/). If only the bucket name is specified, without the trailing /, the policy applies to the bucket.
- Each policy must have a unique policy ID (Id)
- Each statement in a policy must have a unique statement ID (sid)
- Each policy must cover only a single bucket and resources within that bucket (when writing a policy, don't include statements that refer to other buckets or resources in other buckets)
S3 ACL
This is the older mechanism from AWS for access control.
This is ruled out for protected data because ACLs can have a max of 100 grants and it appears that these grants cannot be to groups such as groups of arbitrary AWS users.
Open Question:
- Confirm that grants do not apply to groups of AWS users.
Resources:
S3 and IAM
With IAM a group of users can be granted access to S3 resources. This will be helpful for managing access Sage system administrators and Sage employees.
This is ruled out for protected data because IAM is used for managing groups of users all under a particular AWS bill (e.g., all employees of a company).
Open Questions:
- Is there a cap on the number of users for IAM?
- Confirm that IAM only intended for managing groups and users where the base assumption is all activity is rolling up to a single AWS bill.
Resources:
...
S3 and IAM
With IAM a group of users can be granted access to S3 resources. This will be helpful for managing access Sage system administrators and Sage employees.
This is ruled out for protected data because IAM is used for managing groups of users all under a particular AWS bill (e.g., all employees of a company).
Open Questions:
- Is there a cap on the number of users for IAM?
- Confirm that IAM only intended for managing groups and users where the base assumption is all activity is rolling up to a single AWS bill.
Resources:
Cloud Front Private Content
Cloud Front supports the notion of private content. Cloud front urls can be created with access policies such as an expires time and an IP mask.
Pros:
- This would work for the download of protected content use cases.
Cons: - Note that this is likely a bad solution for the EC2/EMR use cases because Cloud Front sits outside of AWS and users will incur the inbound bandwidth charges when they pull the data down to EC2/EMR.
- There is an additional charge on top of the S3 hosting costs.
Open Questions:
- Since we do not need a CDN for the normal reason (to move often requested content closer to the user to reduce download times), does this buy us much over S3 Pre-Signed URLs? It seems like the only added benefit is an IP mask in addition to the expires time.
Resources:
Options to have customers bear some of the costs
S3 "Requester Pays" Buckets
http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?RequesterPaysBuckets.html
Dev Pay
http://docs.amazonwebservices.com/AmazonS3/2006-03-01/dev/index.html?UsingDevPay.html
Flexible Payments Service
EBS
Data is available as hard drive snaphots.
Pros:
- Its a convenient way to access data from EC2 instances.
Cons:
- This only covers our cloud compute use case, not our download use case.
EBS snapshot ACL
Open questions:
- What is the max grant number for this?
...
If all else fails we can do this, but it will be more work operationally to manage a fleet of custom proxy servers.
File Organization and Format
TODO Brian will add stuff here
Resources:
Additional Details
Network Bandwidth
Hypothetically, the Hutch (and therefore Sage) is on a high throughput link to AWS.
The Pacific Northwest Gigapop is the point of presence for the Internet2/Abilene network in the Pacific Northwest. The PNWGP is connected to the Abilene backbone via a 10 GbE link. In turn, the Abilene Seattle node is connected via OC-192 192 links to both Sunnyvale, California and Denver, Colorado.
PNWPG offers two types of Internet2/Abilene interconnects: Internet2/Abilene transit services and Internet2/Abilene peering at Pacific Wave International Peering Exchange. See Participant Services for more information.
...