Note that we went with pre-signed urls and plan to switch to IAM tokens instead of IAM users when that feature is launched.
Dataset Hosting Design
Table of Contents |
---|
...
- Best Effort Server Log Delivery:
- The server access logging feature is designed for best effort. You can expect that most requests against a bucket that is properly configured for logging will result in a delivered log record, and that most log records will be delivered within a few hours of the time that they were recorded.
- However, the server logging feature is offered on a best-effort basis. The completeness and timeliness of server logging is not guaranteed. The log record for a particular request might be delivered long after the request was actually processed, or it might not be delivered at all. The purpose of server logs is to give the bucket owner an idea of the nature of traffic against his or her bucket. It is not meant to be a complete accounting of all requests.
- [Usage Report Consistency
[http://docs.amazonwebservices.com/AmazonS3/2006-03-01/dev/index.html?ServerLogs.html
] ] It follows from the best-effort nature of the server logging feature that the usage reports available at the AWS portal might include usage that does not correspond to any request in a delivered server log.
- I have it on good authority that the S3 logs are accurate but delievery can be delayed now and then
- Log format details
- AWS Credentials Primer
Open Questions:
- can we use the canonical user id to know who the user is if they have
previously given us their AWS account id? No, but we can ask them to provide
their canonical id to us. - if we stick our own query params on the S3 URL will they show up in the S3 log?
Options to Restrict Access to S3
S3 Pre-Signed URLs for Private Content
"Query String Request Authentication Alternative: You can authenticate certain types of requests by passing the required information as query-string parameters instead of using the Authorization HTTP header. This is useful for enabling direct third-party browser access to your private Amazon S3 data, without proxying the request. The idea is to construct a "pre-signed" request and encode it as a URL that an end-user's browser can retrieve. Additionally, you can limit a pre-signed request by specifying an expiration time."
Scenario:
- See above details about the general S3 scenario.
- This would not be compatible with Elastic MapReduce. Users would have to download files to an EC2 host and then store them in their own bucket in S3.
- The platform grants access to the data by creating pre-signed S3 URLs to Sage's private S3 files. These URLs are created on demand and have a short expiry time.
- Users will not need AWS accounts.
- The platform has a custom solution for collecting payment from users to pay for outbound bandwidth when downloading outside the Custom Access Log Information "You can include custom information to be stored in the access log record for a request by adding a custom query-string parameter to the URL for the request. Amazon S3 will ignore query-string parameters that begin with "x-", but will include those parameters in the access log record for the request, as part of the Request-URI field of the log record. For example, a GET request for "s3.amazonaws.com/mybucket/photos/2006/08/puppy.jpg?x-user=johndoe" will work the same as the same request for "s3.amazonaws.com/mybucket/photos/2006/08/puppy.jpg", except that the "x-user=johndoe" string will be included in the Request-URI field for the associated log record. This functionality is available in the REST interface only."
Open Questions:
- can we use the canonical user id to know who the user is if they have
previously given us their AWS account id? No, but we can ask them to provide
their canonical id to us. - if we stick our own query params on the S3 URL will they show up in the S3 log? - YES see above naming convention
Options to Restrict Access to S3
S3 Pre-Signed URLs for Private Content
"Query String Request Authentication Alternative: You can authenticate certain types of requests by passing the required information as query-string parameters instead of using the Authorization HTTP header. This is useful for enabling direct third-party browser access to your private Amazon S3 data, without proxying the request. The idea is to construct a "pre-signed" request and encode it as a URL that an end-user's browser can retrieve. Additionally, you can limit a pre-signed request by specifying an expiration time."
Scenario:
- See above details about the general S3 scenario.
- This would not be compatible with Elastic MapReduce. Users would have to download files to an EC2 host and then store them in their own bucket in S3.
- The platform grants access to the data by creating pre-signed S3 URLs to Sage's private S3 files. These URLs are created on demand and have a short expiry time.
- Users will not need AWS accounts.
- The platform has a custom solution for collecting payment from users to pay for outbound bandwidth when downloading outside the cloud.
Tech Details:
- Repository Service: the service has an API that vends pre-signed S3 URLs for data layer files. Users are authenticated and authorized by the service prior to the service returning a download URL for the layer. The default URL expiry time will be one minute.
- Web Client: When the user clicks on the download button, the web servlet sends a request to the the repository service which constructs and returns a pre-signed S3 URL with an expiry time of one minute. The web servlet returns this URL to the browser as the location value of a 302 redirect. The browser begins download from S3 immediately.
- R Client: When the user issues the download command at the R prompt, the R client sends a request to the the repository service which constructs and returns a pre-signed S3 URL with an expiry time of one minute. The R client uses the returned URL to begin to download the content from S3 immediately.
...
- The repository service grants access by adding the user to the correct Crowd and IAM groups for the desired dataset. The Crowd groups are "the truth" and the IAM groups are a mirror of that.At access grant time, we also have to determine and store the mapping between IAM id and canonical user id. The S3 team has a fix for this on their roadmap but in the interim we have to make a request for a particular known URL as that new user and then retreive the canonical id from the S3 log and store it in our system.mirror of that.
- Log data
- For IAM user
nicole.deflaux
my identity is recorded in the log asarn:aws:iam::325565585839:user/nicole.deflaux
- Not that IAM usernames can be email addresses so that we can easily reconcile them with their email addresses registered with the platform -> because we will use the email they gave us when we create their IAM user.
- Here is what a full log entry looks like:
Code Block d9df08ac799f2859d42a588b415111314cf66d0ffd072195f33b921db966b440 data01.sagebase.org [18/Feb/2011:19:32:56 +0000] 140.107.149.246 arn:aws:iam::325565585839:user/nicole.deflaux C6736AEED375F69E REST.GET.OBJECT human_liver_cohort/readme.txt "GET /data01.sagebase.org/human_liver_cohort/readme.txt?AWSAccessKeyId=AKIAJBHAI75QI6ANOM4Q&Expires=1298057875&Signature=XAEaaPtHUZPBEtH5SaWYPMUptw4%3D&x-amz-security-token=AQIGQXBwVGtupqEus842na80zMbEFbfpPhOsIic7z1ghm0Umjd8kybj4eaOtBCKlwVHMXi2SuasIKxwYljjDA95O%2BfZb5uF7ku4crE6OObz8d/ev7ArPime2G/a5nXRq56Jx2hAt8NDDbhnE8JqyOnKn%2BN308wx2Ud3Q2R3rSqK6t%2Bq/l0UAkhBFNM1gvjR%2BoPGYBV9Jspwfp8ww8CuZVH1Y2P2iid6ZS93K02sbGvQnhU7eCGhorhMI5kxOqy7bTbzvl2HML7zQphXRIa1wqrRSD/sBLfpK5x6A%2BcQnLrgO6FtWJMDo5rTmgEPo6esNIivWnaiI6BvPddLlBMZVtmcx39/cOBbrfK3v0vHmYb3oftseacjvBD/wfyigB5wbSgRUYNhbUu1V HTTP/1.1" 200 - 6300 6300 45 45 "https://s3-console-us-standard.console.aws.amazon.com/GetResource/Console.html?&lpuid=AIDAJDFZWQHFC725MCNDW&lpfn=nicole.deflaux&lpgn=325565585839&lpas=ACTIVE&lpiamu=t&lpak=AKIAJBHAI75QI6ANOM4Q&lpst=AQIGQXBwVGtupqEus842na80zMbEFbfpPhOsIic7z1ghm0Umjd8kybj4eaOtBCKlwVHMXi2SuasIKxwYljjDA95O%2BfZb5uF7ku4crE6OObz8d%2Fev7ArPime2G%2Fa5nXRq56Jx2hAt8NDDbhnE8JqyOnKn%2BN308wx2Ud3Q2R3rSqK6t%2Bq%2Fl0UAkhBFNM1gvjR%2BoPGYBV9Jspwfp8ww8CuZVH1Y2P2iid6ZS93K02sbGvQnhU7eCGhorhMI5kxOqy7bTbzvl2HML7zQphXRIa1wqrRSD%2FsBLfpK5x6A%2BcQnLrgO6FtWJMDo5rTmgEPo6esNIivWnaiI6BvPddLlBMZVtmcx39%2FcOBbrfK3v0vHmYb3oftseacjvBD%2FwfyigB5wbSgRUYNhbUu1V&lpts=1298051257585" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.102 Safari/534.13" -
- For IAM user
- The repository service vends credentials for the users' use.
- Users will need to sign those S3 URLs with their AWS credentials using the usual toolkit provided by Amazon.
- One example tool provided by Amazon is s3curl
- I'm sure we can also get the R client to also do the signing. We can essential hide the fact that users have their own credentials just for this since the R client can communicate with the repository service and cache credentials in memory.
- We may be able to find a JavaScript library to do the signing as well for Web client use cases. If not, we could proxy download requests. That solution is described in its own section below.
...
The Pacific Northwest Gigapop is the point of presence for the Internet2/Abilene network in the Pacific Northwest. The PNWGP is connected to the Abilene backbone via a 10 GbE link. In turn, the Abilene Seattle node is connected via OC-192 192 links to both Sunnyvale, California and Denver, Colorado.
PNWPG offers two types of Internet2/Abilene interconnects: Internet2/Abilene transit services and Internet2/Abilene peering at Pacific Wave International Peering Exchange. See Participant Services for more information.
...