Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • this won't work for public data if it is a requirement that
    • all users provide an email address and agree to a EULA prior to access
    • we must log downloads
  • this won't work for protected data unless the future implementation provides more support

S3

Skipping a description of public data on S3 because the scenario is very straightforward - if get the URL you can download the resource. For example: http://s3.amazonaws.com/nicole.deflaux/ElasticMapReduceFun/mapper.R

Unrestricted and Embargoed Data Scenario:

...

Custom Proxy

All files are kept completely private on S3 and we write a custom proxy that allows users with permission to download files whether to locations outside AWS or to EC2 hosts.

Scenario:

  • See above details about the general S3 scenario.
  • The platform grants access to the data by facilitating the proxy of their request to S3.
  • Users do not need AWS accounts.
  • The platform has a custom solution for collecting payment from users to pay for outbound bandwidth when downloading outside the cloud.

Tech Details:

  • We would have a fleet of proxy servers running on EC2 hosts that authenticate and authorize users and then proxy download from S3.
  • We would need to configure auto-scaling so that capacity for this proxy fleet can grow and shrink as needed.
  • We would log events when data access is granted and when data is accessed.

Pros:

  • full flexibility
  • we can accurately track who has downloaded what when

Cons:

  • we have another service fleet to administer
  • we are now the scalability bottleneck

S3

Skipping a description of public data on S3 because the scenario is very straightforward - if get the URL you can download the resource. For example: http://s3.amazonaws.com/nicole.deflaux/ElasticMapReduceFun/mapper.R

Unrestricted and Embargoed Data Scenario:

  • All Sage data is stored on S3 and is not public.
  • Users can only discover what data is available via the platform.
  • Users can use the data for cloud computation by spinning up EC2 instances and downloading the files from S3 to the hard drive of their EC2 instance.
  • Users can download the data from S3 to their local system.
  • The platform directs users to sign a Sage-specified EULA prior to gaining access to these files in S3.
  • Users must have a Sage platform account to access this data for download. They may need an AWS account depending upon the mechanism we use to grant access.
  • The platform grants access to this data. See below for details about the various ways we might do this.
  • The platform will write to the audit log each time it grants access. S3
    can also be configured to log all access to resources and this could serve
    as a means of records for billing purposes and also intrusion detection.
    • These two types of logs will have log entries about different events (granting access vs. using access) so they will not have a strict 1-to-1 mapping between entries but should have a substantial overlap.  
    • The platform can store anything it likes in its audit log.  
    • The S3 log stores normal web access log type data with the following identifiable fields:
      • client IP address is available in the log
      • "anonymous" or the user's AWS canonical user id will appear in the log
      • We can leave these logs on S3 and run Elastic MapReduce jobs on them
        when we need to do data mining. Or we can download them and do data mining locally.
  • See proposals below regarding how users might pay for outbound bandwidth.
  • The cost of hosting not free.
    • Storage fees will apply.
    • Data import fees:
      • Bandwidth fees apply when data is uploaded.
      • Data can also be shipped via hard drives and AWS Import/Export fees would
        apply.
    • Data export fees:
      • Bandwidth fees apply when data is downloaded out of AWS. There is no charge when it is downloaded inside AWS (e.g., to an EC2 instance).
      • Data can also be shipped via hard drives and AWS Import/Export fees would
        apply.
    • These same fees apply to any S3 log data we keep on S3.

...

  • Best Effort Server Log Delivery:
    • The server access logging feature is designed for best effort. You can expect that most requests against a bucket that is properly configured for logging will result in a delivered log record, and that most log records will be delivered within a few hours of the time that they were recorded.
    • However, the server logging feature is offered on a best-effort basis. The completeness and timeliness of server logging is not guaranteed. The log record for a particular request might be delivered long after the request was actually processed, or it might not be delivered at all. The purpose of server logs is to give the bucket owner an idea of the nature of traffic against his or her bucket. It is not meant to be a complete accounting of all requests.
    • [Usage Report Consistency

      http://docs.amazonwebservices.com/AmazonS3/2006-03-01/dev/index.html?ServerLogs.html\

      ] It follows from the best-effort nature of the server logging feature that the usage reports available at the AWS portal might include usage that does not correspond to any request in a delivered server log.

    • I have it on good authority that the S3 logs are accurate but delievery can be delayed now and then
  • Log format details
  • AWS Credentials Primer

...

  • A comparison of Bucket Policies, ACLs, and IAM
  • More details on limits
  • Following are the default maximums for your entities:
    • Groups per AWS Account: 100
    • Users per AWS Account: 5000
    • Number of groups per User: 10 (that is, the User can be in this many groups)
    • Access keys per User: 2
    • Signing certificates per User: 2
    • MFA devices per User: 1
    • Server certificates per AWS Account: 10
    • You can request to increase the maximum number of Users or groups for your AWS Account.

S3 ACL

This is the older mechanism from AWS for access control. It control object level, not bucket level, access.

This is ruled out for protected data because ACLs can have a max of 100 grants and it appears that these grants cannot be to groups such as groups of arbitrary AWS users.

Open Question:

  • Confirm that grants do not apply to groups of AWS users.

Resources:

Cloud Front Private Content

Cloud Front supports the notion of private content. Cloud front URLs can be created with access policies such as an expires time and an IP mask. This is ruled out since S3 provides a similar feature and we do not need a CDN for any of the usual reasons one wants to use a CDN.

Since we do not need a CDN for the normal reason (to move often requested content closer to the user to reduce download times), this does not buy us much over S3 Pre-Signed URLs. It seems like the only added benefit is an IP mask in addition to the expires time.

Pros:

  • This would work for the download of unrestricted data use cases.

Cons:

  • Note that this is likely a bad solution for the EC2/EMR use cases because Cloud Front sits outside of AWS and users will incur the inbound bandwidth charges when they pull the data down to EC2/EMR.
  • There is an additional charge on top of the S3 hosting costs.

Resources:

Custom Proxy to S3

All files are kept completely private on S3 and we write a custom proxy that allows users with permission to download files whether to locations outside AWS or to EC2 hosts.

Scenario:

  • See above details about the general S3 scenario.
  • The platform grants access to the data by facilitating the proxy of their request to S3.
  • Users do not need AWS accounts.
  • The platform has a custom solution for collecting payment from users to pay for outbound bandwidth when downloading outside the cloud.

Tech Details:

  • We would have a fleet of proxy servers running on EC2 hosts that authenticate and authorize users and then proxy download from S3.
  • We would need to configure auto-scaling so that capacity for this proxy fleet can grow and shrink as needed.

Pros:

  • full flexibility
  • we can accurately track who has downloaded what when

Cons:

  • we have another service fleet to administer
  • we are now the scalability bottleneck
    • AWS Account: 10
    • You can request to increase the maximum number of Users or groups for your AWS Account.

S3 ACL

This is the older mechanism from AWS for access control. It control object level, not bucket level, access.

This is ruled out for protected data because ACLs can have a max of 100 grants and it appears that these grants cannot be to groups such as groups of arbitrary AWS users.

Open Question:

  • Confirm that grants do not apply to groups of AWS users.

Resources:

Cloud Front Private Content

Cloud Front supports the notion of private content. Cloud front URLs can be created with access policies such as an expires time and an IP mask. This is ruled out since S3 provides a similar feature and we do not need a CDN for any of the usual reasons one wants to use a CDN.

Since we do not need a CDN for the normal reason (to move often requested content closer to the user to reduce download times), this does not buy us much over S3 Pre-Signed URLs. It seems like the only added benefit is an IP mask in addition to the expires time.

Pros:

  • This would work for the download of unrestricted data use cases.

Cons:

  • Note that this is likely a bad solution for the EC2/EMR use cases because Cloud Front sits outside of AWS and users will incur the inbound bandwidth charges when they pull the data down to EC2/EMR.
  • There is an additional charge on top of the S3 hosting costs.

Resources:

Options to have customers bear some of the costs

...

The Pacific Northwest Gigapop is the point of presence for the Internet2/Abilene network in the Pacific Northwest. The PNWGP is connected to the Abilene backbone via a 10 GbE link. In turn, the Abilene Seattle node is connected via OC-192                             192                              links to both Sunnyvale, California and Denver, Colorado.
PNWPG offers two types of Internet2/Abilene interconnects: Internet2/Abilene transit services and Internet2/Abilene peering at Pacific Wave International Peering Exchange. See Participant Services for more information.

...