Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Sage currently has two data sets stored as "AWS Public Datasets" in the US West Region.
  • Users can discover them by browsing public datasets on http://aws.amazon.com/datasets/Biology?browse=1 and also via the platform.
  • Users can use them for cloud computation by spinning up EC2 instances and mounting the data as EBS volumes.
  • Users cannot directly download data from these public datasets, but once the have them mounted on an EC2 host, they can certainly scp them to their local system.
  • Users cannot directly use these files with Elastic Map Reduce.  Instead they would need to first upload them to their own S3 bucket.
  • Users are not forced to sign a Sage-specified EULA prior to access since because they can bypass the platform directly and access this data via normal AWS mechanisms.
  • Users must have an AWS account to access this data.
  • There is no mechanism to grant access. All users with AWS accounts are granted access by default.
  • There is no mechanism to keep an audit log for downloads or other usage of this data.
  • Users pay for access by paying their own costs for EC2 and bandwidth charges if they choose to download the data out of AWS.
  • The cost of hosting is free.

...

All files are kept completely private on S3 and we write a custom proxy that allows users with permission to download files whether to locations outside AWS or to EC2 hosts.

Scenario:

  • See above details below about the general S3 scenario.
  • This would not be compatible with Elastic MapReduce.  Users would have to download files to an EC2 host and then store them in their own bucket in S3.
  • The platform grants access to the data by facilitating the proxy of their request to S3.
  • Users do not need AWS accounts.
  • The platform has a custom solution for collecting payment from users to pay for outbound bandwidth when downloading outside the cloud.

...

  • See above details about the general S3 scenario.The platform grants access to the data by creating pre-signed S3 URLs to
  • This would not be compatible with Elastic MapReduce.  Users would have to download files to an EC2 host and then store them in their own bucket in S3.
  • The platform grants access to the data by creating pre-signed S3 URLs to Sage's private S3 files. These URLs are created on demand and have a short expiry time.
  • Users will not need AWS accounts.
  • The platform has a custom solution for collecting payment from users to pay for outbound bandwidth when downloading outside the cloud.

...

  • See above details about the general S3 scenario.
  • This should work with Elastic MapReduce.  We would have to edit the bucket policy for each user wanting to use the data via EMR.
  • The platform grants access to the data by updating the bucket policy for the dataset to which the user would like access.
  • Users will need AWS accounts.
  • The platform has a custom solution for collecting payment from users to pay for outbound bandwidth when downloading outside the cloud.

...

  • S3 takes care of serving the files (once access is granted) so we don't have to run a proxy fleet.
  • Scalable
  • From the S3 logs we will be able to see who downloaded the data by their canonical user id or IP address and compute charge backs for our custom solution.

...

This is the newest mechanism for access control and is under heavy development in 2011 to expand its features. With IAM a group of users (as opposed to AWS accounts) can be granted access to S3 resources. We decide who those users are. We hand credentials to them for those resources. All stored credentials for them on their behalf and use them to pre-sign S3 URLs.  All activity is rolled up to a single bill. " (Another approach would have been to let users have their IAM credentials and sign their own requests.  Unfortunately this will not scale because we would need to have a group per dataset to get the permissions right, and the mapping of users to groups scales linearly.)

"AWS Identity and Access Management is a web service that enables Amazon Web Services (AWS) customers to manage Users and User permissions in AWS. The service is targeted at organizations with multiple Users or systems that use AWS products such as Amazon EC2, Amazon SimpleDB, and the AWS Management Console. With IAM, you can centrally manage Users, security credentials such as access keys, and permissions that control which AWS resources Users can access.

...

  • See above details about the general S3 scenario.
  • This is not compatible with Elastic MapReduce.  IAM currently does not support EMR.
  • The platform grants access to the data by creating a new IAM user and adding them to the group that has access to the dataset.
  • Users do not need AWS accounts.
  • The platform has a custom solution for collecting payment from users to pay for outbound bandwidth when downloading outside the cloud.

...

  • S3 takes care of serving the files (once access is granted) so we don't have to run a proxy fleet.We
  • Scalable
  • By pre-signing each S3 URL with an individual's credentials, we can accurately match up entries in the audit log and the access log to facilitate a strict accounting of actual downloads.
  • Sage gives By not giving out IAM credentials to users so that users can access Sage resources. We can put strict limitations on what capabilities these credentials give to users., we can get away with a smaller number of groups.
  • Users do not need to give us their AWS credentials for any reason.
  • We will already be using IAM to manage access to AWS resources by Sage system administrators and Sage employees.
  • This will scale sufficiently to meet our needs for the embargo data use case. The default limit is 5,000 users. We can ask Deepak to raise that limit for us. Also note that the default limit will be raised within a year as AWS rolls out expanded support for federated access to AWS resources.
  • More features will come online for IAM over time

Cons:

  • We are trusting users to protect the credentials we have given them.  (This is the same as trusting someone to protect their Sage username and password.)
  • For users who want to use the REST API directly, this may be confusing. They will likely have their own AWS credentials plus Sage AWS credentials.
  • This currently will not scale sufficiently for the unrestricted data since a user may only be in 10 groups. (We can work around this by mapping one sage user to multiple IAM users, but we don't want to do that if we do not have to.)Potential for URL reuse: When the url has not yet expired, it is possible for others to use that same URL to download files. 
    • For example, if someone requests a download URL from the repository service (and remember that the repository service confirms that the user is authenticated and authorized before handing out that URL) and then that person emails the URL to his team, his whole team could start downloading the data layer as long as they all kick off their downloads within that one minute window of time.  
    • Of course, they could also download the data locally and then let their whole team have access to it.
    • This isn't much of a concern for unrestricted data because its actually easier to just make a new account on the system with a new email address.
  • Potential for user confusion: If a user gets her download url, and somehow does not use it right away, she'll need to reload the web page to get another or re-do the R prompt command to get a fresh URL.
    • This isn't much of a concern if we ensure that the workflow of the clients (web or R) is to use the URL upon receipt.

Resources:

  • A comparison of Bucket Policies, ACLs, and IAM
  • More details on limits
  • Following are the default maximums for your entities:
    • Groups per AWS Account: 100
    • Users per AWS Account: 5000
    • Number of groups per User: 10 (that is, the User can be in this many groups) (also note that this limit can be increased by it scales linearly)
    • Access keys per User: 2
    • Signing certificates per User: 2
    • MFA devices per User: 1
    • Server certificates per AWS Account: 10
    • You can request to increase the maximum number of Users or groups for your AWS Account.

...

The Pacific Northwest Gigapop is the point of presence for the Internet2/Abilene network in the Pacific Northwest. The PNWGP is connected to the Abilene backbone via a 10 GbE link. In turn, the Abilene Seattle node is connected via OC-192                               192                                links to both Sunnyvale, California and Denver, Colorado.
PNWPG offers two types of Internet2/Abilene interconnects: Internet2/Abilene transit services and Internet2/Abilene peering at Pacific Wave International Peering Exchange. See Participant Services for more information.

...