Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • metadata
    • how to ensure we have metadata for all stuff in the cloud
  • file formats
    • tar archives or individual files on S3?
    • EBS block devices per dataset?
  • file layout
    • how to organize what we have
    • how can we enforce a clean layout for files and EBS volumes?
    • how to keep track of what we have
  • access patterns
    • we want to make the right thing be the easy thing - make it easy to do computation in the cloud
    • file download will be supported but will not be the recommended use case
    • recommendations and examples from the R prompt for interacting with the data when working on EC2
  • security
    • not all data is public
    • encryption or clear text?
      • key management
    • one time urls?
    • intrusion detection
    • how to manage ACLs and bucket policies
      • are there scalability upper bounds on ACLs? e.g., can't add more than X AWS accounts to an ACL
  • auditability
    • how to have audit logs
    • how to download them and make use of them
  • human data and regulations
    • what recommendations do we make to people getting some data from Sage and some data from dbGaP and co-mingling that data in the cloud
  • monitoring - what should be monitored
    • access patterns
    • who
    • when
    • what
    • how much
      • data foot print
      • upload bandwidth
      • download bandwidth
      • archive to cheaper storage unused stuff
  • cost
    • read vs. write
    • cost of allowing writes
    • cost of keeping same data in multiple formats
    • can we take advantage of the free hosting for http://aws.amazon.com/datasets even though we want to keep an audit log?
    • how to meter and bill customers for usage
  • operations
    • how to make it efficient to manage
    • reduce the burden of administrative tasks
    • how to enable multiple administrators
  • how long does it take to get files up/down?
    • upload speeds - we are on the lambda rail
    • shipping hard drives
  • durability
    • data corruption
    • data loss
  • scalability
    • if possible, we want to only be the access grantors and then let the hosting provider take care of enforcing access controls and vending data

High Level Use Cases

Assume tens of thousands of users will eventually use the platform.

Assume that the initial cloud we target is AWS but we plan to support additional clouds in the future.

Assume that users who only want to download do not need to have an AWS account.

Assume that anyone wanting to interact with data on EC2 or Elastic MapReduce will have an AWS account and will be willing to give us their account number (which is a safe piece of info to give out, it is not an AWS credential).

Users want to:

  • download a public dataset
  • download a protected dataset for which the platform has granted them access
  • use a public dataset on EC2
  • use a protected dataset for which the platform has granted them access on EC2
  • use a public dataset on Elastic MapReduce
  • use a protected dataset for which the platform has granted them access on Elastic MapReduce

There is a more exhaustive list of stuff to consider above but what follows are some firm and some loose requirements:

  • enforce access restrictions on protected datasets
  • log downloads
  • log EC2/EMR usage
  • figure out how to monitor user usage such that we could potentially charge them for usage
  • think about how to minimize costs
  • think about how to ensure that users sign a EULA before getting access to data

Options to Consider

AWS Public Data Sets

this is currently EBS only but it will also be available for S3 in the future

this won't work for protected data unless the future implementation provides more support

Cloud Front Private Content

http://docs.amazonwebservices.com/AmazonCloudFront/latest/DeveloperGuide/index.html?PrivateContent.html#CreateFinalURL

This could be very good for the download use cases.

Note that this is likely a bad solution for the EC2/EMR use cases because Cloud Front sits outside of AWS and users will incur the inbound bandwidth charges when they pull the data down to EC2/EMR.

Dev Pay

http://docs.amazonwebservices.com/AmazonS3/2006-03-01/dev/index.html?UsingDevPay.html

S3

http://docs.amazonwebservices.com/AmazonS3/2006-03-01/dev/index.html?ServerLogs.html

S3 ACL

there is a max of 100 grants

can grants be to groups and we manage the group?

S3 Bucket Policies

EBS snapshot ACL

http://docs.amazonwebservices.com/AWSEC2/latest/APIReference/index.html?ApiReference-query-ModifySnapshotAttribute.html

What is the max grant number for this?

Custom Proxy

All files are kept completely private on S3 and we write a custom proxy that allows users with permission to download files whether to locations outside AWS or to EC2 hosts.

Additional Details

Network Bandwidth

The Pacific Northwest Gigapop is the point of presence for the Internet2/Abilene network in the Pacific Northwest. The PNWGP is connected to the Abilene backbone via a 10 GbE link. In turn, the Abilene Seattle node is connected via OC-192   192    links to both Sunnyvale, California and Denver, Colorado.
PNWPG offers two types of Internet2/Abilene interconnects: Internet2/Abilene transit services and Internet2/Abilene peering at Pacific Wave International Peering Exchange. See Participant Services for more information.

...