Dataset Hosting Design
Assumptions
Where
For the near term we are using AWS as our external hosting partner. They have agreed to support our efforts for CTCAP. Over time we anticipate adding additional external hosting partners such as Google and Microsoft. Different scientists will want to take advantage of different clouds.
We can also imagine that the platform should hold locations of files in internal hosting systems, even though not all users of the platform would have access to files in those locations.
Metadata references to hosted data files should be modelled as a collection of Locations, where a Location could be of many types:
- an S3 URL
- a Google Storage URL
- an Azure Blobstore URL
- an EBS snapshot id
- a filepath on a Sage internal server
- ....
Class Location String provider // AWS, Google, Azure, Sage cluster – people will want to set a preferred cloud to work in String type // filepath, download url, S3 url, EBS snapshot name String location // the actual uri or path
What
For now we are assuming that we are dealing with files. Later on we can also envision providing access to data stored in a database and/or a data warehouse.
Design Considerations
- metadata
- how to ensure we have metadata for all stuff in the cloud
- file formats
- tar archives or individual files on S3?
- EBS block devices per dataset?
- file layout
- how to organize what we have
- how can we enforce a clean layout for files and EBS volumes?
- how to keep track of what we have
- access patterns
- we want to make the right thing be the easy thing - make it easy to do computation in the cloud
- file download will be supported but will not be the recommended use case
- recommendations and examples from the R prompt for interacting with the data when working on EC2
- security
- not all data is public
- encryption or clear text?
- key management
- one time urls?
- intrusion detection
- how to manage ACLs and bucket policies
- are there scalability upper bounds on ACLs? e.g., can't add more than X AWS accounts to an ACL
- auditability
- how to have audit logs
- how to download them and make use of them
- human data and regulations
- what recommendations do we make to people getting some data from Sage and some data from dbGaP and co-mingling that data in the cloud
- monitoring - what should be monitored
- access patterns
- who
- when
- what
- how much
- data foot print
- upload bandwidth
- download bandwidth
- archive to cheaper storage unused stuff
- cost
- read vs. write
- cost of allowing writes
- cost of keeping same data in multiple formats
- can we take advantage of the free hosting for http://aws.amazon.com/datasets even though we want to keep an audit log?
- operations
- how to make it efficient to manage
- reduce the burden of administrative tasks
- how to enable multiple administrators
- how long does it take to get files up/down?
- upload speeds - we are on the lambda rail
- shipping hard drives
- durability
- data corruption
- data loss
Details
Network Bandwidth
The Pacific Northwest Gigapop is the point of presence for the Internet2/Abilene network in the Pacific Northwest. The PNWGP is connected to the Abilene backbone via a 10 GbE link. In turn, the Abilene Seattle node is connected via OC-192 links to both Sunnyvale, California and Denver, Colorado.
PNWPG offers two types of Internet2/Abilene interconnects: Internet2/Abilene transit services and Internet2/Abilene peering at Pacific Wave International Peering Exchange. See Participant Services for more information.
The Corporation for Education Network Initiatives in California (CENIC) and Pacific NorthWest GigaPoP (PNWGP) announced two 10 Gigabit per second (Gbps) connections to Amazon Simple Storage Service (Amazon S3) and Amazon Elastic Compute Cloud (Amazon EC2) for the use of CENIC's members in California, as well as PNWGP's multistate K-20 research and education community.
http://findarticles.com/p/news-articles/wireless-news/mi_hb5558/is_20100720/cenic-pacific-northwest-partner-develop/ai_n54489237/http://www.internet2.edu/maps/network/connectors_participantshttp://www.pnw-gigapop.net/partners/index.html