User Owned S3 Buckets
Traditionally Synapse files have been stored only in an AWS S3 bucket under its control. A need has arisen to support the organization/exposure through Synapse of files which reside in other buckets, owned and controlled by users.
Use Cases
ISB
BRIDGE
The Raw Data is coming into the project from some outside source in which is it being generated over time. Examples are a Bridge project, or many of our 'omics projects like TCGA in which data is generated over time on a per-sample basis. From the point of view of the project the incoming data is considered 'raw'... existing data is not likely to change, but it is likely more samples / time points will be added to the data set over time. This raw data may be large (many TB). The raw data loading process may work most efficiently if it writes data directly to S3. For example, in the case of Bridge, that system may generate data directly in a shared bucked. In the 'omics case, we may set up some sort of high-throughput pipeline from the data generator to S3 (e.g. a protocol like Aspera), or even drop-shipping the data on hard drive to Amazon for loading.General
Downloading large files
For exporting large BAM datasets, the following procedure applies:
- User creates S3 bucket in us-east with permissions for Synapse to copy into
- User indicates which files to copy
- Synapse copies the files to users S3 bucket
- User copies data from their bucket to final destination
Open Questions
- How is a file handle created from an uploaded file in Synapse today? What file naming conventions or S3 metadata are required (e.g. MD5-hash, content-type)?
- Is it OK to let Synapse reference an S3 file that it does not unilaterally control? What happens if an S3 modified or missing when Synapse goes to retrieve it?
- Do we need to support buckets which are read-only (or write-only) to Synapse?
- What are the S3 operations users expect to use on their files? Do they rename, move bet. buckets, make back-ups? Do they expect to use S3 versioning features?
Design Approach
- user would grant permission for Synapse to access their S3 bucket (e.g. by referencing the Synapse IAM user in their bucket's permission settings)
- to be flexible, have user to specify both bucket and key (e.g. a subfolder)
- a bucket descriptor, once defined, could be applied to multiple Synapse projects.
JIRA
Note: This may quickly become out of date, but it's useful to have quick links to the current issues.
https://sagebionetworks.jira.com/browse/PLFM-3134
https://sagebionetworks.jira.com/browse/PLFM-3149
https://sagebionetworks.jira.com/browse/SYNR-859
https://sagebionetworks.jira.com/browse/SYNPY-179
https://sagebionetworks.jira.com/browse/SWC-1989
https://sagebionetworks.jira.com/browse/SWC-1990