Document toolboxDocument toolbox

User Owned S3 Buckets

Traditionally Synapse files have been stored only in an AWS S3 bucket under its control.  A need has arisen to support the organization/exposure through Synapse of files which reside in other buckets, owned and controlled by users.

 

Use Cases

ISB

A project has one or a small number of very technical users in a role that we'll call the 'project admin'.  These are people like Larsson, Kristen, etc. who are responsible for managing data coming into a project and making that data available to the rest of the project users.  Project admins can assumed to be highly technically skilled.  The project admin's job is to make raw data available to the rest of the project users in periodic versioned releases, following the TCGA 'data freeze' example.  The project admin and possibly other users will then be involved in processing the data in Synapse.  All the processed data would fall into use case #1 (i.e. "#1"="Synapse is the only writer to the bucket").  

BRIDGE

The Raw Data is coming into the project from some outside source in which is it being generated over time. Examples are a Bridge project, or many of our 'omics projects like TCGA in which data is generated over time on a per-sample basis. From the point of view of the project the incoming data is considered 'raw'... existing data is not likely to change, but it is likely more samples / time points will be added to the data set over time.  This raw data may be large (many TB).  The raw data loading process may work most efficiently if it writes data directly to S3.  For example, in the case of Bridge, that system may generate data directly in a shared bucked.  In the 'omics case, we may set up some sort of high-throughput pipeline from the data generator to S3 (e.g. a protocol like Aspera), or even drop-shipping the data on hard drive to Amazon for loading.  
For the raw data, I think it will be helpful if the project admin can be responsible for managing a process of staging raw data in S3, and then running a script making that data available through Synapse to the rest of the project. We can put some of the responsibility for not destroying / overwriting data through back end S3 manipulation onto this admin and I think it can work if done for these fairly controlled sort of cases.  Then, once we're into actually running an analysis pipeline, that pipeline should only talk to Synapse through the normal APIs.  I realize this is sort of a back door that creates some risk, but the admins are likely to be Sage employees: Bridge engineers or Data Science team members who need the back door as a more efficient way to set things up.
The Bridge use case is that Bridge will write new data to the bucket and regular intervals and Synapse will need to read these updated files (either because some code on the Bridge side calls Synapse to register these file entities or because Synapse is monitoring this bucket / subdirectory and syncs new files as they show up). In this case, Bridge doesn't expect Synapse to ever write data back to the bucket.

General

There may be many projects with two categories of data
"Raw Data" written to the bucket by some externally managed process, likely not edited in Synapse.  
"Processed Data" created as Synapse users read the raw data and write new data into Synapse.
We'd like one external organization to easily foot the bill for both data sets, as well as any large scale computation on that data.  
Maybe this is easily achieved if we configure a Synapse project with 2 external buckets.  The processed data bucket would be the default upload location for the project, so as users work with Synapse all data naturally goes there.  The raw data bucket could be made available as a read-only data resource through Synapse, and it could be populated via some other mechanism, accessible only to the project admins.

Downloading large files

For exporting large BAM datasets, the following procedure applies:

  • User creates S3 bucket in us-east with permissions for Synapse to copy into
  • User indicates which files to copy
  • Synapse copies the files to users S3 bucket
  • User copies data from their bucket to final destination

Open Questions

- How is a file handle created from an uploaded file in Synapse today?  What file naming conventions or S3 metadata are required (e.g. MD5-hash, content-type)?

- Is it OK to let Synapse reference an S3 file that it does not unilaterally control?  What happens if an S3 modified or missing when Synapse goes to retrieve it?

- Do we need to support buckets which are read-only (or write-only) to Synapse?

- What are the S3 operations users expect to use on their files?  Do they rename, move bet. buckets, make back-ups?  Do they expect to use S3 versioning features?

 

Design Approach

  • user would grant permission for Synapse to access their S3 bucket (e.g. by referencing the Synapse IAM user in their bucket's permission settings)
  • to be flexible, have user to specify both bucket and key (e.g. a subfolder)
  • a bucket descriptor, once defined, could be applied to multiple Synapse projects.

JIRA

Note:  This may quickly become out of date, but it's useful to have quick links to the current issues.

https://sagebionetworks.jira.com/browse/PLFM-3134

https://sagebionetworks.jira.com/browse/PLFM-3149
https://sagebionetworks.jira.com/browse/SYNR-859
https://sagebionetworks.jira.com/browse/SYNPY-179
https://sagebionetworks.jira.com/browse/SWC-1989
https://sagebionetworks.jira.com/browse/SWC-1990