Cloud Object Storage Requirements for Synapse
Introduction
One of the core features of Synapse is to act as a data repository for scientific data. Data contributors typically organize their data into top-level projects that contain various levels of folders with files. Data contributes manage access to their files and folders by setting permission at the project, folder, or file level. Files can be uploaded and downloaded using any of the Synapse Clients: R-Client, Python-Client, Web-Client, or Command-Line-Client. Each client provides high level features for uploading and downloading files to Synapse. All clients implement the high-level upload and download features using the same set of low-level generic Synapse File REST API calls. A single high-level upload or download request will translate to a series of low-level REST API calls.
While the Synapse File REST API was built on top of Amazon's S3, it was designed to hide all of the S3 specific details from the clients. This design decision was made so we could add support for additional cloud object storage providers without changing the Synapse REST API. However there are a required set of feature that a cloud object storage provider must support for an integration with Synapse to be successful. This document outlines the required features for a cloud object storage integration with Synapse.
File Download
To download a file from Synapse, a client must request a temporary URL from Synapse. For example the following call is used to get the a temporary URL for a Synapse File entity: GET /entity/{id}/file. If the caller is authorized to access the requested file, Synapse will return a temporary URL to the caller. The temporary URL must directly reference an object in the private object store and should include a signature or token signed using a shared secret between Synapse and the object store provider. The temporary URL must function such that if any browser is redirected to the URL it will download the file directly using an HTTPS GET.
File Upload
File upload must be resilient to temporary network failures. This translates to uploading large files as multiple parts. For example, given a part size of 5 MB a single 100 MB file would be uploaded as twenty 5 MB parts. A failure while uploading any single part must not invalidate the entire file upload. This means a client is expected to re-try the upload a part if it fails, but not forced to re-try the entire file upload. Also clients must be able to upload parts in parallel using multiple threads. After all parts have been upload the client is expected to make a final call to concatenate all of the parts into a single object. A caller must be able to download a multi-part uploaded file using a single temporary (not a URL for each part). See Chunked File Upload API.
To upload a single part the client must request a temporary URL from Synapse that can be used with an HTTPS PUT to upload the part data directly to the object store provider.
Object Store Feature Requirements
- The object store must be private and secured. Every read and write operation must be authenticated and authorized. Such a feature must be similar to a private S3 bucket.
- Synapse must be able issue temporary URLs that can be used for direct file download from the object store. Such a feature must be similar to Amazon's Amazon's S3 Pre-Signed-URL
- The object store must support file upload in parts. Such a feature must be similar to Amazon's S3 Multipart Upload
- Synapse must be able to issue temporary URLs that can be used for direct file upload of each part. Amazon did not support pre-signed URLs for multi-part upload however, they do support pre-signed URLs for non-multipart file uploads. Amazon also supports concatenating non-multipart objects into a single object with the multipart API. To get pre-signed URL multi-part uploads to work with S3 a combination of the two supported features was used for Synapse.