Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: audi data

...

Currently, each file in Synapse must be downloaded individually.  This works well for many use cases but there are cases where this leads to performance issues.  In order to download a file, a client must first request a pre-signed URL.  The file can then be downloaded using a HTTPS GET on the returned pre-signed URL.  For large files, the time spent on the actual download far exceeds the time spent getting the URL.  For a small file, the request for the pre-signed URL can take as long as the actual file download.  This can be a significant bottleneck for some use cases.  For example, a user may need to download all of the files from a table entity with one or more file columns.  If all of the files are small then the requests for each file's pre-signed URL becomes the bottleneck.  Therefore, we propose adding a new asynchronous services to allow users to download multiple files with a single request.

...

If user B were to attempt to directly access a pre-signed URL for the same file using GET using GET /filefileHandle/1/url then Synapse would return an unauthorized result (403) even though they are authorized to download it through the association with entityId=123.  The reason for this is Synapse cannot does not have the means to lookup the associated object given a file handle id.  This is a based on how the file data is stored in Synapse:

  • Trivial: Given an object.id and object.type lookup the associated fileHandle.id
  • Non-Trivial: Given fileHandle.id lookup the associated object.id and object.type.

We need to change how file data is stored in Synapse to make it trivial to lookup the associated object.id and object.type given a fileHandle.id.  Then it would be possible for user B, in the above example, to download the file using GET /fileHandle/1/url.

Asynchronous Bulk File Download

We propose adding a new REST API call that would allow callers to start an asynchronous job that would create a zip file containing multiple requested files.  Upon completion of the job, a pre-signed URL will be returned with the response object.  The response object will also include a summary of each of the requested file.  Each file summary will include a state of SUCCESS or FAILURE to indicate which of the requested files were included in resulting zip file.  For the case of FAILURE, a reason will be included, such as UNAUTHORIZED or NOT_FOUND.

Zip Results

The resulting zip file will contain an entry for each successful file using the following entry naming scheme:

Code Block
{fileHandleId % 1000} /{fileHandleId}/{fileName}

The zip file entry naming scheme is designed to match the caching schemes used by the R and Python clients such that the zip file can decompressed directly into a users local file cache.

Limits

There will be a limit of 1 GB for a single result zip file.  If the sum of all of the requested files exceeds this size limit a zip will be generated with as many files as possible up to the limit.  All files that are excluded due to exceeding the limit will be marked as FailureCode.SIZE_LIMIT_EXCEEDED.

Download Audit Record

For each file included in a bulk download zip file, an audit record will captures the following data:

  • FileHandleId
  • AssociatedObjectId
  • AssociatedObjectType
  • UserId
  • Timestamp

Object Model

Image Added

REST API

URLMethodRequest BodyResponse BodyAuthorizationDescription
/file/bulk/async/startPOSTBulkFileDownloadRequestAsynchJobIdAnyone is authorized to make this call. 

...

Authorization for each requested file is done individually.  Any file that the caller is not authorized to download will not be included in the resulting zip file and the.Call to start an asynchronous job create a Zip file containing all of the requested files that the caller is authorized to download.
file/bulk/async/get/{asyncToken}GET 

Job complete: BulkFileDownloadResponse (200) 

Job processing: AsynchronousJobStatus (202)

Only the user that started the job is authorized to make this call.Call used to track the asynchronous  job status.  While the job is still running the job status will be returned (202), when the job is complete results will be returned (200).
GET /fileHandle/{handleId}/urlGET Pre-signed URLThe creator of the FileHandle is authorized to make this call.  If the caller is authorized to download the file via an associated object they they will be authorized to make this call.This method already exist in the API. However, only the creator of the fileHandle is authorized to make this call presently.  Authorization will be extend to allow anyone that is authorized to download the object via an associated object.