bulk file downloading function
Downloading from Synapse is currently very slow. We have the ability to download in bulk by kicking off a worker to zip together many small files. We have an open issue that suggests removing this useless step by parallelizing download from the client. By refactoring downloads to into a bulk download function where bulk downloading is carried out over "chunks" of files we can speed up both small file download and download of large files. I.e., large files are featched in paralllel by requesting the URL with an offset and small files are requesting simultaneously. This would likely see a 4-8 x increase on download speed similarly how uploads are sped up through parallelization.
Indeeed, SyncFromSynapse will not solve the issue of downloading a bunch of files from a table column of type filehandle.
In the downloadTableColumn function we use the worker approach but it would be more effecient to download these files directly.
file downloads in the context of a syncFromSynapse will be parellelized in the upcoming release as part of SYNPY-1074. It’s unclear to me in reading this if this issue will also be resolved as part of that (or whether the desire is to have a separate interface that allows an arbitrary list of files to be downloaded in parallel rather than the container based syncFromSynapse.
If the former then we can adapt the work done for 1074 to accept an arbitrary list to download in parellel as well.
Updated description to match need.
This boils down to an optimization problem that will be hard to optimize.
The bulk file download method we use currently gives us a zip file containing all the FileHandles. According to John, the best use case is for many small files, where multiple presignedURL HTTP requests would impact the download time more than the time it took for the back-end to download and zip the files. If we have multiple larger files, the download time for a zipped file would be much longer than that for the client directly downloading each FileHandle, because the both back-end and the client would have to download a large amount of data, instead of only the client downloading a large amount of data.
We should either determine a threshold for when we want to get a zipped file versus downloading directly from the client, or just make a new function that always downloads all of the FileHandles from the client, but retrieves all necessary presignedURLs at once using http://docs.synapse.org/rest/POST/fileHandle/batch.html