We can parallelize multiple downloads from a syncToSynapse by using a ThreadPoolExecutor shared by the downloads (currently we support paraellized part download of file chunks, but download the synced files themselves serially).
Implementation details here: https://github.com/Sage-Bionetworks/synapsePythonClient/pulls
Some performance figures for this change. These were all run on an t2.xlarge (4 cores) in aws-us-east-1.
The client tested configurations are:
SYNPY-1074 (this change)
The current 2.1.1 Synapse client configured for single-threaded downloads (currently the default unless overridden)
The current 2.1.1 Synapse client configured for multi-threaded downloads using the existing multi threaded download functionality (which supports multi threaded part downloads of any single file, but not multi threaded downloads across a sync of files, those are still downloaded serially).
This change should be most beneficial to syncing folders with many small files, since they can be concurrently downloaded. syn22252168 is a Synapse project with 400 text files smaller than 50 bytes distributed in a file hierarchy 3 folders deep with 10 files in every subfolder. This file hierarchy can be recreated with the attached script. A recursive synapse sync was run on this project:
In this test the existing client when configured with the optional multi threaded mode actually performs worse than the client in single threaded mode because of the overhead of launching and running the download in threads isn’t efficient for the small files, and in both cases the files themselves are serially downloaded.
The changes with SYNPY-1074 allow files to be downloaded in parallel, however the individual files are downloaded with a single thread because the download code checks the reported size before deciding which implementation to use.
This test is a real word folder syn21903917 with a mix of large and small files that was reported to have problems as part of SYNPY-1078. Some files are large video files of multiple gigabytes, some are small text files of a few bytes, and some images are a few hundred kB. The project itself is many terabytes, these tests downloaded 50GB of data before filling the available disk at which point the elapsed time was measured.
In this case because of the larger files 2.1.1 multi threaded performs better than 2.1.1 single threaded, but the changes for SYNPY-1074 outperform both.
The last test is a download of a single large file 7GB, syn21919857.
.This performance shouldn’t materially change but since the multi threaded download code was refactored to support a shared thread pool while syncing we want to ensure that performance hasn’t degraded.
As expected/hoped, the changes do not materially affect performance over the existing multi threaded download in this case.
This change has been merged to develop and will be in included in the next release candidate for validation.
Can you think of a good internal user to assign as 'validator', i.e. someone who has a need to download batches of files in their work?
I’ve asked Tobias Ross of the German Cancer Research Center to verify improved performance since he had previously had issues with sync performance (now partially resolved with a previous bug fix).
Received the following back:
I followed up to ask if they collected any concrete numbers and will add them to the issue if received, but marking this as closed for validation/release purposes.