We can parallelize multiple uploads from a syncToSynapse by using a ThreadPoolExecutor shared by the uploads (currently we support paraellized part uploads of file chunks, but upload the synced files themselves serially).
One tricky aspect of this is that the provenance in uploaded manifests can refer to files previously appearing in the manifest, so to do this it would be necessary to resolve dependencies and not upload one file from a sync before files it depended on had been uploaded (this becomes an issue after the rows are not serially uploaded).
Could also be useful to create a glob upload that didn’t require a manifest that could take advantage of the same parallelization to support parellelized uploads that didn’t include a manifest, because. An e.g. shell xargs based upload would not be able to take advantage of the parallelization using a common executor since each file is uploaded in a separate process, and a glob pattern based upload would provide an performant alternative.
Performance numbers for this change:
Two configurations tested:
SYNPY-1073 (this change)
baseline (develop as of 8/17/2020, commit 87ae736)
Two scenarios:
100 10 byte files of random text
32 files random binary files, ranging in size 1 byte to 2GB (1 file for each power of 2 bytes)
This tests the situation that this change is optimized for: parallel uploads is more beneficial for small files because it allows resources/threads that would otherwise be idle to work across files. There should be a significant speedup here. The files/upload manifest for this scenario can be generated via the attached create_small_files.py script.
elapsed seconds:
| baseline | SYNPY-1073 |
1 | 187.32 | 45.34 |
2 | 171.31 | 48 |
3 | 194.55 | 43.106 |
Average | 184.39 | 45.48 |
This tests a mixed workload of a range of file sizes. These changes are most benficial to small files but should also improve other upload profiles, just not as significantly. The files/upload manifest for this scenario can be generated via the attached create_mixed_files.py script.
elapsed seconds:
| baseline | SYNPY-1073 |
1 | 139.81 | 97.6 |
2 | 147.25 | 90.49 |
3 | 144.94 | 98.41 |
Average | 144 | 95.5 |
I’ve asked Tobias Ross of the German Cancer Research Center to verify improved performance since he had previously had issues with sync performance (now partially resolved with a previous bug fix).
Received the following back:
I followed up to ask if they collected any concrete numbers and will add them to the issue if received, but marking this as closed for validation/release purposes.