Parellelize upload syncs

Description

We can parallelize multiple uploads from a syncToSynapse by using a ThreadPoolExecutor shared by the uploads (currently we support paraellized part uploads of file chunks, but upload the synced files themselves serially).

One tricky aspect of this is that the provenance in uploaded manifests can refer to files previously appearing in the manifest, so to do this it would be necessary to resolve dependencies and not upload one file from a sync before files it depended on had been uploaded (this becomes an issue after the rows are not serially uploaded).

Could also be useful to create a glob upload that didn’t require a manifest that could take advantage of the same parallelization to support parellelized uploads that didn’t include a manifest, because. An e.g. shell xargs based upload would not be able to take advantage of the parallelization using a common executor since each file is uploaded in a separate process, and a glob pattern based upload would provide an performant alternative.

Environment

None

Activity

Show:
Jordan Kiang
August 19, 2020, 4:03 PM
Edited

Performance numbers for this change:

Two configurations tested:

  1. SYNPY-1073 (this change)

  2. baseline (develop as of 8/17/2020, commit 87ae736)

Two scenarios:

  1. 100 10 byte files of random text

  2. 32 files random binary files, ranging in size 1 byte to 2GB (1 file for each power of 2 bytes)

Scenario 1:

This tests the situation that this change is optimized for: parallel uploads is more beneficial for small files because it allows resources/threads that would otherwise be idle to work across files. There should be a significant speedup here. The files/upload manifest for this scenario can be generated via the attached create_small_files.py script.

elapsed seconds:

 

baseline

SYNPY-1073

1

187.32

45.34

2

171.31

48

3

194.55

43.106

Average

184.39

45.48

Scenario 2

This tests a mixed workload of a range of file sizes. These changes are most benficial to small files but should also improve other upload profiles, just not as significantly. The files/upload manifest for this scenario can be generated via the attached create_mixed_files.py script.


elapsed seconds:

 

baseline

SYNPY-1073

1

139.81

97.6

2

147.25

90.49

3

144.94

98.41

Average

144

95.5

Jordan Kiang
August 26, 2020, 5:40 PM

I’ve asked Tobias Ross of the German Cancer Research Center to verify improved performance since he had previously had issues with sync performance (now partially resolved with a previous bug fix).

Jordan Kiang
August 28, 2020, 3:13 PM

Received the following back:

I followed up to ask if they collected any concrete numbers and will add them to the issue if received, but marking this as closed for validation/release purposes.

Assignee

Jordan Kiang

Reporter

Jordan Kiang

Labels

None

Validator

None

Development Area

Synapse Core Infrastructure

Release Version History

None

Components

Fix versions

Priority

Major
Configure