Explore moving to /filehandle/batch and async over downloadTableColumn

Description

Downloading lots of small files is still a problematic issue with the clients. We have two possibilities either download each file sequentially or ask a worker to package the files up into a zip file. The former is extremely slow and the latter can be slow as it requires two operations (download of files to workers followed by zipping and download of the zip) additioanally it can block if workers are busy. We should benchmark using /fileHandle/batch and either a queue/consumer or async/await model for downloading these files in batch directly to the client.

Environment

None

Activity

Show:
Larsson Omberg
April 11, 2019, 10:23 PM

Yes. An example is data from mPower where it takes us 5 days to download all of the data when downloading with 4 parallel threads to an EC2 instance in the same region as Synapse. This is likely halved or more if we fetch in parallell directly from S3. Off note, we have very similar code in syncToSynapse that could be extended to for this specific case but ideally we would solve this by completing:

and

and then hooking into the same mechanism for table files. This is harder however as it is relatively easy to solve SYNPY-682 but ideally we would solve both issues by focusing on both parallelizing big files and parallellizing across many smalll files.

 

Meredith Slota
April 11, 2019, 7:13 PM

Do you have any data about download times that could help here? We have expanded our worker army quite a bit so hopefully you don't see any issue re: workers being too busy, but am not sure how big a priority this is.

Assignee

Unassigned

Reporter

Larsson Omberg

Labels

None

Validator

None

Development Area

None

Release Version History

None

Slack Channel

None

Components

Priority

Trivial