review contributed download improvements/methods

Description

A collaborator with Gates is investigating how to download from Synapse in various ways. He has written a CLI that implements download in a few ways. - would you like to evaluate this with me?

https://github.com/pcstout/synapse-downloader

Environment

None

Activity

Show:
Kenneth Daily
February 25, 2020, 9:31 PM

From Patrick today:

I don't recall the exact numbers but I believe the current code (heavily using asyncio) was the fastest, or at least on par with multiple threads.
I decided to go with asyncio over threads since it greatly simplifies the code and fits the use case (heavy IO).
This package can also use a file view (--with-view, https://github.com/ki-tools/synapse-downloader/blob/master/src/synapse_downloader/download/file_handle_view.py) which speeds it up a lot on huge projects. The file view is used to build a cache of all the dataFileHandleIds.

I built some pure async API methods (https://github.com/ki-tools/synapse-downloader/blob/master/src/synapse_downloader/core/synapse_proxy.py#L98) that really helped too, and wrapped some of the existing methods (https://github.com/ki-tools/synapse-downloader/blob/master/src/synapse_downloader/core/synapse_proxy.py#L58) to make them async.

We have been using this package daily in one of our production environments without any issues so it should be pretty solid. This code was rushed a bit though and I'm sure there are some improvements that could be made.

I'd be happy to review and chat about this with you and/or your new engineer, just let me know.

Kenneth Daily
February 25, 2020, 9:06 PM

A heads up - Patrick (the code's author) has made some changes that don't expose some of the utilities he previously had. I have a fork of the repository that is still at the last state when we were discussing the speed issues:

https://github.com/kdaily/synapse-downloader/

Bruce Hoff
February 25, 2020, 8:59 PM

I think we should assign this to Jordan Kiang once he's on board. For now I will assign it to myself.

Kenneth Daily
October 4, 2019, 10:21 PM

Update from the contributor:

The real speed improvements we've seen are on huge Projects. The one we are testing has 56,000+ files and around 80GB of data.

So far this class is the winner (it uses the entity view, thanks for the tip!).

Also using asyncio appears to make a big difference in speed.

Robert Minneker
October 4, 2019, 5:03 PM

Thanks for reaching out, I would be happy to look into this. I ran some tests of the spccore multithreaded download I wrote on Wednesday, so it would be interesting to compare speed among other things.

Your pinned fields
Click on the next to a field label to start pinning.

Assignee

Jordan Kiang

Reporter

Kenneth Daily

Validator

Kenneth Daily