Improve file transfer speeds (upload, download, bucket to bucket)

Description

I've heard that and have discussed upload speeds, and along those lines for HTAN we are the data coordination center (DCC) and will be transferred many large files (terabytes of imaging data) and people will download data as well. We have gotten complaints about slow upload from several concerned parties and the Harvard Medical School has benchmarked this:

Here is my report:

I put together a ~3GB dataset consisting of 10 x ~300 MB files. I tested awscli and the synapse command line client on the campus compute cluster as it has a faster network connection than my own workstation. I did all file I/O in a tmpfs filesystem on a system with ~200 GB of idle RAM so as not to impose any physical disk I/O bottleneck. The s3 bucket I used for awscli testing is located in us-east-1 and doesn't have any special or unusual configuration. The synapse test project is syn21211657. I tested both "aws s3 sync" and "aws s3 cp" to satisfy myself that they gave similar speeds, then did the actual benchmarking with "sync". I recorded the numbers printed out in the tools' own progress bars, but I ensured those numbers were not overly optimistic by also timing the commands externally with the bash "time" built-in and dividing by total file size. The speeds calculated with external timing were always somewhat worse than the progress bar speeds – synapse had significantly more overhead before/after actual data transfer than awscli. I was able to repeatedly test everything to get more reliable numbers, except synapse uploads. There appears to be a server-side cache that makes re-uploads instantaneous.

Here are the results:

awscli upload: 100-170 MB/s

awscli download: 130-180 MB/s

synapse upload: 15-40 MB/s

synapse download: 30-60 MB/s

For HTAN we can create an SOP for users familiar with the awscli to upload large data to s3 buckets we create and associate with Synapse. However we also have issues with data that isn't terabytes for users who are not familiar with the CLI or synapseclient. E.g we have users try to upload ~40GB on the web UI and gotten frustrated with the setup costs of synapseclient, or users who get confusing errors in the synapseclient because of file corruption, etc.

One of our HTAN DCC members uses Globus for ITCR, which solves a lot of "last mile" problems. So it might increase upload/download and facilitate easier bucket to bucket transfer for many files because you don't have to babysit it. We had a preliminary call with the Globus person, , , and . We are working to set up a test bucket and ask our collaborator to test it.

Are there any other points/use cases to consider? (Dwanye mentioned some of you might have some interest, feel free to tags others if desired!)

Environment

None

Activity

Show:
Bruce Hoff
April 15, 2020, 6:25 PM
Edited

Regarding 's request on Jan 21,
> ...could you recap your takeaways?
My notes were:

  • data releases are at 6 mo. intervals

  • process must be streamlined for the contributing core

  • first release Aug. 2020

  • need some guidance about when to upload/download with browser, Py client, S3 methods
    incl. what speeds might be affected and how data size relates to time

  • Not sure if contributors will need to update, delete files, but people always make mistakes

  • Not sure what the breakdown is of files on prem' vs. in the cloud.

  • Would Dwayne's (STS) work help here?

  • Can we incorporate custom thumbnails from DSA?

  • When a user goes to DSA to look at images, will DSA download from Synapse on-the-fly or will it have its own copy already? (If the latter, how will it say sync'd with Synapse?)

Bruce Hoff
April 23, 2020, 11:20 PM

Note:
Today we got this from Jeremy at Harvard:

I uploaded our
first large data file release to Synapse the other day with the 2.1 RC
client. I set it to use 16 threads and 50 MB chunks and got 50 MB/s
overall throughput on a single 80 GB file.

and I discussed the strategy of adding support for the new Synapse "STS token" feature in the Python client. This would allow Jeremy to use the S3 client to upload files to Synapse.

Bruce Hoff
May 21, 2020, 1:40 PM

should go along way in addressing the reported problems. Again, I'm keeping this open for now, to make sure we address the reported concerns.

Bruce Hoff
June 10, 2020, 8:24 PM

Please close this issue, linking to a Python client issue that tracks the remaining work, i.e. uploading/downloading large numbers of small files.

Jordan Kiang
June 16, 2020, 7:18 PM

Closing per above.

Parallelized multiple file syncs tracked here:https://sagebionetworks.jira.com/browse/SYNPY-1072

Fixed

Assignee

Jordan Kiang

Reporter

Xengie Doan

Labels

Validator

Milen Nikolov

Development Area

None

Release Version History

None

Sprint

None

Priority

Major
Configure