Improve file transfer speeds (upload, download, bucket to bucket)

Description

I've heard that and have discussed upload speeds, and along those lines for HTAN we are the data coordination center (DCC) and will be transferred many large files (terabytes of imaging data) and people will download data as well. We have gotten complaints about slow upload from several concerned parties and the Harvard Medical School has benchmarked this:

Here is my report:

I put together a ~3GB dataset consisting of 10 x ~300 MB files. I tested awscli and the synapse command line client on the campus compute cluster as it has a faster network connection than my own workstation. I did all file I/O in a tmpfs filesystem on a system with ~200 GB of idle RAM so as not to impose any physical disk I/O bottleneck. The s3 bucket I used for awscli testing is located in us-east-1 and doesn't have any special or unusual configuration. The synapse test project is syn21211657. I tested both "aws s3 sync" and "aws s3 cp" to satisfy myself that they gave similar speeds, then did the actual benchmarking with "sync". I recorded the numbers printed out in the tools' own progress bars, but I ensured those numbers were not overly optimistic by also timing the commands externally with the bash "time" built-in and dividing by total file size. The speeds calculated with external timing were always somewhat worse than the progress bar speeds – synapse had significantly more overhead before/after actual data transfer than awscli. I was able to repeatedly test everything to get more reliable numbers, except synapse uploads. There appears to be a server-side cache that makes re-uploads instantaneous.

Here are the results:

awscli upload: 100-170 MB/s

awscli download: 130-180 MB/s

synapse upload: 15-40 MB/s

synapse download: 30-60 MB/s

For HTAN we can create an SOP for users familiar with the awscli to upload large data to s3 buckets we create and associate with Synapse. However we also have issues with data that isn't terabytes for users who are not familiar with the CLI or synapseclient. E.g we have users try to upload ~40GB on the web UI and gotten frustrated with the setup costs of synapseclient, or users who get confusing errors in the synapseclient because of file corruption, etc.

One of our HTAN DCC members uses Globus for ITCR, which solves a lot of "last mile" problems. So it might increase upload/download and facilitate easier bucket to bucket transfer for many files because you don't have to babysit it. We had a preliminary call with the Globus person, , , and . We are working to set up a test bucket and ask our collaborator to test it.

Are there any other points/use cases to consider? (Dwanye mentioned some of you might have some interest, feel free to tags others if desired!)

Environment

None

Assignee

Bruce Hoff

Reporter

Xengie Doan

Labels

Validator

Milen Nikolov

Development Area

None

Release Version History

None

Priority

Major
Configure