Make upload speeds comparable to those of the AWS S3 CLI

Description

focuses on improving download speed. The observations noted here:
https://sagebionetworks.jira.com/browse/SYNPY-682?focusedCommentId=104516&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-104516
show that upload speeds could also be improved.

Environment

None

Activity

Show:
Bruce Hoff
March 30, 2020, 6:22 PM

, the difference between 'awscli' and 'master' in your results looks much smaller than the 4-6 fold difference reported by the user at Harvard in PLFM-6045. Can you reconcile this difference? Could it be because they uploaded 10 files at once rather than one, larger file? Ideally we would have a demonstration in which the difference between 'awscli' and 'master' replicates what was seen by them and in which the difference between 'awscli' and 'refactor' is small.

Jordan Kiang
April 10, 2020, 4:54 PM

I used a script shared with some of the folks who reported issues in the linked ticket so that we could ensure consistent usage when getting test timings. Even with the controlled input I was not able to reproduce the same degree of performance disparity that was mentioned in some specific cases, however a reproducible disparity of about 1.4-1.6x was easy to see and 2.1 should narrow that to within 5% for large uploads. Further the new ability to customize the max_threads programatically or via synapseConfig should allow users with additional CPU and bandwidth to further accelerate uploads beyond what they would normally get with vanilla awscli usage.

A final test run reflecting the ability to now also customize the number of concurrent threads used that is available in 2.1. This is an EC2 t3a.2xlarge (8 cores) in the same region (us-east-1) as the destination bucket, with all CPUs idle. The uploads are large-ish (2gb) and each run is the average of 10 uploads.

awscli - awscli (default configuration)
synapse2.0 - 2.0 release of synapse
synapse2.0 - 2.1 release of synapse (default configuration)
awscli-16 - awscli configured with max_concurrent_requests=16
synapse2.6-16 - 2.1 release of synapse (max_threads=16)

 

awscli

synapse2.0

synapse2.1

awscli-16

synapse2.1-16

1

31.235

39.091

34.672

19.049

26.831

2

18.654

32.374

28.912

19.801

33.95

3

32.403

35.441

25.932

19.367

29.583

4

33.538

45.738

27.283

31.062

17.908

5

19.209

31.048

29.32

19.163

16.153

6

29.118

30.514

26.813

18.888

16.732

7

33.889

36.17

28.807

27.006

18.493

8

34.359

42.342

31.737

36.339

18.64

9

18.92

33.616

27.095

32.074

28.67

10

19.58

44.78

28.283

19.155

22.462

Mean Average

27.0905

37.1114

28.8854

24.1904

22.9422

I think we should release these changes for 2.1 and see how they are received.

Jordan Kiang
April 10, 2020, 5:39 PM

Resolving this per above.

Milen Nikolov
April 10, 2020, 5:56 PM

Thanks! After 2.1 is released I will communicate with collaborators on our end that they can test performance.

Jordan Kiang
April 10, 2020, 9:32 PM

asked for some data between 100MB and 1GB. Here is 500MB This is on the same machine as described above (average of 10 runs). Again when the uploads are on the order of 10s the number of extra connections that we make to Synapse (create, fetch urls, complete each url, and complete upload) make gains less pronounced, but still there.

 

 

awscli

synapse2.0

synapse2.1

1

5.742

11.165

10.886

2

22.575

24.869

6.753

3

5.436

10.725

11.851

4

6.123

9.581

8.764

5

17.658

19.398

9.472

6

5.889

10.402

13.942

7

9.803

11.176

8.163

8

6.385

23.574

11.228

9

7.025

9.148

20.386

10

8.033

14.38

8.827

Mean Average

9.4669

14.4418

11.0272

 

Assignee

Jordan Kiang

Reporter

Bruce Hoff

Labels

None

Validator

Bruce Hoff

Development Area

None

Release Version History

None

Sprint

None

Fix versions

Priority

Major
Configure