How much data is there?

Synapse has a object called a “file handle”, a low-level object which references a cloud storage bucket and key and which has basic metadata like file name and size. Conceptually the data “footprint” of Synapse is the sum of the sizes of all the files which Synapse indexes as file handles. The Athena filesnapshots tables lists all the file handles in Synapse and should be useful for computing count and aggregate size statistics. However there are duplicate records to be addressed. First, "snapshots" are taken periodically so we want to eliminate this by taking just the latest snapshot of each record. There can also be multiple file IDs for a <bucket, key> pair, which we can address by taking just the latest ID. The following query in Athena asks how much file data is indexed in Synapse:

with latest as (
    select bucket, key, max(snapshot_timestamp) as snapshot_timestamp
    from filesnapshots
    where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
    group by bucket, key, id, content_size, status
),
max_id as (
    select bucket, key, max(id) as id
    from filesnapshots
    where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
    group by bucket, key
)
select sum(fs.content_size)/1099511627776 total_size_tebibytes
from filesnapshots fs join latest on fs.snapshot_timestamp=latest.snapshot_timestamp and fs.bucket=latest.bucket and fs.key=latest.key
join max_id on fs.id=max_id.id and fs.bucket=max_id.bucket and fs.key=max_id.key
where fs.snapshot_timestamp > current_timestamp - INTERVAL '14' DAY

The result is:

2793

(2.7 petabytes)

Note that this total includes files that have been ‘garbage collected’ and moved to glacier. We can drill down on this by looking at the ‘status’ column, the values of which are, AVAILABLE, ARCHIVED, UNLINKED. The values are described here. The files which are actually accessible through Synapse are AVAILABLE, while the others are in various stages of garbage collection:

with latest as (
    select bucket, key, max(snapshot_timestamp) as snapshot_timestamp
    from filesnapshots
    where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
    group by bucket, key, id, content_size, status
),
max_id as (
    select bucket, key, max(id) as id
    from filesnapshots
    where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
    group by bucket, key
)
select sum(fs.content_size)/1099511627776 size_tebibytes, status
from filesnapshots fs join latest on fs.snapshot_timestamp=latest.snapshot_timestamp and fs.bucket=latest.bucket and fs.key=latest.key
join max_id on fs.id=max_id.id and fs.bucket=max_id.bucket and fs.key=max_id.key
where fs.snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
group by fs.status

This returns:

size_tebibytes status

1 2271 AVAILABLE
2 276 UNLINKED
3 236 ARCHIVED

Or 2.2 petabytes that are “available”.

Synapse is a federated data store. The 2+ PB of available data are not all in one bucket owned by Sage. Rather, users can provision and link their own AWS S3 or Google Bucket, then store objects there, indexed in Synapse. They can even create “external” files, for which Synapse simply stores the files' URLs. Let’s divide up the data by bucket, listing just the 25 largest buckets:

with latest as (
    select bucket, key, max(snapshot_timestamp) as snapshot_timestamp
    from filesnapshots
    where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
    group by bucket, key, id, content_size
),
max_id as (
    select bucket, key, max(id) as id
    from filesnapshots
    where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
    group by bucket, key
)
select fs.bucket, sum(fs.content_size)/1099511627776 size_tebibytes
from filesnapshots fs join latest on fs.snapshot_timestamp=latest.snapshot_timestamp and fs.bucket=latest.bucket and fs.key=latest.key
join max_id on fs.id=max_id.id and fs.bucket=max_id.bucket and fs.key=max_id.key
where 
fs.snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
and status='AVAILABLE'
group by fs.bucket
order by size_tebibytes desc
limit 25

bucket size_tebibytes

1 proddata.sagebase.org 698
2 ad-knowledge-portal-main 324
3 ad-knowledge-portal-large 236
4 amp-mayo-sinai-synapseencryptedexternalbucket-1bmvn8rlwixv2 163
5 nda-bsmn-scratch 152
6 htan-dcc-htapp 85
7 exceptional-longevity 66
8 diverse-cohorts 52
9 mpnstwgs 45
10 nf-syn23664726-s3-bucket-n9uakf7bowwd 43
11 integrated-immunology 41
12 gates-ki-001-synapseencryptedexternalbucket-kl8vu7tqto1b 38
13 sc-237179673806-pp-hg7rd6vvvd7mk-s3bucket-eu11pjwpxxiv 37
14 htan-dcc-chop 36
15 htan-dcc-stanford 33
16 strides-ampad-project-tower-bucket 24
17 htan-dcc-hms 20
18 htan-dcc-vanderbilt 18
19 htan-dcc-duke 15
20 ntap-add5-project-tower-bucket 12

21 ctf-swnts-project-tower-bucket 11
22 htan-dcc-dfci 8
23 htan-dcc-washu 8
24 sc-237179673806-pp-pisxjdspvwrtg-s3bucket-deoh9m8xdfcc 8
25 nda-bsmn 7

The aggregate size of these 25 buckets is 2180 TiB, or 96% of the total.

proddata.sagebase.org is the primary/default Synapse bucket. The next two large buckets, ad-knowledge-portal-main and ad-knowledge-portal-large are in Sage’s AWS STRIDES account, where the costs are paid directly by NIH. All the buckets are recognizable by name as being related to a Sage project and residing in a Sage AWS account. We can look at storage from the S3 perspective as a rough validation of these figures: Our total S3 footprint should be larger than our Synapse footprint.

We can use S3 Data Lens to get a summary of our storage usage from the AWS perspective. As of June 22, 2023:

AWS Account(s)

S3 usage, TiB

Sage Organization (~35 accounts)

1953

*6185: Strides AD

823.3

*5034: (Nextflow Tower)

230.7

Total

3007

Since 3007 PB > 2783 PB, the values seem to be correct. Remaining 200 TiB may be accounted for by data which Sage owns but which is unrelated to Synapse. Below we reconcile on a per-bucket basis. There are two numerical columns for bucket size reported by S3 in Sage’s “organization” and STRIDES, respectively, and one column for the amount indexed in Synapse.

Bucket

S3 bucket size (TiB)

STRIDES S3 bucket size (TiB)

Indexed in Synapse (“AVAILABLE”) (TiB)

proddata.sagebase.org

937.3

698

ad-knowledge-portal-main

333.1

324

ad-knowledge-portal-large

236.1

236

amp-mayo-sinai-synapseencryptedexternalbucket-1bmvn8rlwixv2

312

163

nda-bsmn-scratch

152

htan-dcc-htapp

0 (bucket is gone)

85 (also 62 TiB “UNLINKED”)

exceptional-longevity

66.2

66

diverse-cohorts

54.9

52

mpnstwgs

45

45

nf-syn23664726-s3-bucket-n9uakf7bowwd

45.4

43

integrated-immunology

41.8

41

gates-ki-001-synapseencryptedexternalbucket-kl8vu7tqto1b

39.2

38

sc-237179673806-pp-hg7rd6vvvd7mk-s3bucket-eu11pjwpxxiv

37.7

37

htan-dcc-chop

35.2

36

htan-dcc-stanford

26.9

33

strides-ampad-project-tower-bucket

187.8

24

htan-dcc-hms

16.1

20

htan-dcc-vanderbilt

19.2

18

htan-dcc-duke

15.8

15

ntap-add5-project-tower-bucket

16.3

12

ctf-swnts-project-tower-bucket

11.9

11

htan-dcc-dfci

0

8

htan-dcc-washu

0

8

sc-237179673806-pp-pisxjdspvwrtg-s3bucket-deoh9m8xdfcc

6.8

8

nda-bsmn

7

The five buckets highlighted in red, appear in Synapse but not in our list of top S3 buckets. They have been emptied or removed by their owners, leaving the index in Synapse ‘stale'. The the child page on stale buckets for more details.

While relating content size to bucket name gives some insight, it does not explain the usage of the large, default Synapse bucket. We can investigate how this bucket is used, and get more insight into how other data usage, by relating content size to Synapse projects. We do this by joining the filesnapshots table, queried above, with the nodesnapshots table, which has project id and which relates to filesnapshots through the file_handle_id column. Lets first query the sum total size of data across projects:

with latest_ns as (
	select id, max(snapshot_timestamp) as snapshot_timestamp
	from nodesnapshots
	where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
	group by id
),
latest_fs as (
	select bucket, key, max(snapshot_timestamp) as snapshot_timestamp
	from filesnapshots
	where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
	group by bucket, key
),
fs_max_id as (
	select bucket, key, max(id) as id
	from filesnapshots
	where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
	group by bucket, key
)
select sum(fs.content_size)/1099511627776 project_size_tebibytes
from nodesnapshots ns 
join latest_ns on ns.id=latest_ns.id and ns.snapshot_timestamp=latest_ns.snapshot_timestamp
join filesnapshots fs on ns.file_handle_id=fs.id
join fs_max_id on fs.id=fs_max_id.id,
latest_fs
where
ns.node_type='file'
and ns.snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
and fs.snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
and fs.bucket=latest_fs.bucket and fs.key=latest_fs.key and fs.snapshot_timestamp=latest_fs.snapshot_timestamp
and fs.bucket=fs_max_id.bucket and fs.key=fs_max_id.key and fs.id=fs_max_id.id
and fs.status='AVAILABLE'

The result is 2108. This is close to the result of the earlier query on filesnapshots alone. (TODO: Where did the other 163 TiB go?) Now we partition the result by Project and Bucket:

with latest_ns as (
	select id, max(snapshot_timestamp) as snapshot_timestamp
	from nodesnapshots
	where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
	group by id
),
latest_fs as (
	select bucket, key, max(snapshot_timestamp) as snapshot_timestamp
	from filesnapshots
	where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
	group by bucket, key
),
fs_max_id as (
	select bucket, key, max(id) as id
	from filesnapshots
	where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
	group by bucket, key
)
select 'https://www.synapse.org/#!Synapse:syn' || cast(ns.project_id as varchar), fs.bucket, sum(fs.content_size)/1099511627776 project_size_tebibytes
from nodesnapshots ns 
join latest_ns on ns.id=latest_ns.id and ns.snapshot_timestamp=latest_ns.snapshot_timestamp
join filesnapshots fs on ns.file_handle_id=fs.id
join fs_max_id on fs.id=fs_max_id.id,
latest_fs
where
ns.snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
and fs.snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
and fs.bucket=latest_fs.bucket and fs.key=latest_fs.key and fs.snapshot_timestamp=latest_fs.snapshot_timestamp
and fs.bucket=fs_max_id.bucket and fs.key=fs_max_id.key and fs.id=fs_max_id.id
and fs.status='AVAILABLE'
group by ns.project_id, fs.bucket
order by project_size_tebibytes desc
limit 50

The top 50 account for 86% of the 2108 TiB:

Project

Bucket

Project size, TiB

https://www.synapse.org/#!Synapse:syn2580853

ad-knowledge-portal-main

325

https://www.synapse.org/#!Synapse:syn4921369

proddata.sagebase.org

249

https://www.synapse.org/#!Synapse:syn15067647

amp-mayo-sinai-synapseencryptedexternalbucket-1bmvn8rlwixv2

154

https://www.synapse.org/#!Synapse:syn5902559

nda-bsmn-scratch

150

https://www.synapse.org/#!Synapse:syn2580853

ad-knowledge-portal-large

144

https://www.synapse.org/#!Synapse:syn20834712

htan-dcc-htapp

85

https://www.synapse.org/#!Synapse:syn27229419

exceptional-longevity

66

https://www.synapse.org/#!Synapse:syn51090228

mpnstwgs

45

https://www.synapse.org/#!Synapse:syn23664726

nf-syn23664726-s3-bucket-n9uakf7bowwd

43

https://www.synapse.org/#!Synapse:syn25982655

integrated-immunology

39

https://www.synapse.org/#!Synapse:syn25986113

sc-237179673806-pp-hg7rd6vvvd7mk-s3bucket-eu11pjwpxxiv

37

https://www.synapse.org/#!Synapse:syn22776798

htan-dcc-chop

33

https://www.synapse.org/#!Synapse:syn2759792

proddata.sagebase.org

27

https://www.synapse.org/#!Synapse:syn22860959

diverse-cohorts

27

https://www.synapse.org/#!Synapse:syn23511964

htan-dcc-stanford

25

https://www.synapse.org/#!Synapse:syn30985329

strides-ampad-project-tower-bucket

24

https://www.synapse.org/#!Synapse:syn2580853

diverse-cohorts

23

https://www.synapse.org/#!Synapse:syn49637038

proddata.sagebase.org

21

https://www.synapse.org/#!Synapse:syn1867134

proddata.sagebase.org

18

https://www.synapse.org/#!Synapse:syn21050481

htan-dcc-vanderbilt

17

https://www.synapse.org/#!Synapse:syn2787333

proddata.sagebase.org

17

https://www.synapse.org/#!Synapse:syn11562183

proddata.sagebase.org

16

https://www.synapse.org/#!Synapse:syn10337516

proddata.sagebase.org

15

https://www.synapse.org/#!Synapse:syn2370594

proddata.sagebase.org

14

https://www.synapse.org/#!Synapse:syn22123910

htan-dcc-hms

14

https://www.synapse.org/#!Synapse:syn23511961

htan-dcc-duke

14

https://www.synapse.org/#!Synapse:syn25882123

proddata.sagebase.org

10

https://www.synapse.org/#!Synapse:syn5016925

proddata.sagebase.org

10

https://www.synapse.org/#!Synapse:syn23548160

gates-ki-001-synapseencryptedexternalbucket-kl8vu7tqto1b

9

https://www.synapse.org/#!Synapse:syn22691098

gates-ki-001-synapseencryptedexternalbucket-kl8vu7tqto1b

9

https://www.synapse.org/#!Synapse:syn26133760

sc-237179673806-pp-pisxjdspvwrtg-s3bucket-deoh9m8xdfcc

8

https://www.synapse.org/#!Synapse:syn9727752

ctf-swnts-project-tower-bucket

8

https://www.synapse.org/#!Synapse:syn22255320

htan-dcc-washu

8

https://www.synapse.org/#!Synapse:syn7342718

proddata.sagebase.org

8

https://www.synapse.org/#!Synapse:syn17866381

gates-ki-001-synapseencryptedexternalbucket-kl8vu7tqto1b

7

https://www.synapse.org/#!Synapse:syn36938960

proddata.sagebase.org

7

https://www.synapse.org/#!Synapse:syn5902559

nda-bsmn

7

https://www.synapse.org/#!Synapse:syn30985329

proddata.sagebase.org

7

https://www.synapse.org/#!Synapse:syn22255433

proddata.sagebase.org

6

https://www.synapse.org/#!Synapse:syn21984813

proddata.sagebase.org

6

https://www.synapse.org/#!Synapse:syn5698493

synodos.eu.frankfurt.penn

6

https://www.synapse.org/#!Synapse:syn4939902

ntap-add5-project-tower-bucket

6

https://www.synapse.org/#!Synapse:syn23625052

proddata.sagebase.org

6

https://www.synapse.org/#!Synapse:syn23511954

htan-dcc-dfci

6

https://www.synapse.org/#!Synapse:syn11374339

ntap-cnf-cell-project-tower-bucket

5

https://www.synapse.org/#!Synapse:syn18485874

proddata.sagebase.org

5

https://www.synapse.org/#!Synapse:syn12299750

proddata.sagebase.org

5

https://www.synapse.org/#!Synapse:syn22093319

htan-dcc-ohsu

5

https://www.synapse.org/#!Synapse:syn24984270

htan-dcc-tnp-sardana

5

https://www.synapse.org/#!Synapse:syn26720920

proddata.sagebase.org

4

This query breaks down the data by is_public, is_controlled, is_restricted:

with latest_ns as (
	select id, max(snapshot_timestamp) as snapshot_timestamp
	from nodesnapshots
	where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
	group by id
),
latest_fs as (
	select bucket, key, max(snapshot_timestamp) as snapshot_timestamp
	from filesnapshots
	where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
	group by bucket, key
),
fs_max_id as (
	select bucket, key, max(id) as id
	from filesnapshots
	where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
	group by bucket, key
)
select ns.is_public, ns.is_controlled, ns.is_restricted, sum(fs.content_size)/1099511627776 size_tebibytes
from nodesnapshots ns 
join latest_ns on ns.id=latest_ns.id and ns.snapshot_timestamp=latest_ns.snapshot_timestamp
join filesnapshots fs on ns.file_handle_id=fs.id
join fs_max_id on fs.id=fs_max_id.id,
latest_fs
where
ns.snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
and fs.snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
and fs.bucket=latest_fs.bucket and fs.key=latest_fs.key and fs.snapshot_timestamp=latest_fs.snapshot_timestamp
and fs.bucket=fs_max_id.bucket and fs.key=fs_max_id.key and fs.id=fs_max_id.id
and fs.status='AVAILABLE'
group by ns.is_public, ns.is_controlled, ns.is_restricted
order by size_tebibytes desc

Result:

is_public

is_controlled

is_restricted

size_tebibytes

FALSE

FALSE

FALSE

827

TRUE

TRUE

TRUE

615

TRUE

FALSE

FALSE

196

TRUE

TRUE

FALSE

155

FALSE

TRUE

FALSE

124

FALSE

FALSE

TRUE

73

TRUE

FALSE

TRUE

63

FALSE

TRUE

TRUE

53

TODO: What are the exact definitions of these terms?

Let’s repeat the project breakdown query but look only at “public” data:

with latest_ns as (
	select id, max(snapshot_timestamp) as snapshot_timestamp
	from nodesnapshots
	where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
	group by id
),
latest_fs as (
	select bucket, key, max(snapshot_timestamp) as snapshot_timestamp
	from filesnapshots
	where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
	group by bucket, key
),
fs_max_id as (
	select bucket, key, max(id) as id
	from filesnapshots
	where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
	group by bucket, key
)
select 'https://www.synapse.org/#!Synapse:syn' || cast(ns.project_id as varchar), fs.bucket, sum(fs.content_size)/1099511627776 project_size_tebibytes
from nodesnapshots ns 
join latest_ns on ns.id=latest_ns.id and ns.snapshot_timestamp=latest_ns.snapshot_timestamp
join filesnapshots fs on ns.file_handle_id=fs.id
join fs_max_id on fs.id=fs_max_id.id,
latest_fs
where
ns.snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
and fs.snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
and fs.bucket=latest_fs.bucket and fs.key=latest_fs.key and fs.snapshot_timestamp=latest_fs.snapshot_timestamp
and fs.bucket=fs_max_id.bucket and fs.key=fs_max_id.key and fs.id=fs_max_id.id
and fs.status='AVAILABLE'
and ns.is_public=TRUE
group by ns.project_id, fs.bucket
order by project_size_tebibytes desc
limit 50

The total drops to 948 TiB , dominated by AD Portal and PsychEncode, as before. If we restrict the query to the Synapse ‘default’ bucket , we find 279 TiB of data. To see how it breaks down by project:

with latest_ns as (
	select id, max(snapshot_timestamp) as snapshot_timestamp
	from nodesnapshots
	where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
	group by id
),
latest_fs as (
	select bucket, key, max(snapshot_timestamp) as snapshot_timestamp
	from filesnapshots
	where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
	group by bucket, key
),
fs_max_id as (
	select bucket, key, max(id) as id
	from filesnapshots
	where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
	group by bucket, key
)
select 'https://www.synapse.org/#!Synapse:syn' || cast(ns.project_id as varchar), fs.bucket, sum(fs.content_size)/1099511627776 project_size_tebibytes
from nodesnapshots ns 
join latest_ns on ns.id=latest_ns.id and ns.snapshot_timestamp=latest_ns.snapshot_timestamp
join filesnapshots fs on ns.file_handle_id=fs.id
join fs_max_id on fs.id=fs_max_id.id,
latest_fs
where
ns.snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
and fs.snapshot_timestamp > current_timestamp - INTERVAL '14' DAY
and fs.bucket=latest_fs.bucket and fs.key=latest_fs.key and fs.snapshot_timestamp=latest_fs.snapshot_timestamp
and fs.bucket=fs_max_id.bucket and fs.key=fs_max_id.key and fs.id=fs_max_id.id
and fs.status='AVAILABLE'
and fs.bucket='proddata.sagebase.org'
and ns.is_public=TRUE
group by ns.project_id, fs.bucket
order by project_size_tebibytes desc
limit 50

The top 21 projects account for 83% of the data, with PsychEncode at the top of the list:

_col0

bucket

project_size_tebibytes

https://www.synapse.org/#!Synapse:syn4921369

proddata.sagebase.org

152

https://www.synapse.org/#!Synapse:syn49637038

proddata.sagebase.org

21

https://www.synapse.org/#!Synapse:syn2759792

proddata.sagebase.org

19

https://www.synapse.org/#!Synapse:syn21984813

proddata.sagebase.org

6

https://www.synapse.org/#!Synapse:syn12299750

proddata.sagebase.org

5

https://www.synapse.org/#!Synapse:syn4984604

proddata.sagebase.org

4

https://www.synapse.org/#!Synapse:syn51364943

proddata.sagebase.org

4

https://www.synapse.org/#!Synapse:syn9727752

proddata.sagebase.org

3

https://www.synapse.org/#!Synapse:syn11638893

proddata.sagebase.org

3

https://www.synapse.org/#!Synapse:syn2343195

proddata.sagebase.org

2

https://www.synapse.org/#!Synapse:syn17083203

proddata.sagebase.org

2

https://www.synapse.org/#!Synapse:syn17865732

proddata.sagebase.org

1

https://www.synapse.org/#!Synapse:syn22255433

proddata.sagebase.org

1

https://www.synapse.org/#!Synapse:syn5610418

proddata.sagebase.org

1

https://www.synapse.org/#!Synapse:syn21628800

proddata.sagebase.org

1

https://www.synapse.org/#!Synapse:syn2395480

proddata.sagebase.org

1

https://www.synapse.org/#!Synapse:syn32909874

proddata.sagebase.org

1

https://www.synapse.org/#!Synapse:syn2580853

proddata.sagebase.org

1

https://www.synapse.org/#!Synapse:syn4939902

proddata.sagebase.org

1

https://www.synapse.org/#!Synapse:syn23644107

proddata.sagebase.org

1

https://www.synapse.org/#!Synapse:syn22079621

proddata.sagebase.org

1

How much is the data used?

We can use the filedownloadrecords table to see how much each Synapse project is being accessed.

Download volume per month for the previous half year:

with
latest_fs as (
	select bucket, key, max(snapshot_timestamp) as snapshot_timestamp
	from filesnapshots
	where snapshot_timestamp > current_timestamp - INTERVAL '180' DAY
	group by bucket, key
),
fs_max_id as (
	select bucket, key, max(id) as id
	from filesnapshots
	where snapshot_timestamp > current_timestamp - INTERVAL '180' DAY
	group by bucket, key
)
select month(fdr.record_date) as month, sum(fs.content_size)/1073741824 as download_volume_gigabytes
from filedownloadrecords fdr 
join filesnapshots fs on fdr.file_handle_id=fs.id
join fs_max_id on fdr.file_handle_id=fs_max_id.id,
latest_fs
where 
fdr.file_handle_id=fs.id
and fdr.timestamp > current_timestamp - INTERVAL '180' DAY -- downloads in the last week
and fs.snapshot_timestamp > current_timestamp - INTERVAL '180' DAY
and fs.bucket=latest_fs.bucket and fs.key=latest_fs.key and fs.snapshot_timestamp=latest_fs.snapshot_timestamp
and fs.bucket=fs_max_id.bucket and fs.key=fs_max_id.key and fs.id=fs_max_id.id
and fs.status='AVAILABLE'
group by month(fdr.record_date)
order by month(fdr.record_date)

returns

Month

Download volume, GB

6

2031037

7

4553643

At the time of this writing we have not yet accumulated a full month’s worth of data. Let’s compare the current month’s egress to AWS’s stat’s: For July 1-12, these are the egress charges from AWS Cost Explorer:

Cost

Data out, GB

Sage Organization

$2400

29,800

AD STRIDES

$6500

94,670

NF STRIDES

~$0

~0

The numbers are far less than those reported by Synapse. This could be because the data did not leave the region or because the presigned URL received from Synapse (the creation of which triggered the recording of a download) was not actually used by the client.

Let’s break up the download by project as well as bucket:

with
latest_fs as (
	select bucket, key, max(snapshot_timestamp) as snapshot_timestamp
	from filesnapshots
	where snapshot_timestamp > current_timestamp - INTERVAL '180' DAY
	group by bucket, key
),
fs_max_id as (
	select bucket, key, max(id) as id
	from filesnapshots
	where snapshot_timestamp > current_timestamp - INTERVAL '180' DAY
	group by bucket, key
)
select 'https://www.synapse.org/#!Synapse:syn' || cast(fdr.project_id as varchar), fs.bucket, sum(fs.content_size)/1073741824 as download_volume_gigabytes
from filedownloadrecords fdr 
join filesnapshots fs on fdr.file_handle_id=fs.id
join fs_max_id on fdr.file_handle_id=fs_max_id.id,
latest_fs
where 
fdr.file_handle_id=fs.id
and fdr.timestamp > current_timestamp - INTERVAL '180' DAY -- downloads in the last week
and fs.snapshot_timestamp > current_timestamp - INTERVAL '180' DAY
and fs.bucket=latest_fs.bucket and fs.key=latest_fs.key and fs.snapshot_timestamp=latest_fs.snapshot_timestamp
and fs.bucket=fs_max_id.bucket and fs.key=fs_max_id.key and fs.id=fs_max_id.id
and fs.status='AVAILABLE'
and month(fdr.record_date)=7
group by fdr.project_id, fs.bucket
order by download_volume_gigabytes desc
limit 50

15 projects accounted for 99% of the download activity. Only 6 of these projects use the Synapse default bucket:

Project

Bucket

Download volume, GB

https://www.synapse.org/#!Synapse:syn23511964

htan-dcc-stanford

1804241

https://www.synapse.org/#!Synapse:syn2580853

ad-knowledge-portal-main

846984

https://www.synapse.org/#!Synapse:syn2580853

diverse-cohorts

530371

https://www.synapse.org/#!Synapse:syn51090228

mpnstwgs

325521

https://www.synapse.org/#!Synapse:syn3193805

proddata.sagebase.org

233216

https://www.synapse.org/#!Synapse:syn21050481

htan-dcc-vanderbilt

146573

https://www.synapse.org/#!Synapse:syn23511961

htan-dcc-duke

132917

https://www.synapse.org/#!Synapse:syn51364943

proddata.sagebase.org

126349

https://www.synapse.org/#!Synapse:syn22093319

htan-dcc-ohsu

73147

https://www.synapse.org/#!Synapse:syn2580853

ecmonsen-emorypipeline-lambda-055273631518-us-east-1

61412

https://www.synapse.org/#!Synapse:syn51703277

proddata.sagebase.org

51600

https://www.synapse.org/#!Synapse:syn25986113

sc-237179673806-pp-hg7rd6vvvd7mk-s3bucket-eu11pjwpxxiv

50989

https://www.synapse.org/#!Synapse:syn2580853

proddata.sagebase.org

50526

https://www.synapse.org/#!Synapse:syn51156910

proddata.sagebase.org

31063

https://www.synapse.org/#!Synapse:syn17865732

proddata.sagebase.org

26497

This analysis could further be refined to look just at down load of public data.