...
Code Block |
---|
with latest as ( select bucket, key, max(snapshot_timestamp) as snapshot_timestamp from filesnapshots where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY group by bucket, key, id, content_size, status ), max_id as ( select bucket, key, max(id) as id from filesnapshots where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY group by bucket, key ) select sum(fs.content_size)/1099511627776 total_size_terabytestebibytes from filesnapshots fs join latest on fs.snapshot_timestamp=latest.snapshot_timestamp and fs.bucket=latest.bucket and fs.key=latest.key join max_id on fs.id=max_id.id and fs.bucket=max_id.bucket and fs.key=max_id.key where fs.snapshot_timestamp > current_timestamp - INTERVAL '14' DAY |
...
Code Block |
---|
with latest as ( select bucket, key, max(snapshot_timestamp) as snapshot_timestamp from filesnapshots where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY group by bucket, key, id, content_size, status ), max_id as ( select bucket, key, max(id) as id from filesnapshots where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY group by bucket, key ) select sum(fs.content_size)/1099511627776 size_terabytestebibytes, status from filesnapshots fs join latest on fs.snapshot_timestamp=latest.snapshot_timestamp and fs.bucket=latest.bucket and fs.key=latest.key join max_id on fs.id=max_id.id and fs.bucket=max_id.bucket and fs.key=max_id.key where fs.snapshot_timestamp > current_timestamp - INTERVAL '14' DAY group by fs.status |
This returns:
size_terabytes tebibytes status
1 2271 AVAILABLE
2 276 UNLINKED
3 236 ARCHIVED
...
Code Block |
---|
with latest as ( select bucket, key, max(snapshot_timestamp) as snapshot_timestamp from filesnapshots where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY group by bucket, key, id, content_size ), max_id as ( select bucket, key, max(id) as id from filesnapshots where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY group by bucket, key ) select fs.bucket, sum(fs.content_size)/1099511627776 size_terabytestebibytes from filesnapshots fs join latest on fs.snapshot_timestamp=latest.snapshot_timestamp and fs.bucket=latest.bucket and fs.key=latest.key join max_id on fs.id=max_id.id and fs.bucket=max_id.bucket and fs.key=max_id.key where fs.snapshot_timestamp > current_timestamp - INTERVAL '14' DAY and status='AVAILABLE' group by fs.bucket order by size_terabytestebibytes desc limit 25 |
bucket size_
...
tebibytes
1 proddata.sagebase.org 698
2 ad-knowledge-portal-main 324
3 ad-knowledge-portal-large 236
4 amp-mayo-sinai-synapseencryptedexternalbucket-1bmvn8rlwixv2 163
5 nda-bsmn-scratch 152
6 htan-dcc-htapp 85
7 exceptional-longevity 66
8 diverse-cohorts 52
9 mpnstwgs 45
10 nf-syn23664726-s3-bucket-n9uakf7bowwd 43
11 integrated-immunology 41
12 gates-ki-001-synapseencryptedexternalbucket-kl8vu7tqto1b 38
13 sc-237179673806-pp-hg7rd6vvvd7mk-s3bucket-eu11pjwpxxiv 37
14 htan-dcc-chop 36
15 htan-dcc-stanford 33
16 strides-ampad-project-tower-bucket 24
17 htan-dcc-hms 20
18 htan-dcc-vanderbilt 18
19 htan-dcc-duke 15
20 ntap-add5-project-tower-bucket 12
...
The aggregate size of these 25 buckets is 2180 TBTiB, or 96% of the total.
proddata.sagebase.org
is the primary/default Synapse bucket. The next two large buckets, ad-knowledge-portal-main
and ad-knowledge-portal-large
are in Sage’s AWS STRIDES account, where the costs are paid directly by NIH. All the buckets are recognizable by name as being related to a Sage project and residing in a Sage AWS account. We can look at storage from the S3 perspective as a rough validation of these figures: Our total S3 footprint should be larger than our Synapse footprint.
We can use S3 Data Lens to get a summary of our storage usage from the AWS perspective. As of June 22, 2023:
AWS Account(s) | S3 usage, |
---|
TiB | |
---|---|
Sage Organization (~35 accounts) | 1953 |
*6185: Strides AD | 823.3 |
*5034: (Nextflow Tower) | 230.7 |
Total | 3007 |
Since 3007 PB > 2783 PB, the values seem to be correct. Remaining 200 TB TiB may be accounted for by data which Sage owns but which is unrelated to Synapse. Below we reconcile on a per-bucket basis. There are two numerical columns for bucket size reported by S3 in Sage’s “organization” and STRIDES, respectively, and one column for the amount indexed in Synapse.
Bucket | S3 bucket size ( |
TiB) | STRIDES S3 bucket size ( |
TiB) | Indexed in Synapse (“AVAILABLE”) ( |
TiB) | |||
937.3 | 698 | ||
ad-knowledge-portal-main | 333.1 | 324 | |
ad-knowledge-portal-large | 236.1 | 236 | |
amp-mayo-sinai-synapseencryptedexternalbucket-1bmvn8rlwixv2 | 312 | 163 | |
nda-bsmn-scratch | 152 | ||
htan-dcc-htapp | 0 (bucket is gone) | 85 (also 62 |
TiB “UNLINKED”) | |||
exceptional-longevity | 66.2 | 66 | |
diverse-cohorts | 54.9 | 52 | |
mpnstwgs | 45 | 45 | |
nf-syn23664726-s3-bucket-n9uakf7bowwd | 45.4 | 43 | |
integrated-immunology | 41.8 | 41 | |
gates-ki-001-synapseencryptedexternalbucket-kl8vu7tqto1b | 39.2 | 38 | |
sc-237179673806-pp-hg7rd6vvvd7mk-s3bucket-eu11pjwpxxiv | 37.7 | 37 | |
htan-dcc-chop | 35.2 | 36 | |
htan-dcc-stanford | 26.9 | 33 | |
strides-ampad-project-tower-bucket | 187.8 | 24 | |
htan-dcc-hms | 16.1 | 20 | |
htan-dcc-vanderbilt | 19.2 | 18 | |
htan-dcc-duke | 15.8 | 15 | |
ntap-add5-project-tower-bucket | 16.3 | 12 | |
ctf-swnts-project-tower-bucket | 11.9 | 11 | |
htan-dcc-dfci | 0 | 8 | |
htan-dcc-washu | 0 | 8 | |
sc-237179673806-pp-pisxjdspvwrtg-s3bucket-deoh9m8xdfcc | 6.8 | 8 | |
nda-bsmn | 7 |
The five buckets highlighted in red, appear in Synapse but not in our list of top S3 buckets. They have been emptied or removed by their owners, leaving the index in Synapse ‘stale'. The the child page on stale buckets for more details.
...
Code Block |
---|
with latest_ns as ( select id, max(snapshot_timestamp) as snapshot_timestamp from nodesnapshots where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY group by id ), latest_fs as ( select bucket, key, max(snapshot_timestamp) as snapshot_timestamp from filesnapshots where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY group by bucket, key ), fs_max_id as ( select bucket, key, max(id) as id from filesnapshots where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY group by bucket, key ) select sum(fs.content_size)/1099511627776 project_size_terabytestebibytes from nodesnapshots ns join latest_ns on ns.id=latest_ns.id and ns.snapshot_timestamp=latest_ns.snapshot_timestamp join filesnapshots fs on ns.file_handle_id=fs.id join fs_max_id on fs.id=fs_max_id.id, latest_fs where ns.node_type='file' and ns.snapshot_timestamp > current_timestamp - INTERVAL '14' DAY and fs.snapshot_timestamp > current_timestamp - INTERVAL '14' DAY and fs.bucket=latest_fs.bucket and fs.key=latest_fs.key and fs.snapshot_timestamp=latest_fs.snapshot_timestamp and fs.bucket=fs_max_id.bucket and fs.key=fs_max_id.key and fs.id=fs_max_id.id and fs.status='AVAILABLE' |
The result is 2108. This is close to the result of the earlier query on filesnapshots
alone. (TODO: Where did the other 163 TB TiB go?) Now we partition the result by Project and Bucket:
Code Block |
---|
with latest_ns as ( select id, max(snapshot_timestamp) as snapshot_timestamp from nodesnapshots where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY group by id ), latest_fs as ( select bucket, key, max(snapshot_timestamp) as snapshot_timestamp from filesnapshots where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY group by bucket, key ), fs_max_id as ( select bucket, key, max(id) as id from filesnapshots where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY group by bucket, key ) select 'https://www.synapse.org/#!Synapse:syn' || cast(ns.project_id as varchar), fs.bucket, sum(fs.content_size)/1099511627776 project_size_terabytestebibytes from nodesnapshots ns join latest_ns on ns.id=latest_ns.id and ns.snapshot_timestamp=latest_ns.snapshot_timestamp join filesnapshots fs on ns.file_handle_id=fs.id join fs_max_id on fs.id=fs_max_id.id, latest_fs where ns.snapshot_timestamp > current_timestamp - INTERVAL '14' DAY and fs.snapshot_timestamp > current_timestamp - INTERVAL '14' DAY and fs.bucket=latest_fs.bucket and fs.key=latest_fs.key and fs.snapshot_timestamp=latest_fs.snapshot_timestamp and fs.bucket=fs_max_id.bucket and fs.key=fs_max_id.key and fs.id=fs_max_id.id and fs.status='AVAILABLE' group by ns.project_id, fs.bucket order by project_size_terabytestebibytes desc limit 50 |
The top 50 account for 86% of the 2108 TBTiB:
Project | Bucket | Project size, |
TiB | ||
ad-knowledge-portal-main | 325 | |
249 | ||
amp-mayo-sinai-synapseencryptedexternalbucket-1bmvn8rlwixv2 | 154 | |
nda-bsmn-scratch | 150 | |
ad-knowledge-portal-large | 144 | |
htan-dcc-htapp | 85 | |
exceptional-longevity | 66 | |
mpnstwgs | 45 | |
nf-syn23664726-s3-bucket-n9uakf7bowwd | 43 | |
integrated-immunology | 39 | |
sc-237179673806-pp-hg7rd6vvvd7mk-s3bucket-eu11pjwpxxiv | 37 | |
htan-dcc-chop | 33 | |
27 | ||
diverse-cohorts | 27 | |
htan-dcc-stanford | 25 | |
strides-ampad-project-tower-bucket | 24 | |
diverse-cohorts | 23 | |
21 | ||
18 | ||
htan-dcc-vanderbilt | 17 | |
17 | ||
16 | ||
15 | ||
14 | ||
htan-dcc-hms | 14 | |
htan-dcc-duke | 14 | |
10 | ||
10 | ||
gates-ki-001-synapseencryptedexternalbucket-kl8vu7tqto1b | 9 | |
gates-ki-001-synapseencryptedexternalbucket-kl8vu7tqto1b | 9 | |
sc-237179673806-pp-pisxjdspvwrtg-s3bucket-deoh9m8xdfcc | 8 | |
ctf-swnts-project-tower-bucket | 8 | |
htan-dcc-washu | 8 | |
8 | ||
gates-ki-001-synapseencryptedexternalbucket-kl8vu7tqto1b | 7 | |
7 | ||
nda-bsmn | 7 | |
7 | ||
6 | ||
6 | ||
synodos.eu.frankfurt.penn | 6 | |
ntap-add5-project-tower-bucket | 6 | |
6 | ||
htan-dcc-dfci | 6 | |
ntap-cnf-cell-project-tower-bucket | 5 | |
5 | ||
5 | ||
htan-dcc-ohsu | 5 | |
htan-dcc-tnp-sardana | 5 | |
4 |
This query breaks down the data by is_public
, is_controlled
, is_restricted
:
Code Block |
---|
with latest_ns as ( select id, max(snapshot_timestamp) as snapshot_timestamp from nodesnapshots where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY group by id ), latest_fs as ( select bucket, key, max(snapshot_timestamp) as snapshot_timestamp from filesnapshots where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY group by bucket, key ), fs_max_id as ( select bucket, key, max(id) as id from filesnapshots where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY group by bucket, key ) select ns.is_public, ns.is_controlled, ns.is_restricted, sum(fs.content_size)/1099511627776 size_terabytestebibytes from nodesnapshots ns join latest_ns on ns.id=latest_ns.id and ns.snapshot_timestamp=latest_ns.snapshot_timestamp join filesnapshots fs on ns.file_handle_id=fs.id join fs_max_id on fs.id=fs_max_id.id, latest_fs where ns.snapshot_timestamp > current_timestamp - INTERVAL '14' DAY and fs.snapshot_timestamp > current_timestamp - INTERVAL '14' DAY and fs.bucket=latest_fs.bucket and fs.key=latest_fs.key and fs.snapshot_timestamp=latest_fs.snapshot_timestamp and fs.bucket=fs_max_id.bucket and fs.key=fs_max_id.key and fs.id=fs_max_id.id and fs.status='AVAILABLE' group by ns.is_public, ns.is_controlled, ns.is_restricted order by size_terabytestebibytes desc |
Result:
is_public | is_controlled | is_restricted | size_ |
tebibytes | |||
FALSE | FALSE | FALSE | 827 |
TRUE | TRUE | TRUE | 615 |
TRUE | FALSE | FALSE | 196 |
TRUE | TRUE | FALSE | 155 |
FALSE | TRUE | FALSE | 124 |
FALSE | FALSE | TRUE | 73 |
TRUE | FALSE | TRUE | 63 |
FALSE | TRUE | TRUE | 53 |
TODO: What are the exact definitions of these terms?
...
Code Block |
---|
with latest_ns as ( select id, max(snapshot_timestamp) as snapshot_timestamp from nodesnapshots where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY group by id ), latest_fs as ( select bucket, key, max(snapshot_timestamp) as snapshot_timestamp from filesnapshots where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY group by bucket, key ), fs_max_id as ( select bucket, key, max(id) as id from filesnapshots where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY group by bucket, key ) select 'https://www.synapse.org/#!Synapse:syn' || cast(ns.project_id as varchar), fs.bucket, sum(fs.content_size)/1099511627776 project_size_terabytestebibytes from nodesnapshots ns join latest_ns on ns.id=latest_ns.id and ns.snapshot_timestamp=latest_ns.snapshot_timestamp join filesnapshots fs on ns.file_handle_id=fs.id join fs_max_id on fs.id=fs_max_id.id, latest_fs where ns.snapshot_timestamp > current_timestamp - INTERVAL '14' DAY and fs.snapshot_timestamp > current_timestamp - INTERVAL '14' DAY and fs.bucket=latest_fs.bucket and fs.key=latest_fs.key and fs.snapshot_timestamp=latest_fs.snapshot_timestamp and fs.bucket=fs_max_id.bucket and fs.key=fs_max_id.key and fs.id=fs_max_id.id and fs.status='AVAILABLE' and ns.is_public=TRUE group by ns.project_id, fs.bucket order by project_size_terabytestebibytes desc limit 50 |
The total drops to 948 TBTiB , dominated by AD Portal and PsychEncode, as before. If we restrict the query to the Synapse ‘default’ bucket , we find 279 TB TiB of data. To see how it breaks down by project:
Code Block |
---|
with latest_ns as ( select id, max(snapshot_timestamp) as snapshot_timestamp from nodesnapshots where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY group by id ), latest_fs as ( select bucket, key, max(snapshot_timestamp) as snapshot_timestamp from filesnapshots where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY group by bucket, key ), fs_max_id as ( select bucket, key, max(id) as id from filesnapshots where snapshot_timestamp > current_timestamp - INTERVAL '14' DAY group by bucket, key ) select 'https://www.synapse.org/#!Synapse:syn' || cast(ns.project_id as varchar), fs.bucket, sum(fs.content_size)/1099511627776 project_size_terabytestebibytes from nodesnapshots ns join latest_ns on ns.id=latest_ns.id and ns.snapshot_timestamp=latest_ns.snapshot_timestamp join filesnapshots fs on ns.file_handle_id=fs.id join fs_max_id on fs.id=fs_max_id.id, latest_fs where ns.snapshot_timestamp > current_timestamp - INTERVAL '14' DAY and fs.snapshot_timestamp > current_timestamp - INTERVAL '14' DAY and fs.bucket=latest_fs.bucket and fs.key=latest_fs.key and fs.snapshot_timestamp=latest_fs.snapshot_timestamp and fs.bucket=fs_max_id.bucket and fs.key=fs_max_id.key and fs.id=fs_max_id.id and fs.status='AVAILABLE' and fs.bucket='proddata.sagebase.org' and ns.is_public=TRUE group by ns.project_id, fs.bucket order by project_size_terabytestebibytes desc limit 50 |
The top 21 projects account for 83% of the data, with PsychEncode at the top of the list:
_col0 | bucket | project_size_ |
tebibytes | ||
152 | ||
21 | ||
19 | ||
6 | ||
5 | ||
4 | ||
4 | ||
3 | ||
3 | ||
2 | ||
2 | ||
1 | ||
1 | ||
1 | ||
1 | ||
1 | ||
1 | ||
1 | ||
1 | ||
1 | ||
1 |
How much is the data used?
...
Code Block |
---|
with latest_fs as ( select bucket, key, max(snapshot_timestamp) as snapshot_timestamp from filesnapshots where snapshot_timestamp > current_timestamp - INTERVAL '180' DAY group by bucket, key ), fs_max_id as ( select bucket, key, max(id) as id from filesnapshots where snapshot_timestamp > current_timestamp - INTERVAL '180' DAY group by bucket, key ) select month(fdr.record_date) as month, sum(fs.content_size)/1073741824 as download_volume_gigabytes from filedownloadrecords fdr join filesnapshots fs on fdr.file_handle_id=fs.id join fs_max_id on fdr.file_handle_id=fs_max_id.id, latest_fs where fdr.file_handle_id=fs.id and fdr.timestamp > current_timestamp - INTERVAL '180' DAY -- downloads in the last week and fs.snapshot_timestamp > current_timestamp - INTERVAL '180' DAY and fs.bucket=latest_fs.bucket and fs.key=latest_fs.key and fs.snapshot_timestamp=latest_fs.snapshot_timestamp and fs.bucket=fs_max_id.bucket and fs.key=fs_max_id.key and fs.id=fs_max_id.id and fs.status='AVAILABLE' group by month(fdr.record_date) order by month(fdr.record_date) |
returns
Month | Download volume, GB |
6 | 2031037 |
7 | 4553643 |
At the time of this writing we have not yet accumulated a full month’s worth of data. Let’s compare the current month’s egress to AWS’s stat’s: For July 1-12, these are the egress charges from AWS Cost Explorer:
Cost | Data out, GB | |
---|---|---|
Sage Organization | $2400 | 29,800 |
AD STRIDES | $6500 | 94,670 |
NF STRIDES | ~$0 | ~0 |
The numbers are far less than those reported by Synapse. This could be because the data did not leave the region or because the presigned URL received from Synapse (the creation of which triggered the recording of a download) was not actually used by the client.
...
15 projects accounted for 99% of the download activity. Only 6 of these projects use the Synapse default bucket:
Project | Bucket | Download volume, GB |
htan-dcc-stanford | 1804241 | |
ad-knowledge-portal-main | 846984 | |
diverse-cohorts | 530371 | |
mpnstwgs | 325521 | |
233216 | ||
htan-dcc-vanderbilt | 146573 | |
htan-dcc-duke | 132917 | |
126349 | ||
htan-dcc-ohsu | 73147 | |
ecmonsen-emorypipeline-lambda-055273631518-us-east-1 | 61412 | |
51600 | ||
sc-237179673806-pp-hg7rd6vvvd7mk-s3bucket-eu11pjwpxxiv | 50989 | |
50526 | ||
31063 | ||
26497 |
This analysis could further be refined to look just at down load of public data.