...
In order to determine what is the best option and if it is worth deleting data we collected statistics about the unlinked data and made an estimate of the amount of data that is actually access (hot data) so that we can compare the various options, the following data was collected for the proddata.sagebase.org bucket (To see how refer to https://sagebionetworks.jira.com/wiki/spaces/PLFM/pages/1629782065/S3+Bucket+Analysis#%5BhardBreak%5DUnlinked-and-Hot-Data ):
Total Size: 748,386,690,923,161 (680.7 TiB)
Number of objects: 44,509,789
Hot Data Count* : 9,802,820
Hot Data Size*: 314,611,245,904,518 (286.1 TiB)
Hot Data (>= 128KB) Count: 4,555,052
Hot Data (>= 128KB) Size: 314,472,063,061,465 (286 TiB)
Unlinked Data Count: 8,762,805
Unlinked Data Size: 100,456,586,079,288 (91.36 TiB)
Unlinked Data Count (>= 128 KB): 2,823,189
Unlinked Data Size (>= 128 KB): 100,408,552,794,768 (91.32 TiB)
Monthly Unlinked Count (>= 128KB): 28,808
Monthly Unlinked Size (>= 128KB): 1,024,577,069,334 (0.93 TiB)
* We considered the file handles downloaded from tables and/or entities from 2020 till now. Does not include other type of downloads, so we will round up to 300TiB of hot data for estimates.
We estimated the monthly cost using the S3 pricing calculator and rounded to 680TiB the total amount of data(T), 300TiB of hot data (H) and 91 91TiB of unlinked data (U):
Leave the data in STD: $15,185
Move current unlinked (>=128 KB) to infrequent Access*: $14,394
Using the unlinked data >= 128KB: (680T - 91U) in standard + 91U in IA
Consider an avg of 28,808/month unlinked files for moving data to IA (PUTs cost)
Eventually move unlinked data to Glacier deep archive*: $13,323
Using the unlinked data >= 128KB: (680T - 91U) in standard + 91U in Glacier Deep Archive
Consider an avg of 28,808/month unlinked files for moving data to IA (PUTs cost)
Consider the avg of 28,808/month for lifecycle transitions to Glacier and Glacier Deep Archive
Move everything to INT**: $11,775
Consider 44% in INT-Standard (300H/680T)
Consider 56% in INT-IA ((680T-300H)/680T)
Includes the 44M objects monitoring fee
Move to INT + unlinked eventually in INT-deep archive**: $10,734
Assumes that unlinked data is part of the cold data
Consider 44% in INT-Standard (300H/680T)
Consider 43% in INT-IA ((680TD - 300H - 91U)/680T)
Consider 13% in INT-Deep Archive (91U/680T)
Includes the 44M objects monitoring fee
Assumes that we have 28,808 tags per month (GET/PUT request + tag costs)
Move to INT + delete unlinked data**: $10,544
Assumes that unlinked data is part of the cold data
Consider 50% in INT-Standard (300H/(680T - 91))
Consider 50% in INT-IA
Includes the 44M - 8M objects monitoring fee, no tags, no lifecycle transitions
* Note that it does not include the initial cost of moving 2M objects to IA
** Note that it does not include the initial fee for moving all the 44M objects to INT of about $400 (through a lifecycle transition), once an object is in INT there are no fees for other transitions
It is clear that just moving everything to INT is cost effective for our use case where our access patterns are unknown since we have data that even if still linked is rarely accessed (e.g. older projects, older versions of entities, older tables, messages etc). On top of it, archiving unlinked data might be worth it even though the cost savings is not comparable to just using the INT storage class, we have around 1TiB of data each month in average that gets unlinked.
Of note is that part of the unlinked data is temporary file handles that synapse creates internally (e.g. table queries, tables csv for uploads etc), the fact that each month we have such a high number (30K) of file handles that are unlinked might be attributed to this. Unfortunately we do not tag this kind of data so we do not know exactly how much of those 1TiB are actually temporary file handles (We do know that at least 20TiB of the total unlinked data is due to this: https://sagebionetworks.jira.com/wiki/spaces/PLFM/pages/1629782065/S3+Bucket+Analysis#Temporary-File-Handles).
...