Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In order to determine what is the best option and if it is worth deleting data we collected statistics about the unlinked data and made an estimate of the amount of data that is actually access (hot data) so that we can compare the various options, the following data was collected for the proddata.sagebase.org bucket (To see how refer to https://sagebionetworks.jira.com/wiki/spaces/PLFM/pages/1629782065/S3+Bucket+Analysis#%5BhardBreak%5DUnlinked-and-Hot-Data ):

  • Total Size: 748,386,690,923,161 (680.7 TiB)

  • Number of objects: 44,509,789

  • Hot Data Count* : 9,802,820

  • Hot Data Size*: 314,611,245,904,518 (286.1 TiB)

  • Hot Data (>= 128KB) Count: 4,555,052

  • Hot Data (>= 128KB) Size: 314,472,063,061,465 (286 TiB)

  • Unlinked Data Count: 8,762,805

  • Unlinked Data Size: 100,456,586,079,288 (91.36 TiB)

  • Unlinked Data Count (>= 128 KB): 2,823,189

  • Unlinked Data Size (>= 128 KB): 100,408,552,794,768 (91.32 TiB)

  • Monthly Unlinked Count (>= 128KB): 28,808

  • Monthly Unlinked Size (>= 128KB): 1,024,577,069,334 (0.93 TiB)

* We considered the file handles downloaded from tables and/or entities from 2020 till now. Does not include other type of downloads, so we will round up to 300TiB of hot data for estimates.

We estimated the monthly cost using the S3 pricing calculator and rounded to 680TiB the total amount of data(T), 300TiB of hot data (H) and 91 91TiB of unlinked data (U):

  1. Leave the data in STD: $15,185

  2. Move current unlinked (>=128 KB) to infrequent Access*: $14,394

    1. Using the unlinked data >= 128KB: (680T - 91U) in standard + 91U in IA

    2. Consider an avg of 28,808/month unlinked files for moving data to IA (PUTs cost)

  3. Eventually move unlinked data to Glacier deep archive*: $13,323

    1. Using the unlinked data >= 128KB: (680T - 91U) in standard + 91U in Glacier Deep Archive

    2. Consider an avg of 28,808/month unlinked files for moving data to IA (PUTs cost)

    3. Consider the avg of 28,808/month for lifecycle transitions to Glacier and Glacier Deep Archive

  4. Move everything to INT**: $11,775

    1. Consider 44% in INT-Standard (300H/680T)

    2. Consider 56% in INT-IA ((680T-300H)/680T)

    3. Includes the 44M objects monitoring fee

  5. Move to INT + unlinked eventually in INT-deep archive**: $10,734

    1. Assumes that unlinked data is part of the cold data

    2. Consider 44% in INT-Standard (300H/680T)

    3. Consider 43% in INT-IA ((680TD - 300H - 91U)/680T)

    4. Consider 13% in INT-Deep Archive (91U/680T)

    5. Includes the 44M objects monitoring fee

    6. Assumes that we have 28,808 tags per month (GET/PUT request + tag costs)

  6. Move to INT + delete unlinked data**: $10,544

    1. Assumes that unlinked data is part of the cold data

    2. Consider 50% in INT-Standard (300H/(680T - 91))

    3. Consider 50% in INT-IA

    4. Includes the 44M - 8M objects monitoring fee, no tags, no lifecycle transitions

* Note that it does not include the initial cost of moving 2M objects to IA

** Note that it does not include the initial fee for moving all the 44M objects to INT of about $400 (through a lifecycle transition), once an object is in INT there are no fees for other transitions

It is clear that just moving everything to INT is cost effective for our use case where our access patterns are unknown since we have data that even if still linked is rarely accessed (e.g. older projects, older versions of entities, older tables, messages etc). On top of it, archiving unlinked data might be worth it even though the cost savings is not comparable to just using the INT storage class, we have around 1TiB of data each month in average that gets unlinked.

Of note is that part of the unlinked data is temporary file handles that synapse creates internally (e.g. table queries, tables csv for uploads etc), the fact that each month we have such a high number (30K) of file handles that are unlinked might be attributed to this. Unfortunately we do not tag this kind of data so we do not know exactly how much of those 1TiB are actually temporary file handles (We do know that at least 20TiB of the total unlinked data is due to this: https://sagebionetworks.jira.com/wiki/spaces/PLFM/pages/1629782065/S3+Bucket+Analysis#Temporary-File-Handles).

...