Content Comparison

...

Code Block

language	sql

SELECT COUNT(*) FROM MULTIPART_UPLOAD U WHERE U.STATE = 'UPLOADING'

Result: 6353

Upon further analysis of the backend code we discovered a bug where a multipart upload is initiated when we create or update a wiki page using the first version of the wiki API that submitted the markdown as a string. The multipart upload is never completed for such cases: https://sagebionetworks.jira.com/browse/PLFM-6523. Additionally the new multipart upload that tracks the uploads was implemented relatively recently, the previous implementation might have left behind other unfinished multipart uploads.

Summary of Results

Working on a snapshot of prod 332 (11/05/2020) for the production bucket we have the following numbers for file handles:

...

Count

...

Count in S3 (Unique Keys)

...

Size in DB

...

Size in S3 (Unique Keys)

...

Description

...

File Handles

...

47,106,657

...

39,426,647

...

~679 TB

...

~633 TB

...

File handles that point to the production bucket

...

Linked File Entities

...

4,749,667

...

4,659,729

...

~589.7 TB

...

~560 TB

...

Entities that point to file handles in the production bucket

...

Linked Table Rows

...

13,711,427

...

12,004,739

...

~4.1 TB

...

~3.4 TB

...

File handles referenced in tables that point to the production bucket

...

Other Links

...

~1,630,418

...

~1,592,049

...

~0.6 TB

...

~0.6 TB

...

Other type of linked file handles that point to the production bucket

...

Temporary Handles

...

~4,206,159

...

~4,206,159

...

~20.7 TB

...

~20.7 TB

...

File handles that are not linked, and mostly one time use

Additionally we have the following figures for S3:

...

Count

...

Size

...

Description

...

S3 Objects

...

~41,625,517

...

~640 TB

...

The objects in S3 from the inventory

...

No S3 Objects

...

101

...

~10.4 GB

...

Objects that are referenced by file handles but do not exist in S3

...

No File Handle

...

2,198,971

...

~7.5 TB

...

Objects that do not have any file handle

In summary out of the 47M file handles that point to the production bucket, we can account for about 24M (~50%). Out of 633TB of indexed data, we can account for about 585 TB (92%). The amount of data that can potentially be archived amounts to about 48 TB, referenced by around 23M file handles. Note that the temporary file handles can potentially be archived as well removing an additional 20.7 TB from the bucket.We can have a rough estimate of the amount of data that has been uploaded in prod but not yet completed:

Code Block

language	sql

WITH UPLOADING AS (
	SELECT U.ID, U.PART_SIZE, COUNT(*) AS PARTS FROM MULTIPART_UPLOAD U JOIN MULTIPART_UPLOAD_PART_STATE P ON U.ID = P.UPLOAD_ID
	WHERE U.STATE = 'UPLOADING' AND U.BUCKET = 'proddata.sagebase.org'
	GROUP BY U.ID
),
UPLOADING_SIZE AS (
	SELECT (PART_SIZE * PARTS) AS SIZE FROM UPLOADING
)
SELECT COUNT(*), SUM(SIZE) FROM UPLOADING_SIZE

Count	Size
3037	2649792499252 (2.4TB)

So we have about 2.4TB of data that could be potentially freed just removing the unfinished multipart uploads.

Upon further analysis of the backend code we discovered a bug where a multipart upload is initiated when we create or update a wiki page using the first version of the wiki API that submitted the markdown as a string. The multipart upload is never completed for such cases: https://sagebionetworks.jira.com/browse/PLFM-6523. Additionally the new multipart upload that tracks the uploads was implemented relatively recently, the previous implementation might have left behind other unfinished multipart uploads.

Initial Summary of Results

Working on a snapshot of prod 332 (11/05/2020) for the production bucket we have the following numbers for file handles:

	Count	Count in S3 (Unique Keys)	Size in DB	Size in S3 (Unique Keys)	Description
File Handles	47,106,657	39,426,647	~679 TB	~633 TB	File handles that point to the production bucket
Linked File Entities	4,749,667	4,659,729	~589.7 TB	~560 TB	Entities that point to file handles in the production bucket
Linked Table Rows	13,711,427	12,004,739	~4.1 TB	~3.4 TB	File handles referenced in tables that point to the production bucket
Other Links	~1,630,418	~1,592,049	~0.6 TB	~0.6 TB	Other type of linked file handles that point to the production bucket
Temporary Handles	~4,206,159	~4,206,159	~20.7 TB	~20.7 TB	File handles that are not linked, and mostly one time use

Additionally we have the following figures for S3:

	Count	Size	Description
S3 Objects	~41,625,517	~640 TB	The objects in S3 from the inventory
No S3 Objects	101	~10.4 GB	Objects that are referenced by file handles but do not exist in S3
No File Handle	2,198,971	~7.5 TB	Objects that do not have any file handle

In summary out of the 47M file handles that point to the production bucket, we can account for about 24M (~50%). Out of 633TB of indexed data, we can account for about 585 TB (92%). The amount of data that can potentially be archived amounts to about 48 TB, referenced by around 23M file handles. Note that the temporary file handles can potentially be archived as well removing an additional 20.7 TB from the bucket.

Unlinked and Hot Data

As of May 2021 we implemented the discovery of association and the unlinked file handle detection. In order to decide how to proceed with a strategy to archive unlinked data (See https://sagebionetworks.jira.com/wiki/spaces/PLFM/pages/1620508673/Synapse+S3+Storage+Maintenance#Un-linked-File-Handle-Archival ) we need to make an estimate of some of the data we collected. In the following we provide the steps taken for collecting this data.

Unlinked Data

We wanted to know how much data is unlinked that does not have copies that are still linked, we proceeded as follows:

We executed the unlinked file handle detection on a migrated staging version (Stack 357)
A snapshot of the DB was created in AWS

Created a table to hold unlinked file handles found for the proddata.sagebase.org bucket including the count of keys that are linked + count that are unlinked:

Code Block

language	sql

CREATE TABLE `FILES_UNLINKED` (
    `ID` BIGINT(20) NOT NULL,
    `KEY` VARCHAR(700) NOT NULL COLLATE 'utf8mb4_0900_ai_ci',
    `CONTENT_SIZE` BIGINT(20) NOT NULL,
    `CREATED_ON` TIMESTAMP NOT NULL,
    `LINKED_COUNT` BIGINT(20) NOT NULL,
    `UNLINKED_COUNT` BIGINT(20) NOT NULL
)

Imported in the FILES_UNLINKED table all the file handles that are unlinked in proddata.sagebase.org:

Code Block

language	sql

INSERT INTO FILES_UNLINKED(ID, `KEY`, CONTENT_SIZE, CREATED_ON, LINKED_COUNT, UNLINKED_COUNT)
  SELECT U.ID, U.KEY, MAX(U.CONTENT_SIZE) AS CONTENT_SIZE, MAX(U.CREATED_ON) AS CREATED_ON, SUM(IF(F.`STATUS` = 'AVAILABLE', 1, 0)) AS LINKED_COUNT, SUM(IF(F.`STATUS` = 'UNLINKED', 1, 0)) AS UNLINKED_COUNT
      FROM FILES U JOIN FILES F WHERE U.UPDATED_ON >= NOW() - INTERVAL 5 DAY AND U.BUCKET_NAME='proddata.sagebase.org' AND U.STATUS = 'UNLINKED' 
  AND U.BUCKET_NAME = F.BUCKET_NAME AND U.KEY = F.`KEY`
  GROUP BY U.ID, U.KEY

Computed the unlinked count and size:

Code Block

language	sql

 SELECT COUNT(*), SUM(S) FROM (
    SELECT U.`KEY`, MAX(U.CONTENT_SIZE) AS S FROM FILES_UNLINKED U WHERE U.LINKED_COUNT = 0 GROUP BY U.`KEY`
) AS T

Computed the unlinked count and size for keys >= 128KB:

Code Block

language	sql

SELECT COUNT(*), SUM(S) FROM (
    SELECT U.`KEY`, MAX(U.CONTENT_SIZE) AS S FROM FILES_UNLINKED U WHERE U.LINKED_COUNT = 0 AND U.CONTENT_SIZE >= 131072 GROUP BY U.`KEY`
) AS T

Computed the monthly average count and size of unlinked data:

Code Block

language	sql

SELECT AVG(C), AVG(CONTENT_SIZE) FROM (
    SELECT YEAR(CREATED_ON) AS Y, MONTH(CREATED_ON) AS M, COUNT(*) C, SUM(CONTENT_SIZE) CONTENT_SIZE FROM (
        SELECT MAX(U.CREATED_ON) AS CREATED_ON , MAX(U.CONTENT_SIZE) AS CONTENT_SIZE FROM FILES_UNLINKED U WHERE U.LINKED_COUNT = 0 AND U.CONTENT_SIZE >= 131072 GROUP BY U.KEY
    ) AS T GROUP BY Y, M ORDER BY Y, M DESC
) AS W

This are the results:

Unlinked Data Count: 8,762,805
Unlinked Data Size: 100,456,586,079,288 (91.36 TiB)
Unlinked Data Count (>= 128 KB): 2,823,189
Unlinked Data Size (>= 128 KB): 100,408,552,794,768 (91.32 TiB)
Monthly Unlinked Count (>= 128KB): 28,808
Monthly Unlinked Size (>= 128KB): 1,024,577,069,334 (0.93 TiB)

Hot Data

Additionally we wanted to have a rough estimate of the amount of hot data in our bucket. Unfortunately we never enabled the bucket analytics so we have to work with the data that we collect internally. In particular we collect the downloads from entities and tables and we store the records in S3 in parquet format, we can query this data with Athena, joining on the file handle data that we now export in S3 to get the count and size (we computed for years 2020 and 2021):

Code Block

language	sql

WITH 
    F AS (SELECT (cast(id as varchar)) as id, MAX(contentsize) AS size FROM prod357filehandledatarecords R WHERE R.bucket = 'proddata.sagebase.org' GROUP BY id),
    D AS (SELECT DISTINCT filehandleid FROM prod357filedownloadsrecords R WHERE R.year IN ('2020', '2021'))
SELECT COUNT(distinct D.filehandleid), SUM(F.size) FROM D JOIN F ON D.filehandleid = F.id

Note that we only consider downloads in our bucket. Additionally we wanted to know how much of this is for files that are bigger than 128KB:

Code Block

language	sql

 WITH 
    F AS (SELECT (cast(id as varchar)) as id, MAX(contentsize) AS size FROM prod357filehandledatarecords R WHERE R.bucket = 'proddata.sagebase.org' GROUP BY id),
    D AS (SELECT DISTINCT filehandleid FROM prod357filedownloadsrecords R WHERE R.year IN ('2020', '2021'))
SELECT COUNT(distinct D.filehandleid), SUM(F.size) FROM D JOIN F ON D.filehandleid = F.id AND F.size >= 131072

The results are as follow:

Hot Data Count*: 9,802,820
Hot Data Size*: 314,611,245,904,518 (286.1 TiB)
Hot Data (>= 128KB) Count: 4,555,052
Hot Data (>= 128KB) Size: 314,472,063,061,465 (286 TiB)

Version	Old Version 18	New Version Current
Changes made by	Marco Marasca	Marco Marasca
Saved on	Nov 20, 2020	May 14, 2021

Content Comparison

Versions Compared

Key

Summary of Results

Initial Summary of Results

Unlinked and Hot Data

Unlinked Data

Hot Data