Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagesql
SELECT COUNT(*) FROM MULTIPART_UPLOAD U WHERE U.STATE = 'UPLOADING'

Result: 6353

Upon further analysis of the backend code we discovered a bug where a multipart upload is initiated when we create or update a wiki page using the first version of the wiki API that submitted the markdown as a string. The multipart upload is never completed for such cases: https://sagebionetworks.jira.com/browse/PLFM-6523. Additionally the new multipart upload that tracks the uploads was implemented relatively recently, the previous implementation might have left behind other unfinished multipart uploads.

Summary of Results

Working on a snapshot of prod 332 (11/05/2020) for the production bucket we have the following numbers for file handles:

...

Count

...

Count in S3 (Unique Keys)

...

Size in DB

...

Size in S3 (Unique Keys)

...

Description

...

File Handles

...

47,106,657

...

39,426,647

...

~679 TB

...

~633 TB

...

File handles that point to the production bucket

...

Linked File Entities

...

4,749,667

...

4,659,729

...

~589.7 TB

...

~560 TB

...

Entities that point to file handles in the production bucket

...

Linked Table Rows

...

13,711,427

...

12,004,739

...

~4.1 TB

...

~3.4 TB

...

File handles referenced in tables that point to the production bucket

...

Other Links

...

~1,630,418

...

~1,592,049

...

~0.6 TB

...

~0.6 TB

...

Other type of linked file handles that point to the production bucket

...

Temporary Handles

...

~4,206,159

...

~4,206,159

...

~20.7 TB

...

~20.7 TB

...

File handles that are not linked, and mostly one time use

Additionally we have the following figures for S3:

...

Count

...

Size

...

Description

...

S3 Objects

...

~41,625,517

...

~640 TB

...

The objects in S3 from the inventory

...

No S3 Objects

...

101

...

~10.4 GB

...

Objects that referenced in existing file handles but do not exist in S3

...

No File Handle

...

2,198,971

...

~7.5 TB

...

Objects that do not have any file handle

In summary out of the 47M file handles that point to the production bucket, we can account for about 24M (~50%). Out of 633TB of indexed data, we can account for about 585 TB (92%). The amount of data that can potentially be archived amounts to about 48 TB with pointed by around 23M file handles. Note that the temporary file handles can potentially be archived as well removing an additional 20.7 TB from the bucket.We can have a rough estimate of the amount of data that has been uploaded in prod but not yet completed:

Code Block
languagesql
WITH UPLOADING AS (
	SELECT U.ID, U.PART_SIZE, COUNT(*) AS PARTS FROM MULTIPART_UPLOAD U JOIN MULTIPART_UPLOAD_PART_STATE P ON U.ID = P.UPLOAD_ID
	WHERE U.STATE = 'UPLOADING' AND U.BUCKET = 'proddata.sagebase.org'
	GROUP BY U.ID
),
UPLOADING_SIZE AS (
	SELECT (PART_SIZE * PARTS) AS SIZE FROM UPLOADING
)
SELECT COUNT(*), SUM(SIZE) FROM UPLOADING_SIZE

Count

Size

3037

2649792499252 (2.4TB)

So we have about 2.4TB of data that could be potentially freed just removing the unfinished multipart uploads.

Upon further analysis of the backend code we discovered a bug where a multipart upload is initiated when we create or update a wiki page using the first version of the wiki API that submitted the markdown as a string. The multipart upload is never completed for such cases: https://sagebionetworks.jira.com/browse/PLFM-6523. Additionally the new multipart upload that tracks the uploads was implemented relatively recently, the previous implementation might have left behind other unfinished multipart uploads.

Initial Summary of Results

Working on a snapshot of prod 332 (11/05/2020) for the production bucket we have the following numbers for file handles:

Count

Count in S3 (Unique Keys)

Size in DB

Size in S3 (Unique Keys)

Description

File Handles

47,106,657

39,426,647

~679 TB

~633 TB

File handles that point to the production bucket

Linked File Entities

4,749,667

4,659,729

~589.7 TB

~560 TB

Entities that point to file handles in the production bucket

Linked Table Rows

13,711,427

12,004,739

~4.1 TB

~3.4 TB

File handles referenced in tables that point to the production bucket

Other Links

~1,630,418

~1,592,049

~0.6 TB

~0.6 TB

Other type of linked file handles that point to the production bucket

Temporary Handles

~4,206,159

~4,206,159

~20.7 TB

~20.7 TB

File handles that are not linked, and mostly one time use

Additionally we have the following figures for S3:

Count

Size

Description

S3 Objects

~41,625,517

~640 TB

The objects in S3 from the inventory

No S3 Objects

101

~10.4 GB

Objects that are referenced by file handles but do not exist in S3

No File Handle

2,198,971

~7.5 TB

Objects that do not have any file handle

In summary out of the 47M file handles that point to the production bucket, we can account for about 24M (~50%). Out of 633TB of indexed data, we can account for about 585 TB (92%). The amount of data that can potentially be archived amounts to about 48 TB, referenced by around 23M file handles. Note that the temporary file handles can potentially be archived as well removing an additional 20.7 TB from the bucket.


Unlinked and Hot Data

As of May 2021 we implemented the discovery of association and the unlinked file handle detection. In order to decide how to proceed with a strategy to archive unlinked data (See https://sagebionetworks.jira.com/wiki/spaces/PLFM/pages/1620508673/Synapse+S3+Storage+Maintenance#Un-linked-File-Handle-Archival ) we need to make an estimate of some of the data we collected. In the following we provide the steps taken for collecting this data.

Unlinked Data

We wanted to know how much data is unlinked that does not have copies that are still linked, we proceeded as follows:

  1. We executed the unlinked file handle detection on a migrated staging version (Stack 357)

  2. A snapshot of the DB was created in AWS

  3. Created a table to hold unlinked file handles found for the proddata.sagebase.org bucket including the count of keys that are linked + count that are unlinked:

    Code Block
    languagesql
    CREATE TABLE `FILES_UNLINKED` (
        `ID` BIGINT(20) NOT NULL,
        `KEY` VARCHAR(700) NOT NULL COLLATE 'utf8mb4_0900_ai_ci',
        `CONTENT_SIZE` BIGINT(20) NOT NULL,
        `CREATED_ON` TIMESTAMP NOT NULL,
        `LINKED_COUNT` BIGINT(20) NOT NULL,
        `UNLINKED_COUNT` BIGINT(20) NOT NULL
    )
  4. Imported in the FILES_UNLINKED table all the file handles that are unlinked in proddata.sagebase.org:

    Code Block
    languagesql
    INSERT INTO FILES_UNLINKED(ID, `KEY`, CONTENT_SIZE, CREATED_ON, LINKED_COUNT, UNLINKED_COUNT)
      SELECT U.ID, U.KEY, MAX(U.CONTENT_SIZE) AS CONTENT_SIZE, MAX(U.CREATED_ON) AS CREATED_ON, SUM(IF(F.`STATUS` = 'AVAILABLE', 1, 0)) AS LINKED_COUNT, SUM(IF(F.`STATUS` = 'UNLINKED', 1, 0)) AS UNLINKED_COUNT
          FROM FILES U JOIN FILES F WHERE U.UPDATED_ON >= NOW() - INTERVAL 5 DAY AND U.BUCKET_NAME='proddata.sagebase.org' AND U.STATUS = 'UNLINKED' 
      AND U.BUCKET_NAME = F.BUCKET_NAME AND U.KEY = F.`KEY`
      GROUP BY U.ID, U.KEY
  5. Computed the unlinked count and size:

    Code Block
    languagesql
     SELECT COUNT(*), SUM(S) FROM (
        SELECT U.`KEY`, MAX(U.CONTENT_SIZE) AS S FROM FILES_UNLINKED U WHERE U.LINKED_COUNT = 0 GROUP BY U.`KEY`
    ) AS T
  6. Computed the unlinked count and size for keys >= 128KB:

    Code Block
    languagesql
    SELECT COUNT(*), SUM(S) FROM (
        SELECT U.`KEY`, MAX(U.CONTENT_SIZE) AS S FROM FILES_UNLINKED U WHERE U.LINKED_COUNT = 0 AND U.CONTENT_SIZE >= 131072 GROUP BY U.`KEY`
    ) AS T
  7. Computed the monthly average count and size of unlinked data:

    Code Block
    languagesql
    SELECT AVG(C), AVG(CONTENT_SIZE) FROM (
        SELECT YEAR(CREATED_ON) AS Y, MONTH(CREATED_ON) AS M, COUNT(*) C, SUM(CONTENT_SIZE) CONTENT_SIZE FROM (
            SELECT MAX(U.CREATED_ON) AS CREATED_ON , MAX(U.CONTENT_SIZE) AS CONTENT_SIZE FROM FILES_UNLINKED U WHERE U.LINKED_COUNT = 0 AND U.CONTENT_SIZE >= 131072 GROUP BY U.KEY
        ) AS T GROUP BY Y, M ORDER BY Y, M DESC
    ) AS W

This are the results:

  • Unlinked Data Count: 8,762,805

  • Unlinked Data Size: 100,456,586,079,288 (91.36 TiB)

  • Unlinked Data Count (>= 128 KB): 2,823,189

  • Unlinked Data Size (>= 128 KB): 100,408,552,794,768 (91.32 TiB)

  • Monthly Unlinked Count (>= 128KB): 28,808

  • Monthly Unlinked Size (>= 128KB): 1,024,577,069,334 (0.93 TiB)

Hot Data

Additionally we wanted to have a rough estimate of the amount of hot data in our bucket. Unfortunately we never enabled the bucket analytics so we have to work with the data that we collect internally. In particular we collect the downloads from entities and tables and we store the records in S3 in parquet format, we can query this data with Athena, joining on the file handle data that we now export in S3 to get the count and size (we computed for years 2020 and 2021):

Code Block
languagesql
WITH 
    F AS (SELECT (cast(id as varchar)) as id, MAX(contentsize) AS size FROM prod357filehandledatarecords R WHERE R.bucket = 'proddata.sagebase.org' GROUP BY id),
    D AS (SELECT DISTINCT filehandleid FROM prod357filedownloadsrecords R WHERE R.year IN ('2020', '2021'))
SELECT COUNT(distinct D.filehandleid), SUM(F.size) FROM D JOIN F ON D.filehandleid = F.id

Note that we only consider downloads in our bucket. Additionally we wanted to know how much of this is for files that are bigger than 128KB:

Code Block
languagesql
 WITH 
    F AS (SELECT (cast(id as varchar)) as id, MAX(contentsize) AS size FROM prod357filehandledatarecords R WHERE R.bucket = 'proddata.sagebase.org' GROUP BY id),
    D AS (SELECT DISTINCT filehandleid FROM prod357filedownloadsrecords R WHERE R.year IN ('2020', '2021'))
SELECT COUNT(distinct D.filehandleid), SUM(F.size) FROM D JOIN F ON D.filehandleid = F.id AND F.size >= 131072

The results are as follow:

  • Hot Data Count*: 9,802,820

  • Hot Data Size*: 314,611,245,904,518 (286.1 TiB)

  • Hot Data (>= 128KB) Count: 4,555,052

  • Hot Data (>= 128KB) Size: 314,472,063,061,465 (286 TiB)