...
Code Block | ||
---|---|---|
| ||
SELECT COUNT(*) FROM MULTIPART_UPLOAD U WHERE U.STATE = 'UPLOADING' |
Result: 6353
Upon further analysis of the backend code we discovered a bug where a multipart upload is initiated when we create or update a wiki page using the first version of the wiki API that submitted the markdown as a string. The multipart upload is never completed for such cases: https://sagebionetworks.jira.com/browse/PLFM-6523. Additionally the new multipart upload that tracks the uploads was implemented relatively recently, the previous implementation might have left behind other unfinished multipart uploads.
Summary of Results
Working on a snapshot of prod 332 (11/05/2020) for the production bucket we have the following numbers for file handles:
...
Count
...
Count in S3 (Unique Keys)
...
Size in DB
...
Size in S3 (Unique Keys)
...
Description
...
File Handles
...
47,106,657
...
39,426,647
...
~679 TB
...
~633 TB
...
File handles that point to the production bucket
...
Linked File Entities
...
4,749,667
...
4,659,729
...
~589.7 TB
...
~560 TB
...
Entities that point to file handles in the production bucket
...
Linked Table Rows
...
13,711,427
...
12,004,739
...
~4.1 TB
...
~3.4 TB
...
File handles referenced in tables that point to the production bucket
...
Other Links
...
~1,630,418
...
~1,592,049
...
~0.6 TB
...
~0.6 TB
...
Other type of linked file handles that point to the production bucket
...
Temporary Handles
...
~4,206,159
...
~4,206,159
...
~20.7 TB
...
~20.7 TB
...
File handles that are not linked, and mostly one time use
Additionally we have the following figures for S3:
...
Count
...
Size
...
Description
...
S3 Objects
...
~41,625,517
...
~640 TB
...
The objects in S3 from the inventory
...
No S3 Objects
...
101
...
~10.4 GB
...
Objects that are referenced by file handles but do not exist in S3
...
No File Handle
...
2,198,971
...
~7.5 TB
...
Objects that do not have any file handle
In summary out of the 47M file handles that point to the production bucket, we can account for about 24M (~50%). Out of 633TB of indexed data, we can account for about 585 TB (92%). The amount of data that can potentially be archived amounts to about 48 TB, referenced by around 23M file handles. Note that the temporary file handles can potentially be archived as well removing an additional 20.7 TB from the bucket.We can have a rough estimate of the amount of data that has been uploaded in prod but not yet completed:
Code Block | ||
---|---|---|
| ||
WITH UPLOADING AS (
SELECT U.ID, U.PART_SIZE, COUNT(*) AS PARTS FROM MULTIPART_UPLOAD U JOIN MULTIPART_UPLOAD_PART_STATE P ON U.ID = P.UPLOAD_ID
WHERE U.STATE = 'UPLOADING' AND U.BUCKET = 'proddata.sagebase.org'
GROUP BY U.ID
),
UPLOADING_SIZE AS (
SELECT (PART_SIZE * PARTS) AS SIZE FROM UPLOADING
)
SELECT COUNT(*), SUM(SIZE) FROM UPLOADING_SIZE |
Count | Size |
---|---|
3037 | 2649792499252 (2.4TB) |
So we have about 2.4TB of data that could be potentially freed just removing the unfinished multipart uploads.
Upon further analysis of the backend code we discovered a bug where a multipart upload is initiated when we create or update a wiki page using the first version of the wiki API that submitted the markdown as a string. The multipart upload is never completed for such cases: https://sagebionetworks.jira.com/browse/PLFM-6523. Additionally the new multipart upload that tracks the uploads was implemented relatively recently, the previous implementation might have left behind other unfinished multipart uploads.
Initial Summary of Results
Working on a snapshot of prod 332 (11/05/2020) for the production bucket we have the following numbers for file handles:
Count | Count in S3 (Unique Keys) | Size in DB | Size in S3 (Unique Keys) | Description | |
---|---|---|---|---|---|
File Handles | 47,106,657 | 39,426,647 | ~679 TB | ~633 TB | File handles that point to the production bucket |
Linked File Entities | 4,749,667 | 4,659,729 | ~589.7 TB | ~560 TB | Entities that point to file handles in the production bucket |
Linked Table Rows | 13,711,427 | 12,004,739 | ~4.1 TB | ~3.4 TB | File handles referenced in tables that point to the production bucket |
Other Links | ~1,630,418 | ~1,592,049 | ~0.6 TB | ~0.6 TB | Other type of linked file handles that point to the production bucket |
Temporary Handles | ~4,206,159 | ~4,206,159 | ~20.7 TB | ~20.7 TB | File handles that are not linked, and mostly one time use |
Additionally we have the following figures for S3:
Count | Size | Description | |
---|---|---|---|
S3 Objects | ~41,625,517 | ~640 TB | The objects in S3 from the inventory |
No S3 Objects | 101 | ~10.4 GB | Objects that are referenced by file handles but do not exist in S3 |
No File Handle | 2,198,971 | ~7.5 TB | Objects that do not have any file handle |
In summary out of the 47M file handles that point to the production bucket, we can account for about 24M (~50%). Out of 633TB of indexed data, we can account for about 585 TB (92%). The amount of data that can potentially be archived amounts to about 48 TB, referenced by around 23M file handles. Note that the temporary file handles can potentially be archived as well removing an additional 20.7 TB from the bucket.
Unlinked and Hot Data
As of May 2021 we implemented the discovery of association and the unlinked file handle detection. In order to decide how to proceed with a strategy to archive unlinked data (See https://sagebionetworks.jira.com/wiki/spaces/PLFM/pages/1620508673/Synapse+S3+Storage+Maintenance#Un-linked-File-Handle-Archival ) we need to make an estimate of some of the data we collected. In the following we provide the steps taken for collecting this data.
Unlinked Data
We wanted to know how much data is unlinked that does not have copies that are still linked, we proceeded as follows:
We executed the unlinked file handle detection on a migrated staging version (Stack 357)
A snapshot of the DB was created in AWS
Created a table to hold unlinked file handles found for the proddata.sagebase.org bucket including the count of keys that are linked + count that are unlinked:
Code Block language sql CREATE TABLE `FILES_UNLINKED` ( `ID` BIGINT(20) NOT NULL, `KEY` VARCHAR(700) NOT NULL COLLATE 'utf8mb4_0900_ai_ci', `CONTENT_SIZE` BIGINT(20) NOT NULL, `CREATED_ON` TIMESTAMP NOT NULL, `LINKED_COUNT` BIGINT(20) NOT NULL, `UNLINKED_COUNT` BIGINT(20) NOT NULL )
Imported in the FILES_UNLINKED table all the file handles that are unlinked in proddata.sagebase.org:
Code Block language sql INSERT INTO FILES_UNLINKED(ID, `KEY`, CONTENT_SIZE, CREATED_ON, LINKED_COUNT, UNLINKED_COUNT) SELECT U.ID, U.KEY, MAX(U.CONTENT_SIZE) AS CONTENT_SIZE, MAX(U.CREATED_ON) AS CREATED_ON, SUM(IF(F.`STATUS` = 'AVAILABLE', 1, 0)) AS LINKED_COUNT, SUM(IF(F.`STATUS` = 'UNLINKED', 1, 0)) AS UNLINKED_COUNT FROM FILES U JOIN FILES F WHERE U.UPDATED_ON >= NOW() - INTERVAL 5 DAY AND U.BUCKET_NAME='proddata.sagebase.org' AND U.STATUS = 'UNLINKED' AND U.BUCKET_NAME = F.BUCKET_NAME AND U.KEY = F.`KEY` GROUP BY U.ID, U.KEY
Computed the unlinked count and size:
Code Block language sql SELECT COUNT(*), SUM(S) FROM ( SELECT U.`KEY`, MAX(U.CONTENT_SIZE) AS S FROM FILES_UNLINKED U WHERE U.LINKED_COUNT = 0 GROUP BY U.`KEY` ) AS T
Computed the unlinked count and size for keys >= 128KB:
Code Block language sql SELECT COUNT(*), SUM(S) FROM ( SELECT U.`KEY`, MAX(U.CONTENT_SIZE) AS S FROM FILES_UNLINKED U WHERE U.LINKED_COUNT = 0 AND U.CONTENT_SIZE >= 131072 GROUP BY U.`KEY` ) AS T
Computed the monthly average count and size of unlinked data:
Code Block language sql SELECT AVG(C), AVG(CONTENT_SIZE) FROM ( SELECT YEAR(CREATED_ON) AS Y, MONTH(CREATED_ON) AS M, COUNT(*) C, SUM(CONTENT_SIZE) CONTENT_SIZE FROM ( SELECT MAX(U.CREATED_ON) AS CREATED_ON , MAX(U.CONTENT_SIZE) AS CONTENT_SIZE FROM FILES_UNLINKED U WHERE U.LINKED_COUNT = 0 AND U.CONTENT_SIZE >= 131072 GROUP BY U.KEY ) AS T GROUP BY Y, M ORDER BY Y, M DESC ) AS W
This are the results:
Unlinked Data Count: 8,762,805
Unlinked Data Size: 100,456,586,079,288 (91.36 TiB)
Unlinked Data Count (>= 128 KB): 2,823,189
Unlinked Data Size (>= 128 KB): 100,408,552,794,768 (91.32 TiB)
Monthly Unlinked Count (>= 128KB): 28,808
Monthly Unlinked Size (>= 128KB): 1,024,577,069,334 (0.93 TiB)
Hot Data
Additionally we wanted to have a rough estimate of the amount of hot data in our bucket. Unfortunately we never enabled the bucket analytics so we have to work with the data that we collect internally. In particular we collect the downloads from entities and tables and we store the records in S3 in parquet format, we can query this data with Athena, joining on the file handle data that we now export in S3 to get the count and size (we computed for years 2020 and 2021):
Code Block | ||
---|---|---|
| ||
WITH
F AS (SELECT (cast(id as varchar)) as id, MAX(contentsize) AS size FROM prod357filehandledatarecords R WHERE R.bucket = 'proddata.sagebase.org' GROUP BY id),
D AS (SELECT DISTINCT filehandleid FROM prod357filedownloadsrecords R WHERE R.year IN ('2020', '2021'))
SELECT COUNT(distinct D.filehandleid), SUM(F.size) FROM D JOIN F ON D.filehandleid = F.id |
Note that we only consider downloads in our bucket. Additionally we wanted to know how much of this is for files that are bigger than 128KB:
Code Block | ||
---|---|---|
| ||
WITH
F AS (SELECT (cast(id as varchar)) as id, MAX(contentsize) AS size FROM prod357filehandledatarecords R WHERE R.bucket = 'proddata.sagebase.org' GROUP BY id),
D AS (SELECT DISTINCT filehandleid FROM prod357filedownloadsrecords R WHERE R.year IN ('2020', '2021'))
SELECT COUNT(distinct D.filehandleid), SUM(F.size) FROM D JOIN F ON D.filehandleid = F.id AND F.size >= 131072 |
The results are as follow:
Hot Data Count*: 9,802,820
Hot Data Size*: 314,611,245,904,518 (286.1 TiB)
Hot Data (>= 128KB) Count: 4,555,052
Hot Data (>= 128KB) Size: 314,472,063,061,465 (286 TiB)