...
Upon further analysis of the backend code we discovered a bug where a multipart upload is initiated when we create or update a wiki page using the first version of the wiki API that submitted the markdown as a string. The multipart upload is never completed for such cases: https://sagebionetworks.jira.com/browse/PLFM-6523. Additionally the new multipart upload that tracks the uploads was implemented relatively recently, the previous implementation might have left behind other unfinished multipart uploads.
Summary of Results
How much data and file handles are potentially unlinked?Working on a snapshot of prod 332 (11/05/2020) for the production bucket we have the following numbers for file handles:
Count | Count in S3 (Unique Keys) | Size in DB | Size in S3 (Unique Keys) | Description | |
---|---|---|---|---|---|
File Handles | 47,106,657 | 39,426,647 | ~679 TB | ~633 TB | File handles that point to the production bucket |
Linked File Entities | 4,749,667 | 4,659,729 | ~589.7 TB | ~560 TB | Entities that point to file handles in the production bucket |
Linked Table Rows | 13,711,427 | 12,004,739 | ~4.1 TB | ~3.4 TB | File handles referenced in tables that point to the production bucket |
Other Links | ~1,630,418 | ~1,592,049 | ~0.6 TB | ~0.6 TB | Other type of linked file handles that point to the production bucket |
Temporary Handles | ~4,206,159 | ~4,206,159 | ~20.7 TB | ~20.7 TB | File handles that are not linked, and mostly one time use |
Additionally we have the following figures for S3:
Count | Size | Description | |
---|---|---|---|
S3 Objects | ~41,625,517 | ~640 TB | The objects in S3 from the inventory |
No S3 Objects | 101 | ~10.4 GB | Objects that referenced in existing file handles but do not exist in S3 |
No File Handle | 2,198,971 | ~7.5 TB | Objects that do not have any file handle |
In summary out of the 47M file handles that point to the production bucket, we can account for about 24M (~50%). Out of 633TB of indexed data, we can account for about 585 TB (92%). The amount of data that can potentially be archived amounts to about 48 TB with pointed by around 23M file handles. Note that the temporary file handles can potentially be archived as well removing an additional 20.7 TB from the bucket.