Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Upon further analysis of the backend code we discovered a bug where a multipart upload is initiated when we create or update a wiki page using the first version of the wiki API that submitted the markdown as a string. The multipart upload is never completed for such cases: https://sagebionetworks.jira.com/browse/PLFM-6523. Additionally the new multipart upload that tracks the uploads was implemented relatively recently, the previous implementation might have left behind other unfinished multipart uploads.

Summary of Results

How much data and file handles are potentially unlinked?Working on a snapshot of prod 332 (11/05/2020) for the production bucket we have the following numbers for file handles:

Count

Count in S3 (Unique Keys)

Size in DB

Size in S3 (Unique Keys)

Description

File Handles

47,106,657

39,426,647

~679 TB

~633 TB

File handles that point to the production bucket

Linked File Entities

4,749,667

4,659,729

~589.7 TB

~560 TB

Entities that point to file handles in the production bucket

Linked Table Rows

13,711,427

12,004,739

~4.1 TB

~3.4 TB

File handles referenced in tables that point to the production bucket

Other Links

~1,630,418

~1,592,049

~0.6 TB

~0.6 TB

Other type of linked file handles that point to the production bucket

Temporary Handles

~4,206,159

~4,206,159

~20.7 TB

~20.7 TB

File handles that are not linked, and mostly one time use

Additionally we have the following figures for S3:

Count

Size

Description

S3 Objects

~41,625,517

~640 TB

The objects in S3 from the inventory

No S3 Objects

101

~10.4 GB

Objects that referenced in existing file handles but do not exist in S3

No File Handle

2,198,971

~7.5 TB

Objects that do not have any file handle

In summary out of the 47M file handles that point to the production bucket, we can account for about 24M (~50%). Out of 633TB of indexed data, we can account for about 585 TB (92%). The amount of data that can potentially be archived amounts to about 48 TB with pointed by around 23M file handles. Note that the temporary file handles can potentially be archived as well removing an additional 20.7 TB from the bucket.