...
The inventory reports also if a file was uploaded as multipart, this would provide us with how many objects are uploaded not without going through the standard synapse upload API:
...
This result is surprising, only 5.7M objects seems to be multipart uploads but we do have an order of magnitude more than that in the database, what is going on?
On further analysis we checked a few of those files and we could see that they were in fact normal multipart uploads in the DB with the relative file handles. The reason for this inconsistency is that we encrypted the S3 bucket back in 2019, this most likely was done using a PUT copy of the same object. This belief is reinforced by the fact that the modified dates on those objects seem to be consistent with the timeline of the encryption, while the original upload date in synapse was done prior. If the python API was used most likely all the objects that were smaller than a certain size were “copied” over without multipart.
...
The Job on the the table that had around 46M rows took around 20 minutes (on an smaller instance with 16GB of ram and 4 vcpu). We then created a crawler to discover the table from S3:
...
This copied around 10.5K tables, out of which about 1.3k contained file handles. The process took roughly 4 minutes and exported 36173253 file handles. The biggest table contains 7093160 file handles. The distinct number of file handles is 27624848 (This is probably due to table snapshots).
We then created an additional FILES_TABLE_DISTINCT table to stored the the distinct file handles ids:
...
The space taken by the file handles that are linked both in tables and in entities (which should be part of and that are included in the storage report reported above):
...
These are interesting numbers, it looks like the table links are only claiming about 3.4TB of data. The total number of file handles referenced in by synapse tables and that are in the production bucket is 13760593 (49166 + 13711427), . Out of 27624848 unique file handles in referenced by synapse tables 50% are in the prod bucket (The rest might be file handles that are stored elsewhere).
Other Linked Objects
TODO, especially:
File handles for messages
Wiki markdown plus attachments
Submissions files
Temporary Objects
TODO
...
, data uploaded by the backend that creates file handles using the local upload to the prod bucket. This data is linked in asynchronous job responses and there is no FileHandleAssociationProvider.
Storage reports (SELECT COUNT(*), SUM(CONTENT_SIZE) FROM file_handles_with_d_id F WHERE F.KEY LIKE '%Job-%.csv')
CSV query results (SELECT COUNT(*), SUM(CONTENT_SIZE) FROM file_handles_with_d_id F WHERE F.KEY LIKE '%Job-%.tsv), the previous query includes csv query results (uses the same pattern). Note that this results are cached.
Bulk file downloads (SELECT COUNT(*), SUM(CONTENT_SIZE) FROM file_handles_with_d_id F WHERE F.KEY LIKE '%Job%.zip')
...
Code Block | ||
---|---|---|
| ||
SELECT COUNT(*) FROM MULTIPART_UPLOAD U WHERE U.STATE = 'UPLOADING' |
Result: 6353
Upon further analysis of the backend code we discovered a bug where a multipart upload is initiated when we create or update a wiki page using the first version of the wiki API that submitted the markdown as a string. The multipart upload is never completed for such cases: https://sagebionetworks.jira.com/browse/PLFM-6523. Additionally the new multipart upload that tracks the uploads was implemented relatively recently, the previous implementation might have left behind other unfinished multipart uploads.
Summary of Results
How much data and file handles are potentially unlinked?