...
These are interesting numbers, it looks like the table links are only claiming about 3.4TB of data. The total number of file handles referenced by synapse tables and that are in the production bucket is 13760593 (49166 + 13711427). Out of 27624848 unique file handles referenced by synapse tables 50% are in the prod bucket (The rest might be file handles that are stored elsewhere).
Other Linked Objects
TODO, especially:
File handles for messages
Wiki markdown plus attachments
Submissions files
Temporary Objects
TODO, data uploaded by the backend that creates file handles using the local upload to the prod bucket. This data is linked in asynchronous job responses and there is no FileHandleAssociationProvider.
...
Storage reports (SELECT COUNT(*), SUM(CONTENT_SIZE) FROM file_handles_with_d_id F WHERE F.KEY LIKE '%Job-%.csv')
...
CSV query results (SELECT COUNT(*), SUM(CONTENT_SIZE) FROM file_handles_with_d_id F WHERE F.KEY LIKE '%Job-%.tsv), the previous query includes csv query results (uses the same pattern). Note that this results are cached.
...
We have several other objects in Synapse that link back explicitly to file handles, the list of these relations is described by https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/file/FileHandleAssociateType.html. Most of those are relatively small sets, in the following we gather the statistics about the following associations where we see relatively big numbers:
MessageAttachment: File handles for user messages
WikiMarkdown: File handles that contain the markdown of wikis
WikiAttachment: File handles attached to wikis
SubmissionAttachment: File handles submitted as part of evaluation submissions (This are submitted as entities, but we keep the list of file handles submitted. Since entities can be deleted we need to compute the delta)
The above objects track their file handle associations into dedicated tables and we can figure out the count and size, taking care of de-duplicating:
MessageAttachement
The messages store their file handles in the MESSAGE_CONTENT table, the file handles are either uploaded internally (e.g. system generated emails) or externally (first the message is uploaded and then the message is sent through the API, an example is the evaluation orchestrator):
We can count and compute the file size directly from the DB:
Code Block | ||
---|---|---|
| ||
WITH U_F AS (
SELECT DISTINCT(M.FILE_HANDLE_ID) FROM MESSAGE_CONTENT M
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_F JOIN FILES_PROD F ON U_F.FILE_HANDLE_ID = F.ID; |
Count | Size |
---|---|
437029 | 2431601540 (2.26GB) |
Just for completeness we can actually de-duplicate further by the object key (since file handles are exposed in the API external processes might actually make a copy of the same file handle):
Code Block | ||
---|---|---|
| ||
WITH U_SIZE AS (
SELECT MAX(CONTENT_SIZE) AS CONTENT_SIZE FROM MESSAGE_CONTENT M JOIN FILES_PROD F ON M.FILE_HANDLE_ID = F.ID GROUP BY `KEY`
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_SIZE; |
Count | Size |
---|---|
436974 | 2431579468 (2.26GB) |
WikiMarkdown and WikiAttachment
The file handles for both the wiki markdown and the wiki attachments are tracked in the table V2_WIKI_ATTACHMENT_RESERVATION:
Code Block | ||
---|---|---|
| ||
WITH U_F AS (
SELECT DISTINCT(FILE_HANDLE_ID) FROM V2_WIKI_ATTACHMENT_RESERVATION
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_F JOIN FILES_PROD F ON U_F.FILE_HANDLE_ID = F.ID; |
Count | Size |
---|---|
1150410 | 22359954476 (20.8GB) |
De-duplicating for the size:
Code Block | ||
---|---|---|
| ||
WITH U_SIZE AS (
SELECT MAX(CONTENT_SIZE) AS CONTENT_SIZE FROM V2_WIKI_ATTACHMENT_RESERVATION W JOIN FILES_PROD F ON W.FILE_HANDLE_ID = F.ID GROUP BY `KEY`
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_SIZE; |
Count | Size |
---|---|
1112096 | 16977090003 (15.8GB) |
SubmissionAttachment
Evaluation submissions file handles are tracked in the table JDOSUBMISSION_FILE:
Code Block | ||
---|---|---|
| ||
WITH U_F AS (
SELECT DISTINCT(FILE_HANDLE_ID) FROM JDOSUBMISSION_FILE
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_F JOIN FILES_PROD F ON U_F.FILE_HANDLE_ID = F.ID; |
Count | Size |
---|---|
93198 | 13031793753688 (11.8TB) |
Note however that submissions are linked to entities, we need to compute the count and size of submissions for which the file handle is NOT linked to an entity:
Code Block | ||
---|---|---|
| ||
WITH U_F AS (
SELECT DISTINCT(S.FILE_HANDLE_ID) FROM JDOSUBMISSION_FILE S LEFT JOIN JDOREVISION R ON S.FILE_HANDLE_ID = R.FILE_HANDLE_ID WHERE R.FILE_HANDLE_ID IS NULL
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_F JOIN FILES_PROD F ON U_F.FILE_HANDLE_ID = F.ID; |
Count | Size |
---|---|
42979 | 613597243955 (0.55TB) |
Therefore out of 93198 submissions with file entities, 42979 (46%) are not linked to entities and take 0.55TB of data (0.47% of the total). Note that we used the table with file handles that are only in prod (e.g. the temporary table FILES_PROD created previously), the numbers are similar if we include all other storage locations (+/- 300GB).
Temporary File Handles
Synapse also uses file handles for what I refer to as temporary file handles, this are file handles used most likely only once. These file handles can be generated internally by the system or uploaded externally, for the latter we cannot control where the file handle ends up (e.g. in our bucket).
In particular we identified in the code base 4 places where we use file handles that might end up in our bucket (The main reference is the usage of the multipartUploadLocalFile method: https://github.com/Sage-Bionetworks/Synapse-Repository-Services/blob/e2b84bfbbd1d1450e955f360199f5448a1b620c3/services/repository-managers/src/main/java/org/sagebionetworks/repo/manager/file/MultipartManagerImpl.java#L222) for this cases we do not have an associations but still make up a chunk of the data in prod:
Storage reports: CSV reports that contains a breakdown of data consumed by entities in each project.
Query results: Table and View query results that are stored in CSV or TSV files. They are used in some cases as a cache.
Bulk downloads: Zipped packages for bulk downloads up to 2GB.
Tables uploads: This are CSV files uploaded by the user in order to update tables or views.
All of the above are linked to asynchronous job requests, the body of the request is stored as a blob in the database. Unfortunately the asynchronous job status is not a migratable table therefore we lose the file handle links at each release cycle.
For some of these we can estimate the size since we know the file name pattern used by the backend when the file handles are created, for this we used Athena on the exported file handle table since the results can be computed much faster:
Storage reports
They are initiated from a DownloadStorageReportRequest, a background job creates a file with the pattern Job-{jobId}.csv. Using Athena we ran the following query:
Code Block | ||
---|---|---|
| ||
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM file_handles_with_d_id F
WHERE regexp_like(F.KEY, '\d+/.+/Job-\d+\.csv$') |
Count | Size |
---|---|
2532705 | 2206719851653 (2TB) |
Note that these includes also query results in csv format, since they follow the same pattern.
Query Results
They are initiated from a DownloadFromTableRequest, a background job creates either a CSV or TSV file with the pattern Job-{jobId}.{csv|tsv}. Since the previous query included the csv pattern we ran the following query:
Code Block | ||
---|---|---|
| ||
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM file_handles_with_d_id F
WHERE regexp_like(F.KEY, '\d+/.+/Job-\d+\.tsv$') |
Count | Size |
---|---|
4085 | 85852089838 (80GB) |
Bulk Downloads
They are initiated from a BulkFileDownloadRequest, a background job created a zipped package using the file pattern Job{jobId}.zip. This is partially true as the request can override the filename. We ran the following query with Athena:
Code Block |
---|
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM file_handles_with_d_id F
WHERE regexp_like(F.KEY, '\d+/.+/Job\d+\.zip$') |
Count | Size |
---|---|
1669369 | 20575241255094 (18.7TB) |
Since this is lot of data, to make sure those are not file handles used in entities or tables we additionally ran the following query:
Code Block | ||
---|---|---|
| ||
WITH F AS (
SELECT DISTINCT F.ID FROM file_handles_with_d_id F WHERE regexp_like(F.KEY, '\d+/.+/Job\d+\.zip$')
)
SELECT COUNT(*) FROM F JOIN nodes_file_handles N ON F.ID = N.FILE_HANDLE_ID
UNION
SELECT COUNT(*) FROM F JOIN tables_file_handles T ON F.ID = T.ID |
That returned 0 results. Meaning that with a high probability this file handles are not linked anywhere else.
Table Uploads
They are initiated from an UploadToTableRequest, this request can even be embedded in a TableUpdateTransactionRequest as part of the change set. Unfortunately at the moment we do not have a way to estimate their size or numbers, in particular we verified that:
The asynchronous job status table is not migratable therefore we lose any information in the original request
The data warehouse does not store the request body for requests and the id of the file handle is stored as part of the request body
We checked the web client implementation to see if there was a common file name pattern used when uploading the CSV but the name of the original file is used instead
Multipart Uploads
In prod we do not clean up unfinished multipart uploads, the following script was used to fetch some statistics about those:
...