Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

These are interesting numbers, it looks like the table links are only claiming about 3.4TB of data. The total number of file handles referenced by synapse tables and that are in the production bucket is 13760593 (49166 + 13711427). Out of 27624848 unique file handles referenced by synapse tables 50% are in the prod bucket (The rest might be file handles that are stored elsewhere).

Other Linked Objects

TODO, especially:

  • File handles for messages

  • Wiki markdown plus attachments

  • Submissions files

Temporary Objects

TODO, data uploaded by the backend that creates file handles using the local upload to the prod bucket. This data is linked in asynchronous job responses and there is no FileHandleAssociationProvider.

...

Storage reports (SELECT COUNT(*), SUM(CONTENT_SIZE) FROM file_handles_with_d_id F WHERE F.KEY LIKE '%Job-%.csv')

...

CSV query results (SELECT COUNT(*), SUM(CONTENT_SIZE) FROM file_handles_with_d_id F WHERE F.KEY LIKE '%Job-%.tsv), the previous query includes csv query results (uses the same pattern). Note that this results are cached.

...

We have several other objects in Synapse that link back explicitly to file handles, the list of these relations is described by https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/file/FileHandleAssociateType.html. Most of those are relatively small sets, in the following we gather the statistics about the following associations where we see relatively big numbers:

  • MessageAttachment: File handles for user messages

  • WikiMarkdown: File handles that contain the markdown of wikis

  • WikiAttachment: File handles attached to wikis

  • SubmissionAttachment: File handles submitted as part of evaluation submissions (This are submitted as entities, but we keep the list of file handles submitted. Since entities can be deleted we need to compute the delta)

The above objects track their file handle associations into dedicated tables and we can figure out the count and size, taking care of de-duplicating:

MessageAttachement

The messages store their file handles in the MESSAGE_CONTENT table, the file handles are either uploaded internally (e.g. system generated emails) or externally (first the message is uploaded and then the message is sent through the API, an example is the evaluation orchestrator):

We can count and compute the file size directly from the DB:

Code Block
languagesql
WITH U_F AS (
  SELECT DISTINCT(M.FILE_HANDLE_ID) FROM MESSAGE_CONTENT M
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_F JOIN FILES_PROD F ON U_F.FILE_HANDLE_ID = F.ID;

Count

Size

437029

2431601540 (2.26GB)

Just for completeness we can actually de-duplicate further by the object key (since file handles are exposed in the API external processes might actually make a copy of the same file handle):

Code Block
languagesql
WITH U_SIZE AS (
  SELECT MAX(CONTENT_SIZE) AS CONTENT_SIZE FROM MESSAGE_CONTENT M JOIN FILES_PROD F ON M.FILE_HANDLE_ID = F.ID GROUP BY `KEY`
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_SIZE;

Count

Size

436974

2431579468 (2.26GB)

WikiMarkdown and WikiAttachment

The file handles for both the wiki markdown and the wiki attachments are tracked in the table V2_WIKI_ATTACHMENT_RESERVATION:

Code Block
languagesql
WITH U_F AS (
  SELECT DISTINCT(FILE_HANDLE_ID) FROM V2_WIKI_ATTACHMENT_RESERVATION
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_F JOIN FILES_PROD F ON U_F.FILE_HANDLE_ID = F.ID;

Count

Size

1150410

22359954476 (20.8GB)

De-duplicating for the size:

Code Block
languagesql
WITH U_SIZE AS (
  SELECT MAX(CONTENT_SIZE) AS CONTENT_SIZE FROM V2_WIKI_ATTACHMENT_RESERVATION W JOIN FILES_PROD F ON W.FILE_HANDLE_ID = F.ID GROUP BY `KEY`
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_SIZE;

Count

Size

1112096

16977090003 (15.8GB)

SubmissionAttachment

Evaluation submissions file handles are tracked in the table JDOSUBMISSION_FILE:

Code Block
languagesql
WITH U_F AS (
  SELECT DISTINCT(FILE_HANDLE_ID) FROM JDOSUBMISSION_FILE
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_F JOIN FILES_PROD F ON U_F.FILE_HANDLE_ID = F.ID;

Count

Size

93198

13031793753688 (11.8TB)

Note however that submissions are linked to entities, we need to compute the count and size of submissions for which the file handle is NOT linked to an entity:

Code Block
languagesql
WITH U_F AS (
 SELECT DISTINCT(S.FILE_HANDLE_ID) FROM JDOSUBMISSION_FILE S LEFT JOIN JDOREVISION R ON S.FILE_HANDLE_ID = R.FILE_HANDLE_ID WHERE R.FILE_HANDLE_ID IS NULL
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_F JOIN FILES_PROD F ON U_F.FILE_HANDLE_ID = F.ID;

Count

Size

42979

613597243955 (0.55TB)

Therefore out of 93198 submissions with file entities, 42979 (46%) are not linked to entities and take 0.55TB of data (0.47% of the total). Note that we used the table with file handles that are only in prod (e.g. the temporary table FILES_PROD created previously), the numbers are similar if we include all other storage locations (+/- 300GB).

Temporary File Handles

Synapse also uses file handles for what I refer to as temporary file handles, this are file handles used most likely only once. These file handles can be generated internally by the system or uploaded externally, for the latter we cannot control where the file handle ends up (e.g. in our bucket).

In particular we identified in the code base 4 places where we use file handles that might end up in our bucket (The main reference is the usage of the multipartUploadLocalFile method: https://github.com/Sage-Bionetworks/Synapse-Repository-Services/blob/e2b84bfbbd1d1450e955f360199f5448a1b620c3/services/repository-managers/src/main/java/org/sagebionetworks/repo/manager/file/MultipartManagerImpl.java#L222) for this cases we do not have an associations but still make up a chunk of the data in prod:

  • Storage reports: CSV reports that contains a breakdown of data consumed by entities in each project.

  • Query results: Table and View query results that are stored in CSV or TSV files. They are used in some cases as a cache.

  • Bulk downloads: Zipped packages for bulk downloads up to 2GB.

  • Tables uploads: This are CSV files uploaded by the user in order to update tables or views.

All of the above are linked to asynchronous job requests, the body of the request is stored as a blob in the database. Unfortunately the asynchronous job status is not a migratable table therefore we lose the file handle links at each release cycle.

For some of these we can estimate the size since we know the file name pattern used by the backend when the file handles are created, for this we used Athena on the exported file handle table since the results can be computed much faster:

Storage reports

They are initiated from a DownloadStorageReportRequest, a background job creates a file with the pattern Job-{jobId}.csv. Using Athena we ran the following query:

Code Block
languagesql
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM file_handles_with_d_id F
WHERE regexp_like(F.KEY, '\d+/.+/Job-\d+\.csv$')

Count

Size

2532705

2206719851653 (2TB)

Note that these includes also query results in csv format, since they follow the same pattern.

Query Results

They are initiated from a DownloadFromTableRequest, a background job creates either a CSV or TSV file with the pattern Job-{jobId}.{csv|tsv}. Since the previous query included the csv pattern we ran the following query:

Code Block
languagesql
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM file_handles_with_d_id F
WHERE regexp_like(F.KEY, '\d+/.+/Job-\d+\.tsv$')

Count

Size

4085

85852089838 (80GB)

Bulk Downloads

They are initiated from a BulkFileDownloadRequest, a background job created a zipped package using the file pattern Job{jobId}.zip. This is partially true as the request can override the filename. We ran the following query with Athena:

Code Block
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM file_handles_with_d_id F
WHERE regexp_like(F.KEY, '\d+/.+/Job\d+\.zip$')

Count

Size

1669369

20575241255094 (18.7TB)

Since this is lot of data, to make sure those are not file handles used in entities or tables we additionally ran the following query:

Code Block
languagesql
WITH F AS (
  SELECT DISTINCT F.ID FROM file_handles_with_d_id F WHERE regexp_like(F.KEY, '\d+/.+/Job\d+\.zip$')
) 
SELECT COUNT(*) FROM F JOIN nodes_file_handles N ON F.ID = N.FILE_HANDLE_ID
UNION
SELECT COUNT(*) FROM F JOIN tables_file_handles T ON F.ID = T.ID

That returned 0 results. Meaning that with a high probability this file handles are not linked anywhere else.

Table Uploads

They are initiated from an UploadToTableRequest, this request can even be embedded in a TableUpdateTransactionRequest as part of the change set. Unfortunately at the moment we do not have a way to estimate their size or numbers, in particular we verified that:

  • The asynchronous job status table is not migratable therefore we lose any information in the original request

  • The data warehouse does not store the request body for requests and the id of the file handle is stored as part of the request body

  • We checked the web client implementation to see if there was a common file name pattern used when uploading the CSV but the name of the original file is used instead

Multipart Uploads

In prod we do not clean up unfinished multipart uploads, the following script was used to fetch some statistics about those:

...