Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
SELECT COUNT(DISTINCT F.ID) FROM FILES_TABLE_DISTINCT F JOIN OBJECT_REPLICATION R ON F.ID = R.FILE_ID WHERE R.IN_SYNAPSE_STORAGE = TRUE

We have 49169 49166 file handles referenced in tables that are also used in file entities. We can update the table above with this information:

Code Block
languagesql
UPDATE FILES_TABLE_DISTINCT F JOIN OBJECT_REPLICATION R ON F.ID = R.FILE_ID WHERE R.IN_SYNAPSE_STORAGE = TRUE SET REUSED = TRUE

We created a Glue job to export this table to S3 so that we can join on the S3 data to compute the size claimed by tables, the ETL script is as follows:

...

Count

Size

13711427

4545704827408 (4.1TB)

The space taken by the This number is the unique file handles that are linked both in tables and in entities:

...

referenced by tables that is stored in the production bucket, if we want to compute the actual size (e.g. excluding file handle copies):

Code Block
languagesql
WITH U_SIZE AS (
  SELECT MAX(F.CONTENT_SIZE) CONTENT_SIZE FROM tables_file_handles T JOIN file_handles_with_d_id F ON T.ID = F.ID WHERE T.reused = TRUE FALSE GROUP BY F.KEY
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_SIZE

Count

Size

4916612004739

66005128040194 3707513625379 (60TB)

...

3.4TB)

The space taken by the file handles that are linked both in tables and in entities (which should be part of the storage report reported above):

Code Block
SELECT COUNT(*), SUM(F.CONTENT_SIZE) FROM tables_file_handles T JOIN file_handles_with_d_id F ON T.ID = F.ID WHERE T.reused = TRUE

Count

13760593

...

Size

49166

66005128040194 (60TB)

Trying to deduplicate from copied file handles does not yield different results (meaning that this file handles were never copied):

Code Block
languagesql
WITH U_SIZE AS (
  SELECT MAX(F.CONTENT_SIZE) AS CONTENT_SIZE FROM tables_file_handles T JOIN file_handles_with_d_id F ON T.ID = F.ID WHERE T.reused = TRUE GROUP BY F.KEY
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_SIZE

Count

Size

49166

66005128040194 (60TB)

These are interesting numbers, it looks like the table links are only claiming about 3.4TB of data. The total number of file handles referenced in tables that are in the production bucket is 13760593 (49166 + 13711427), of 27624848 unique file handles in synapse tables 50% are in the prod bucket (The rest might be file handles that are stored elsewhere).

Other Linked Objects

TODO

Temporary Objects

TODO

  • Storage reports (SELECT COUNT(*), SUM(CONTENT_SIZE) FROM file_handles_with_d_id F WHERE F.KEY LIKE '%Job-%.csv')

  • CSV query results (SELECT COUNT(*), SUM(CONTENT_SIZE) FROM file_handles_with_d_id F WHERE F.KEY LIKE '%Job-%.tsv), the previous query includes csv query results

  • Bulk file downloads (SELECT COUNT(*), SUM(CONTENT_SIZE) FROM file_handles_with_d_id F WHERE F.KEY LIKE '%Job%.zip')

Multipart Uploads

In prod we do not clean up unfinished multipart uploads, the following script was used to fetch some statistics about those:

...