...
The inventory reports also if a file was uploaded as multipart, this would provide us with how many objects are uploaded not without going through the standard synapse upload API:
...
This result is surprising, only 5.7M objects seems to be multipart uploads but we do have an order of magnitude more than that in the database, what is going on?
On further analysis we checked a few of those files and we could see that they were in fact normal multipart uploads in the DB with the relative file handles. The reason for this inconsistency is that we encrypted the S3 bucket back in 2019, this most likely was done using a PUT copy of the same object. This belief is reinforced by the fact that the modified dates on those objects seem to be consistent with the timeline of the encryption, while the original upload date in synapse was done prior. If the python API was used most likely all the objects that were smaller than a certain size were “copied” over without multipart.
...
File Handles
We created a table that contains only data pointing to the proddata.sagebase.org bucket that also includes a de-duplication identifier:
...
The Job on the the table that had around 46M rows took around 20 minutes (on an smaller instance with 16GB of ram and 4 vcpu). We then created a crawler to discover the table from S3:
...
Note that this number most likely contains temporary objects never deleted (e.g. temporary multipart upload files, old tests, staging data etc).
...
File Entities
From the synapse storage report that is generated monthly and stored in a synapse table we can get an idea of how much data is used by synapse entities in projects, the following query gets the aggregated sum of the size in bytes for the last 10 months:
...
This copied around 10.5K tables, out of which about 1.3k contained file handles. The process took roughly 4 minutes and exported 36173253 file handles. The biggest table contains 7093160 file handles. The distinct number of file handles is 27624848 (This is probably due to table snapshots).
We then created an additional FILES_TABLE_DISTINCT table to stored the the distinct file handles ids:
...
Code Block |
---|
SELECT COUNT(DISTINCT F.ID) FROM FILES_TABLE_DISTINCT F JOIN OBJECT_REPLICATION R ON F.ID = R.FILE_ID WHERE R.IN_SYNAPSE_STORAGE = TRUE |
We have 49169 49166 file handles referenced in tables that are also used in file entities. We can update the table above with this information:
Code Block | ||
---|---|---|
| ||
UPDATE FILES_TABLE_DISTINCT F JOIN OBJECT_REPLICATION R ON F.ID = R.FILE_ID WHERE R.IN_SYNAPSE_STORAGE = TRUE SET REUSED = TRUE |
We created a Glue job to export this table to S3 so that we can join on the S3 data to compute the size claimed by tables, the ETL script is as follows:
...
Count | Size |
---|---|
13711427 | 4545704827408 (4.1TB) |
The space taken by the This number is the unique file handles that are linked both in tables and in entities:
...
referenced by tables that is stored in the production bucket, if we want to compute the actual size (e.g. excluding file handle copies):
Code Block | ||
---|---|---|
| ||
WITH U_SIZE AS ( SELECT MAX(F.CONTENT_SIZE) CONTENT_SIZE FROM tables_file_handles T JOIN file_handles_with_d_id F ON T.ID = F.ID WHERE T.reused = TRUE FALSE GROUP BY F.KEY ) SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_SIZE |
Count | Size |
---|---|
4916612004739 | 66005128040194 3707513625379 (60TB) |
...
3.4TB) |
The space taken by the file handles that are linked both in tables and in entities (and that are included in the storage report reported above):
Code Block |
---|
SELECT COUNT(*), SUM(F.CONTENT_SIZE) FROM tables_file_handles T JOIN file_handles_with_d_id F ON T.ID = F.ID WHERE T.reused = TRUE |
Count |
---|
13760593 |
Out of 27624848 unique file handles in synapse tables 50% are in the prod bucket (The rest might be file handles that are stored elsewhere).
Multipart Uploads
...
Size | |
---|---|
49166 | 66005128040194 (60TB) |
Trying to deduplicate from copied file handles does not yield different results (meaning that this file handles were never copied):
So we have about 1.4 M multipart uploads that are potentially taking storage but are not completed, we could not compute the size of the uploads/parts. Since we upload temporary objects for each part it might also make up the difference reported by the size in S3 and the size in indexed in the file handles table.
The indexed multipart uploads in the DB give us a very different picture:
Code Block | ||
---|---|---|
| ||
SELECT COUNT(*) FROM MULTIPART_UPLOAD U WHERE U.STATE = 'UPLOADING' |
Result: 6353
...
Code Block | ||
---|---|---|
| ||
package org.sagebionetworks.project;
import java.time.Instant;
import java.time.temporal.ChronoUnit;
import java.util.Date;
import org.sagebionetworks.aws.MultiFactorAuthenticationCredentialProvider;
import com.amazonaws.regions.Regions;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.ListMultipartUploadsRequest;
import com.amazonaws.services.s3.model.MultipartUploadListing;
public class MultipartUploadsScanner {
private AmazonS3 s3Client;
public static void main(String[] args) {
String mfaDeviceArn = "mfaDeviceArn";
AmazonS3 s3Client = buildS3Client(mfaDeviceArn);
String bucket = "proddata.sagebase.org";
new MultipartUploadsScanner(s3Client).scan(bucket);
}
private static AmazonS3 buildS3Client(String mfaDeviceArn) {
AmazonS3ClientBuilder builder = AmazonS3ClientBuilder.standard();
builder.withCredentials(new MultiFactorAuthenticationCredentialProvider(mfaDeviceArn));
builder.withRegion(Regions.US_EAST_1);
return builder.build();
}
public MultipartUploadsScanner(AmazonS3 s3Client) {
this.s3Client = s3Client;
}
void scan(String bucket) {
ListMultipartUploadsRequest request = new ListMultipartUploadsRequest(bucket);
boolean hasNext = true;
String keyMarker = null;
String uploadIdMarker = null;
int totalCount = 0;
int totalOldCount = 0;
Date oneMonthAgo = Date.from(Instant.now().minus(30, ChronoUnit.DAYS));
while (hasNext) {
request.setKeyMarker(keyMarker);
request.setUploadIdMarker(uploadIdMarker);
MultipartUploadListing result = s3Client.listMultipartUploads(request);
int count = result.getMultipartUploads().size();
long oldCount = result.getMultipartUploads().stream().filter( u-> u.getInitiated().before(oneMonthAgo)).count();
totalCount += count;
totalOldCount += oldCount;
System.out.println(String.format("Batch Count: %s, Old Count: %s (Total: %s, Total Old: %s)", count, oldCount, totalCount, totalOldCount));
if (result.isTruncated()) {
keyMarker = result.getNextKeyMarker();
uploadIdMarker = result.getNextUploadIdMarker();
System.out.println(String.format("Has next: %s, %s", uploadIdMarker, keyMarker));
} else {
hasNext = false;
}
}
System.out.println(String.format("Number of multipart uploads: %s (Old: %s)", totalCount, totalOldCount));
}
} |
...
Not Completed (to date)
...
Not Completed (started more than 30 days ago)
...
1417823
...
1414593
WITH U_SIZE AS (
SELECT MAX(F.CONTENT_SIZE) AS CONTENT_SIZE FROM tables_file_handles T JOIN file_handles_with_d_id F ON T.ID = F.ID WHERE T.reused = TRUE GROUP BY F.KEY
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_SIZE |
Count | Size |
---|---|
49166 | 66005128040194 (60TB) |
These are interesting numbers, it looks like the table links are only claiming about 3.4TB of data. The total number of file handles referenced by synapse tables and that are in the production bucket is 13760593 (49166 + 13711427). Out of 27624848 unique file handles referenced by synapse tables 50% are in the prod bucket (The rest might be file handles that are stored elsewhere).
Other Linked Objects
We have several other objects in Synapse that link back explicitly to file handles, the list of these relations is described by https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/file/FileHandleAssociateType.html. Most of those are relatively small sets, in the following we gather the statistics about the following associations where we see relatively big numbers:
MessageAttachment: File handles for user messages
WikiMarkdown: File handles that contain the markdown of wikis
WikiAttachment: File handles attached to wikis
SubmissionAttachment: File handles submitted as part of evaluation submissions (This are submitted as entities, but we keep the list of file handles submitted. Since entities can be deleted we need to compute the delta)
The above objects track their file handle associations into dedicated tables and we can figure out the count and size, taking care of de-duplicating:
MessageAttachement
The messages store their file handles in the MESSAGE_CONTENT table, the file handles are either uploaded internally (e.g. system generated emails) or externally (first the message is uploaded and then the message is sent through the API, an example is the evaluation orchestrator):
We can count and compute the file size directly from the DB:
Code Block | ||
---|---|---|
| ||
WITH U_F AS (
SELECT DISTINCT(M.FILE_HANDLE_ID) FROM MESSAGE_CONTENT M
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_F JOIN FILES_PROD F ON U_F.FILE_HANDLE_ID = F.ID; |
Count | Size |
---|---|
437029 | 2431601540 (2.26GB) |
Just for completeness we can actually de-duplicate further by the object key (since file handles are exposed in the API external processes might actually make a copy of the same file handle):
Code Block | ||
---|---|---|
| ||
WITH U_SIZE AS (
SELECT MAX(CONTENT_SIZE) AS CONTENT_SIZE FROM MESSAGE_CONTENT M JOIN FILES_PROD F ON M.FILE_HANDLE_ID = F.ID GROUP BY `KEY`
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_SIZE; |
Count | Size |
---|---|
436974 | 2431579468 (2.26GB) |
WikiMarkdown and WikiAttachment
The file handles for both the wiki markdown and the wiki attachments are tracked in the table V2_WIKI_ATTACHMENT_RESERVATION:
Code Block | ||
---|---|---|
| ||
WITH U_F AS (
SELECT DISTINCT(FILE_HANDLE_ID) FROM V2_WIKI_ATTACHMENT_RESERVATION
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_F JOIN FILES_PROD F ON U_F.FILE_HANDLE_ID = F.ID; |
Count | Size |
---|---|
1150410 | 22359954476 (20.8GB) |
De-duplicating for the size:
Code Block | ||
---|---|---|
| ||
WITH U_SIZE AS (
SELECT MAX(CONTENT_SIZE) AS CONTENT_SIZE FROM V2_WIKI_ATTACHMENT_RESERVATION W JOIN FILES_PROD F ON W.FILE_HANDLE_ID = F.ID GROUP BY `KEY`
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_SIZE; |
Count | Size |
---|---|
1112096 | 16977090003 (15.8GB) |
SubmissionAttachment
Evaluation submissions file handles are tracked in the table JDOSUBMISSION_FILE:
Code Block | ||
---|---|---|
| ||
WITH U_F AS (
SELECT DISTINCT(FILE_HANDLE_ID) FROM JDOSUBMISSION_FILE
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_F JOIN FILES_PROD F ON U_F.FILE_HANDLE_ID = F.ID; |
Count | Size |
---|---|
93198 | 13031793753688 (11.8TB) |
Note however that submissions are linked to entities, we need to compute the count and size of submissions for which the file handle is NOT linked to an entity:
Code Block | ||
---|---|---|
| ||
WITH U_F AS (
SELECT DISTINCT(S.FILE_HANDLE_ID) FROM JDOSUBMISSION_FILE S LEFT JOIN JDOREVISION R ON S.FILE_HANDLE_ID = R.FILE_HANDLE_ID WHERE R.FILE_HANDLE_ID IS NULL
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_F JOIN FILES_PROD F ON U_F.FILE_HANDLE_ID = F.ID; |
Count | Size |
---|---|
42979 | 613597243955 (0.55TB) |
Therefore out of 93198 submissions with file entities, 42979 (46%) are not linked to entities and take 0.55TB of data (0.47% of the total). Note that we used the table with file handles that are only in prod (e.g. the temporary table FILES_PROD created previously), the numbers are similar if we include all other storage locations (+/- 300GB).
Temporary File Handles
Synapse also uses file handles for what I refer to as temporary file handles, this are file handles used most likely only once. These file handles can be generated internally by the system or uploaded externally, for the latter we cannot control where the file handle ends up (e.g. in our bucket).
In particular we identified in the code base 4 places where we use file handles that might end up in our bucket (The main reference is the usage of the multipartUploadLocalFile method: https://github.com/Sage-Bionetworks/Synapse-Repository-Services/blob/e2b84bfbbd1d1450e955f360199f5448a1b620c3/services/repository-managers/src/main/java/org/sagebionetworks/repo/manager/file/MultipartManagerImpl.java#L222) for this cases we do not have an associations but still make up a chunk of the data in prod:
Storage reports: CSV reports that contains a breakdown of data consumed by entities in each project.
Query results: Table and View query results that are stored in CSV or TSV files. They are used in some cases as a cache.
Bulk downloads: Zipped packages for bulk downloads up to 2GB.
Tables uploads: This are CSV files uploaded by the user in order to update tables or views.
All of the above are linked to asynchronous job requests, the body of the request is stored as a blob in the database. Unfortunately the asynchronous job status is not a migratable table therefore we lose the file handle links at each release cycle.
For some of these we can estimate the size since we know the file name pattern used by the backend when the file handles are created, for this task we used Athena on the exported file handle table in S3 since the results can be computed much faster:
Storage reports
They are initiated from a DownloadStorageReportRequest, a background job creates a file with the pattern Job-{jobId}.csv. Using Athena we ran the following query:
Code Block | ||
---|---|---|
| ||
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM file_handles_with_d_id F
WHERE regexp_like(F.KEY, '\d+/.+/Job-\d+\.csv$') |
Count | Size |
---|---|
2532705 | 2206719851653 (2TB) |
Note that these includes also query results in csv format, since they follow the same pattern.
Query Results
They are initiated from a DownloadFromTableRequest, a background job creates either a CSV or TSV file with the pattern Job-{jobId}.{csv|tsv}. Since the previous query included the csv pattern we ran the following query:
Code Block | ||
---|---|---|
| ||
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM file_handles_with_d_id F
WHERE regexp_like(F.KEY, '\d+/.+/Job-\d+\.tsv$') |
Count | Size |
---|---|
4085 | 85852089838 (80GB) |
Bulk Downloads
They are initiated from a BulkFileDownloadRequest, a background job created a zipped package using the file pattern Job{jobId}.zip. This is partially true as the request can override the filename. We ran the following query with Athena:
Code Block |
---|
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM file_handles_with_d_id F
WHERE regexp_like(F.KEY, '\d+/.+/Job\d+\.zip$') |
Count | Size |
---|---|
1669369 | 20575241255094 (18.7TB) |
Since this is lot of data, to make sure those are not file handles used in entities or tables we additionally ran the following query:
Code Block | ||
---|---|---|
| ||
WITH F AS (
SELECT DISTINCT F.ID FROM file_handles_with_d_id F WHERE regexp_like(F.KEY, '\d+/.+/Job\d+\.zip$')
)
SELECT COUNT(*) FROM F JOIN nodes_file_handles N ON F.ID = N.FILE_HANDLE_ID
UNION
SELECT COUNT(*) FROM F JOIN tables_file_handles T ON F.ID = T.ID |
That returned 0 results. Meaning that with a high probability this file handles are not linked anywhere else.
Table Uploads
They are initiated from an UploadToTableRequest, this request can even be embedded in a TableUpdateTransactionRequest as part of the change set. Unfortunately at the moment we do not have a way to estimate their size or numbers, in particular we verified that:
The asynchronous job status table is not migratable therefore we lose any information in the original request
The data warehouse does not store the request body for requests and the id of the file handle is stored as part of the request body
We checked the web client implementation to see if there was a common file name pattern used when uploading the CSV but the name of the original file is used instead
Multipart Uploads
In prod we do not clean up unfinished multipart uploads, the following script was used to fetch some statistics about those:
Code Block | ||
---|---|---|
| ||
package org.sagebionetworks.project;
import java.time.Instant;
import java.time.temporal.ChronoUnit;
import java.util.Date;
import org.sagebionetworks.aws.MultiFactorAuthenticationCredentialProvider;
import com.amazonaws.regions.Regions;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.ListMultipartUploadsRequest;
import com.amazonaws.services.s3.model.MultipartUploadListing;
public class MultipartUploadsScanner {
private AmazonS3 s3Client;
public static void main(String[] args) {
String mfaDeviceArn = "mfaDeviceArn";
AmazonS3 s3Client = buildS3Client(mfaDeviceArn);
String bucket = "proddata.sagebase.org";
new MultipartUploadsScanner(s3Client).scan(bucket);
}
private static AmazonS3 buildS3Client(String mfaDeviceArn) {
AmazonS3ClientBuilder builder = AmazonS3ClientBuilder.standard();
builder.withCredentials(new MultiFactorAuthenticationCredentialProvider(mfaDeviceArn));
builder.withRegion(Regions.US_EAST_1);
return builder.build();
}
public MultipartUploadsScanner(AmazonS3 s3Client) {
this.s3Client = s3Client;
}
void scan(String bucket) {
ListMultipartUploadsRequest request = new ListMultipartUploadsRequest(bucket);
boolean hasNext = true;
String keyMarker = null;
String uploadIdMarker = null;
int totalCount = 0;
int totalOldCount = 0;
Date oneMonthAgo = Date.from(Instant.now().minus(30, ChronoUnit.DAYS));
while (hasNext) {
request.setKeyMarker(keyMarker);
request.setUploadIdMarker(uploadIdMarker);
MultipartUploadListing result = s3Client.listMultipartUploads(request);
int count = result.getMultipartUploads().size();
long oldCount = result.getMultipartUploads().stream().filter( u-> u.getInitiated().before(oneMonthAgo)).count();
totalCount += count;
totalOldCount += oldCount;
System.out.println(String.format("Batch Count: %s, Old Count: %s (Total: %s, Total Old: %s)", count, oldCount, totalCount, totalOldCount));
if (result.isTruncated()) {
keyMarker = result.getNextKeyMarker();
uploadIdMarker = result.getNextUploadIdMarker();
System.out.println(String.format("Has next: %s, %s", uploadIdMarker, keyMarker));
} else {
hasNext = false;
}
}
System.out.println(String.format("Number of multipart uploads: %s (Old: %s)", totalCount, totalOldCount));
}
} |
Not Completed (to date) | Not Completed (started more than 30 days ago) |
---|---|
1417823 | 1414593 |
So we have about 1.4 M multipart uploads that are potentially taking storage but are not completed, we could not compute the size of the uploads/parts. Since we upload temporary objects for each part it might also make up the difference reported by the size in S3 and the size in indexed in the file handles table.
The indexed multipart uploads in the DB give us a very different picture:
Code Block | ||
---|---|---|
| ||
SELECT COUNT(*) FROM MULTIPART_UPLOAD U WHERE U.STATE = 'UPLOADING' |
Result: 6353
We can have a rough estimate of the amount of data that has been uploaded in prod but not yet completed:
Code Block | ||
---|---|---|
| ||
WITH UPLOADING AS (
SELECT U.ID, U.PART_SIZE, COUNT(*) AS PARTS FROM MULTIPART_UPLOAD U JOIN MULTIPART_UPLOAD_PART_STATE P ON U.ID = P.UPLOAD_ID
WHERE U.STATE = 'UPLOADING' AND U.BUCKET = 'proddata.sagebase.org'
GROUP BY U.ID
),
UPLOADING_SIZE AS (
SELECT (PART_SIZE * PARTS) AS SIZE FROM UPLOADING
)
SELECT COUNT(*), SUM(SIZE) FROM UPLOADING_SIZE |
Count | Size |
---|---|
3037 | 2649792499252 (2.4TB) |
So we have about 2.4TB of data that could be potentially freed just removing the unfinished multipart uploads.
Upon further analysis of the backend code we discovered a bug where a multipart upload is initiated when we create or update a wiki page using the first version of the wiki API that submitted the markdown as a string. The multipart upload is never completed for such cases: https://sagebionetworks.jira.com/browse/PLFM-6523. Additionally the new multipart upload that tracks the uploads was implemented relatively recently, the previous implementation might have left behind other unfinished multipart uploads.
Initial Summary of Results
Working on a snapshot of prod 332 (11/05/2020) for the production bucket we have the following numbers for file handles:
Count | Count in S3 (Unique Keys) | Size in DB | Size in S3 (Unique Keys) | Description | |
---|---|---|---|---|---|
File Handles | 47,106,657 | 39,426,647 | ~679 TB | ~633 TB | File handles that point to the production bucket |
Linked File Entities | 4,749,667 | 4,659,729 | ~589.7 TB | ~560 TB | Entities that point to file handles in the production bucket |
Linked Table Rows | 13,711,427 | 12,004,739 | ~4.1 TB | ~3.4 TB | File handles referenced in tables that point to the production bucket |
Other Links | ~1,630,418 | ~1,592,049 | ~0.6 TB | ~0.6 TB | Other type of linked file handles that point to the production bucket |
Temporary Handles | ~4,206,159 | ~4,206,159 | ~20.7 TB | ~20.7 TB | File handles that are not linked, and mostly one time use |
Additionally we have the following figures for S3:
Count | Size | Description | |
---|---|---|---|
S3 Objects | ~41,625,517 | ~640 TB | The objects in S3 from the inventory |
No S3 Objects | 101 | ~10.4 GB | Objects that are referenced by file handles but do not exist in S3 |
No File Handle | 2,198,971 | ~7.5 TB | Objects that do not have any file handle |
In summary out of the 47M file handles that point to the production bucket, we can account for about 24M (~50%). Out of 633TB of indexed data, we can account for about 585 TB (92%). The amount of data that can potentially be archived amounts to about 48 TB, referenced by around 23M file handles. Note that the temporary file handles can potentially be archived as well removing an additional 20.7 TB from the bucket.
Unlinked and Hot Data
As of May 2021 we implemented the discovery of association and the unlinked file handle detection. In order to decide how to proceed with a strategy to archive unlinked data (See https://sagebionetworks.jira.com/wiki/spaces/PLFM/pages/1620508673/Synapse+S3+Storage+Maintenance#Un-linked-File-Handle-Archival ) we need to make an estimate of some of the data we collected. In the following we provide the steps taken for collecting this data.
Unlinked Data
We wanted to know how much data is unlinked that does not have copies that are still linked, we proceeded as follows:
We executed the unlinked file handle detection on a migrated staging version (Stack 357)
A snapshot of the DB was created in AWS
Created a table to hold unlinked file handles found for the proddata.sagebase.org bucket including the count of keys that are linked + count that are unlinked:
Code Block language sql CREATE TABLE `FILES_UNLINKED` ( `ID` BIGINT(20) NOT NULL, `KEY` VARCHAR(700) NOT NULL COLLATE 'utf8mb4_0900_ai_ci', `CONTENT_SIZE` BIGINT(20) NOT NULL, `CREATED_ON` TIMESTAMP NOT NULL, `LINKED_COUNT` BIGINT(20) NOT NULL, `UNLINKED_COUNT` BIGINT(20) NOT NULL )
Imported in the FILES_UNLINKED table all the file handles that are unlinked in proddata.sagebase.org:
Code Block language sql INSERT INTO FILES_UNLINKED(ID, `KEY`, CONTENT_SIZE, CREATED_ON, LINKED_COUNT, UNLINKED_COUNT) SELECT U.ID, U.KEY, MAX(U.CONTENT_SIZE) AS CONTENT_SIZE, MAX(U.CREATED_ON) AS CREATED_ON, SUM(IF(F.`STATUS` = 'AVAILABLE', 1, 0)) AS LINKED_COUNT, SUM(IF(F.`STATUS` = 'UNLINKED', 1, 0)) AS UNLINKED_COUNT FROM FILES U JOIN FILES F WHERE U.UPDATED_ON >= NOW() - INTERVAL 5 DAY AND U.BUCKET_NAME='proddata.sagebase.org' AND U.STATUS = 'UNLINKED' AND U.BUCKET_NAME = F.BUCKET_NAME AND U.KEY = F.`KEY` GROUP BY U.ID, U.KEY
Computed the unlinked count and size:
Code Block language sql SELECT COUNT(*), SUM(S) FROM ( SELECT U.`KEY`, MAX(U.CONTENT_SIZE) AS S FROM FILES_UNLINKED U WHERE U.LINKED_COUNT = 0 GROUP BY U.`KEY` ) AS T
Computed the unlinked count and size for keys >= 128KB:
Code Block language sql SELECT COUNT(*), SUM(S) FROM ( SELECT U.`KEY`, MAX(U.CONTENT_SIZE) AS S FROM FILES_UNLINKED U WHERE U.LINKED_COUNT = 0 AND U.CONTENT_SIZE >= 131072 GROUP BY U.`KEY` ) AS T
Computed the monthly average count and size of unlinked data:
Code Block language sql SELECT AVG(C), AVG(CONTENT_SIZE) FROM ( SELECT YEAR(CREATED_ON) AS Y, MONTH(CREATED_ON) AS M, COUNT(*) C, SUM(CONTENT_SIZE) CONTENT_SIZE FROM ( SELECT MAX(U.CREATED_ON) AS CREATED_ON , MAX(U.CONTENT_SIZE) AS CONTENT_SIZE FROM FILES_UNLINKED U WHERE U.LINKED_COUNT = 0 AND U.CONTENT_SIZE >= 131072 GROUP BY U.KEY ) AS T GROUP BY Y, M ORDER BY Y, M DESC ) AS W
This are the results:
Unlinked Data Count: 8,762,805
Unlinked Data Size: 100,456,586,079,288 (91.36 TiB)
Unlinked Data Count (>= 128 KB): 2,823,189
Unlinked Data Size (>= 128 KB): 100,408,552,794,768 (91.32 TiB)
Monthly Unlinked Count (>= 128KB): 28,808
Monthly Unlinked Size (>= 128KB): 1,024,577,069,334 (0.93 TiB)
Hot Data
Additionally we wanted to have a rough estimate of the amount of hot data in our bucket. Unfortunately we never enabled the bucket analytics so we have to work with the data that we collect internally. In particular we collect the downloads from entities and tables and we store the records in S3 in parquet format, we can query this data with Athena, joining on the file handle data that we now export in S3 to get the count and size (we computed for years 2020 and 2021):
Code Block | ||
---|---|---|
| ||
WITH
F AS (SELECT (cast(id as varchar)) as id, MAX(contentsize) AS size FROM prod357filehandledatarecords R WHERE R.bucket = 'proddata.sagebase.org' GROUP BY id),
D AS (SELECT DISTINCT filehandleid FROM prod357filedownloadsrecords R WHERE R.year IN ('2020', '2021'))
SELECT COUNT(distinct D.filehandleid), SUM(F.size) FROM D JOIN F ON D.filehandleid = F.id |
Note that we only consider downloads in our bucket. Additionally we wanted to know how much of this is for files that are bigger than 128KB:
Code Block | ||
---|---|---|
| ||
WITH
F AS (SELECT (cast(id as varchar)) as id, MAX(contentsize) AS size FROM prod357filehandledatarecords R WHERE R.bucket = 'proddata.sagebase.org' GROUP BY id),
D AS (SELECT DISTINCT filehandleid FROM prod357filedownloadsrecords R WHERE R.year IN ('2020', '2021'))
SELECT COUNT(distinct D.filehandleid), SUM(F.size) FROM D JOIN F ON D.filehandleid = F.id AND F.size >= 131072 |
The results are as follow:
Hot Data Count*: 9,802,820
Hot Data Size*: 314,611,245,904,518 (286.1 TiB)
Hot Data (>= 128KB) Count: 4,555,052
Hot Data (>= 128KB) Size: 314,472,063,061,465 (286 TiB)