...
Count | Size |
---|---|
39426647 | 695938050033098 (633TB) |
There are 7680010 * 2 file handles that are duplicated (pointing to the same key), the size of those copies is around 46TB.
S3 Inventory vs File Handles
...
So out of 39426647 distinct file handles in prod that take 633TB of data, 4659729 (~12%) file handles are used as entities and make up for 559.5TB (~88%) of the data. So about 34766918 (88%) file handles taking 73TB (~12%) of data is are potentially linked in tables or other objects or used by unlinked file handles.
...
Note that we used a flag to indicate if the file handle is “reused”, since we have the object replication data we can verify if any of the file handles in tables are also reused in tables by entities using a query such as (We created an index on both the FILE_ID in the object replication and on ID in the FILES_TABLE_DISTINCT first):
...
Code Block | ||
---|---|---|
| ||
UPDATE FILES_TABLE_DISTINCT F JOIN OBJECT_REPLICATION R ON F.ID = R.FILE_ID SET REUSED = TRUE |
We created a Glue job similar to the above to export this table to S3 so that we can join on the S3 data to compute the size claimed by tables (TODO)
Multipart Uploads
In prod we do not clean up unfinished multipart uploads, the following script was used to fetch some statistics about thoseETL script is as follows:
Code Block | ||
---|---|---|
| ||
package org.sagebionetworks.project; import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job ## @params: [JOB_NAME] args = getResolvedOptions(sys.argv, ['JOB_NAME']) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args) connection_mysql8 = { "url": "jdbc:mysql://restored-prod-332-table-0-20201105.c5sxx7pot9i8.us-east-1.rds.amazonaws.com:3306/prod332", "dbtable": "FILES_TABLE_DISTINCT", "user": "prod332user", "password": "secret", "customJdbcDriverS3Path": "s3://analytics.sagebase.org/mysql/mysql-connector-java-8.0.22.jar", "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver", "hashexpression": "ID" } ## @type: DataSource ## @args: [connection_type = "mysql", connection_options=connection_mysql8, transformation_ctx = "datasource0"] ## @return: datasource0 ## @inputs: [] datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection_mysql8, transformation_ctx = "datasource0") ## @type: ApplyMapping ## @args: [mapping = [("ID", "BIGINT", "id", "BIGINT"), ("REUSED", "TINYINT ", "reused", "TINYINT")], transformation_ctx = "applymapping1"] ## @return: applymapping1 ## @inputs: [frame = datasource0] applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("ID", "BIGINT", "id", "BIGINT"), ("REUSED", "TINYINT", "reused", "TINYINT")], transformation_ctx = "applymapping1") ## @type: ResolveChoice ## @args: [choice = "make_struct", transformation_ctx = "resolvechoice2"] ## @return: resolvechoice2 ## @inputs: [frame = applymapping1] resolvechoice2 = ResolveChoice.apply(frame = applymapping1, specs = [("reused", "cast:boolean")], transformation_ctx = "resolvechoice2") ## @type: DataSink ## @args: [connection_type = "s3", connection_options = {"path": "s3://analytics.sagebase.org/prod332/tables_file_handles"}, format = "parquet", transformation_ctx = "datasink4"] ## @return: datasink4 ## @inputs: [frame = dropnullfields3] datasink4 = glueContext.write_dynamic_frame.from_options(frame = resolvechoice2, connection_type = "s3", connection_options = {"path": "s3://analytics.sagebase.org/prod332/tables_file_handles"}, format = "parquet", transformation_ctx = "datasink4") job.commit() |
With this table catalogued in Glue, we can now join the data in S3 using Athena (Note: for this we used a dump of the file handle table computed above, but including the file handles with duplicate key since we are joining on ids), we compute the sum of the file sizes that are linked in the tables:
Code Block | ||
---|---|---|
| ||
SELECT COUNT(*), SUM(F.CONTENT_SIZE) FROM tables_file_handles T JOIN file_handles_with_d_id F ON T.ID = F.ID WHERE T.reused = FALSE |
Note that we filter by the reused = False since those file handles are already counted by file entities.
Count | Size |
---|---|
13711427 | 4545704827408 (4.1TB) |
The space taken by the file handles that are linked both in tables and in entities:
Code Block |
---|
SELECT COUNT(*), SUM(F.CONTENT_SIZE) FROM tables_file_handles T JOIN file_handles_with_d_id F ON T.ID = F.ID WHERE T.reused = TRUE |
Count | Size |
---|---|
49166 | 66005128040194 (60TB) |
These are interesting numbers, it looks like the table links are only claiming about 4.1TB of data, let us count how many file handles referenced in tables actually exists among the file handles that stored in prod:
Code Block |
---|
SELECT COUNT(*) FROM tables_file_handles T JOIN file_handles_with_d_id F ON T.ID = F.ID |
Count |
---|
13760593 |
Out of 27624848 unique file handles in synapse tables 50% are in the prod bucket (The rest might be file handles that are stored elsewhere).
Multipart Uploads
In prod we do not clean up unfinished multipart uploads, the following script was used to fetch some statistics about those:
Code Block | ||
---|---|---|
| ||
package org.sagebionetworks.project;
import java.time.Instant;
import java.time.temporal.ChronoUnit;
import java.util.Date;
import org.sagebionetworks.aws.MultiFactorAuthenticationCredentialProvider;
import com.amazonaws.regions.Regions;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.ListMultipartUploadsRequest;
import com.amazonaws.services.s3.model.MultipartUploadListing;
public class MultipartUploadsScanner {
private AmazonS3 s3Client;
public static void main(String[] args) {
String mfaDeviceArn = "mfaDeviceArn";
AmazonS3 s3Client = buildS3Client(mfaDeviceArn);
String bucket = "proddata.sagebase.org";
new MultipartUploadsScanner(s3Client).scan(bucket);
}
private static AmazonS3 buildS3Client(String mfaDeviceArn) {
AmazonS3ClientBuilder builder = AmazonS3ClientBuilder.standard();
builder.withCredentials(new MultiFactorAuthenticationCredentialProvider(mfaDeviceArn));
builder.withRegion(Regions.US_EAST_1);
return builder.build();
}
public MultipartUploadsScanner(AmazonS3 s3Client) {
this.s3Client = s3Client;
}
void scan(String bucket) {
ListMultipartUploadsRequest request = new ListMultipartUploadsRequest(bucket);
boolean hasNext = true;
String keyMarker = null;
String uploadIdMarker = null;
int totalCount = 0;
int totalOldCount = 0;
Date oneMonthAgo = Date.from(Instant.now().minus(30, ChronoUnit.DAYS));
while (hasNext) {
request.setKeyMarker(keyMarker);
request.setUploadIdMarker(uploadIdMarker);
MultipartUploadListing result = s3Client.listMultipartUploads(request);
int count = result.getMultipartUploads().size();
long oldCount = result.getMultipartUploads().stream().filter( u-> u.getInitiated().before(oneMonthAgo)).count();
totalCount += count;
totalOldCount += oldCount;
System.out.println(String.format("Batch Count: %s, Old Count: %s (Total: %s, Total Old: %s)", count, oldCount, totalCount, totalOldCount));
if (result.isTruncated()) {
keyMarker = result.getNextKeyMarker();
uploadIdMarker = result.getNextUploadIdMarker();
System.out.println(String.format("Has next: %s, %s", uploadIdMarker, keyMarker));
} else {
hasNext = false;
}
}
System.out.println(String.format("Number of multipart uploads: %s (Old: %s)", totalCount, totalOldCount));
}
} |
...
Code Block | ||
---|---|---|
| ||
SELECT COUNT(*) FROM MULTIPART_UPLOAD U WHERE U.STATE = 'UPLOADING' |
Result: 6353