Page Comparison

...

Count	Size
39426647	695938050033098 (633TB)

There are 7680010 * 2 file handles that are duplicated (pointing to the same key), the size of those copies is around 46TB.

S3 Inventory vs File Handles

...

So out of 39426647 distinct file handles in prod that take 633TB of data, 4659729 (~12%) file handles are used as entities and make up for 559.5TB (~88%) of the data. So about 34766918 (88%) file handles taking 73TB (~12%) of data is are potentially linked in tables or other objects or used by unlinked file handles.

...

Note that we used a flag to indicate if the file handle is “reused”, since we have the object replication data we can verify if any of the file handles in tables are also reused in tables by entities using a query such as (We created an index on both the FILE_ID in the object replication and on ID in the FILES_TABLE_DISTINCT first):

...

Code Block

language	sql

UPDATE FILES_TABLE_DISTINCT F JOIN OBJECT_REPLICATION R ON F.ID = R.FILE_ID SET REUSED = TRUE

We created a Glue job similar to the above to export this table to S3 so that we can join on the S3 data to compute the size claimed by tables (TODO)

Multipart Uploads

In prod we do not clean up unfinished multipart uploads, the following script was used to fetch some statistics about thoseETL script is as follows:

Code Block

language	javapy

package org.sagebionetworks.project;

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

connection_mysql8 = {
    "url": "jdbc:mysql://restored-prod-332-table-0-20201105.c5sxx7pot9i8.us-east-1.rds.amazonaws.com:3306/prod332",
    "dbtable": "FILES_TABLE_DISTINCT",
    "user": "prod332user",
    "password": "secret",
    "customJdbcDriverS3Path": "s3://analytics.sagebase.org/mysql/mysql-connector-java-8.0.22.jar",
    "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver",
    "hashexpression": "ID"
}

## @type: DataSource
## @args: [connection_type = "mysql", connection_options=connection_mysql8, transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection_mysql8, transformation_ctx = "datasource0")

## @type: ApplyMapping
## @args: [mapping = [("ID", "BIGINT", "id", "BIGINT"), ("REUSED", "TINYINT ", "reused", "TINYINT")], transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("ID", "BIGINT", "id", "BIGINT"), ("REUSED", "TINYINT", "reused", "TINYINT")], transformation_ctx = "applymapping1")
## @type: ResolveChoice
## @args: [choice = "make_struct", transformation_ctx = "resolvechoice2"]
## @return: resolvechoice2
## @inputs: [frame = applymapping1]
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, specs = [("reused", "cast:boolean")], transformation_ctx = "resolvechoice2")
## @type: DataSink
## @args: [connection_type = "s3", connection_options = {"path": "s3://analytics.sagebase.org/prod332/tables_file_handles"}, format = "parquet", transformation_ctx = "datasink4"]
## @return: datasink4
## @inputs: [frame = dropnullfields3]
datasink4 = glueContext.write_dynamic_frame.from_options(frame = resolvechoice2, connection_type = "s3", connection_options = {"path": "s3://analytics.sagebase.org/prod332/tables_file_handles"}, format = "parquet", transformation_ctx = "datasink4")

job.commit()

With this table catalogued in Glue, we can now join the data in S3 using Athena (Note: for this we used a dump of the file handle table computed above, but including the file handles with duplicate key since we are joining on ids), we compute the sum of the file sizes that are linked in the tables:

Code Block

language	sql

SELECT COUNT(*), SUM(F.CONTENT_SIZE) FROM tables_file_handles T JOIN file_handles_with_d_id F ON T.ID = F.ID WHERE T.reused = FALSE

Note that we filter by the reused = False since those file handles are already counted by file entities.

Count	Size
13711427	4545704827408 (4.1TB)

The space taken by the file handles that are linked both in tables and in entities:

Code Block
SELECT COUNT(*), SUM(F.CONTENT_SIZE) FROM tables_file_handles T JOIN file_handles_with_d_id F ON T.ID = F.ID WHERE T.reused = TRUE

Count	Size
49166	66005128040194 (60TB)

These are interesting numbers, it looks like the table links are only claiming about 4.1TB of data, let us count how many file handles referenced in tables actually exists among the file handles that stored in prod:

Code Block
SELECT COUNT(*) FROM tables_file_handles T JOIN file_handles_with_d_id F ON T.ID = F.ID

Count
13760593

Out of 27624848 unique file handles in synapse tables 50% are in the prod bucket (The rest might be file handles that are stored elsewhere).

Multipart Uploads

In prod we do not clean up unfinished multipart uploads, the following script was used to fetch some statistics about those:

Code Block

language	java

package org.sagebionetworks.project;

import java.time.Instant;
import java.time.temporal.ChronoUnit;
import java.util.Date;

import org.sagebionetworks.aws.MultiFactorAuthenticationCredentialProvider;

import com.amazonaws.regions.Regions;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.ListMultipartUploadsRequest;
import com.amazonaws.services.s3.model.MultipartUploadListing;

public class MultipartUploadsScanner {
	
	private AmazonS3 s3Client;
	
	public static void main(String[] args) {

		String mfaDeviceArn = "mfaDeviceArn";
		
		AmazonS3 s3Client = buildS3Client(mfaDeviceArn);
		
		String bucket = "proddata.sagebase.org";
		
		new MultipartUploadsScanner(s3Client).scan(bucket);
	}
		
	private static AmazonS3 buildS3Client(String mfaDeviceArn) {	
		AmazonS3ClientBuilder builder = AmazonS3ClientBuilder.standard();
		builder.withCredentials(new MultiFactorAuthenticationCredentialProvider(mfaDeviceArn));
		builder.withRegion(Regions.US_EAST_1);
		return builder.build();
	}

	
	public MultipartUploadsScanner(AmazonS3 s3Client) {
		this.s3Client = s3Client;
	}
	
	void scan(String bucket) {
		
		ListMultipartUploadsRequest request = new ListMultipartUploadsRequest(bucket);
		
		boolean hasNext = true;
		String keyMarker = null;
		String uploadIdMarker = null;
		
		int totalCount = 0;
		int totalOldCount = 0;
		
		Date oneMonthAgo = Date.from(Instant.now().minus(30, ChronoUnit.DAYS));
		
		while (hasNext) {
			
			request.setKeyMarker(keyMarker);
			request.setUploadIdMarker(uploadIdMarker);
			
			MultipartUploadListing result = s3Client.listMultipartUploads(request);
			
			int count = result.getMultipartUploads().size();
			long oldCount = result.getMultipartUploads().stream().filter( u-> u.getInitiated().before(oneMonthAgo)).count();
			
			totalCount += count;
			totalOldCount += oldCount;
			
			System.out.println(String.format("Batch Count: %s, Old Count: %s (Total: %s, Total Old: %s)", count, oldCount, totalCount, totalOldCount));
			
			if (result.isTruncated()) {
				keyMarker = result.getNextKeyMarker();
				uploadIdMarker = result.getNextUploadIdMarker();
				System.out.println(String.format("Has next: %s, %s", uploadIdMarker, keyMarker));
			} else {
				hasNext = false;
			}
		}
		
		System.out.println(String.format("Number of multipart uploads: %s (Old: %s)", totalCount, totalOldCount));
	}

}

...

Code Block

language	sql

SELECT COUNT(*) FROM MULTIPART_UPLOAD U WHERE U.STATE = 'UPLOADING'

Result: 6353

Versions Compared

Old Version 9

New Version 10

Key

S3 Inventory vs File Handles

Multipart Uploads

Multipart Uploads

Summary of Results