Page Comparison

...

A snapshot of both the prod database and the tables database was taken on stack 332 on November 5th.
The S3 inventory was enabled on production and a snapshot was produced on November 5th

Table of Contents

S3 Bucket

Data reported by S3 metrics on November 5th (From the cloud watch metrics for the bucket), note that results might not be precise:

Storage: 646TB707451754672065 bytes (643TB)
Number of objects: 43,536,933

...

Count	Size
39426647	695938050033098 (633TB)

S3 Inventory vs File Handles

Since we have the S3 inventory in parquet format we decided to export the snapshot of the production file handles table as created above to S3 in parquet format, so that we can analyze further the data. The file handles table has a considerable size and we decided to use an AWS Glue job to export the table. There are some caveat since apparently at the time of writing Glue does not fully support MySQL 8 and this is not fully documented. The workaround is to customize the data source JDBC connection using a compatible MySQL driver (See https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-jdbc).

We uploaded the JDBC driver JAR for version 8.0.22 in the bucket used for analytics and used in the connection (using the "customJdbcDriverS3Path" and "customJdbcDriverClassName" properties of the connection). The Job was parallelized using an hash expression on the id with 7 worker nodes, note that we filter out the duplicated file handles as well. Below the ETL script used for exporting the data to S3 in this way is as follows:

Code Block

language	py

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

connection_mysql8 = {
    "url": "jdbc:mysql://restored-prod-332-db-20201105.c5sxx7pot9i8.us-east-1.rds.amazonaws.com:3306/prod332",
    "dbtable": "FILES_PROD",
    "user": "prod332user",
    "password": "secret",
    "customJdbcDriverS3Path": "s3://analytics.sagebase.org/mysql/mysql-connector-java-8.0.22.jar",
    "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver",
    "hashexpression": "D_ID = 1 AND ID"
}

## @type: DataSource
## @args: [connection_type = "mysql", connection_options=connection_mysql8, transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection_mysql8, transformation_ctx = "datasource0")

## @type: ApplyMapping
## @args: [mapping = [("ID", "BIGINT", "id", "BIGINT"), ("CREATED_ON", "timestamp", "CREATED_ON", "timestamp"), ("CREATED_BY", "BIGINT", "CREATED_BY", "BIGINT"), ("KEY", "string", "key", "string"), ("CONTENT_SIZE", "BIGINT", "CONTENT_SIZE", "BIGINT"), ("CONTENT_MD5", "string", "CONTENT_MD5", "string"), ("IS_PREVIEW", "boolean", "IS_PREVIEW", "boolean"), ("PREVIEW_ID", "BIGINT", "PREVIEW_ID", "BIGINT"), transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("ID", "BIGINT", "id", "BIGINT"), ("CREATED_ON", "timestamp", "CREATED_ON", "timestamp"), ("CREATED_BY", "BIGINT", "CREATED_BY", "BIGINT"), ("KEY", "string", "key", "string"), ("CONTENT_SIZE", "BIGINT", "CONTENT_SIZE", "BIGINT"), ("CONTENT_MD5", "string", "CONTENT_MD5", "string"), ("IS_PREVIEW", "boolean", "IS_PREVIEW", "boolean"), ("PREVIEW_ID", "BIGINT", "PREVIEW_ID", "BIGINT")], transformation_ctx = "applymapping1")
## @type: ResolveChoice
## @args: [choice = "make_struct", transformation_ctx = "resolvechoice2"]
## @return: resolvechoice2
## @inputs: [frame = applymapping1]
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
## @type: DropNullFields
## @args: [transformation_ctx = "dropnullfields3"]
## @return: dropnullfields3
## @inputs: [frame = resolvechoice2]
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
## @type: DataSink
## @args: [connection_type = "s3", connection_options = {"path": "s3://analytics.sagebase.org/prod332/file_handles"}, format = "parquet", transformation_ctx = "datasink4"]
## @return: datasink4
## @inputs: [frame = dropnullfields3]
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://analytics.sagebase.org/prod332/file_handles"}, format = "parquet", transformation_ctx = "datasink4")
job.commit()

The Job on the the table that had around 46M rows took around 20 minutes. We then created a crawler to discover the table from S3:

Code Block

Name	prod332FileHandlesCrawler
Description	
Create a single schema for each S3 path	false
Security configuration	
Tags	
-
State	Ready
Schedule	
Last updated	Fri Nov 13 12:13:21 GMT-800 2020
Date created	Fri Nov 13 10:40:11 GMT-800 2020
Database	files_inventory
Service role	AWSGlueServiceRoleDefault
Selected classifiers	
Data store	S3
Include path	s3://analytics.sagebase.org/prod332/file_handles
Connection	
Exclude patterns	
Configuration options
Schema updates in the data store	Update the table definition in the data catalog.
Object deletion in the data store	Mark the table as deprecated in the data catalog.

Once the table is added to the catalog we can now run queries with Athena joining the S3 inventory and the file handles table, we computed the count and size of objects that are stored in file handles but for which we do not have data in prod:

Code Block

language	sql

SELECT COUNT(*) AS MISSING_COUNT, SUM(F.CONTENT_SIZE) AS MISSING_SIZE 
FROM file_handles F LEFT JOIN prod_inventory I ON F.KEY = I.KEY
WHERE I.KEY IS NULL

Count	Size
101	11252921200 (11TB)

We can also run the inverse, to find which data is in prod but not indexed (which should retain a result similar to previous findings), note that we filter on the date of the inventory snapshot in this case:

Code Block

language	sql

SELECT COUNT(*) AS MISSING_COUNT, SUM(I.SIZE) AS MISSING_SIZE 
FROM prod_inventory I LEFT JOIN file_handles F ON I.KEY = F.KEY
WHERE I.dt = '2020-11-05-00-00' AND F.KEY IS NULL

Count	Size
2198971	8286060045314 (7.5TB)

Multipart Uploads

In prod we do not clean up unfinished multipart uploads, the following script was used to fetch some statistics about those:

...

So we have about 1.4 M multipart uploads that are potentially taking storage but are not completed, we could not compute the size of the uploads/parts. Since we upload temporary objects for each part it might also make up the difference reported by the size in S3 and the size in indexed in the file handles table.

The indexed multipart uploads in the DB give us a very different picture:

Code Block

language	sql

SELECT COUNT(*) FROM MULTIPART_UPLOAD U WHERE U.STATE = 'UPLOADING'

Result: 6353

Versions Compared

Old Version 4

New Version 5

Key

S3 Bucket

S3 Inventory vs File Handles

Multipart Uploads