Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • A snapshot of both the prod database and the tables database was taken on stack 332 on November 5th.

  • The S3 inventory was enabled on production and a snapshot was produced on November 5th

Table of Contents

S3 Bucket

Data reported by S3 metrics on November 5th (From the cloud watch metrics for the bucket), note that results might not be precise:

  • Storage: 646TB707451754672065 bytes (643TB)

  • Number of objects: 43,536,933

...

Count

Size

39426647

695938050033098 (633TB)

S3 Inventory vs File Handles

Since we have the S3 inventory in parquet format we decided to export the snapshot of the production file handles table as created above to S3 in parquet format, so that we can analyze further the data. The file handles table has a considerable size and we decided to use an AWS Glue job to export the table. There are some caveat since apparently at the time of writing Glue does not fully support MySQL 8 and this is not fully documented. The workaround is to customize the data source JDBC connection using a compatible MySQL driver (See https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-jdbc).

We uploaded the JDBC driver JAR for version 8.0.22 in the bucket used for analytics and used in the connection (using the "customJdbcDriverS3Path" and "customJdbcDriverClassName" properties of the connection). The Job was parallelized using an hash expression on the id with 7 worker nodes, note that we filter out the duplicated file handles as well. Below the ETL script used for exporting the data to S3 in this way is as follows:

Code Block
languagepy
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

connection_mysql8 = {
    "url": "jdbc:mysql://restored-prod-332-db-20201105.c5sxx7pot9i8.us-east-1.rds.amazonaws.com:3306/prod332",
    "dbtable": "FILES_PROD",
    "user": "prod332user",
    "password": "secret",
    "customJdbcDriverS3Path": "s3://analytics.sagebase.org/mysql/mysql-connector-java-8.0.22.jar",
    "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver",
    "hashexpression": "D_ID = 1 AND ID"
}

## @type: DataSource
## @args: [connection_type = "mysql", connection_options=connection_mysql8, transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection_mysql8, transformation_ctx = "datasource0")

## @type: ApplyMapping
## @args: [mapping = [("ID", "BIGINT", "id", "BIGINT"), ("CREATED_ON", "timestamp", "CREATED_ON", "timestamp"), ("CREATED_BY", "BIGINT", "CREATED_BY", "BIGINT"), ("KEY", "string", "key", "string"), ("CONTENT_SIZE", "BIGINT", "CONTENT_SIZE", "BIGINT"), ("CONTENT_MD5", "string", "CONTENT_MD5", "string"), ("IS_PREVIEW", "boolean", "IS_PREVIEW", "boolean"), ("PREVIEW_ID", "BIGINT", "PREVIEW_ID", "BIGINT"), transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("ID", "BIGINT", "id", "BIGINT"), ("CREATED_ON", "timestamp", "CREATED_ON", "timestamp"), ("CREATED_BY", "BIGINT", "CREATED_BY", "BIGINT"), ("KEY", "string", "key", "string"), ("CONTENT_SIZE", "BIGINT", "CONTENT_SIZE", "BIGINT"), ("CONTENT_MD5", "string", "CONTENT_MD5", "string"), ("IS_PREVIEW", "boolean", "IS_PREVIEW", "boolean"), ("PREVIEW_ID", "BIGINT", "PREVIEW_ID", "BIGINT")], transformation_ctx = "applymapping1")
## @type: ResolveChoice
## @args: [choice = "make_struct", transformation_ctx = "resolvechoice2"]
## @return: resolvechoice2
## @inputs: [frame = applymapping1]
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
## @type: DropNullFields
## @args: [transformation_ctx = "dropnullfields3"]
## @return: dropnullfields3
## @inputs: [frame = resolvechoice2]
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
## @type: DataSink
## @args: [connection_type = "s3", connection_options = {"path": "s3://analytics.sagebase.org/prod332/file_handles"}, format = "parquet", transformation_ctx = "datasink4"]
## @return: datasink4
## @inputs: [frame = dropnullfields3]
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://analytics.sagebase.org/prod332/file_handles"}, format = "parquet", transformation_ctx = "datasink4")
job.commit()

The Job on the the table that had around 46M rows took around 20 minutes. We then created a crawler to discover the table from S3:

Code Block
Name	prod332FileHandlesCrawler
Description	
Create a single schema for each S3 path	false
Security configuration	
Tags	
-
State	Ready
Schedule	
Last updated	Fri Nov 13 12:13:21 GMT-800 2020
Date created	Fri Nov 13 10:40:11 GMT-800 2020
Database	files_inventory
Service role	AWSGlueServiceRoleDefault
Selected classifiers	
Data store	S3
Include path	s3://analytics.sagebase.org/prod332/file_handles
Connection	
Exclude patterns	
Configuration options
Schema updates in the data store	Update the table definition in the data catalog.
Object deletion in the data store	Mark the table as deprecated in the data catalog.

Once the table is added to the catalog we can now run queries with Athena joining the S3 inventory and the file handles table, we computed the count and size of objects that are stored in file handles but for which we do not have data in prod:

Code Block
languagesql
SELECT COUNT(*) AS MISSING_COUNT, SUM(F.CONTENT_SIZE) AS MISSING_SIZE 
FROM file_handles F LEFT JOIN prod_inventory I ON F.KEY = I.KEY
WHERE I.KEY IS NULL

Count

Size

101

11252921200 (11TB)

We can also run the inverse, to find which data is in prod but not indexed (which should retain a result similar to previous findings), note that we filter on the date of the inventory snapshot in this case:

Code Block
languagesql
SELECT COUNT(*) AS MISSING_COUNT, SUM(I.SIZE) AS MISSING_SIZE 
FROM prod_inventory I LEFT JOIN file_handles F ON I.KEY = F.KEY
WHERE I.dt = '2020-11-05-00-00' AND F.KEY IS NULL

Count

Size

2198971

8286060045314 (7.5TB)

Multipart Uploads

In prod we do not clean up unfinished multipart uploads, the following script was used to fetch some statistics about those:

...

So we have about 1.4 M multipart uploads that are potentially taking storage but are not completed, we could not compute the size of the uploads/parts. Since we upload temporary objects for each part it might also make up the difference reported by the size in S3 and the size in indexed in the file handles table.

The indexed multipart uploads in the DB give us a very different picture:

Code Block
languagesql
SELECT COUNT(*) FROM MULTIPART_UPLOAD U WHERE U.STATE = 'UPLOADING'

Result: 6353