...
A snapshot of both the prod database and the tables database was taken on stack 332 on November 5th.
The S3 inventory was enabled on production and a snapshot was produced on November 5th
Table of Contents |
---|
S3 Bucket
Data reported by S3 metrics on November 5th (From the cloud watch metrics for the bucket), note that results might not be precise:
Storage: 646TB707451754672065 bytes (643TB)
Number of objects: 43,536,933
...
Count | Size |
---|---|
39426647 | 695938050033098 (633TB) |
S3 Inventory vs File Handles
Since we have the S3 inventory in parquet format we decided to export the snapshot of the production file handles table as created above to S3 in parquet format, so that we can analyze further the data. The file handles table has a considerable size and we decided to use an AWS Glue job to export the table. There are some caveat since apparently at the time of writing Glue does not fully support MySQL 8 and this is not fully documented. The workaround is to customize the data source JDBC connection using a compatible MySQL driver (See https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-jdbc).
We uploaded the JDBC driver JAR for version 8.0.22 in the bucket used for analytics and used in the connection (using the "customJdbcDriverS3Path" and "customJdbcDriverClassName" properties of the connection). The Job was parallelized using an hash expression on the id with 7 worker nodes, note that we filter out the duplicated file handles as well. Below the ETL script used for exporting the data to S3 in this way is as follows:
Code Block | ||
---|---|---|
| ||
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
connection_mysql8 = {
"url": "jdbc:mysql://restored-prod-332-db-20201105.c5sxx7pot9i8.us-east-1.rds.amazonaws.com:3306/prod332",
"dbtable": "FILES_PROD",
"user": "prod332user",
"password": "secret",
"customJdbcDriverS3Path": "s3://analytics.sagebase.org/mysql/mysql-connector-java-8.0.22.jar",
"customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver",
"hashexpression": "D_ID = 1 AND ID"
}
## @type: DataSource
## @args: [connection_type = "mysql", connection_options=connection_mysql8, transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection_mysql8, transformation_ctx = "datasource0")
## @type: ApplyMapping
## @args: [mapping = [("ID", "BIGINT", "id", "BIGINT"), ("CREATED_ON", "timestamp", "CREATED_ON", "timestamp"), ("CREATED_BY", "BIGINT", "CREATED_BY", "BIGINT"), ("KEY", "string", "key", "string"), ("CONTENT_SIZE", "BIGINT", "CONTENT_SIZE", "BIGINT"), ("CONTENT_MD5", "string", "CONTENT_MD5", "string"), ("IS_PREVIEW", "boolean", "IS_PREVIEW", "boolean"), ("PREVIEW_ID", "BIGINT", "PREVIEW_ID", "BIGINT"), transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("ID", "BIGINT", "id", "BIGINT"), ("CREATED_ON", "timestamp", "CREATED_ON", "timestamp"), ("CREATED_BY", "BIGINT", "CREATED_BY", "BIGINT"), ("KEY", "string", "key", "string"), ("CONTENT_SIZE", "BIGINT", "CONTENT_SIZE", "BIGINT"), ("CONTENT_MD5", "string", "CONTENT_MD5", "string"), ("IS_PREVIEW", "boolean", "IS_PREVIEW", "boolean"), ("PREVIEW_ID", "BIGINT", "PREVIEW_ID", "BIGINT")], transformation_ctx = "applymapping1")
## @type: ResolveChoice
## @args: [choice = "make_struct", transformation_ctx = "resolvechoice2"]
## @return: resolvechoice2
## @inputs: [frame = applymapping1]
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
## @type: DropNullFields
## @args: [transformation_ctx = "dropnullfields3"]
## @return: dropnullfields3
## @inputs: [frame = resolvechoice2]
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
## @type: DataSink
## @args: [connection_type = "s3", connection_options = {"path": "s3://analytics.sagebase.org/prod332/file_handles"}, format = "parquet", transformation_ctx = "datasink4"]
## @return: datasink4
## @inputs: [frame = dropnullfields3]
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://analytics.sagebase.org/prod332/file_handles"}, format = "parquet", transformation_ctx = "datasink4")
job.commit() |
The Job on the the table that had around 46M rows took around 20 minutes. We then created a crawler to discover the table from S3:
Code Block |
---|
Name prod332FileHandlesCrawler
Description
Create a single schema for each S3 path false
Security configuration
Tags
-
State Ready
Schedule
Last updated Fri Nov 13 12:13:21 GMT-800 2020
Date created Fri Nov 13 10:40:11 GMT-800 2020
Database files_inventory
Service role AWSGlueServiceRoleDefault
Selected classifiers
Data store S3
Include path s3://analytics.sagebase.org/prod332/file_handles
Connection
Exclude patterns
Configuration options
Schema updates in the data store Update the table definition in the data catalog.
Object deletion in the data store Mark the table as deprecated in the data catalog. |
Once the table is added to the catalog we can now run queries with Athena joining the S3 inventory and the file handles table, we computed the count and size of objects that are stored in file handles but for which we do not have data in prod:
Code Block | ||
---|---|---|
| ||
SELECT COUNT(*) AS MISSING_COUNT, SUM(F.CONTENT_SIZE) AS MISSING_SIZE
FROM file_handles F LEFT JOIN prod_inventory I ON F.KEY = I.KEY
WHERE I.KEY IS NULL |
Count | Size |
---|---|
101 | 11252921200 (11TB) |
We can also run the inverse, to find which data is in prod but not indexed (which should retain a result similar to previous findings), note that we filter on the date of the inventory snapshot in this case:
Code Block | ||
---|---|---|
| ||
SELECT COUNT(*) AS MISSING_COUNT, SUM(I.SIZE) AS MISSING_SIZE
FROM prod_inventory I LEFT JOIN file_handles F ON I.KEY = F.KEY
WHERE I.dt = '2020-11-05-00-00' AND F.KEY IS NULL |
Count | Size |
---|---|
2198971 | 8286060045314 (7.5TB) |
Multipart Uploads
In prod we do not clean up unfinished multipart uploads, the following script was used to fetch some statistics about those:
...
So we have about 1.4 M multipart uploads that are potentially taking storage but are not completed, we could not compute the size of the uploads/parts. Since we upload temporary objects for each part it might also make up the difference reported by the size in S3 and the size in indexed in the file handles table.
The indexed multipart uploads in the DB give us a very different picture:
Code Block | ||
---|---|---|
| ||
SELECT COUNT(*) FROM MULTIPART_UPLOAD U WHERE U.STATE = 'UPLOADING' |
Result: 6353