We collect in this document the various steps taken to gauge the amount of data in the S3 proddata.sagebase.org bucket comparing it to the data indexed in the synapse database.
A snapshot of both the prod database and the tables database was taken on stack 332 on November 5th.
The S3 inventory was enabled on production and a snapshot was produced on November 5th
S3 Bucket
Data reported by S3 metrics on November 5th (From the cloud watch metrics for the bucket), note that results might not be precise:
Storage: 707451754672065 bytes (643TB)
Number of objects: 43,536,933
S3 Inventory
The inventory was enabled (See https://sagebionetworks.jira.com/browse/PLFM-6426 ) on the bucket.
We created a database named files_inventory in the Glue catalog. In Athena we created an external table for the S3 inventory:
CREATE EXTERNAL TABLE prod_inventory( `bucket` string, key string, version_id string, is_latest boolean, is_delete_marker boolean, size bigint, last_modified_date timestamp, e_tag string, is_multipart_uploaded boolean ) PARTITIONED BY (dt string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' LOCATION 's3://prod.inventory.sagebase.org/inventory/proddata.sagebase.org/defaultInventory/hive';
We ran the repair on the table:
MSCK REPAIR TABLE prod_inventory
We check how much data and objects are reported (dt is the date when the inventory is generated):
SELECT dt, COUNT(*), SUM(size) FROM prod_inventory GROUP BY dt
Results:
dt | count | size |
---|---|---|
2020-11-05-00-00 | 41625517 | 704212857157112 (640.4TB) |
2020-11-08-00-00 | 41654050 | 708573173690177 (644.4TB) |
The inventory reports also if a file was uploaded as multipart, this would provide us with how many objects are uploaded not through the standard synapse upload API:
SELECT dt, COUNT(*), SUM(size) FROM prod_inventory WHERE is_multipart_uploaded = true GROUP BY dt
2020-11-08-00-00 | 5712315 | 531513092323401 (483.4TB) |
2020-11-05-00-00 | 5705476 | 527196059092709 (479.4TB) |
This result is surprising, only 5.7M seems to be multipart uploads but we do have an order of magnitude more than that, what is going on?
On further analysis we checked a few of those files and we could see that they were in fact normal multipart uploads in the DB with the relative file handles. The reason for this inconsistency is that we encrypted the S3 bucket back in 2019, this most likely was done using a PUT copy of the same object. This belief is reinforced by the fact that the modified dates on those objects seem to be consistent with the timeline of the encryption, while the original upload date in synapse was done prior. If the python API was used most likely all the objects that were smaller than a certain size were “copied” over without multipart.
Synapse File Handles
We created a table that contains only data pointing to the proddata.sagebase.org bucket that also includes a de-duplication identifier:
CREATE TABLE FILES_PROD AS SELECT *, ROW_NUMBER() OVER(PARTITION BY BUCKET_NAME, `KEY` ORDER BY ID) AS D_ID FROM FILES WHERE BUCKET_NAME = 'proddata.sagebase.org'
Count and size of the file handles that are indexed for the prod bucket:
SELECT COUNT(*) AS TOTAL_COUNT, SUM(CONTENT_SIZE) AS TOTAL_SIZE FROM FILES_PROD
Count | Size |
---|---|
47106657 | 746862657773610 (679TB) |
Excluding duplicated file handles pointing to the same key:
SELECT COUNT(*) AS TOTAL_COUNT, SUM(CONTENT_SIZE) AS TOTAL_SIZE FROM FILES_PROD WHERE D_ID = 1
Count | Size |
---|---|
39426647 | 695938050033098 (633TB) |
S3 Inventory vs File Handles
Since we have the S3 inventory in parquet format we decided to export the snapshot of the production file handles table as created above to S3 in parquet format, so that we can analyze further the data. The file handles table has a considerable size and we decided to use an AWS Glue job to export the table. There are some caveat since apparently at the time of writing Glue does not fully support MySQL 8 and this is not fully documented. The workaround is to customize the data source JDBC connection using a compatible MySQL driver (See https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-jdbc).
We uploaded the JDBC driver JAR for version 8.0.22 in the bucket used for analytics and used in the connection (using the "customJdbcDriverS3Path" and "customJdbcDriverClassName" properties of the connection). The Job was parallelized using an hash expression on the id with 7 worker nodes, note that we filter out the duplicated file handles as well. Below the ETL script used for exporting the data to S3 in this way is as follows:
import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job ## @params: [JOB_NAME] args = getResolvedOptions(sys.argv, ['JOB_NAME']) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args) connection_mysql8 = { "url": "jdbc:mysql://restored-prod-332-db-20201105.c5sxx7pot9i8.us-east-1.rds.amazonaws.com:3306/prod332", "dbtable": "FILES_PROD", "user": "prod332user", "password": "secret", "customJdbcDriverS3Path": "s3://analytics.sagebase.org/mysql/mysql-connector-java-8.0.22.jar", "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver", "hashexpression": "D_ID = 1 AND ID" } ## @type: DataSource ## @args: [connection_type = "mysql", connection_options=connection_mysql8, transformation_ctx = "datasource0"] ## @return: datasource0 ## @inputs: [] datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection_mysql8, transformation_ctx = "datasource0") ## @type: ApplyMapping ## @args: [mapping = [("ID", "BIGINT", "id", "BIGINT"), ("CREATED_ON", "timestamp", "CREATED_ON", "timestamp"), ("CREATED_BY", "BIGINT", "CREATED_BY", "BIGINT"), ("KEY", "string", "key", "string"), ("CONTENT_SIZE", "BIGINT", "CONTENT_SIZE", "BIGINT"), ("CONTENT_MD5", "string", "CONTENT_MD5", "string"), ("IS_PREVIEW", "boolean", "IS_PREVIEW", "boolean"), ("PREVIEW_ID", "BIGINT", "PREVIEW_ID", "BIGINT"), transformation_ctx = "applymapping1"] ## @return: applymapping1 ## @inputs: [frame = datasource0] applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("ID", "BIGINT", "id", "BIGINT"), ("CREATED_ON", "timestamp", "CREATED_ON", "timestamp"), ("CREATED_BY", "BIGINT", "CREATED_BY", "BIGINT"), ("KEY", "string", "key", "string"), ("CONTENT_SIZE", "BIGINT", "CONTENT_SIZE", "BIGINT"), ("CONTENT_MD5", "string", "CONTENT_MD5", "string"), ("IS_PREVIEW", "boolean", "IS_PREVIEW", "boolean"), ("PREVIEW_ID", "BIGINT", "PREVIEW_ID", "BIGINT")], transformation_ctx = "applymapping1") ## @type: ResolveChoice ## @args: [choice = "make_struct", transformation_ctx = "resolvechoice2"] ## @return: resolvechoice2 ## @inputs: [frame = applymapping1] resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2") ## @type: DropNullFields ## @args: [transformation_ctx = "dropnullfields3"] ## @return: dropnullfields3 ## @inputs: [frame = resolvechoice2] dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3") ## @type: DataSink ## @args: [connection_type = "s3", connection_options = {"path": "s3://analytics.sagebase.org/prod332/file_handles"}, format = "parquet", transformation_ctx = "datasink4"] ## @return: datasink4 ## @inputs: [frame = dropnullfields3] datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://analytics.sagebase.org/prod332/file_handles"}, format = "parquet", transformation_ctx = "datasink4") job.commit()
The Job on the the table that had around 46M rows took around 20 minutes. We then created a crawler to discover the table from S3:
Name prod332FileHandlesCrawler Description Create a single schema for each S3 path false Security configuration Tags - State Ready Schedule Last updated Fri Nov 13 12:13:21 GMT-800 2020 Date created Fri Nov 13 10:40:11 GMT-800 2020 Database files_inventory Service role AWSGlueServiceRoleDefault Selected classifiers Data store S3 Include path s3://analytics.sagebase.org/prod332/file_handles Connection Exclude patterns Configuration options Schema updates in the data store Update the table definition in the data catalog. Object deletion in the data store Mark the table as deprecated in the data catalog.
Once the table is added to the catalog we can now run queries with Athena joining the S3 inventory and the file handles table, we computed the count and size of objects that are stored in file handles but for which we do not have data in prod:
SELECT COUNT(*) AS MISSING_COUNT, SUM(F.CONTENT_SIZE) AS MISSING_SIZE FROM file_handles F LEFT JOIN prod_inventory I ON F.KEY = I.KEY WHERE I.KEY IS NULL
Count | Size |
---|---|
101 | 11252921200 (10.4GB) |
We can also run the inverse, to find which data is in prod but not indexed (which should retain a result similar to previous findings), note that we filter on the date of the inventory snapshot in this case:
SELECT COUNT(*) AS MISSING_COUNT, SUM(I.SIZE) AS MISSING_SIZE FROM prod_inventory I LEFT JOIN file_handles F ON I.KEY = F.KEY WHERE I.dt = '2020-11-05-00-00' AND F.KEY IS NULL
Count | Size |
---|---|
2198971 | 8286060045314 (7.5TB) |
Note that this number most likely contains temporary objects never deleted (e.g. temporary multipart upload files, old tests, staging data etc).
Synapse Storage Report
From the synapse storage report that is generated monthly and stored in a synapse table we can get an idea of how much data is used by synapse entities in projects, the following query gets the aggregated sum of the size in bytes for the last 10 months:
SELECT `Date`, SUM(sizeInBytes) FROM syn18406644 GROUP BY `Date` ORDER BY `Date` DESC LIMIT 10
Date | SUM(sizeInBytes) |
---|---|
11/07/2020 8:19 AM | 625884553013513 (569TB) |
10/07/2020 9:19 AM | 601258482253040 |
09/07/2020 9:18 AM | 582341331814681 |
08/07/2020 9:17 AM | 578802072068115 |
07/07/2020 9:16 AM | 571523659700716 |
06/07/2020 9:16 AM | 565440327890856 |
05/07/2020 9:15 AM | 553787060613178 |
04/07/2020 9:14 AM | 548567403971529 |
03/07/2020 8:14 AM | 515099665980791 |
02/07/2020 8:13 AM | 496359704897457 |
Note that this report does not account for file handles that are referenced in tables or other objects or file handles that are not linked to anything which seem to account for about 70TB of data.
Synapse Tables
TODO: Create a table that contains all the file handles for all the T%F tables in the tables database, export it to S3 and join on the file handles table to get the size.
Multipart Uploads
In prod we do not clean up unfinished multipart uploads, the following script was used to fetch some statistics about those:
package org.sagebionetworks.project; import java.time.Instant; import java.time.temporal.ChronoUnit; import java.util.Date; import org.sagebionetworks.aws.MultiFactorAuthenticationCredentialProvider; import com.amazonaws.regions.Regions; import com.amazonaws.services.s3.AmazonS3; import com.amazonaws.services.s3.AmazonS3ClientBuilder; import com.amazonaws.services.s3.model.ListMultipartUploadsRequest; import com.amazonaws.services.s3.model.MultipartUploadListing; public class MultipartUploadsScanner { private AmazonS3 s3Client; public static void main(String[] args) { String mfaDeviceArn = "mfaDeviceArn"; AmazonS3 s3Client = buildS3Client(mfaDeviceArn); String bucket = "proddata.sagebase.org"; new MultipartUploadsScanner(s3Client).scan(bucket); } private static AmazonS3 buildS3Client(String mfaDeviceArn) { AmazonS3ClientBuilder builder = AmazonS3ClientBuilder.standard(); builder.withCredentials(new MultiFactorAuthenticationCredentialProvider(mfaDeviceArn)); builder.withRegion(Regions.US_EAST_1); return builder.build(); } public MultipartUploadsScanner(AmazonS3 s3Client) { this.s3Client = s3Client; } void scan(String bucket) { ListMultipartUploadsRequest request = new ListMultipartUploadsRequest(bucket); boolean hasNext = true; String keyMarker = null; String uploadIdMarker = null; int totalCount = 0; int totalOldCount = 0; Date oneMonthAgo = Date.from(Instant.now().minus(30, ChronoUnit.DAYS)); while (hasNext) { request.setKeyMarker(keyMarker); request.setUploadIdMarker(uploadIdMarker); MultipartUploadListing result = s3Client.listMultipartUploads(request); int count = result.getMultipartUploads().size(); long oldCount = result.getMultipartUploads().stream().filter( u-> u.getInitiated().before(oneMonthAgo)).count(); totalCount += count; totalOldCount += oldCount; System.out.println(String.format("Batch Count: %s, Old Count: %s (Total: %s, Total Old: %s)", count, oldCount, totalCount, totalOldCount)); if (result.isTruncated()) { keyMarker = result.getNextKeyMarker(); uploadIdMarker = result.getNextUploadIdMarker(); System.out.println(String.format("Has next: %s, %s", uploadIdMarker, keyMarker)); } else { hasNext = false; } } System.out.println(String.format("Number of multipart uploads: %s (Old: %s)", totalCount, totalOldCount)); } }
Not Completed (to date) | Not Completed (started more than 30 days ago) |
---|---|
1417823 | 1414593 |
So we have about 1.4 M multipart uploads that are potentially taking storage but are not completed, we could not compute the size of the uploads/parts. Since we upload temporary objects for each part it might also make up the difference reported by the size in S3 and the size in indexed in the file handles table.
The indexed multipart uploads in the DB give us a very different picture:
SELECT COUNT(*) FROM MULTIPART_UPLOAD U WHERE U.STATE = 'UPLOADING'
Result: 6353