Skip to end of banner
Go to start of banner

S3 Bucket Analysis

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 19 Next »

We collect in this document the various steps taken to gauge the amount of data in the S3 proddata.sagebase.org bucket comparing it to the data indexed in the synapse database.

  • A snapshot of both the prod database and the tables database was taken on stack 332 on November 5th.

  • The S3 inventory was enabled on production and a snapshot was produced on November 5th

S3 Bucket

Data reported by S3 metrics on November 5th (From the cloud watch metrics for the bucket), note that results might not be precise:

  • Storage: 707451754672065 bytes (643TB)

  • Number of objects: 43,536,933

S3 Inventory

The inventory was enabled (See https://sagebionetworks.jira.com/browse/PLFM-6426 ) on the bucket.

We created a database named files_inventory in the Glue catalog. In Athena we created an external table for the S3 inventory:

CREATE EXTERNAL TABLE prod_inventory(
  `bucket` string,
  key string,
  version_id string,
  is_latest boolean,
  is_delete_marker boolean,
  size bigint,
  last_modified_date timestamp,
  e_tag string,
  is_multipart_uploaded boolean
  )
  PARTITIONED BY (dt string)
  ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
  STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
  OUTPUTFORMAT  'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
  LOCATION 's3://prod.inventory.sagebase.org/inventory/proddata.sagebase.org/defaultInventory/hive';

We ran the repair on the table:

MSCK REPAIR TABLE prod_inventory

We check how much data and objects are reported (dt is the date when the inventory is generated):

SELECT dt, COUNT(*), SUM(size) FROM prod_inventory GROUP BY dt

Results:

dt

count

size

2020-11-05-00-00

41625517

704212857157112 (640.4TB)

2020-11-08-00-00

41654050

708573173690177 (644.4TB)

The inventory reports also if a file was uploaded as multipart, this would provide us with how many objects are uploaded without going through the standard synapse upload API:

SELECT dt, COUNT(*), SUM(size) FROM prod_inventory WHERE is_multipart_uploaded = true 
GROUP BY dt

2020-11-08-00-00

5712315

531513092323401 (483.4TB)

2020-11-05-00-00

5705476

527196059092709 (479.4TB)

This result is surprising, only 5.7M objects seems to be multipart uploads but we do have an order of magnitude more than that in the database, what is going on?

On further analysis we checked a few of those files and we could see that they were in fact normal multipart uploads in the DB with the relative file handles. The reason for this inconsistency is that we encrypted the S3 bucket back in 2019, this most likely was done using a PUT copy of the same object. This belief is reinforced by the fact that the modified dates on those objects seem to be consistent with the timeline of the encryption, while the original upload date in synapse was done prior. If the python API was used most likely all the objects that were smaller than a certain size were “copied” over without multipart.

File Handles

We created a table that contains only data pointing to the proddata.sagebase.org bucket that also includes a de-duplication identifier:

CREATE TABLE FILES_PROD AS 
SELECT *, ROW_NUMBER() OVER(PARTITION BY BUCKET_NAME, `KEY` ORDER BY ID) AS D_ID 
FROM FILES WHERE BUCKET_NAME = 'proddata.sagebase.org'

Count and size of the file handles that are indexed for the prod bucket:

SELECT COUNT(*) AS TOTAL_COUNT, SUM(CONTENT_SIZE) AS TOTAL_SIZE FROM FILES_PROD

Count

Size

47106657

746862657773610 (679TB)

Excluding duplicated file handles pointing to the same key:

SELECT COUNT(*) AS TOTAL_COUNT, SUM(CONTENT_SIZE) AS TOTAL_SIZE FROM FILES_PROD WHERE D_ID = 1

Count

Size

39426647

695938050033098 (633TB)

There are 7680010 * 2 file handles that are duplicated (pointing to the same key), the size of those copies is around 46TB.

S3 Inventory vs File Handles

Since we have the S3 inventory in parquet format we decided to export the snapshot of the production file handles table as created above to S3 in parquet format, so that we can analyze further the data. The file handles table has a considerable size and we decided to use an AWS Glue job to export the table. There are some caveat since apparently at the time of writing Glue does not fully support MySQL 8 and this is not fully documented. The workaround is to customize the data source JDBC connection using a compatible MySQL driver (See https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-jdbc).

We uploaded the JDBC driver JAR for version 8.0.22 in the bucket used for analytics and used in the connection (using the "customJdbcDriverS3Path" and "customJdbcDriverClassName" properties of the connection). The Job was parallelized using an hash expression on the id with 7 worker nodes, note that we filter out the duplicated file handles as well. Below the ETL script used for exporting the data to S3 in this way is as follows:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

connection_mysql8 = {
    "url": "jdbc:mysql://restored-prod-332-db-20201105.c5sxx7pot9i8.us-east-1.rds.amazonaws.com:3306/prod332",
    "dbtable": "FILES_PROD",
    "user": "prod332user",
    "password": "secret",
    "customJdbcDriverS3Path": "s3://analytics.sagebase.org/mysql/mysql-connector-java-8.0.22.jar",
    "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver",
    "hashexpression": "D_ID = 1 AND ID"
}

## @type: DataSource
## @args: [connection_type = "mysql", connection_options=connection_mysql8, transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection_mysql8, transformation_ctx = "datasource0")

## @type: ApplyMapping
## @args: [mapping = [("ID", "BIGINT", "id", "BIGINT"), ("CREATED_ON", "timestamp", "CREATED_ON", "timestamp"), ("CREATED_BY", "BIGINT", "CREATED_BY", "BIGINT"), ("KEY", "string", "key", "string"), ("CONTENT_SIZE", "BIGINT", "CONTENT_SIZE", "BIGINT"), ("CONTENT_MD5", "string", "CONTENT_MD5", "string"), ("IS_PREVIEW", "boolean", "IS_PREVIEW", "boolean"), ("PREVIEW_ID", "BIGINT", "PREVIEW_ID", "BIGINT"), transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("ID", "BIGINT", "id", "BIGINT"), ("CREATED_ON", "timestamp", "CREATED_ON", "timestamp"), ("CREATED_BY", "BIGINT", "CREATED_BY", "BIGINT"), ("KEY", "string", "key", "string"), ("CONTENT_SIZE", "BIGINT", "CONTENT_SIZE", "BIGINT"), ("CONTENT_MD5", "string", "CONTENT_MD5", "string"), ("IS_PREVIEW", "boolean", "IS_PREVIEW", "boolean"), ("PREVIEW_ID", "BIGINT", "PREVIEW_ID", "BIGINT")], transformation_ctx = "applymapping1")
## @type: ResolveChoice
## @args: [choice = "make_struct", transformation_ctx = "resolvechoice2"]
## @return: resolvechoice2
## @inputs: [frame = applymapping1]
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
## @type: DropNullFields
## @args: [transformation_ctx = "dropnullfields3"]
## @return: dropnullfields3
## @inputs: [frame = resolvechoice2]
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
## @type: DataSink
## @args: [connection_type = "s3", connection_options = {"path": "s3://analytics.sagebase.org/prod332/file_handles"}, format = "parquet", transformation_ctx = "datasink4"]
## @return: datasink4
## @inputs: [frame = dropnullfields3]
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://analytics.sagebase.org/prod332/file_handles"}, format = "parquet", transformation_ctx = "datasink4")
job.commit()

The Job on the the table that had around 46M rows took around 20 minutes (on an smaller instance with 16GB of ram and 4 vcpu). We then created a crawler to discover the table from S3:

Name	prod332FileHandlesCrawler
Description	
Create a single schema for each S3 path	false
Security configuration	
Tags	
-
State	Ready
Schedule	
Last updated	Fri Nov 13 12:13:21 GMT-800 2020
Date created	Fri Nov 13 10:40:11 GMT-800 2020
Database	files_inventory
Service role	AWSGlueServiceRoleDefault
Selected classifiers	
Data store	S3
Include path	s3://analytics.sagebase.org/prod332/file_handles
Connection	
Exclude patterns	
Configuration options
Schema updates in the data store	Update the table definition in the data catalog.
Object deletion in the data store	Mark the table as deprecated in the data catalog.

Once the table is added to the catalog we can now run queries with Athena joining the S3 inventory and the file handles table, we computed the count and size of objects that are stored in file handles but for which we do not have data in prod:

SELECT COUNT(*) AS MISSING_COUNT, SUM(F.CONTENT_SIZE) AS MISSING_SIZE 
FROM file_handles F LEFT JOIN prod_inventory I ON F.KEY = I.KEY
WHERE I.KEY IS NULL

Count

Size

101

11252921200 (10.4GB)

We can also run the inverse, to find which data is in prod but not indexed (which should retain a result similar to previous findings), note that we filter on the date of the inventory snapshot in this case:

SELECT COUNT(*) AS MISSING_COUNT, SUM(I.SIZE) AS MISSING_SIZE 
FROM prod_inventory I LEFT JOIN file_handles F ON I.KEY = F.KEY
WHERE I.dt = '2020-11-05-00-00' AND F.KEY IS NULL

Count

Size

2198971

8286060045314 (7.5TB)

Note that this number most likely contains temporary objects never deleted (e.g. temporary multipart upload files, old tests, staging data etc).

File Entities

From the synapse storage report that is generated monthly and stored in a synapse table we can get an idea of how much data is used by synapse entities in projects, the following query gets the aggregated sum of the size in bytes for the last 10 months:

SELECT `Date`, SUM(sizeInBytes) FROM syn18406644 GROUP BY `Date` ORDER BY `Date` DESC LIMIT 10

Date

SUM(sizeInBytes)

11/07/2020 8:19 AM

625884553013513 (569TB)

10/07/2020 9:19 AM

601258482253040

09/07/2020 9:18 AM

582341331814681

08/07/2020 9:17 AM

578802072068115

07/07/2020 9:16 AM

571523659700716

06/07/2020 9:16 AM

565440327890856

05/07/2020 9:15 AM

553787060613178

04/07/2020 9:14 AM

548567403971529

03/07/2020 8:14 AM

515099665980791

02/07/2020 8:13 AM

496359704897457

Note that this report does not account for file handles that are referenced in tables or other objects or file handles that are not linked to anything.

Running the following query should retains similar results, but more inline with the snapshot we are using and excluding eventual duplicates:

WITH 
U_R_F AS (
	SELECT DISTINCT R.FILE_HANDLE_ID FROM JDOREVISION R WHERE R.FILE_HANDLE_ID IS NOT NULL
),
U_F AS (
	SELECT MAX(F.CONTENT_SIZE) AS CONTENT_SIZE FROM U_R_F R JOIN FILES_PROD F ON R.FILE_HANDLE_ID = F.ID GROUP BY F.`KEY`
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_F

Count

Size

4659729

615280584612739 (559.5TB)

So out of 39426647 distinct file handles in prod that take 633TB of data, 4659729 (~12%) file handles are used as entities and make up for 559.5TB (~88%) of the data. So about 34766918 (88%) file handles taking 73TB (~12%) of data are potentially linked in tables or other objects or used by unlinked file handles.

Synapse Tables

Each synapse table might link to file handles and for data access checks we keep for each table a companion table that stores the file handle id for each table. To get an idea of how much data is accounted for by synapse tables we created a table that contains the union of all the tables.

We created a container table:

CREATE TABLE `FILES_TABLES` (
	`TABLE_ID` BIGINT(20) NOT NULL,
	`ID` BIGINT(20) NOT NULL
)

And we run the following script, we also record the origin id of the table referencing the file handle:

package org.sagebionetworks.project;

import java.util.List;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import javax.sql.DataSource;

import org.apache.commons.dbcp2.BasicDataSource;
import org.apache.commons.lang3.time.StopWatch;
import org.springframework.jdbc.core.JdbcTemplate;

public class SynapseTablesFileHandlesScanner {

	private static final String DB_DRIVER = "com.mysql.cj.jdbc.Driver";
	private static final String DB_URL = "jdbc:mysql://restored-prod-332-table-0-20201105.c5sxx7pot9i8.us-east-1.rds.amazonaws.com:3306/prod332";
	private static final String DB_USER = "prod332user";
	private static final String DB_PASSWORD = "secret";
	private static final String SCHEMA = "prod332";
	
	public static void main(String[] args) {
		DataSource dataSource = getDataSource();
		JdbcTemplate jdbcTemplate = getTemplate(dataSource);
		ExecutorService executorService = Executors.newFixedThreadPool(20);
		ScheduledExecutorService schedulerService = Executors.newScheduledThreadPool(1);
		
		final SynapseTablesFileHandlesScanner scanner = new SynapseTablesFileHandlesScanner(jdbcTemplate);
		
		int limit = 20000;
		
		final List<FileTable> tables = scanner.getFileTables(limit);
		final int size = tables.size();
		final AtomicInteger counter = new AtomicInteger();
		
		final StopWatch watch = StopWatch.createStarted();
		
		System.out.println(String.format("Copying data from %d tables...", size));
		
		tables.forEach(table -> {
			executorService.submit(() -> {
				timedRunnable(() -> {
					int rowNum = scanner.copyFileHandles(table);
					System.out.println(String.format("Finished copying %d rows from table %s", rowNum, table.getName()));
					counter.incrementAndGet();
				}, String.format("Copying table %s (Id: %d, Est Size: %d)", table.getName(), table.getId(), table.getSize()));
			});
			System.out.println(String.format("Table %s submitted.", table.getName()));
		});	
		
		schedulerService.scheduleWithFixedDelay(() -> {
			int currentCount = counter.get();
			try {
				System.out.println(String.format("-----\n%d/%d tables processed (%d%%, Elapsed Time: %d sec)\n-----", currentCount, size, (currentCount * 100 / size), watch.getTime(TimeUnit.SECONDS)));
			} catch (Exception e) {
				e.printStackTrace();
			}
		}, 5, 5, TimeUnit.SECONDS);
				
		terminateExecutorService(executorService, 5, TimeUnit.HOURS);
		terminateExecutorService(schedulerService, 5, TimeUnit.SECONDS);
		
		watch.stop();
		
		System.out.println(String.format("Copying data from %d tables...DONE (Took: %d mins)", size, watch.getTime(TimeUnit.MINUTES)));
	}
	
	private static void terminateExecutorService(ExecutorService executorService, long timeout, TimeUnit unit) {
		executorService.shutdown();
		
		try {
		    if (!executorService.awaitTermination(timeout, unit)) {
		        executorService.shutdownNow();
		    } 
		} catch (InterruptedException e) {
		    executorService.shutdownNow();
		}
	}
	
	private static void timedRunnable(Runnable runnable, String text) {
		StopWatch watch = StopWatch.createStarted();
		System.out.println(String.format("%s...", text));
		try {
			runnable.run();
		} catch (Exception e) {
			System.out.println(String.format("%s...FAILED: %s", text, e.getMessage()));
			e.printStackTrace();
		}
		watch.stop();
		System.out.println(String.format("%s...DONE (Took: %d sec)", text, watch.getTime(TimeUnit.SECONDS)));
	}
	
	private static JdbcTemplate getTemplate(DataSource dataSource) {
		return new JdbcTemplate(dataSource);
	}

	private static DataSource getDataSource() {

		BasicDataSource ds = new BasicDataSource();

		ds.setDriverClassName(DB_DRIVER);
		ds.setUsername(DB_USER);
		ds.setPassword(DB_PASSWORD);
		ds.setUrl(DB_URL);
		ds.setMinIdle(5);
		ds.setMaxIdle(50);
		ds.setMaxTotal(50);

		return ds;
	}
	
	private int copyFileHandles(FileTable table) {
		return jdbcTemplate.update("INSERT INTO FILES_TABLES SELECT " + table.getId() + ", FILE_ID FROM " + table.getName());
	}

	private JdbcTemplate jdbcTemplate;
	
	public SynapseTablesFileHandlesScanner(JdbcTemplate jdbcTemplate) {
		this.jdbcTemplate = jdbcTemplate;
	}

	public List<FileTable> getFileTables(int limit) {
		StringBuilder sqlBuilder = new StringBuilder()
				.append("SELECT TABLE_NAME, TABLE_ROWS FROM information_schema.TABLES T WHERE T.TABLE_SCHEMA = \"").append(SCHEMA).append("\"")
				.append(" AND T.TABLE_NAME LIKE 'T%F'")
				.append(" ORDER BY T.TABLE_NAME LIMIT ").append(limit);
		
		return jdbcTemplate.query(sqlBuilder.toString(), (rs, rNum) -> new FileTable(rs.getString(1), rs.getLong(2)));
	}

	private static class FileTable {

		private static final Pattern FILES_TABLE_PATTERN = Pattern.compile("T(\\d+)(_\\d+)?F");
		
		private Long id;
		private String name;
		private Long size;

		public FileTable(String name, Long size) {
			this.id = parseId(name);
			this.name = name;
			this.size = size;
		}
				
		private Long getId() {
			return id;
		}

		public String getName() {
			return name;
		}

		public Long getSize() {
			return size;
		}
				
		private static Long parseId(String name) {
			Matcher matcher = FILES_TABLE_PATTERN.matcher(name);
			if (matcher.matches()) {
				return Long.valueOf(matcher.group(1));
			}
			throw new IllegalArgumentException("Invalid table name " + name);
		}

	}
}

This copied around 10.5K tables, out of which about 1.3k contained file handles. The process took roughly 4 minutes and exported 36173253 file handles. The biggest table contains 7093160 file handles. The distinct number of file handles is 27624848 (This is probably due to table snapshots).

We then created an additional FILES_TABLE_DISTINCT table to stored the the distinct file handles ids:

CREATE TABLE `FILES_TABLE_DISTINCT` (
	`ID` BIGINT(20) NOT NULL,
	`REUSED` TINYINT(4) NOT NULL DEFAULT '0'
)
INSERT INTO FILES_TABLE_DISTINCT SELECT DISTINCT F.ID, 0 FROM FILES_TABLES F

Note that we used a flag to indicate if the file handle is “reused”, since we have the object replication data we can verify if any of the file handles in tables are also reused by entities using a query such as (We created an index on both the FILE_ID in the object replication and on ID in the FILES_TABLE_DISTINCT first):

SELECT COUNT(DISTINCT F.ID) FROM FILES_TABLE_DISTINCT F JOIN OBJECT_REPLICATION R ON F.ID = R.FILE_ID WHERE R.IN_SYNAPSE_STORAGE = TRUE

We have 49166 file handles referenced in tables that are also used in file entities. We can update the table above with this information:

UPDATE FILES_TABLE_DISTINCT F JOIN OBJECT_REPLICATION R ON F.ID = R.FILE_ID WHERE R.IN_SYNAPSE_STORAGE = TRUE SET REUSED = TRUE

We created a Glue job to export this table to S3 so that we can join on the S3 data to compute the size claimed by tables, the ETL script is as follows:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

connection_mysql8 = {
    "url": "jdbc:mysql://restored-prod-332-table-0-20201105.c5sxx7pot9i8.us-east-1.rds.amazonaws.com:3306/prod332",
    "dbtable": "FILES_TABLE_DISTINCT",
    "user": "prod332user",
    "password": "secret",
    "customJdbcDriverS3Path": "s3://analytics.sagebase.org/mysql/mysql-connector-java-8.0.22.jar",
    "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver",
    "hashexpression": "ID"
}

## @type: DataSource
## @args: [connection_type = "mysql", connection_options=connection_mysql8, transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection_mysql8, transformation_ctx = "datasource0")

## @type: ApplyMapping
## @args: [mapping = [("ID", "BIGINT", "id", "BIGINT"), ("REUSED", "TINYINT ", "reused", "TINYINT")], transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("ID", "BIGINT", "id", "BIGINT"), ("REUSED", "TINYINT", "reused", "TINYINT")], transformation_ctx = "applymapping1")
## @type: ResolveChoice
## @args: [choice = "make_struct", transformation_ctx = "resolvechoice2"]
## @return: resolvechoice2
## @inputs: [frame = applymapping1]
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, specs = [("reused", "cast:boolean")], transformation_ctx = "resolvechoice2")
## @type: DataSink
## @args: [connection_type = "s3", connection_options = {"path": "s3://analytics.sagebase.org/prod332/tables_file_handles"}, format = "parquet", transformation_ctx = "datasink4"]
## @return: datasink4
## @inputs: [frame = dropnullfields3]
datasink4 = glueContext.write_dynamic_frame.from_options(frame = resolvechoice2, connection_type = "s3", connection_options = {"path": "s3://analytics.sagebase.org/prod332/tables_file_handles"}, format = "parquet", transformation_ctx = "datasink4")

job.commit()

With this table catalogued in Glue, we can now join the data in S3 using Athena (Note: for this we used a dump of the file handle table computed above, but including the file handles with duplicate key since we are joining on ids), we compute the sum of the file sizes that are linked in the tables:

SELECT COUNT(*), SUM(F.CONTENT_SIZE) FROM tables_file_handles T JOIN file_handles_with_d_id F ON T.ID = F.ID WHERE T.reused = FALSE

Note that we filter by the reused = False since those file handles are already counted by file entities.

Count

Size

13711427

4545704827408 (4.1TB)

This number is the unique file handles referenced by tables that is stored in the production bucket, if we want to compute the actual size (e.g. excluding file handle copies):

WITH U_SIZE AS (
  SELECT MAX(F.CONTENT_SIZE) CONTENT_SIZE FROM tables_file_handles T JOIN file_handles_with_d_id F ON T.ID = F.ID WHERE T.reused = FALSE GROUP BY F.KEY
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_SIZE

Count

Size

12004739

3707513625379 (3.4TB)

The space taken by the file handles that are linked both in tables and in entities (and that are included in the storage report reported above):

SELECT COUNT(*), SUM(F.CONTENT_SIZE) FROM tables_file_handles T JOIN file_handles_with_d_id F ON T.ID = F.ID WHERE T.reused = TRUE

Count

Size

49166

66005128040194 (60TB)

Trying to deduplicate from copied file handles does not yield different results (meaning that this file handles were never copied):

WITH U_SIZE AS (
  SELECT MAX(F.CONTENT_SIZE) AS CONTENT_SIZE FROM tables_file_handles T JOIN file_handles_with_d_id F ON T.ID = F.ID WHERE T.reused = TRUE GROUP BY F.KEY
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_SIZE

Count

Size

49166

66005128040194 (60TB)

These are interesting numbers, it looks like the table links are only claiming about 3.4TB of data. The total number of file handles referenced by synapse tables and that are in the production bucket is 13760593 (49166 + 13711427). Out of 27624848 unique file handles referenced by synapse tables 50% are in the prod bucket (The rest might be file handles that are stored elsewhere).

Other Linked Objects

We have several other objects in Synapse that link back explicitly to file handles, the list of these relations is described by https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/file/FileHandleAssociateType.html. Most of those are relatively small sets, in the following we gather the statistics about the following associations where we see relatively big numbers:

  • MessageAttachment: File handles for user messages

  • WikiMarkdown: File handles that contain the markdown of wikis

  • WikiAttachment: File handles attached to wikis

  • SubmissionAttachment: File handles submitted as part of evaluation submissions (This are submitted as entities, but we keep the list of file handles submitted. Since entities can be deleted we need to compute the delta)

The above objects track their file handle associations into dedicated tables and we can figure out the count and size, taking care of de-duplicating:

MessageAttachement

The messages store their file handles in the MESSAGE_CONTENT table, the file handles are either uploaded internally (e.g. system generated emails) or externally (first the message is uploaded and then the message is sent through the API, an example is the evaluation orchestrator):

We can count and compute the file size directly from the DB:

WITH U_F AS (
  SELECT DISTINCT(M.FILE_HANDLE_ID) FROM MESSAGE_CONTENT M
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_F JOIN FILES_PROD F ON U_F.FILE_HANDLE_ID = F.ID;

Count

Size

437029

2431601540 (2.26GB)

Just for completeness we can actually de-duplicate further by the object key (since file handles are exposed in the API external processes might actually make a copy of the same file handle):

WITH U_SIZE AS (
  SELECT MAX(CONTENT_SIZE) AS CONTENT_SIZE FROM MESSAGE_CONTENT M JOIN FILES_PROD F ON M.FILE_HANDLE_ID = F.ID GROUP BY `KEY`
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_SIZE;

Count

Size

436974

2431579468 (2.26GB)

WikiMarkdown and WikiAttachment

The file handles for both the wiki markdown and the wiki attachments are tracked in the table V2_WIKI_ATTACHMENT_RESERVATION:

WITH U_F AS (
  SELECT DISTINCT(FILE_HANDLE_ID) FROM V2_WIKI_ATTACHMENT_RESERVATION
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_F JOIN FILES_PROD F ON U_F.FILE_HANDLE_ID = F.ID;

Count

Size

1150410

22359954476 (20.8GB)

De-duplicating for the size:

WITH U_SIZE AS (
  SELECT MAX(CONTENT_SIZE) AS CONTENT_SIZE FROM V2_WIKI_ATTACHMENT_RESERVATION W JOIN FILES_PROD F ON W.FILE_HANDLE_ID = F.ID GROUP BY `KEY`
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_SIZE;

Count

Size

1112096

16977090003 (15.8GB)

SubmissionAttachment

Evaluation submissions file handles are tracked in the table JDOSUBMISSION_FILE:

WITH U_F AS (
  SELECT DISTINCT(FILE_HANDLE_ID) FROM JDOSUBMISSION_FILE
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_F JOIN FILES_PROD F ON U_F.FILE_HANDLE_ID = F.ID;

Count

Size

93198

13031793753688 (11.8TB)

Note however that submissions are linked to entities, we need to compute the count and size of submissions for which the file handle is NOT linked to an entity:

WITH U_F AS (
 SELECT DISTINCT(S.FILE_HANDLE_ID) FROM JDOSUBMISSION_FILE S LEFT JOIN JDOREVISION R ON S.FILE_HANDLE_ID = R.FILE_HANDLE_ID WHERE R.FILE_HANDLE_ID IS NULL
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_F JOIN FILES_PROD F ON U_F.FILE_HANDLE_ID = F.ID;

Count

Size

42979

613597243955 (0.55TB)

Therefore out of 93198 submissions with file entities, 42979 (46%) are not linked to entities and take 0.55TB of data (0.47% of the total). Note that we used the table with file handles that are only in prod (e.g. the temporary table FILES_PROD created previously), the numbers are similar if we include all other storage locations (+/- 300GB).

Temporary File Handles

Synapse also uses file handles for what I refer to as temporary file handles, this are file handles used most likely only once. These file handles can be generated internally by the system or uploaded externally, for the latter we cannot control where the file handle ends up (e.g. in our bucket).

In particular we identified in the code base 4 places where we use file handles that might end up in our bucket (The main reference is the usage of the multipartUploadLocalFile method: https://github.com/Sage-Bionetworks/Synapse-Repository-Services/blob/e2b84bfbbd1d1450e955f360199f5448a1b620c3/services/repository-managers/src/main/java/org/sagebionetworks/repo/manager/file/MultipartManagerImpl.java#L222) for this cases we do not have an associations but still make up a chunk of the data in prod:

  • Storage reports: CSV reports that contains a breakdown of data consumed by entities in each project.

  • Query results: Table and View query results that are stored in CSV or TSV files. They are used in some cases as a cache.

  • Bulk downloads: Zipped packages for bulk downloads up to 2GB.

  • Tables uploads: This are CSV files uploaded by the user in order to update tables or views.

All of the above are linked to asynchronous job requests, the body of the request is stored as a blob in the database. Unfortunately the asynchronous job status is not a migratable table therefore we lose the file handle links at each release cycle.

For some of these we can estimate the size since we know the file name pattern used by the backend when the file handles are created, for this task we used Athena on the exported file handle table in S3 since the results can be computed much faster:

Storage reports

They are initiated from a DownloadStorageReportRequest, a background job creates a file with the pattern Job-{jobId}.csv. Using Athena we ran the following query:

SELECT COUNT(*), SUM(CONTENT_SIZE) FROM file_handles_with_d_id F
WHERE regexp_like(F.KEY, '\d+/.+/Job-\d+\.csv$')

Count

Size

2532705

2206719851653 (2TB)

Note that these includes also query results in csv format, since they follow the same pattern.

Query Results

They are initiated from a DownloadFromTableRequest, a background job creates either a CSV or TSV file with the pattern Job-{jobId}.{csv|tsv}. Since the previous query included the csv pattern we ran the following query:

SELECT COUNT(*), SUM(CONTENT_SIZE) FROM file_handles_with_d_id F
WHERE regexp_like(F.KEY, '\d+/.+/Job-\d+\.tsv$')

Count

Size

4085

85852089838 (80GB)

Bulk Downloads

They are initiated from a BulkFileDownloadRequest, a background job created a zipped package using the file pattern Job{jobId}.zip. This is partially true as the request can override the filename. We ran the following query with Athena:

SELECT COUNT(*), SUM(CONTENT_SIZE) FROM file_handles_with_d_id F
WHERE regexp_like(F.KEY, '\d+/.+/Job\d+\.zip$')

Count

Size

1669369

20575241255094 (18.7TB)

Since this is lot of data, to make sure those are not file handles used in entities or tables we additionally ran the following query:

WITH F AS (
  SELECT DISTINCT F.ID FROM file_handles_with_d_id F WHERE regexp_like(F.KEY, '\d+/.+/Job\d+\.zip$')
) 
SELECT COUNT(*) FROM F JOIN nodes_file_handles N ON F.ID = N.FILE_HANDLE_ID
UNION
SELECT COUNT(*) FROM F JOIN tables_file_handles T ON F.ID = T.ID

That returned 0 results. Meaning that with a high probability this file handles are not linked anywhere else.

Table Uploads

They are initiated from an UploadToTableRequest, this request can even be embedded in a TableUpdateTransactionRequest as part of the change set. Unfortunately at the moment we do not have a way to estimate their size or numbers, in particular we verified that:

  • The asynchronous job status table is not migratable therefore we lose any information in the original request

  • The data warehouse does not store the request body for requests and the id of the file handle is stored as part of the request body

  • We checked the web client implementation to see if there was a common file name pattern used when uploading the CSV but the name of the original file is used instead

Multipart Uploads

In prod we do not clean up unfinished multipart uploads, the following script was used to fetch some statistics about those:

package org.sagebionetworks.project;

import java.time.Instant;
import java.time.temporal.ChronoUnit;
import java.util.Date;

import org.sagebionetworks.aws.MultiFactorAuthenticationCredentialProvider;

import com.amazonaws.regions.Regions;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.ListMultipartUploadsRequest;
import com.amazonaws.services.s3.model.MultipartUploadListing;

public class MultipartUploadsScanner {
	
	private AmazonS3 s3Client;
	
	public static void main(String[] args) {

		String mfaDeviceArn = "mfaDeviceArn";
		
		AmazonS3 s3Client = buildS3Client(mfaDeviceArn);
		
		String bucket = "proddata.sagebase.org";
		
		new MultipartUploadsScanner(s3Client).scan(bucket);
	}
		
	private static AmazonS3 buildS3Client(String mfaDeviceArn) {	
		AmazonS3ClientBuilder builder = AmazonS3ClientBuilder.standard();
		builder.withCredentials(new MultiFactorAuthenticationCredentialProvider(mfaDeviceArn));
		builder.withRegion(Regions.US_EAST_1);
		return builder.build();
	}

	
	public MultipartUploadsScanner(AmazonS3 s3Client) {
		this.s3Client = s3Client;
	}
	
	void scan(String bucket) {
		
		ListMultipartUploadsRequest request = new ListMultipartUploadsRequest(bucket);
		
		boolean hasNext = true;
		String keyMarker = null;
		String uploadIdMarker = null;
		
		int totalCount = 0;
		int totalOldCount = 0;
		
		Date oneMonthAgo = Date.from(Instant.now().minus(30, ChronoUnit.DAYS));
		
		while (hasNext) {
			
			request.setKeyMarker(keyMarker);
			request.setUploadIdMarker(uploadIdMarker);
			
			MultipartUploadListing result = s3Client.listMultipartUploads(request);
			
			int count = result.getMultipartUploads().size();
			long oldCount = result.getMultipartUploads().stream().filter( u-> u.getInitiated().before(oneMonthAgo)).count();
			
			totalCount += count;
			totalOldCount += oldCount;
			
			System.out.println(String.format("Batch Count: %s, Old Count: %s (Total: %s, Total Old: %s)", count, oldCount, totalCount, totalOldCount));
			
			if (result.isTruncated()) {
				keyMarker = result.getNextKeyMarker();
				uploadIdMarker = result.getNextUploadIdMarker();
				System.out.println(String.format("Has next: %s, %s", uploadIdMarker, keyMarker));
			} else {
				hasNext = false;
			}
		}
		
		System.out.println(String.format("Number of multipart uploads: %s (Old: %s)", totalCount, totalOldCount));
	}

}

Not Completed (to date)

Not Completed (started more than 30 days ago)

1417823

1414593

So we have about 1.4 M multipart uploads that are potentially taking storage but are not completed, we could not compute the size of the uploads/parts. Since we upload temporary objects for each part it might also make up the difference reported by the size in S3 and the size in indexed in the file handles table.

The indexed multipart uploads in the DB give us a very different picture:

SELECT COUNT(*) FROM MULTIPART_UPLOAD U WHERE U.STATE = 'UPLOADING'

Result: 6353

We can have a rough estimate of the amount of data that has been uploaded in prod but not yet completed:

WITH UPLOADING AS (
	SELECT U.ID, U.PART_SIZE, COUNT(*) AS PARTS FROM MULTIPART_UPLOAD U JOIN MULTIPART_UPLOAD_PART_STATE P ON U.ID = P.UPLOAD_ID
	WHERE U.STATE = 'UPLOADING' AND U.BUCKET = 'proddata.sagebase.org'
	GROUP BY U.ID
),
UPLOADING_SIZE AS (
	SELECT (PART_SIZE * PARTS) AS SIZE FROM UPLOADING
)
SELECT COUNT(*), SUM(SIZE) FROM UPLOADING_SIZE

Count

Size

3037

2649792499252 (2.4TB)

So we have about 2.4TB of data that could be potentially freed just removing the unfinished multipart uploads.

Upon further analysis of the backend code we discovered a bug where a multipart upload is initiated when we create or update a wiki page using the first version of the wiki API that submitted the markdown as a string. The multipart upload is never completed for such cases: https://sagebionetworks.jira.com/browse/PLFM-6523. Additionally the new multipart upload that tracks the uploads was implemented relatively recently, the previous implementation might have left behind other unfinished multipart uploads.

Summary of Results

Working on a snapshot of prod 332 (11/05/2020) for the production bucket we have the following numbers for file handles:

Count

Count in S3 (Unique Keys)

Size in DB

Size in S3 (Unique Keys)

Description

File Handles

47,106,657

39,426,647

~679 TB

~633 TB

File handles that point to the production bucket

Linked File Entities

4,749,667

4,659,729

~589.7 TB

~560 TB

Entities that point to file handles in the production bucket

Linked Table Rows

13,711,427

12,004,739

~4.1 TB

~3.4 TB

File handles referenced in tables that point to the production bucket

Other Links

~1,630,418

~1,592,049

~0.6 TB

~0.6 TB

Other type of linked file handles that point to the production bucket

Temporary Handles

~4,206,159

~4,206,159

~20.7 TB

~20.7 TB

File handles that are not linked, and mostly one time use

Additionally we have the following figures for S3:

Count

Size

Description

S3 Objects

~41,625,517

~640 TB

The objects in S3 from the inventory

No S3 Objects

101

~10.4 GB

Objects that are referenced by file handles but do not exist in S3

No File Handle

2,198,971

~7.5 TB

Objects that do not have any file handle

In summary out of the 47M file handles that point to the production bucket, we can account for about 24M (~50%). Out of 633TB of indexed data, we can account for about 585 TB (92%). The amount of data that can potentially be archived amounts to about 48 TB, referenced by around 23M file handles. Note that the temporary file handles can potentially be archived as well removing an additional 20.7 TB from the bucket.

  • No labels