Page Comparison

...

Note that this number most likely contains temporary objects never deleted (e.g. temporary multipart upload files, old tests, staging data etc).

Synapse Storage Report and Entities

From the synapse storage report that is generated monthly and stored in a synapse table we can get an idea of how much data is used by synapse entities in projects, the following query gets the aggregated sum of the size in bytes for the last 10 months:

...

Note that this report does not account for file handles that are referenced in tables or other objects or file handles that are not linked to anything which seem to account for about 70TB of data.

Synapse Tables

TODO: Create a table that contains all the file handles for all the T%F tables in the tables database, export it to S3 and join on the file handles table to get the size..

Running the following query should retains similar results, but more inline with the snapshot we are using and excluding eventual duplicates:

Code Block

language	sql

WITH 
U_R_F AS (
	SELECT DISTINCT R.FILE_HANDLE_ID FROM JDOREVISION R WHERE R.FILE_HANDLE_ID IS NOT NULL
),
U_F AS (
	SELECT MAX(F.CONTENT_SIZE) AS CONTENT_SIZE FROM U_R_F R JOIN FILES_PROD F ON R.FILE_HANDLE_ID = F.ID GROUP BY F.`KEY`
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_F

Count	Size
4659729	615280584612739 (559.5TB)

So out of 39426647 distinct file handles in prod that take 633TB of data, 4659729 (~12%) file handles are used as entities and make up for 559.5TB (~88%) of the data. So about 73TB (~12%) of data is potentially linked in tables or other objects or used by unlinked file handles.

Synapse Tables

Each synapse table might link to file handles and for data access checks we keep for each table a companion table that stores the file handle id for each table. To get an idea of how much data is accounted for by synapse tables we created a table that contains the union of all the tables.

We created a container table:

Code Block

language	sql

CREATE TABLE `FILES_TABLES` (
	`TABLE_ID` BIGINT(20) NOT NULL,
	`ID` BIGINT(20) NOT NULL
)

And we run the following script, we also record the origin id of the table referencing the file handle:

Code Block

language	java

package org.sagebionetworks.project;

import java.util.List;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import javax.sql.DataSource;

import org.apache.commons.dbcp2.BasicDataSource;
import org.apache.commons.lang3.time.StopWatch;
import org.springframework.jdbc.core.JdbcTemplate;

public class SynapseTablesFileHandlesScanner {

	private static final String DB_DRIVER = "com.mysql.cj.jdbc.Driver";
	private static final String DB_URL = "jdbc:mysql://restored-prod-332-table-0-20201105.c5sxx7pot9i8.us-east-1.rds.amazonaws.com:3306/prod332";
	private static final String DB_USER = "prod332user";
	private static final String DB_PASSWORD = "secret";
	private static final String SCHEMA = "prod332";
	
	public static void main(String[] args) {
		DataSource dataSource = getDataSource();
		JdbcTemplate jdbcTemplate = getTemplate(dataSource);
		ExecutorService executorService = Executors.newFixedThreadPool(20);
		ScheduledExecutorService schedulerService = Executors.newScheduledThreadPool(1);
		
		final SynapseTablesFileHandlesScanner scanner = new SynapseTablesFileHandlesScanner(jdbcTemplate);
		
		int limit = 20000;
		
		final List<FileTable> tables = scanner.getFileTables(limit);
		final int size = tables.size();
		final AtomicInteger counter = new AtomicInteger();
		
		final StopWatch watch = StopWatch.createStarted();
		
		System.out.println(String.format("Copying data from %d tables...", size));
		
		tables.forEach(table -> {
			executorService.submit(() -> {
				timedRunnable(() -> {
					int rowNum = scanner.copyFileHandles(table);
					System.out.println(String.format("Finished copying %d rows from table %s", rowNum, table.getName()));
					counter.incrementAndGet();
				}, String.format("Copying table %s (Id: %d, Est Size: %d)", table.getName(), table.getId(), table.getSize()));
			});
			System.out.println(String.format("Table %s submitted.", table.getName()));
		});	
		
		schedulerService.scheduleWithFixedDelay(() -> {
			int currentCount = counter.get();
			try {
				System.out.println(String.format("-----\n%d/%d tables processed (%d%%, Elapsed Time: %d sec)\n-----", currentCount, size, (currentCount * 100 / size), watch.getTime(TimeUnit.SECONDS)));
			} catch (Exception e) {
				e.printStackTrace();
			}
		}, 5, 5, TimeUnit.SECONDS);
				
		terminateExecutorService(executorService, 5, TimeUnit.HOURS);
		terminateExecutorService(schedulerService, 5, TimeUnit.SECONDS);
		
		watch.stop();
		
		System.out.println(String.format("Copying data from %d tables...DONE (Took: %d mins)", size, watch.getTime(TimeUnit.MINUTES)));
	}
	
	private static void terminateExecutorService(ExecutorService executorService, long timeout, TimeUnit unit) {
		executorService.shutdown();
		
		try {
		    if (!executorService.awaitTermination(timeout, unit)) {
		        executorService.shutdownNow();
		    } 
		} catch (InterruptedException e) {
		    executorService.shutdownNow();
		}
	}
	
	private static void timedRunnable(Runnable runnable, String text) {
		StopWatch watch = StopWatch.createStarted();
		System.out.println(String.format("%s...", text));
		try {
			runnable.run();
		} catch (Exception e) {
			System.out.println(String.format("%s...FAILED: %s", text, e.getMessage()));
			e.printStackTrace();
		}
		watch.stop();
		System.out.println(String.format("%s...DONE (Took: %d sec)", text, watch.getTime(TimeUnit.SECONDS)));
	}
	
	private static JdbcTemplate getTemplate(DataSource dataSource) {
		return new JdbcTemplate(dataSource);
	}

	private static DataSource getDataSource() {

		BasicDataSource ds = new BasicDataSource();

		ds.setDriverClassName(DB_DRIVER);
		ds.setUsername(DB_USER);
		ds.setPassword(DB_PASSWORD);
		ds.setUrl(DB_URL);
		ds.setMinIdle(5);
		ds.setMaxIdle(50);
		ds.setMaxTotal(50);

		return ds;
	}
	
	private int copyFileHandles(FileTable table) {
		return jdbcTemplate.update("INSERT INTO FILES_TABLES SELECT " + table.getId() + ", FILE_ID FROM " + table.getName());
	}

	private JdbcTemplate jdbcTemplate;
	
	public SynapseTablesFileHandlesScanner(JdbcTemplate jdbcTemplate) {
		this.jdbcTemplate = jdbcTemplate;
	}

	public List<FileTable> getFileTables(int limit) {
		StringBuilder sqlBuilder = new StringBuilder()
				.append("SELECT TABLE_NAME, TABLE_ROWS FROM information_schema.TABLES T WHERE T.TABLE_SCHEMA = \"").append(SCHEMA).append("\"")
				.append(" AND T.TABLE_NAME LIKE 'T%F'")
				.append(" ORDER BY T.TABLE_NAME LIMIT ").append(limit);
		
		return jdbcTemplate.query(sqlBuilder.toString(), (rs, rNum) -> new FileTable(rs.getString(1), rs.getLong(2)));
	}

	private static class FileTable {

		private static final Pattern FILES_TABLE_PATTERN = Pattern.compile("T(\\d+)(_\\d+)?F");
		
		private Long id;
		private String name;
		private Long size;

		public FileTable(String name, Long size) {
			this.id = parseId(name);
			this.name = name;
			this.size = size;
		}
				
		private Long getId() {
			return id;
		}

		public String getName() {
			return name;
		}

		public Long getSize() {
			return size;
		}
				
		private static Long parseId(String name) {
			Matcher matcher = FILES_TABLE_PATTERN.matcher(name);
			if (matcher.matches()) {
				return Long.valueOf(matcher.group(1));
			}
			throw new IllegalArgumentException("Invalid table name " + name);
		}

	}
}

This copied around 10.5K tables, out of which about 1.3k contained file handles. The process took roughly 4 minutes and exported 36173253 file handles. The biggest table contains 7093160 file handles. The distinct number of file handles is 27624848.

We then created an additional FILES_TABLE_DISTINCT table to stored the the distinct file handles ids:

Code Block
CREATE TABLE `FILES_TABLE_DISTINCT` ( `ID` BIGINT(20) NOT NULL, `REUSED` TINYINT(4) NOT NULL DEFAULT '0' )

Code Block

language	sql

INSERT INTO FILES_TABLE_DISTINCT SELECT DISTINCT F.ID, 0 FROM FILES_TABLES F

Note that we used a flag to indicate if the file handle is “reused”, since we have the object replication data we can verify if any of the file handles in tables are also reused in tables using a query such as (We created an index on both the FILE_ID in the object replication and in the FILES_TABLE_DISTINCT first):

Code Block
SELECT COUNT(DISTINCT F.ID) FROM FILES_TABLE_DISTINCT F JOIN OBJECT_REPLICATION R ON F.ID = R.FILE_ID

We have 49169 file handles referenced in tables that are also used in file entities. We can update the table above with this information:

Code Block

language	sql

UPDATE FILES_TABLE_DISTINCT F JOIN OBJECT_REPLICATION R ON F.ID = R.FILE_ID SET REUSED = TRUE

We created a Glue job similar to the above to export this table to S3 so that we can join on the S3 data to compute the size claimed by tables (TODO)

Multipart Uploads

In prod we do not clean up unfinished multipart uploads, the following script was used to fetch some statistics about those:

...

Versions Compared

Old Version 8

New Version 9

Key

Synapse Storage Report and Entities

Synapse Tables

Synapse Tables

Multipart Uploads