...
Note that this number most likely contains temporary objects never deleted (e.g. temporary multipart upload files, old tests, staging data etc).
Synapse Storage Report and Entities
From the synapse storage report that is generated monthly and stored in a synapse table we can get an idea of how much data is used by synapse entities in projects, the following query gets the aggregated sum of the size in bytes for the last 10 months:
...
Note that this report does not account for file handles that are referenced in tables or other objects or file handles that are not linked to anything which seem to account for about 70TB of data.
Synapse Tables
TODO: Create a table that contains all the file handles for all the T%F tables in the tables database, export it to S3 and join on the file handles table to get the size..
Running the following query should retains similar results, but more inline with the snapshot we are using and excluding eventual duplicates:
Code Block | ||
---|---|---|
| ||
WITH
U_R_F AS (
SELECT DISTINCT R.FILE_HANDLE_ID FROM JDOREVISION R WHERE R.FILE_HANDLE_ID IS NOT NULL
),
U_F AS (
SELECT MAX(F.CONTENT_SIZE) AS CONTENT_SIZE FROM U_R_F R JOIN FILES_PROD F ON R.FILE_HANDLE_ID = F.ID GROUP BY F.`KEY`
)
SELECT COUNT(*), SUM(CONTENT_SIZE) FROM U_F |
Count | Size |
---|---|
4659729 | 615280584612739 (559.5TB) |
So out of 39426647 distinct file handles in prod that take 633TB of data, 4659729 (~12%) file handles are used as entities and make up for 559.5TB (~88%) of the data. So about 73TB (~12%) of data is potentially linked in tables or other objects or used by unlinked file handles.
Synapse Tables
Each synapse table might link to file handles and for data access checks we keep for each table a companion table that stores the file handle id for each table. To get an idea of how much data is accounted for by synapse tables we created a table that contains the union of all the tables.
We created a container table:
Code Block | ||
---|---|---|
| ||
CREATE TABLE `FILES_TABLES` (
`TABLE_ID` BIGINT(20) NOT NULL,
`ID` BIGINT(20) NOT NULL
) |
And we run the following script, we also record the origin id of the table referencing the file handle:
Code Block | ||
---|---|---|
| ||
package org.sagebionetworks.project;
import java.util.List;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import javax.sql.DataSource;
import org.apache.commons.dbcp2.BasicDataSource;
import org.apache.commons.lang3.time.StopWatch;
import org.springframework.jdbc.core.JdbcTemplate;
public class SynapseTablesFileHandlesScanner {
private static final String DB_DRIVER = "com.mysql.cj.jdbc.Driver";
private static final String DB_URL = "jdbc:mysql://restored-prod-332-table-0-20201105.c5sxx7pot9i8.us-east-1.rds.amazonaws.com:3306/prod332";
private static final String DB_USER = "prod332user";
private static final String DB_PASSWORD = "secret";
private static final String SCHEMA = "prod332";
public static void main(String[] args) {
DataSource dataSource = getDataSource();
JdbcTemplate jdbcTemplate = getTemplate(dataSource);
ExecutorService executorService = Executors.newFixedThreadPool(20);
ScheduledExecutorService schedulerService = Executors.newScheduledThreadPool(1);
final SynapseTablesFileHandlesScanner scanner = new SynapseTablesFileHandlesScanner(jdbcTemplate);
int limit = 20000;
final List<FileTable> tables = scanner.getFileTables(limit);
final int size = tables.size();
final AtomicInteger counter = new AtomicInteger();
final StopWatch watch = StopWatch.createStarted();
System.out.println(String.format("Copying data from %d tables...", size));
tables.forEach(table -> {
executorService.submit(() -> {
timedRunnable(() -> {
int rowNum = scanner.copyFileHandles(table);
System.out.println(String.format("Finished copying %d rows from table %s", rowNum, table.getName()));
counter.incrementAndGet();
}, String.format("Copying table %s (Id: %d, Est Size: %d)", table.getName(), table.getId(), table.getSize()));
});
System.out.println(String.format("Table %s submitted.", table.getName()));
});
schedulerService.scheduleWithFixedDelay(() -> {
int currentCount = counter.get();
try {
System.out.println(String.format("-----\n%d/%d tables processed (%d%%, Elapsed Time: %d sec)\n-----", currentCount, size, (currentCount * 100 / size), watch.getTime(TimeUnit.SECONDS)));
} catch (Exception e) {
e.printStackTrace();
}
}, 5, 5, TimeUnit.SECONDS);
terminateExecutorService(executorService, 5, TimeUnit.HOURS);
terminateExecutorService(schedulerService, 5, TimeUnit.SECONDS);
watch.stop();
System.out.println(String.format("Copying data from %d tables...DONE (Took: %d mins)", size, watch.getTime(TimeUnit.MINUTES)));
}
private static void terminateExecutorService(ExecutorService executorService, long timeout, TimeUnit unit) {
executorService.shutdown();
try {
if (!executorService.awaitTermination(timeout, unit)) {
executorService.shutdownNow();
}
} catch (InterruptedException e) {
executorService.shutdownNow();
}
}
private static void timedRunnable(Runnable runnable, String text) {
StopWatch watch = StopWatch.createStarted();
System.out.println(String.format("%s...", text));
try {
runnable.run();
} catch (Exception e) {
System.out.println(String.format("%s...FAILED: %s", text, e.getMessage()));
e.printStackTrace();
}
watch.stop();
System.out.println(String.format("%s...DONE (Took: %d sec)", text, watch.getTime(TimeUnit.SECONDS)));
}
private static JdbcTemplate getTemplate(DataSource dataSource) {
return new JdbcTemplate(dataSource);
}
private static DataSource getDataSource() {
BasicDataSource ds = new BasicDataSource();
ds.setDriverClassName(DB_DRIVER);
ds.setUsername(DB_USER);
ds.setPassword(DB_PASSWORD);
ds.setUrl(DB_URL);
ds.setMinIdle(5);
ds.setMaxIdle(50);
ds.setMaxTotal(50);
return ds;
}
private int copyFileHandles(FileTable table) {
return jdbcTemplate.update("INSERT INTO FILES_TABLES SELECT " + table.getId() + ", FILE_ID FROM " + table.getName());
}
private JdbcTemplate jdbcTemplate;
public SynapseTablesFileHandlesScanner(JdbcTemplate jdbcTemplate) {
this.jdbcTemplate = jdbcTemplate;
}
public List<FileTable> getFileTables(int limit) {
StringBuilder sqlBuilder = new StringBuilder()
.append("SELECT TABLE_NAME, TABLE_ROWS FROM information_schema.TABLES T WHERE T.TABLE_SCHEMA = \"").append(SCHEMA).append("\"")
.append(" AND T.TABLE_NAME LIKE 'T%F'")
.append(" ORDER BY T.TABLE_NAME LIMIT ").append(limit);
return jdbcTemplate.query(sqlBuilder.toString(), (rs, rNum) -> new FileTable(rs.getString(1), rs.getLong(2)));
}
private static class FileTable {
private static final Pattern FILES_TABLE_PATTERN = Pattern.compile("T(\\d+)(_\\d+)?F");
private Long id;
private String name;
private Long size;
public FileTable(String name, Long size) {
this.id = parseId(name);
this.name = name;
this.size = size;
}
private Long getId() {
return id;
}
public String getName() {
return name;
}
public Long getSize() {
return size;
}
private static Long parseId(String name) {
Matcher matcher = FILES_TABLE_PATTERN.matcher(name);
if (matcher.matches()) {
return Long.valueOf(matcher.group(1));
}
throw new IllegalArgumentException("Invalid table name " + name);
}
}
}
|
This copied around 10.5K tables, out of which about 1.3k contained file handles. The process took roughly 4 minutes and exported 36173253 file handles. The biggest table contains 7093160 file handles. The distinct number of file handles is 27624848.
We then created an additional FILES_TABLE_DISTINCT table to stored the the distinct file handles ids:
Code Block |
---|
CREATE TABLE `FILES_TABLE_DISTINCT` (
`ID` BIGINT(20) NOT NULL,
`REUSED` TINYINT(4) NOT NULL DEFAULT '0'
) |
Code Block | ||
---|---|---|
| ||
INSERT INTO FILES_TABLE_DISTINCT SELECT DISTINCT F.ID, 0 FROM FILES_TABLES F |
Note that we used a flag to indicate if the file handle is “reused”, since we have the object replication data we can verify if any of the file handles in tables are also reused in tables using a query such as (We created an index on both the FILE_ID in the object replication and in the FILES_TABLE_DISTINCT first):
Code Block |
---|
SELECT COUNT(DISTINCT F.ID) FROM FILES_TABLE_DISTINCT F JOIN OBJECT_REPLICATION R ON F.ID = R.FILE_ID |
We have 49169 file handles referenced in tables that are also used in file entities. We can update the table above with this information:
Code Block | ||
---|---|---|
| ||
UPDATE FILES_TABLE_DISTINCT F JOIN OBJECT_REPLICATION R ON F.ID = R.FILE_ID SET REUSED = TRUE |
We created a Glue job similar to the above to export this table to S3 so that we can join on the S3 data to compute the size claimed by tables (TODO)
Multipart Uploads
In prod we do not clean up unfinished multipart uploads, the following script was used to fetch some statistics about those:
...