S3 Bucket Analysis

We collect in this document the various steps taken to gauge the amount of data in the S3 proddata.sagebase.org bucket comparing it to the data indexed in the synapse database.

A snapshot of both the prod database and the tables database was taken on stack 332 on November 5th.
The S3 inventory was enabled on production and a snapshot was produced on November 5th

S3 Bucket

Data reported by S3 metrics on November 5th (From the cloud watch metrics for the bucket):

Storage: 646TB
Number of objects: 43,536,933

S3 Inventory

The inventory was enabled (See https://sagebionetworks.jira.com/browse/PLFM-6426 ) on the bucket.

We created a database named files_inventory in the Glue catalog. In Athena we created an external table for the S3 inventory:

CREATE EXTERNAL TABLE prod_inventory(
  `bucket` string,
  key string,
  version_id string,
  is_latest boolean,
  is_delete_marker boolean,
  size bigint,
  last_modified_date timestamp,
  e_tag string,
  is_multipart_uploaded boolean
  )
  PARTITIONED BY (dt string)
  ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
  STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
  OUTPUTFORMAT  'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
  LOCATION 's3://prod.inventory.sagebase.org/inventory/proddata.sagebase.org/defaultInventory/hive';

We ran the repair on the table:

MSCK REPAIR TABLE prod_inventory

We check how much data and objects are reported (dt is the date when the inventory is generated):

SELECT dt, COUNT(*), SUM(size) FROM prod_inventory GROUP BY dt

Results:

dt	count	size
2020-11-05-00-00	41625517	704212857157112
2020-11-08-00-00	41654050	708573173690177

The inventory reports also if a file was uploaded as multipart, this would provide us with how many objects are uploaded not through the standard synapse upload API:

SELECT dt, COUNT(*), SUM(size) FROM prod_inventory WHERE is_multipart_uploaded = true 
GROUP BY dt

2020-11-08-00-00	5712315	531513092323401
2020-11-05-00-00	5705476	527196059092709

This result is surprising, only 5.7M seems to be multipart uploads but we do have an order of magnitude more than that, what is going on?

On further analysis we checked a few of those files and we could see that they were in fact normal multipart uploads in the DB with the relative file handles. The reason for this inconsistency is that we encrypted the S3 bucket back in 2019, this most likely was done using a PUT copy of the same object. This is enforced by the fact that the modified dates on those objects seem to be consistent with the timeline of the encryption, while the original upload date in synapse was done prior. If the python API was used most likely all the objects that were smaller than a certain size were copied over without multipart.

Synapse File Handles

We created a table that contains only data pointing to the proddata.sagebase.org bucket that also includes a de-duplication identifier:

CREATE TABLE FILES_PROD AS 
SELECT *, ROW_NUMBER() OVER(PARTITION BY BUCKET_NAME, `KEY` ORDER BY ID) AS D_ID 
FROM FILES WHERE BUCKET_NAME = 'proddata.sagebase.org'

Count and size of the file handles that are indexed for the prod bucket:

SELECT COUNT(*) AS TOTAL_COUNT, SUM(CONTENT_SIZE) AS TOTAL_SIZE FROM FILES_PROD

Count	Size
47106657	746862657773610 (679TB)

Excluding duplicated file handles pointing to the same key:

SELECT COUNT(*) AS TOTAL_COUNT, SUM(CONTENT_SIZE) AS TOTAL_SIZE FROM FILES_PROD WHERE D_ID = 1

Count	Size
39426647	695938050033098 (633TB)

Multipart Uploads

In prod we do not clean up unfinished multipart uploads, the following script was used to fetch some statistics about those:

package org.sagebionetworks.project;

import java.time.Instant;
import java.time.temporal.ChronoUnit;
import java.util.Date;

import org.sagebionetworks.aws.MultiFactorAuthenticationCredentialProvider;

import com.amazonaws.regions.Regions;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.ListMultipartUploadsRequest;
import com.amazonaws.services.s3.model.MultipartUploadListing;

public class MultipartUploadsScanner {
	
	private AmazonS3 s3Client;
	
	public static void main(String[] args) {

		String mfaDeviceArn = "mfaDeviceArn";
		
		AmazonS3 s3Client = buildS3Client(mfaDeviceArn);
		
		String bucket = "proddata.sagebase.org";
		
		new MultipartUploadsScanner(s3Client).scan(bucket);
	}
		
	private static AmazonS3 buildS3Client(String mfaDeviceArn) {	
		AmazonS3ClientBuilder builder = AmazonS3ClientBuilder.standard();
		builder.withCredentials(new MultiFactorAuthenticationCredentialProvider(mfaDeviceArn));
		builder.withRegion(Regions.US_EAST_1);
		return builder.build();
	}

	
	public MultipartUploadsScanner(AmazonS3 s3Client) {
		this.s3Client = s3Client;
	}
	
	void scan(String bucket) {
		
		ListMultipartUploadsRequest request = new ListMultipartUploadsRequest(bucket);
		
		boolean hasNext = true;
		String keyMarker = null;
		String uploadIdMarker = null;
		
		int totalCount = 0;
		int totalOldCount = 0;
		
		Date oneMonthAgo = Date.from(Instant.now().minus(30, ChronoUnit.DAYS));
		
		while (hasNext) {
			
			request.setKeyMarker(keyMarker);
			request.setUploadIdMarker(uploadIdMarker);
			
			MultipartUploadListing result = s3Client.listMultipartUploads(request);
			
			int count = result.getMultipartUploads().size();
			long oldCount = result.getMultipartUploads().stream().filter( u-> u.getInitiated().before(oneMonthAgo)).count();
			
			totalCount += count;
			totalOldCount += oldCount;
			
			System.out.println(String.format("Batch Count: %s, Old Count: %s (Total: %s, Total Old: %s)", count, oldCount, totalCount, totalOldCount));
			
			if (result.isTruncated()) {
				keyMarker = result.getNextKeyMarker();
				uploadIdMarker = result.getNextUploadIdMarker();
				System.out.println(String.format("Has next: %s, %s", uploadIdMarker, keyMarker));
			} else {
				hasNext = false;
			}
		}
		
		System.out.println(String.format("Number of multipart uploads: %s (Old: %s)", totalCount, totalOldCount));
	}

}

Not Completed (to date)	Not Completed (started more than 30 days ago)
1417823	1414593

So we have about 1.4 M multipart uploads that are taking storage but are not completed, we could not compute the size of the uploads/parts. Since we upload temporary objects for each part it might also make up the difference reported by the size in S3 and the size in indexed in the file handles table.