Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Storing the data in an an archiving tier (e.g. glacier) could be an option but it complicates restores (first the objects needs to be restored) and access to data is becomes expensive and slow. The S3 Intelligent-Tiering automatically manages and monitor usage patterns to change the access tiers and while the fetching of the data has no costs associated we would pay ($0.1 per million objects/month) for monitoring, but for this use case it makes sense to directly store the data in the infrequent access storage class rather than starting as a standard storage first and optionally move the objects to an archiving tier insteadwith custom life cycles to move it to other classes or delete the objects since the intelligent tiering class does not have (yet) fine grained control on how the data is moved between classes.

We introduce a STATUS column in the file handle table that can have the following values:

...

TODO: We enabled the S3 inventory to check how much data is not indexed, from https://sagebionetworks.jira.com/browse/PLFM-510 as of Sept 2020, 21% of files were not accounted for by synapse projects (668 TB in the bucket vs 530TB), the rest might be file handles in tables or other associated objects (See S3 Bucket Analysis for further analysis).

This might happen because of various reasons:

  • Unfinished multipart uploads: We currently store parts in temporary objects in S3 and the backend later copies the part over to the multipart upload and deletes the temporary parts when the multipart is completed, if the multipart upload is not finished the parts are never deleted. Additionally S3 keeps the multipart upload data “hidden” until the multipart is finished or aborted. As of November 10th we do have 1417823 uncompleted multipart uploads, of which 1414593 were initiated more before October 30st 11th (See attached script S3 Bucket Analysis )

  • Data that can be considered temporary: For example bulk download packages might end up stored in the production bucket, this zip packages are most likely used once and never re-downloaded.

  • Data from staging: Data created in staging is removed after migrations, but of course the data in S3 is left intact.

  • Old data already present in the bucket and never cleaned up.

...

An alternative is to setup a Glue job ** that periodically dumps the files table to S3 similar to the S3 inventory and then a job will join on the 2 tables using Athena to find un-indexed data, note however that we need to make sure that no temporary data (e.g. the multipart upload parts) is included in the join, e.g. filtering by sensible dates.

In general we should make sure that any data that ends up in the prod bucket is actually indexed in file handles, for example I would move the temporary objects used for the multipart upload in its own dedicated bucket.

** Unfortunately at the time of writing AWS Glue does not seem to support a connection to MySQL 8.

...

.

...