Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Unfinished multipart uploads: We currently store parts in temporary objects in S3 and the backend later copies the part over to the multipart upload and deletes the temporary parts when the multipart is completed, if the multipart upload is not finished the parts are never deleted. Additionally S3 keeps the multipart upload data “hidden” until the multipart is finished or aborted. As of November 10th we do have 1417823 uncompleted multipart uploads, of which 1414593 were initiated more before October 30st .(See attached script)

  • Data that can be considered temporary: For example bulk download packages might end up stored in the production bucket, this zip packages are most likely used once and never re-downloaded.

  • Data from staging: Data created in staging is removed after migrations, but of course the data in S3 is left intact.

  • Old data already present in the bucket and never cleaned up.

...

An alternative is to setup a Glue job that periodically dumps the files table to S3, similar to the S3 inventory and then a job will join on the 2 tables using Athena to find un-indexed data, note however that we need to make sure that no temporary data (e.g. the multipart upload parts) is included in the join, e.g. filtering by sensible dates.

View file
nameMultipartUploadsScanner.java