...
Unfinished multipart uploads: We currently store parts in temporary objects in S3 and the backend later copies the part over to the multipart upload and deletes the temporary parts when the multipart is completed, if the multipart upload is not finished the parts are never deleted. Additionally S3 keeps the multipart upload data “hidden” until the multipart is finished or aborted. As of November 10th we do have 1417823 uncompleted multipart uploads, of which 1414593 were initiated more before October 30st.
Data that can be considered temporary: For example bulk download packages might end up stored in the production bucket, this zip packages are most likely used once and never re-downloaded.
Data from staging: Data created in staging is removed after migrations, but of course the data in S3 is left intact.
Old data already present in the bucket and never cleaned up.
...
Enable a life cycle in the S3 bucket that automatically removes un-finished multipart uploads (See https://sagebionetworks.jira.com/browse/PLFM-6462). This would also mean to expire our multipart uploads in the backend (e.g. if the life cycle deleted incomplete uploads after 2 months, the multipart uploads in the backend should be forcibly restarted for example after a month).
Refactor the multipart upload to avoid uploading to temporary objects in the bucket (See https://sagebionetworks.jira.com/browse/PLFM-6412)
...
Streams to S3 the keys of the file handles that point to the prod bucket that were created between two given dates (e.g. last time the job was run and a month in the past from now)
Use Athena to join such data on the latest inventory (filtering again by the given dates) to identify un-indexed data
Delete from S3 the un-indexed data (or move it to a low cost storage class bucket, with an automatic deletion policy)
An alternative is to setup a Glue job that periodically dumps the files table to S3, similar to the S3 inventory and then a job will join on the 2 tables using Athena to find un-indexed data, note however that we need to make sure that no temporary data (e.g. the multipart upload parts) is included in the join, e.g. filtering by sensible dates.