Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • UnlinkedFileHandleDetectionWorker: This will scan the CREATED and LINKED file handles in order to identify UNLINKED ones if the last time they were seen (STATUS_TIMESTAMP) is earlier than X days from the last time that the scan was performed, as long as no other scanning is on-going and the scanning was completed successfully. The job should be triggered by a remote request and run periodically. Additionally it can be parameterized with the bucket name (e.g. for now we can use it on the prod bucket, but we might want to enable it for other buckets as well).

  • UnlinkedFileHandleArchiveWorker: This job will fetch the UNLINKED file handles whose STATUS_TIMESTAMP is older than X days (e.g. only archive files that have been unlinked for 30 days) and:

    • Update their status to ARCHIVING

    • Copy the file from the source bucket to the archive bucket, the destination key will be prefixed with the source bucket name. E.g. proddata.sagebase.org/key1 → archive.sagebase.org/proddata.sagebase.org/key1

    • Update their status to ARCHIVED

    • Delete the file from the source bucket

...

An alternative is to setup a Glue job** that periodically dumps the files table to S3 , similar to the S3 inventory and then a job will join on the 2 tables using Athena to find un-indexed data, note however that we need to make sure that no temporary data (e.g. the multipart upload parts) is included in the join, e.g. filtering by sensible dates.

** Unfortunately at the time of writing AWS Glue does not seem to support a connection to MySQL 8.

View file
nameMultipartUploadsScanner.java

...