Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • The file scanner continuously exports associations (currently every 6 days all the associations are scanned, both in prod and staging), using kinesis and a glue table to convert to parquet

  • A worker exports the file handle data, including id and updatedOn, using kinesis and a glue table to convert to parquet

  • Once a month (1st monday of the month at night) we run an Athena query that joins the two glue tables together to discover all the file handles that do not have an associations (left join with null check), the id of the query is sent to the backend so that results can be processed

  • A worker process the athena query results to flag the file handles as unlinked (only handles that are in the available status)

Alternatives Considered:
  1. When we run the scanner we update a timestamp of the file handle directly in the DB, a worker will collect the file handles that have not been updated in more than 30 days and flag them as unlinked: This was the original option. It was dismissed because of the risk of trashing migration.

  2. Keep a consistent list of links (e.g. add/remove links according to the object that work on file handles): We dismissed it because of the complexity of maintaining a reference count index, additionally we would duplicate data in the DB and migration would suffer from it.

  3. Use of S3 infrastructure and objects tags: The idea was to use object tags and a lifecycle configuration, basically trying to tag objects when they were scanned. We dismissed this because updating a tag does not update the update date of the object and lifecycles in s3 buckets are based on it, we would have had to come up with a complex mechanism to update the lifecycle configuration periodically. Additionally the tagging operation costs would add up over time especially if scans are done frequently on all the objects.

...