...
The file scanner continuously exports associations (currently every 6 days all the associations are scanned, both in prod and staging), using kinesis and a glue table to convert to parquet
A worker exports the file handle data, including id and updatedOn, using kinesis and a glue table to convert to parquet
Once a month (1st monday of the month at night) we run an Athena query that joins the two glue tables together to discover all the file handles that do not have an associations (left join with null check), the id of the query is sent to the backend so that results can be processed
A worker process the athena query results to flag the file handles as unlinked (only handles that are in the available status)
Considered Alternatives
...
:
When we run the scanner we update a timestamp of the file handle directly in the DB, a worker will collect the file handles that have not been updated in more than 30 days and flag them as unlinked: This was the original option. It was dismissed because of the risk of trashing migration.
Keep a consistent list of links (e.g. add/remove links according to the object that work on file handles): We dismissed it because of the complexity of maintaining a reference count index, additionally we would duplicate data in the DB and migration would suffer from it.
Use of S3 infrastructure and objects tags: The idea was to use object tags and a lifecycle configuration, basically trying to tag objects when they were scanned. We dismissed this because updating a tag does not update the update date of the object and lifecycles in s3 buckets are based on it, we would have had to come up with a complex mechanism to update the lifecycle configuration periodically. Additionally the tagging operation costs would add up over time especially if scans are done frequently on all the objects.
...
Another clear point that surfaced is that deleting the unlinked data is not worth the effort, if we move it to the cheapest archive tiering it is actually worth keeping the file handles around just in case we need to recover them.
Open questions
...
Even though we flag file handles as UNLINKED, we allow file handles to be copied and we have in the database multiple file handles that point to the same key. In some cases some of the copies might be flagged as UNLINKED while one or more existing copy for the same key might still be linked. This poses a technical challenge: we cannot archive this data since it is still unlinked but the record in the database is still there as UNLINKED, if we do not change their status or date we will always encounter this file handles unless all the links are eventually removed, for example:
F1 (k1, AVAILABLE)
F2 (k1, UNLINKED)
F3 (k1, UNLINKED)
F4 (k2, UNLINKED)
If we collect a batch of file handles that are unlinked and we limit the batch to a size of 2, we will fetch F2, F3. They all point to the same key but cannot be archived because F1 that references the same key is still linked. Should we in this case delete F2 and F3? (so that in the next run we can process F4). If we do not delete them we could update their timestamp but we risk to keep updating, maybe using a special status? Why would we want to keep those around anyway?It is really not clear if lifecycle transitions generate S3 notifications, for example if we wanted to get a notification when a temporary object is automatically expired so that we can remove the file handle record. It looks like this was not possible a few years ago, but the reference in the documentation disappeared, we will have to test this or contact support.
...