...
The request specifies a batch size and an amount in days to look backwards. This is needed to limit the amount of data that we need to scan per job and avoiding having expensive table scans (Maybe using a separate table for the status might be better, but we would have to refactor every file handle query to use a left join).
The worker will fetch a batch (LIMIT) of unique keys that:
Have been updated more than 30 days ago (let us call this point in time T) and no more than T - <the input days> (if we call this job every 10 minutes we can limit this to a day to avoid big scans)
Are flagged as UNLINKED
Whose bucket is proddata.sagebase.org (for now we process only our bucket, we can later extend it to other buckets, maybe through a storage location configuration)
For each key a A message is put on a queue that includes the pair <key, T> and processed by another worker that (in a single transaction)a batch of keys (e.g. 100) and T. Another worker will process each key in the message in a separate transaction, for each key:
Updates all the file handles for that key older than T and that are UNLINKED so that their status is ARCHIVED
Fetches the count of file handles for that key that are AVAILABLE OR (UNLINKED and have been updated before after T)
If the count is > 0 then the key is still available or another file handle is unlinked but still not past the 30 days window
If the count is 0 then we can archive the object in S3: we first get the tag set of the object and then merge it with a new tag such as “synapse-status=unlinked”archive”. With a configuration for the INT storage class that automatically moves the objects tagged as such in the archive tier and deep archive tier if they were not accessed for more than 90 and 180 days respectively.
All the previews of the ARCHIVED file handles can be deleted, as long as they are not used by other available file handles (yeah, this is unfortunately a possibility). The S3 object of the preview previews is deleted iff the preview if the preview file handle is the last pointing to itits key
Note that since we do not care much about keeping track of the job itself the job completes when the batch has finished sending messages to the queue (e.g. not when all the keys in the batch have been processed). If we call this job again too frequently we might risk to put the same key in the queue multiple times but I don’t see any side effects.
...