...
The request specifies a batch size and an amount in days to look backwards. This is needed to limit the amount of data that we need to scan per job and avoiding having expensive table scans (Maybe using a separate table for the status might be better, but we would have to refactor every file handle query to use a left join).
The worker will fetch a batch (LIMIT) of unique keys that:
Have been updated more than 30 days ago (let us call this point in time T) and no more than T - <the input days> (if we call this job every 10 minutes we can limit this to a day to avoid big scans)
Are flagged as UNLINKED
Whose bucket is proddata.sagebase.org (for now we process only our bucket, we can later extend it to other buckets, maybe through a storage location configuration)
For each key a A message is put on a queue that includes the pair <key, T> and processed by another worker that (in a single transaction)a batch of keys (e.g. 100) and T. Another worker will process each key in the message in a separate transaction, for each key:
Updates all the file handles for that key older than T and that are UNLINKED so that their status is ARCHIVED
Fetches the count of file handles for that key that are AVAILABLE OR (UNLINKED and have been updated before after T)
If the count is > 0 then the key is still available or another file handle is unlinked but still not past the 30 days window
If the count is 0 then we can archive the object in S3: we first get the tag set of the object and then merge it with a new tag such as “synapse-status=unlinked”archive”. With a configuration for the INT storage class that automatically moves the objects tagged as such in the archive tier and deep archive tier if they were not accessed for more than 90 and 180 days respectively.
All the previews of the ARCHIVED file handles can be deleted, as long as they are not used by other available file handles (yeah, this is unfortunately a possibility). The S3 object of the preview previews is deleted iff the preview if the preview file handle is the last pointing to itits key
Note that since we do not care much about keeping track of the job itself the job completes when the batch has finished sending messages to the queue (e.g. not when all the keys in the batch have been processed). If we call this job again too frequently we might risk to put the same key in the queue multiple times but I don’t see any side effects.
...
Why can we make “copies” of file handles in the first place? Maybe we should not allow it in S3 or GC buckets? Who is using this? If we get rid of this does it make it simpler? → ✅ This seems to be used heavily by scientists (from DW data), unrealistic to get rid of it and/or deduplicate them.
I have no idea when the last access date is updated for the INT class nor a way to test it quickly, does tagging the object reset this and moves the object back to INT-Standard? If so we might have a weird lifecycle where an object that is uploaded might go: INT-Standard → (Day 30) INT-IA → UNLINKED → (Day 60, tagging) INT-Standard → (Day 90) INT-IA → (Day 150) INT-Archive → (Day 240) INT-DeepArchive. Basically we might have a month where we go back paying the INT-Standard, in the long run the cost would be amortized since objects will eventually move to the archive tier. → ✅ I verified that tagging does NOT in fact push the object back to the INT-Standard. In my own bucket I have objects that are INT-IA, I tagged those objects from the console (therefore fetching the tags as well) and the metrics reported after a day did not change the count of object in INT-IA. This is great news as the lifecycle is now clear and more cost effective: INT-Standard → (Day 30) INT-IA → UNLINKED → (Day 60, Tagging) → (Day 90) INT-Archive → (Day 180) INT-DeepArchive.
Should we delete the copies of the file handles if at least one is AVAILABLE? If so, if all copies are UNLINKED should we just keep one around? Which one? → ✅ No, copies effectively make the ownership a 1 to N relationship
It is really not clear if lifecycle transitions generate S3 notifications, for example if we wanted to get a notification when a temporary object is automatically expired so that we can remove the file handle record. It looks like this was not possible a few years ago, but the reference in the documentation disappeared, we will have to test this or contact support.
I didn’t even start thinking about GC or other type of file handles. The vast majority of unlinked data is in prod though.
...