...
We could introduce an explicit link at the time the data is uploaded and maintain a consistent index enforcing a one to one relationship: this is hard to implement in practice on top of the current architecture, we do not know where the data uploaded will be linked to (we could generate expiring tokens for uploads to be used when linking) and would be a breaking change that would probably take years to be introduced. Additionally we found several millions of file handles already shared between different associations.
We could maintain a single index with all the associations and try to keep it consistent: each time a link is established a record is added with the type of association, when an association is broken the index is updated. When the last link is removed the file handle can be flagged as potentially un-linked and archived after a certain amount of time unless it’s linked back. This would require keeping the index up to date (potentially eventually consistent) but it does not work well with deletes as it might be extremely complex to know when an association is broken and communicate the change back to the index: e.g. when a folder is deleted all the files under the folder might be deleted (even though technically we already traverse the hierarchy when purging the trashcan). This brings overhead for each type of association (e.g. the handling needs to be done in each place where file handles are used). The advantage of this approach is that if an association is broken and the link is not removed the worst that can happen is that the data is kept (so no harm). Additionally there are less moving parts and point of failures (compared to the next solution). At the same time we would have the migration problem, this would turn into a big migratable table, but we could technically store it in a separate database that does not migrate, a good potential candidate that would scale well would be to store this index in Dynamo DB (using as PK the file handle id and as SK the association).
Another solution is instead to periodically scan all the file handle associations and ask which ids are linked, with this information we can build an “index” that can be queried to identify the file handles that are un-linked and archive them for deletion with the possibility to restore them within a given period of time. This approach has the advantage that its implementation has a lower impact on the rest of the system and it is isolated to a specific task potentially reducing the risk of wrong deletions. It’s not immune to mistakes, since when scanning the association a developer could still make a mistake but we can design it so that we can be alerted when mistakes happens before the actual archival/deletion occurs.
...