...
While a file handle is in the UNLINKED, ARCHIVING or ARCHIVED state it can be restored, the process would move back the data to the original key (if the object is in archiving tier such as glacier we first need to request the restore from AWS and the process is a bit more involved). We can introduce dedicated statuses (e.g. RESTORING → AVAILABLE) while the restore request is in progress.
Alternative Implementation (Dismissed)
The process proposed above has the advantage of being mostly done externally to Synapse, but also introduce several point of failures that might be difficult to tackle correctly. Additionally it adds infrastructure and technologies that the backend team has limited knowledge of and in general it’s a substantial engineering effort that might span several months of development.
...
Every time a link is added (e.g. a file entity is created, a revision is updated etc) a dedicated service is invoked to inform the system about the association. We simply add a record to a dedicated table that maps the file handle with the object type/id, this should be done in the same transaction.
When a link is removed we invoke the service again to remove the link
Periodically we scan this table to detect unlinked linked file handles (e.g. file handles that do not have records in this table), we record this information in another table with with a timestamp.
Periodically we scan the “un-linked” table for file handles that have been un-linked for more than 30 days and update the file handles, we remove the record from this table
When a link is added we make sure that the the record is removed from the previous table as well
...
Of course this has the drawback of introducing another big table that needs to be migrated and the existing links needs to be backfilled, we could store this in a dedicated DB that does not migrate but in this case the “un-linked” discovery would not take advantage of joining the file handle table.
Note: We decided to dismiss this simpler implementation for now: it is problematic due to the duplication of data (especially for tables) in the main DB, additionally handling deletions might become too complex: versioning makes it so that we would need to either keep a record for each version, or check every time a version is deleted if this would free up a file handle (which might be referenced by other node versions). Finally even though this is a simpler solution we would need to backfill the data anyway leading to a process which is similar to the file handle associations scanning above.
Unindexed Data
From the S3 Bucket Analysis it turns out that we have around 7.5TB of data in S3 for which there is no file handle.
...