Table of Contents |
---|
Main Features
- A new API to delete an entity into a trash can. Once an entity is in the trash can, it becomes invisible from the normal CRUD operations. The current delete API will be kept as an alternative to permanently delete an entity.
- Each user has a dedicated trash can. The user can view deleted entities in his/her trash can.
- The user can restore deleted entities in the trash can.
- The user can purge the whole trash can or purge individual entities.
- A worker periodically scans trash cans and purges entities that are in the trash can for more than a month.
Challenges
- Need to define a reasonable boundary as to what data should go to the trash can so that it can be restored as completely as possible. The boundary obviously goes beyond the entity/node. We need to also trash the revisions (which has the annotations, references, version numbers), the ACLs, together with entity. What else? When deciding what data to trash, also keep in mind the impact on maintenance and future development.
- Need to handle hierarchies. Especially each entity has a parent and a benefactor. Deleting a node has the cascading effect of also deleting the descendants and the beneficiaries. We must also trash the dependents so that they can be restored together the deleting node. This requirement can hit the performance badly if we end up deleting a large tree of nodes. One approach to this challenge is to set a limit to the number of entities to move to the trash can. If we count the number of entities to be more than 100, for example, we throw an exception ("Too large to fit into the trash can.") and prompt the user to permanently delete the entities instead.
- Need to cope with changes. Once the entity is in trash can, it is frozen from changes. In the meantime, the data in Synapse keeps changing. When we restore an entity from the trash can, its surroundings may have already changed. For example, its parent does not exist any more, or the access requirement has changed. Not only data. Schemas change too. Besides the schemas are not standardized, not versioned, not persisted, and not very well detached from the code logic. That said, there exists the risk that the items you put in a trash can today may fail to restore a month later due to incompatible schemas. We should at least be able to detect such conflicts and fail the restore with exceptions. Or perhaps better, cut off the conflicting parts and let the user restore manually.
Proposed Solutions
The flag approach
Add a boolean column IS_DELETED. Data is not deleted but is flagged as "deleted". A simple idea but requires using filtering on all the non-trash-can queries.
The backup approach
Delete by backing up an entity to a different place. As the trash data is in a different place, all the existing queries are not affected.
One approach is to move deleted data to separate tables of the same schema. For example, move an entry in the table REVISION to REVISION_TRASH of the same schema.
Another perhaps better approach is to take advantage of the exiting backup mechanism. The main benefit of this approach is that challenge 1 is already taken care of. Necessary backup data are wrapped into NodeBackup and NodeRevisionBackup. NodeBackupManager.getNode() and NodeBackupManager.getNodeRevision() create the backup objects. NodeBackupDriver.writeBackup() serializes the backup objects to XML files. It also recursively write the whole tree. However, a limit must be added here for the trash can feature – backup/retore can afford running for hours while moving entities into a trash can must finish quickly. The trash can in this approach can be implemented using Amazon S3.
The folder approach
Move the deleted data to a trash folder. Details below.
The Trash Folder Approach in Details
(To be written)