Synapse S3 Storage Maintenance

JIRA: https://sagebionetworks.jira.com/browse/PLFM-510 (description outdated, left for reference).

Related JIRAs:

Related Previous Work:

Introduction

Synapse by default stores most of its data in an encrypted AWS S3 bucket. Additionally external buckets owned by users can be linked to projects and folders so that data can be stored outside the main bucket and its associated costs can be billed separately. Note that synapse supports different types of storage, such a bucket provisioned in google cloud and can link data that is stored elsewhere as long as it’s dereferenceable through a URL. The scope of this document focuses on the S3 storage as it’s the main type of storage currently being adopted.

Whenever a file is uploaded though synapse the reference to the file is maintained in an index, which is a table in an RDS MySQL instance. We refer to a record in this index table as a File Handle. A file handle is simply a pointer to where the physical data is stored and does not provide context where the data is actually used and for this reason there is no explicit access model and instead the user that created the file handle “owns” it (we only allow the owner to delete the file handle or access it directly). Additionally from a user perspective a file handle is immutable and its metadata cannot be updated, this simplifies the internal handling of files in Synapse and provides a generic abstraction over file management.

Once a file handle is created it can be referenced directly in multiple places including file entities, records in a synapse table, user and team profile pictures, wiki attachments, user messages etc. The complete list of type of references is maintained in what we call a file handle associate type:

https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/file/FileHandleAssociateType.html.

Through its association a file can be shared with users and teams and downloaded usually through a pre-signed URL.

From a maintenance perspective there are two main categories of data that could lead to potential unwanted costs:

Unlinked Data
The current architecture has the advantage that the infrastructure used to manage file uploads is unified in the whole system, but has the drawback that it separates the indexing of the data from the usage of the data, leading to data that can be potentially be un-linked and effectively unused but for which we would still pay the associated storage costs.
Unindexed Data
Worth mentioning another category of data in our bucket: data that is instead in the S3 bucket but not indexed in synapse (e.g. a file handle does not exists for a key), this is mostly a concern for the default bucket used by Synapse (as external buckets are managed by other users).

Ideally the system would be designed in a way that the amount of unlinked and unindexed data is kept as low as possible and could self-heal from potential abuse.

Unlinked Data

There are various scenarios when this might happen:

Updates: for example when a user updates its profile picture or when a version of a file entity is updated with a new file handle. Another example is records in a synapse table that points to a file handle, a table might be updated to use different file handles without deleting the old one.
Deletions: for example when a project is deleted and after it is purged from the trashcan all the entities in the projects are deleted, the file handles are instead left intact and the data maintained in the bucket.
No association: a file could be uploaded, a file handle could be created but never linked to anything, making the data inaccessible but to the user itself.
Data migration: a recent use case might lead to several terabytes of un-linked data, when users want to move their data to an external bucket: users might download from one bucket and re-upload to another bucket (or copying the data over other buckets using the recently developed APIs) updating the file handles in the entities and leaving behind a trail of file handles and data that is not used.

There are various ways that we could tackle this problem:

We could introduce an explicit link at the time the data is uploaded and maintain a consistent index: This is hard to implement in practice on top of the current architecture, we do not know where the data uploaded will be linked to and would be a breaking change that would probably take years to be introduced.
Auto-expire file uploads: When a file is initially uploaded we can flag it with a temporary state, after X days the file is automatically archived/deleted unless it’s explicitly linked through some association and the state is explicitly maintained. This requires to update the state of the file anytime a link is created but it does not work well with deletes as it might be extremely complex to know when an association is broken and communicate the change back to the file handle state: e.g. when a folder is deleted all the files under the folder might be deleted but since a file handle can technically be linked to different objects (e.g. in a file entity but at the same time as a record in a synapse table) the complexity of handling this correctly makes it so that this is not a viable solution.
A potential solution is instead to use a different approach: we can periodically scan all the file handle associations in Synapse and ask which ids are linked, with this information we can build an “index” that can be queried to identify the file handles that are un-linked and archive them for deletion with the possibility to restore them within a given period of time. This approach has the advantage that its implementation has a lower impact on the rest of the system and it is isolated to a specific task potentially reducing the risk of wrong deletions. It’s not immune to mistakes, since when scanning the association a developer could still make a mistake but we can design it so that we can be alerted when mistakes happens before the actual archival/deletion occurs.

In the following we propose a design that revolves around this solution. There are three main phases:

Discovery of the file handles links
Detection of the un-linked data
Archival of the un-linked data

File Handle Associations Discovery

The first step to decide what is un-linked or unused is to discover the linked data. In the backend we maintain already a set of references, generally using dedicated tables with foreign keys back to the file handles table that keep track of the current associations. This goes back to the https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/file/FileHandleAssociateType.html and theoretically we keep all the type of associations we have for file handles in order to handle access permissions (e.g. who can download what through which link). This is a bit more complicated in practice, sometimes we do not have a foreign key, but rather a field in a serialized form in the associated record (e.g. profile pictures), or even in a separate database (e.g. File references in synapse tables are maintained as dedicated tables, one for each synapse table).

The proposed approach is as follows:

Extend the current FileHandleAssociationProvider interface that each of the “associable” object implement to allow streaming ALL of the file handle ids that the type of object has an association with (including entities that are still in the trashcan).
Create an administrative job that is periodically invoked that uses the aforementioned method to stream all the file handle ids for each type (this can be done in parallel, but we can keep it simple in the beginning and see how long it takes to do serially).
For each id in the stream we “touch” the relative file handle, recording the last time it was “seen“ and update its status (See next section).

In this way we effectively build an index that can be queried to fetch the file handles that were not seen after a certain amount of time and can be flagged for archival. An alternative is to instead build a dedicated migratable companion table without touching the file handles table, this might be less strenuous for the system as it would not affect the reads on the file handles. The important part to consider is that we need to keep track of the time when the driving job scanning the associations was started and completed successfully in order to avoid querying stale data.

For example if the scan was done at time T and we want to identify files to archive, we can define a delta DT after which we consider a file un-linked. The query should only consider file handles created before T-DT. This still introduces some edge cases, for example when a file is linked just after the scan is performed, un-linked before the next scan is done and re-linked again after the next scan. This is probably a rare occurrence, but to avoid such issues we could send an async message when a file is linked that is processed to update the status of the file handle.

A potential issue with this approach is that the timing of the migration might be affected with several updates each time the scan is performed, on the other end the job might have a predictable timeline and we could time it to run for example just after the release so that the migration process will take in the changes in the beginning of the week.

Un-linked File Handle Detection and Archival

Deleting user data is a tricky business, especially if we cannot be 100% sure that the data is not used. Instead we propose an approach that goes in stages where first the un-linked data is detected but leaving it accessible and only after a certain amount of time we start archiving it and eventually delete it. The archived data will be stored in a dedicated bucket with a life cycle policy to delete objects after X months/years.

Additionally we can enforce the objects storage tier in this bucket to be set as S3 Standard - Infrequent Access (See https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-class-intro.html) to reduce storage costs, the costs of using the infrequent access tiering for objects translates in a reduced storage cost but additional cost to add and retrieve data:

Storage cost is $0.0125/GB (vs standard is $0.023/GB for the first 50TB, $0.022/GB for the next 450TB and $0.021/GB for over 500TB): E.g. 1TB is ~$12.5 vs $25
PUT/POST/COPY/LIST cost is $0.01/1000 requests (vs standard $0.005/1000 requests): e.g. a million objects is ~$10 vs $5
GET/SELECT cost is $0.001/1000 requests (vs standard $0.0004/1000 requests): e.g. a million objects is ~$1 vs $0.4
Life Cycle cost $0.01/1000 requests (e.g. automatically delete): e.g. a million objects is ~$10
Fetch costs is $0.01/GB (e.g. if we want to restore): 1TB is ~$10

Storing the data in an an archiving tier (e.g. glacier) could be an option but it complicates restores (first the objects needs to be restored) and access to data is expensive and slow. The S3 Intelligent-Tiering automatically manages and monitor usage patterns to change the access tiers and while the fetching of the data has no costs associated we would pay ($0.1 per million objects/month) for monitoring, but for this use case it makes sense to directly store the data in the infrequent access storage class rather than starting as a standard storage first and optionally move the objects to an archiving tier instead.

We introduce a STATUS column in the file handle table that can have the following values:

STATUS	Description	File Handle Accessible
CREATED	Default status	Yes
LINKED	When the a scan is performed and the file handle is found in at least one association	Yes
UNLINKED	The file handle has been identified as un-linked, if a pre-signed URL is requested for such an object we trigger an alarm (e.g. this should never happen unless we mistakenly identified a linked object)	Yes, trigger alarm
ARCHIVING	The file is being archived	No, throw not found, trigger alarm
ARCHIVED	The file has been archived (e.g. moved from the original bucket to the archive bucket)	No, throw not found, trigger alarm
DELETED	The file has been deleted, we can setup notifications in the S3 bucket and change the status when an object is deleted matching the key of the archived object(s). An alternative to deletion would be to store the objects in the S3 Glacier or Deep Archive for low long term storage costs.	No, throw not found, trigger alarm

Additionally we keep track of the status update timestamp with a new column STATUS_TIMESTAMP. We use this timestamp to decide both when to move to the UNLINKED status and when to move an UNLINKED file handle to the archive.

We proposed to introduce two new workers:

UnlinkedFileHandleDetectionWorker: This will scan the CREATED and LINKED file handles in order to identify UNLINKED ones if the last time they were seen (STATUS_TIMESTAMP) is earlier than X days from the last time that the scan was performed, as long as no other scanning is on-going and the scanning was completed successfully. The job should be triggered by a remote request and run periodically. Additionally it can be parameterized with the bucket name (e.g. for now we can use it on the prod bucket, but we might want to enable it for other buckets as well).
UnlinkedFileHandleArchiveWorker: This job will fetch the UNLINKED file handles whose STATUS_TIMESTAMP is older than X days (e.g. only archive files that have been unlinked for 30 days) and:
- Update their status to ARCHIVING
- Copy the file from the source bucket to the archive bucket, the destination key will be prefixed with the source bucket name.
- Update their status to ARCHIVED
- Delete the file from the source bucket

Considerations and open questions:

We might want to split the latter worker in multiple workers, a driver one that changes the status to (an additional status) ARCHIVE_REQUESTED and send an async message for another set of workers that will do the moving so that it can be parallelized. Note that we could externalize the move to avoid using our workers, but since we drive the copy and it’ll be relatively infrequent for the moment we can use the standard workers infrastructure.
An alternative to use solely the STATUS_TIMESTAMP for archiving unlinked files is to use a counter: every time the UnlinkedFileHandleDetectionWorker runs we re-process also the UNLINKED file handles and we increment a counter, we only archive file handles that on top of being UNLINKED for X days have the counter over a given value Y.
Does it make sense to store the history of the file handle status instead? E.g. we can have a dedicated table that stores the status and timestamps, a missing entry could be considered the default CREATED. Queries can get complicated especially considering the last status and joins.
If a pre-signed URL is requested for a UNLINKED file handle, should we update its status to LINKED?
I left out the restore for now which will work the same as archiving but in reverse. This becomes a bit more complicated with different storage classes (e.g. if we never want to delete).

Unindexed Data

TODO: We enabled the S3 inventory to check how much data is not indexed, from https://sagebionetworks.jira.com/browse/PLFM-510 as of Sept 2020, 21% of files were not accounted for by synapse projects (668 TB in the bucket vs 530TB), the rest might be file handles in tables or other associated objects.

This might happen because of various reasons:

Unfinished multipart uploads: We currently store parts in temporary objects in S3 and the backend later copies the part over to the multipart upload and deletes the temporary parts when the multipart is completed, if the multipart upload is not finished the parts are never deleted. Additionally S3 keeps the multipart upload data “hidden” until the multipart is finished or aborted. As of November 10th we do have 1417823 uncompleted multipart uploads, of which 1414593 were initiated more before October 30st.
Data that can be considered temporary: For example bulk download packages might end up stored in the production bucket, this zip packages are most likely used once and never re-downloaded.
Data from staging: Data created in staging is removed after migrations, but of course the data in S3 is left intact.
Old data already present in the bucket and never cleaned up.

In this case the amount of data compared to the category of un-linked data is most likely irrelevant and it’s probably not worth tackling at the moment but still worth mentioning potential solutions. The first point with multi-part uploads is what might be more relevant but we can enable solutions to avoid future costs such as:

Enable a life cycle in the S3 bucket that automatically removes un-finished multipart uploads (See https://sagebionetworks.jira.com/browse/PLFM-6462). This would also mean to expire our multipart uploads in the backend (e.g. if the life cycle deleted incomplete uploads after 2 months, the multipart uploads in the backend should be forcibly restarted for example after a month).
Refactor the multipart upload to avoid uploading to temporary objects in the bucket (See https://sagebionetworks.jira.com/browse/PLFM-6412)

We enabled the S3 inventory for our bucket and we can write a job that compares the file handle index with the inventory to delete un-indexed data. A potential approach is to write a job that:

Streams to S3 the keys of the file handles that point to the prod bucket that were created between two given dates (e.g. last time the job was run and a month in the past from now)
Use Athena to join such data on the latest inventory (filtering again by the given dates) to identify un-indexed data
Delete from S3 the un-indexed data (or move it to a low cost storage class bucket, with an automatic deletion policy)

An alternative is to setup a Glue job that periodically dumps the files table to S3, similar to the S3 inventory and then a job will join on the 2 tables using Athena to find un-indexed data, note however that we need to make sure that no temporary data (e.g. the multipart upload parts) is included in the join, e.g. filtering by sensible dates.