JIRA: https://sagebionetworks.jira.com/browse/PLFM-510 (description outdated, left for reference).
Related JIRAs:
Related Previous Work:
Introduction
Synapse by default stores most of its data in an encrypted AWS S3 bucket. Additionally external buckets owned by users can be linked to projects and folders so that data can be stored outside the main bucket and its associated costs can be billed separately. Note that synapse supports different types of storage, such a bucket provisioned in google cloud and can link data that is stored elsewhere as long as it’s dereferenceable through a URL. The scope of this document focuses on the S3 storage as it’s the main type of storage currently being adopted.
Whenever a file is uploaded though synapse the reference to the file is maintained in an index, which is a table in an RDS MySQL instance. We refer to a record in this index table as a File Handle. A file handle is simply a pointer to where the physical data is stored and does not provide context where the data is actually used and for this reason there is no explicit access model and instead the user that created the file handle “owns” it (we only allow the owner to delete the file handle or access it directly). Additionally from a user perspective a file handle is immutable and its metadata cannot be updated, this simplifies the internal handling of files in Synapse and provides a generic abstraction over file management.
Once a file handle is created it can be referenced directly in multiple places including file entities, records in a synapse table, user and team profile pictures, wiki attachments, user messages etc. The complete list of type of references is maintained in what we call a file handle associate type:
https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/file/FileHandleAssociateType.html.
Through its association a file can be shared with users and teams and downloaded usually through a pre-signed URL.
From a maintenance perspective there are two main categories of data that could lead to potential unwanted costs:
Unlinked Data
The current architecture has the advantage that the infrastructure used to manage file uploads is unified in the whole system, but has the drawback that it separates the indexing of the data from the usage of the data, leading to data that can be potentially be un-linked and effectively unused but for which we would still pay the associated storage costs.Unindexed Data
Worth mentioning another category of data in our bucket: data that is instead in the S3 bucket but not indexed in synapse (e.g. a file handle does not exists for a key), this is mostly a concern for the default bucket used by Synapse (as external buckets are managed by other users).
Ideally the system would be designed in a way that the amount of unlinked and unindexed data is kept as low as possible and could self-heal from potential abuse.
Unlinked Data
There are various scenarios when this might happen:
Updates: for example when a user updates its profile picture or when a version of a file entity is updated with a new file handle. Another example is records in a synapse table that points to a file handle, a table might be updated to use different file handles without deleting the old one.
Deletions: for example when a project is deleted and after it is purged from the trashcan all the entities in the projects are deleted, the file handles are instead left intact and the data maintained in the bucket.
No association: a file could be uploaded, a file handle could be created but never linked to anything, making the data inaccessible but to the user itself.
Data migration: a recent use case might lead to several terabytes of un-linked data, when users want to move their data to an external bucket: users might download from one bucket and re-upload to another bucket (or copying the data over other buckets using the recently developed APIs) updating the file handles in the entities and leaving behind a trail of file handles and data that is not used.
There are various ways that we could tackle this problem:
We could introduce an explicit link at the time the data is uploaded and maintain a consistent index enforcing a one to one relationship: this is hard to implement in practice on top of the current architecture, we do not know where the data uploaded will be linked to (we could generate expiring tokens for uploads to be used when linking) and would be a breaking change that would probably take years to be introduced. Additionally we found several millions of file handles already shared between different associations.
We could maintain a single index with all the associations and try to keep it consistent: each time a link is established a record is added with the type of association, when an association is broken the index is updated. When the last link is removed the file handle can be flagged as potentially un-linked and archived after a certain amount of time unless it’s linked back. This would require keeping the index up to date (potentially eventually consistent) but it does not work well with deletes as it might be extremely complex to know when an association is broken and communicate the change back to the index: e.g. when a folder is deleted all the files under the folder might be deleted (even though technically we already traverse the hierarchy when purging the trashcan). This brings overhead for each type of association (e.g. the handling needs to be done in each place where file handles are used). The advantage of this approach is that if an association is broken and the link is not removed the worst that can happen is that the data is kept (so no harm). Additionally there are less moving parts and point of failures (compared to the next solution). At the same time we would have the migration problem, this would turn into a big migratable table, but we could technically store it in a separate database that does not migrate, e.g. a potential candidate (there seem to be limitations in how to delete data though) that would scale well would be to store this index in Dynamo DB (using as PK the file handle id and as SK the association).
Another solution is instead to periodically scan all the file handle associations and ask which ids are linked, with this information we can build an “index” that can be queried to identify the file handles that are un-linked and archive them for deletion with the possibility to restore them within a given period of time. This approach has the advantage that its implementation has a lower impact on the rest of the system and it is isolated to a specific task potentially reducing the risk of wrong deletions. It’s not immune to mistakes, since when scanning the association a developer could still make a mistake but we can design it so that we can be alerted when mistakes happens before the actual archival/deletion occurs.
In the following we propose a design that revolves around this last solution. There are three main phases that take place:
A Discovery phase, where the association are scanned so that links to file handles can be recorded
A Detection phase, where with the information from the previous phase we can establish which file handles is not linked
An Archival phase, where the file handles that are deemed un-linked are placed into an archive state and will eventually be deleted
File Handle Associations Discovery
The first step is to discover the existing association to file handles. In the backend we maintain already a set of references, generally using dedicated tables with foreign keys back to the file handles table that keep track of the current associations. This goes back to the https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/file/FileHandleAssociateType.html and theoretically we keep all the type of associations we have for file handles in order to handle access permissions (e.g. who can download what through which link). This is a bit more complicated in practice, sometimes we do not have a foreign key, but rather a field in a serialized form in the associated record (e.g. profile pictures), or even in a separate database (e.g. File references in synapse tables are maintained as dedicated tables, one for each synapse table).
In the following table we provide the list of known associations to file handles along with a description of how the association is stored:
Association Type | Table | Foreign Key (ON DELETE CASCADE) | Description | Current Size | Unlinking |
---|---|---|---|---|---|
FileEntity | JDOREVISION | FILE_HANDLE_ID (RESTRICT) | Each file entity revision has a FK back to the referenced file handle. Multiple revision can reference a single file entity (e.g. One node might be linked to multiple file handles through revisions). | ~11M | The association can be broken in several ways:
Note that even if a revision is deleted or a file handle is changed other revisions might still refer to the file handle. |
TableEntity | Each table entity has an associated table with the file handles. They are not migratable tables and not consistent. The data is also store in the various transactions used to build tables in S3 in a dedicated bucket. | No, the tables are stored in a separate DB | Each table might reference multiple file handles, when a table is built each transaction is processed and if a file handle is in the transaction it is added to a dedicated table, one for each synapse table. Unfortunately this table is not migratable and is rebuilt every week. We keep a migratable table with all the table transactions and table row changes packages in a zip file and stored in a dedicated S3 bucket. | ~36M, distributed in around ~10K tables (~1.3K non empty) | The association can be broken only when the table is deleted (removed from the trashcan). |
WikiAttachment | V2_WIKI_ATTACHMENT_RESERVATION V2_WIKI_MARKDOWN | FILE_HANDLE_ID (RESTRICT) No, contained in the ATTACHMENT_ID_LIST blob | The attachments to a wiki page, the table includes both the file handles storing the wiki page and its attachments. The list of attachments is also stored in the V2_WIKI_MARKDOWN table in a blob with the list of ids. | ~1M | The association might be broken in several ways:
|
WikiMarkdown | V2_WIKI_MARKDOWN | FILE_HANDLE_ID (RESTRICT) | The markdown of a wiki page. | ~770K | See above |
UserProfileAttachment | JDOUSERPROFILE | PICTURE_ID (SET NULL) | The user profile image. | ~60K | The association can be broken when the profile image is changed. |
TeamAttachment | TEAM | No, contained in the PROPERTIES blob that stores a serialized version of the team object (icon property) | The team picture. | ~4.5K | The association can be broken when the team picture is changed. |
MessageAttachment | MESSAGE_CONTENT | FILE_HANDLE_ID (RESTRICT - NO ACTION) | The messages to users content. | ~460K | The association can be broken if the message is deleted (only admins). |
SubmissionAttachment | JDOSUBMISSION_FILE | FILE_HANDLE_ID (RESTRICT - NO ACTION) | The file handles that are part of an evaluation submission, in particular this are the file handles associated with a file entity that is part of a submission (e.g. all the version or a specific version). | ~110K | The association can be broken when the submission is deleted or when the evaluation is deleted. |
VerificationSubmission | VERIFICATION_FILE | FILE_HANDLE_ID (RESTRICT) | The files that are submitted as part of the user verification. Note that when a user is approved or rejected the association is removed. | <10 | The association is broken when the submission is approved or rejected. |
AccessRequirementAttachment | ACCESS_REQUIREMENT_REVISION | No, a file handle might be contained in the SERIALIZED_ENTITY blob that stores a managed access requirement (the ducTemplateFileHandleId property) | A managed access requirement might have a file handle pointing to a DUC template. | ~5K | The association is broken when the access requirement is deleted or updated with a new file handle. |
DataAccessRequestAttachment | DATA_ACCESS_REQUEST | No, various file handles are referenced in the REQUEST_SERIALIZED blob that stores a serialized version of the access request. | A data access request might have multiple files attached for the approval phase (e.g DUC, IRB approval and other attachments). | ~2K | The association is broken when the request is updated with different file handles. |
DataAccessSubmissionAttachment | DATA_ACCESS_SUBMISSION | No, various file handles are referenced in the SUBMISSION_SERIALIZED blob that stores a serialized version of the submission. | Same as above, but for the actual submission. | ~3K | Never? |
FormData | FORM_DATA | FILE_HANDLE_ID (RESTRICT) | The data of a form. | ~300 | When the form is deleted. |
Note: An interesting observation (Also from the S3 Bucket Analysis ) is that most of data is referenced by file entities (as expected). Most of our concerns could revolve around un-linking entities rather than other objects. For this it might be actually worth implementing option 2. above instead. We could simply register the links when we encounter them and only take care of un-registering tables and entities links for cleanup.
The idea would be to periodically scan over all the associations and record the last time a file handle was “seen” (e.g. as associated). In this way we can build an index that can be queried to fetch the last time a file handle association was seen so that file handles can be flagged as un-linked.
There are three aspects that require considerations for this process:
How to perform the scan
Where to store the result
When to perform the scan
Scanning
As we can see from the table above the way the associations are stored for each type is not consistent, some types use dedicated tables, other use a column with a field referring to the file handle id and some other store the information embedded in a serialized field. Additionally the size distribution is widely uneven, we have tables with a few thousands associations and tables with millions of records.
The simplest approach would be to simply run a driver job that for each type of association sends an SQS message to starts a sub-job to scan all the associations of that type. Each sub-job performs a scan reading the data in batches. This approach is problematic because while in most of the cases each job might take a few seconds, in some cases it might take several hours (e.g. tables). If a worker goes down while scanning and the job is put back in the queue the job would have to start from scratch.
We instead can split each association in independent sub-jobs that can run in parallel and are only scanning a batch of data (e.g. and dedicate a small fleet of workers for this task), similar to what happens when we migrate (where we scan in batches of 100k records). An idea can be to divide the scan for a given association type in partitions of a given size, for example considering an id used in a table, its min and max value we can divide in sub-jobs of potentially evenly distributed batches. Each batch can be processed by a dedicated worker. This approach can be probably generalized for most of the associations types above providing a default implementation with some variations (e.g. some objects will need to be de-serialized). The tricky part is to divide the scanning in partitions, this can be approximated using in some cases a unique id of the table being scanned, in some other cases (e.g. nodes) we could partition using the file handle id itself.
The idea is to rely on the fault tolerance of both SQS and the worker architecture: if each batch is small enough and a recoverable failure happens the worker can start the same batch assuming that all the information is contained in the SQS message driving the worker.
Synapse tables problem
The most problematic part is scanning the associations in tables:
We have several thousand tables, each one with its own file association table
Tables are rebuilt every week as are not migratable and in some cases they might actually fail to build all together rendering the file association table useless.
Each table is built using “transactions” which contain the set of changes to the table, we store the history of transactions in a migratable table. If a transaction succeeds we store the change set in S3 as a compressed serialized object. When a table is built it goes through the history of the transactions reading from S3 the change set and applying it to the table. Each table might have several thousands of changes applied to it. Each change set might contain a file handle association and this is when we record the file handle in the table file association table.
One way we could approach scanning the associations for tables is to use the TABLE_ROW_CHANGE table that contains the history of change sets for each table, we can read the changes directly from this table and use it to create sub-jobs that process a batch of changes. Each worker will read from the batch of changes from S3 and look for file handles in the change set itself.
Note that the file handles in this case are freed only when the table is deleted and the changes are dropped. This is unavoidable at the moment due to how the tables are built.
An alternative that would make the scanning much faster is to keep a single migratable table with the file handle associations that is populated when the changes are processed, this table would need to be back-filled with previous data. We can keep this as an optimization if the first approach does not work well.
Storage
We have various options to store this information so that it can be used later in the process.
We could store the “linked” status in the file handle table itself along with the timestamp: this would be ideal as we would have to work with only one table and queries would be relatively efficient. Unfortunately this is not possible as the file handle table is our biggest table that is migrated every week and updating the file handles in such a way would most likely lead to unsustainable delays in the migration process. An option would be to move this table in a different “non-migratable” DB, maybe building a dedicated service over file handles, this is a huge task and would required quite a lot of refactoring (e.g. we rely on joying this table).
We could store the last time a file handle is linked in a companion migratable table and use it in a similar fashion: this has a similar limitation, it would probably lead to delays in migration as the table would grow substantially over time.
We can store this information in another type of external storage.
At the moment option 3. seems the more viable: we can leverage existing technology to store this information as an append only log in S3 directly and perform queries using Athena:
We can setup a kinesis firehose stream that delivers the data to a sort of append only log in S3: each time a file handle is scanned we send an event to a kinesis firehose stream, the stream can be setup to target a dedicated S3 bucket and transform the data in a columnar format (e.g. apache parquet) before storage. The records would contain the triple: object type, file handle id, timestamp
We can setup a glue table over this S3 destination so that the data can be queried with Athena
The bucket can be setup so that old data is automatically removed (e.g. after 60 days)
During the S3 bucket analysis we used Athena to query data in S3 heavily and the results are very promising, joining on mid-size tables directly in S3 (~50M records) is extremely fast (less than 20 seconds) and the results can be stored in S3 directly as target tables. It also has the advantage that once the data is in S3 all the computation can be done externally from a dedicated job (e.g. the data can be joined in S3 to find out the un-liniked file handles and stored as a dedicated table that can be queried separately).
Timing
An important aspect to consider is that the scanning of the associations might take a long time. Even if we do it in parallel we need to limit the amount data scanned at a given time to avoid overloading the DB. We can dedicated a small amount of workers (e.g. 10 workers) to process a batch of data, the data is then sent to a kinesis stream that will eventually deliver it to S3.
This process is not deterministic and failure can occur during the scanning (e.g. a worker goes down, network connectivity is lost etc.). We would still need to have an idea of the scanning progress, or at least when it started and more or less when it completed. For example we do not want to start several scans of the all the associations in parallel, ideally we would scan once in a while (e.g. once a week) and use the start time of the job to filter on the file handles considered for further processing. Before querying the log we would need to know more or less when was the last time a scan finished:
For this we can monitor the stream to check if records are still being processed, if the stream has been empty for more than 24 hours for example we can consider the job that is being running as “finished”. We can also monitor the queue dedicated to the workers, and check if for example errors happens during the process (e.g. messages end up in a dead letter queue). In this case we maintain a table with the scanning jobs and ensure that only one is running at a given time.
Another aspect to consider is how to start a scan:
Automatically, according to the status of the current job
Externally, using a jenkins job or some other trigger (e.g. a cloud watch event with a lambda) that periodically starts the scan (e.g. every week)
Option 1. would be ideal as the system would be self contained but at the moment Option 2. is preferred to avoid the usual prod vs staging problem.
Un-linked File Handle Detection
Once one or more scans are performed we end up with a log with the last time a file handle was seen as linked. This data will live in S3 and can be queried with Athena. We can now use this information to check if any file handle has not being linked for more than a given amount of time, assuming that a scan was performed recently. Note that we can consider file handles that have been modified before a certain amount of time E.g. we consider only file handles modified more than 30 days ago.
Since the file handles are in the main DB we need a way to get this information and we could simply run several queries with Athena with a batch of ids. This is not ideal as it’s a slow process with several limitations, Athena queries are limited in size (e.g. we can probably ask for a 10K batch of ids), we have a limit on the number of concurrent queries that can be performed (e.g. 20/s) and handling the results is tricky (e.g. queries are asynchronous). Where Athena shines is when a single query is performed on a big dataset.
We can instead periodically export the file handle table in S3 and join the data directly using Athena. We can write a script that weekly does the following:
Restore a recent snapshot of the DB
Run a Glue job on the snapshot to export the file handle table in S3 (from an initial experiment this can take around 10-20 minutes) in parquet format, we can filter on the modification date (e.g. only file handles that have been modified 30 days prior the last scan was started) and avoid including previews (which are technically linked file handles).
Join this table with the data delivered by the kinesis stream to detect un-linked file handles to produce a new table in S3 with only the file handles that are un-linked
Send a message to the backend to inform that the results are ready, a worker can read this results in batches to populate a table with the un-linked file handle ids.
Note that given the uncertainty in the process (e.g. unexpected failures, missed scanned data etc) when a file handle is detected as un-linked we do not actually perform any other action. We can simply have a counter (or several records) that tracks how many times a file handle was detected as un-linked. This means that we can boost our confidence over time, for example if a file handle was detected as un-linked in 3 subsequent scans we can flag the file handles as UNLINKED and for subsequent ARCHIVAL. We can have a worker that periodically counts how many times a file was detected as un-linked, update the file handle status and remove the records from this table (e.g. to avoid reprocessing).
An interesting alternative approach to test would be to append the results in S3 directly (Athena allows to have an INSERT INTO kind of operation), building a log of un-linked detections, we can then run a query on this dataset directly in S3: for example we can group by the id of the file handle in the un-linked log over a window of the previous 30 days and count how many times the file handle was detected as un-linked, the result would effectively be the file handles that were detected as un-linked at least 3 times in the span of a month (assuming we run the detection often enough).
Un-linked File Handle Archival
Deleting user data is a tricky business, especially if we cannot be 100% sure that the data is not used. Instead we propose an approach that goes in stages where first the un-linked data is detected but leaving it accessible and only after a certain amount of time we start archiving it and eventually delete it, with the option to restore it before it is actually garbage collected. The archived data will be stored in a dedicated bucket with a life cycle policy to delete objects after X months/years.
The reason to move the data in a dedicated bucket is because we can defined life cycle policies at the bucket level keeping the data well organized.
We can enforce the objects storage tier in this bucket to be set as S3 Standard - Infrequent Access (See https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-class-intro.html) to reduce storage costs, the costs of using the infrequent access tiering for objects translates in a reduced storage cost but additional cost to add and retrieve data:
Storage cost is $0.0125/GB (vs standard is $0.023/GB for the first 50TB, $0.022/GB for the next 450TB and $0.021/GB for over 500TB): E.g. 1TB is ~$12.5 vs $25
PUT/POST/COPY/LIST cost is $0.01/1000 requests (vs standard $0.005/1000 requests): e.g. a million objects is ~$10 vs $5
GET/SELECT cost is $0.001/1000 requests (vs standard $0.0004/1000 requests): e.g. a million objects is ~$1 vs $0.4
Life Cycle cost $0.01/1000 requests (e.g. automatically delete): e.g. a million objects is ~$10
Fetch costs is $0.01/GB (e.g. if we want to restore): 1TB is ~$10
Storing the data in an archiving tier (e.g. glacier) could be an option but it complicates restores (first the objects needs to be restored) and access to data becomes expensive and slow. A new offering from AWS is the S3 Intelligent-Tiering that automatically manages and monitor usage patterns of objects keys to change the access tiers and while the fetching of the data has no costs associated we would pay ($0.1 per million objects/month) for monitoring, but for this use case it makes sense to directly store the data in the infrequent access storage class with custom life cycles to move it to other classes or delete the objects since the intelligent tiering class does not have (yet) fine grained control on how the data is moved between classes.
We can introduce a new STATUS column in the file handles:
STATUS | Description | File Handle Accessible |
---|---|---|
AVAILABLE | Default status | Yes |
UNLINKED | The file handle has been identified as un-linked, if a pre-signed URL is requested for such an object we trigger an alarm (e.g. this should never happen unless we mistakenly identified a linked object) | Yes, trigger alarm. Should we instead consider this as not accessible? |
ARCHIVING | The file is being archived | No, throw not found, trigger alarm |
ARCHIVED | The file has been archived (e.g. moved from the original bucket to the archive bucket) | No, throw not found, trigger alarm |
DELETED | The file has been deleted, we can setup notifications in the S3 bucket and change the status when an object is deleted matching the key of the archived object(s). An alternative to deletion would be to store the objects in the S3 Glacier or Deep Archive for low long term storage costs. | No, throw not found, trigger alarm |
Additionally we keep track of the status update timestamp with a new column STATUS_TIMESTAMP. We use this timestamp to decide when to move an UNLINKED file handle to the archive. E.g. we can keep the file handle un-linked for 30 days before we move it to the archive.
A worker will simply periodically scan the UNLINKED file handles whose status was updated more than 30 days in the past:
The worker will check if other file handles that are still AVAILABLE use the same key (e.g. it’s a logical copy), in this case we cannot move the underlying data yet and the archival is delayed. In this case we can update the timestamp (so that the file handle is re-processed in the next 30 days). Note: this is a tricky edge case, we could delete the file handle at this point but then we only allow the 30 days window to restore it in the case of a copy. Maybe we can use a special status instead? What happen if the last copy is archived? What if this copy is restored then?
For eligible file handles the worker will set the status to ARCHIVING and send a message for another worker to start the archiving that will move the data to the archive bucket and set the status to ARCHIVED once done. We can keep the key of the file handle the same but prefixed with the source bucket (e.g. archive.bucket/proddata.sagebase.org/originalKey). In this way it’s easy to restore the file handle to the original key. Note that the previews can be deleted here, if a file handle is restored we can trigger the preview generation.
While a file handle is in the UNLINKED, ARCHIVING or ARCHIVED state it can be restored, the process would move back the data to the original key (if the object is in archiving tier such as glacier we first need to request the restore from AWS and the process is a bit more involved). We can introduce dedicated statuses (e.g. RESTORING → AVAILABLE) while the restore request is in progress.
Alternative Implementation
The process proposed above has the advantage of being mostly done externally to Synapse, but also introduce several point of failures that might be difficult to tackle correctly. Additionally it adds infrastructure and technologies that the backend team has limited knowledge of and in general it’s a substantial engineering effort that might span several months of development.
It’s worth mentioning a simpler alternative that reuses familiar technologies, especially considering the fact that from the bucket analysis what we are most concerned about is data linked in file entities and tables. The process might go as follow:
Every time a link is added (e.g. a file entity is created, a revision is updated etc) a dedicated service is invoked to inform the system about the association. We simply add a record to a dedicated table that maps the file handle with the object type/id, this should be done in the same transaction.
When a link is removed we invoke the service again to remove the link
Periodically we scan this table to detect unlinked linked file handles (e.g. file handles that do not have records in this table), we record this information in another table with with a timestamp.
Periodically we scan the “un-linked” table for file handles that have been un-linked for more than 30 days and update the file handles, we remove the record from this table
When a link is added we make sure that the the record is removed from the previous table as well
This is a simpler solution, the drawback is that we specifically have to “add the link” when we work with file handles and it adds a bit of overhead (e.g. additional insert to the DB), but we already do it one way or another. Forgetting to cleanup the links does not lead to any data loss. Additionally we can incrementally add the cleanup for each type if needed, for example we can focus on entities and tables only now. The deletion can be asynchronous, taking care of the deletion timestamp (e.g. delete only if the deletion timestamp is after any existing link), this is because deleting an entity or a table might lead to several thousands or millions of deletions that needs to be done in a background job.
Note that this would not necessarily replace the existing associations used for the access checks.
Of course this has the drawback of introducing another big table that needs to be migrated and the existing links needs to be backfilled, we could store this in a dedicated DB that does not migrate but in this case the “un-linked” discovery would not take advantage of joining the file handle table.
Unindexed Data
From the S3 Bucket Analysis it turns out that we have around 7.5TB of data in S3 for which there is no file handle.
This might happen because of various reasons:
Unfinished multipart uploads: We currently store parts in temporary objects in S3 and the backend later copies the part over to the multipart upload and deletes the temporary parts when the multipart is completed, if the multipart upload is not finished the parts are never deleted. Additionally S3 keeps the multipart upload data “hidden” until the multipart is finished or aborted. As of November 10th we do have 1417823 uncompleted multipart uploads, of which 1414593 were initiated before October 11th (See S3 Bucket Analysis ). An rough estimate on the current unfinished multipart parts that were uploaded accounts for 2.4TB of data.
Data from staging: Data created in staging is removed after migrations, but of course the data in S3 is left intact.
Old data already present in the bucket and never cleaned up.
In this case the amount of data compared to the category of un-linked data is most likely irrelevant and it’s probably not worth tackling at the moment but still worth mentioning potential solutions. The first point with multi-part uploads is what might be more relevant but we can enable solutions to avoid future costs such as:
Enable a life cycle in the S3 bucket that automatically removes un-finished multipart uploads (See https://sagebionetworks.jira.com/browse/PLFM-6462). This would also mean to expire our multipart uploads in the backend (e.g. if the life cycle deleted incomplete uploads after 2 months, the multipart uploads in the backend should be forcibly restarted for example after a month).
Refactor the multipart upload to avoid uploading to temporary objects in the bucket (See https://sagebionetworks.jira.com/browse/PLFM-6412)
We enabled the S3 inventory for our bucket and we can write a job that compares the file handle index with the inventory to delete un-indexed data. A potential approach is to write a job that:
Streams to S3 the keys of the file handles that point to the prod bucket that were created between two given dates (e.g. last time the job was run and a month in the past from now)
Use Athena to join such data on the latest inventory (filtering again by the given dates) to identify un-indexed data
Delete from S3 the un-indexed data (or move it to a low cost storage class bucket, with an automatic deletion policy)
An alternative is to setup a Glue job that periodically dumps the files table to S3 similar to the S3 inventory and then a job will join on the 2 tables using Athena to find un-indexed data, note however that we need to make sure that no temporary data (e.g. the multipart upload parts) is included in the join, e.g. filtering by sensible dates.
In general we should make sure that any data that ends up in the prod bucket is actually indexed in file handles, for example I would move the temporary objects used for the multipart upload in its own dedicated bucket.