Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. We could introduce an explicit link at the time the data is uploaded and maintain a consistent index : This enforcing a one to one relationship: this is hard to implement in practice on top of the current architecture, we do not know where the data uploaded will be linked to (we could generate expiring tokens for uploads to be used when linking) and would be a breaking change that would probably take years to be introduced.Auto-expire file uploads: When a file is initially uploaded we can flag it with a temporary state, after X days the file is automatically archived/deleted unless it’s explicitly linked through some association and the state is explicitly maintained. This requires to update the state of the file anytime a link is created . Additionally we found several millions of file handles already shared between different associations.

  2. We could maintain a single index with all the associations and try to keep it consistent: each time a link is established a record is added with the type of association, when an association is broken the index is updated. When the last link is removed the file handle can be flagged as potentially un-linked and archived after a certain amount of time unless it’s linked back. This would require keeping the index up to date (potentially eventually consistent) but it does not work well with deletes as it might be extremely complex to know when an association is broken and communicate the change back to the file handle stateindex: e.g. when a folder is deleted all the files under the folder might be deleted but since a file handle can technically be linked to different objects (even though technically we already traverse the hierarchy when purging the trashcan). This brings overhead for each type of association (e.g. in a file entity but at the same time as a record in a synapse table) the complexity of handling this correctly makes it so that this is not a viable solution.

  3. A potential solution is instead to use a different approach: we can periodically scan all the file handle associations in Synapse and ask which ids are linked, with this information we can build an “index” that can be queried to identify the file handles that are un-linked and archive them for deletion with the possibility to restore them within a given period of time. This approach has the advantage that its implementation has a lower impact on the rest of the system and it is isolated to a specific task potentially reducing the risk of wrong deletions. It’s not immune to mistakes, since when scanning the association a developer could still make a mistake but we can design it so that we can be alerted when mistakes happens before the actual archival/deletion occurs.

In the following we propose a design that revolves around this solution. There are three main phases:

  1. Discovery of the file handles links

  2. Detection of the un-linked data

  3. Archival of the un-linked data

File Handle Associations Discovery

The first step to decide what is un-linked or unused is to discover the linked data. In the backend we maintain already a set of references, generally using dedicated tables with foreign keys back to the file handles table that keep track of the current associations. This goes back to the https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/file/FileHandleAssociateType.html and theoretically we keep all the type of associations we have for file handles in order to handle access permissions (e.g. who can download what through which link). This is a bit more complicated in practice, sometimes we do not have a foreign key, but rather a field in a serialized form in the associated record (e.g. profile pictures), or even in a separate database (e.g. File references in synapse tables are maintained as dedicated tables, one for each synapse table).

The proposed approach is as follows:

  • Extend the current FileHandleAssociationProvider interface that each of the “associable” object implement to allow streaming ALL of the file handle ids that the type of object has an association with (including entities that are still in the trashcan).

  • Create an administrative job that is periodically invoked that uses the aforementioned method to stream all the file handle ids for each type (this can be done in parallel, but we can keep it simple in the beginning and see how long it takes to do serially).

  • For each id in the stream we “touch” the relative file handle, recording the last time it was “seen“ and update its status (See next section).

In this way we effectively build an index that can be queried to fetch the file handles that were not seen after a certain amount of time and can be flagged for archival. An alternative is to instead build a dedicated migratable companion table without touching the file handles table, this might be less strenuous for the system as it would not affect the reads on the file handles. The important part to consider is that we need to keep track of the time when the driving job scanning the associations was started and completed successfully in order to avoid querying stale data.

For example if the scan was done at time T and we want to identify files to archive, we can define a delta DT after which we consider a file un-linked. The query should only consider file handles created before T-DT. This still introduces some edge cases, for example when a file is linked just after the scan is performed, un-linked before the next scan is done and re-linked again after the next scan. This is probably a rare occurrence, but to avoid such issues we could send an async message when a file is linked that is processed to update the status of the file handle.

A potential issue with this approach is that the timing of the migration might be affected with several updates each time the scan is performed, on the other end the job might have a predictable timeline and we could time it to run for example just after the release so that the migration process will take in the changes in the beginning of the week.

Un-linked File Handle Detection and Archival

Deleting user data is a tricky business, especially if we cannot be 100% sure that the data is not used. Instead we propose an approach that goes in stages where first the un-linked data is detected but leaving it accessible and only after a certain amount of time we start archiving it and eventually delete it. The archived data will be stored in a dedicated bucket with a life cycle policy to delete objects after X months/years.

...

  • Storage cost is $0.0125/GB (vs standard is $0.023/GB for the first 50TB, $0.022/GB for the next 450TB and $0.021/GB for over 500TB): E.g. 1TB is ~$12.5 vs $25

  • PUT/POST/COPY/LIST cost is $0.01/1000 requests (vs standard $0.005/1000 requests): e.g. a million objects is ~$10 vs $5

  • GET/SELECT cost is $0.001/1000 requests (vs standard $0.0004/1000 requests): e.g. a million objects is ~$1 vs $0.4

  • Life Cycle cost $0.01/1000 requests (e.g. automatically delete): e.g. a million objects is ~$10

  • Fetch costs is $0.01/GB (e.g. if we want to restore): 1TB is ~$10

Storing the data in an an archiving tier (e.g. glacier) could be an option but it complicates restores (first the objects needs to be restored) and access to data becomes expensive and slow. The S3 Intelligent-Tiering automatically manages and monitor usage patterns to change the access tiers and while the fetching of the data has no costs associated we would pay ($0.1 per million objects/month) for monitoring, but for this use case it makes sense to directly store the data in the infrequent access storage class with custom life cycles to move it to other classes or delete the objects since the intelligent tiering class does not have (yet) fine grained control on how the data is moved between classes.

We introduce a STATUS column in the file handle table that can have the following values:

...

STATUS

...

Description

...

File Handle Accessible

...

CREATED

...

Default status

...

Yes

...

LINKED

...

When the a scan is performed and the file handle is found in at least one association

...

Yes

...

UNLINKED

...

The file handle has been identified as un-linked, if a pre-signed URL is requested for such an object we trigger an alarm (e.g. this should never happen unless we mistakenly identified a linked object)

...

Yes, trigger alarm

...

ARCHIVING

...

The file is being archived

...

No, throw not found, trigger alarm

...

ARCHIVED

...

The file has been archived (e.g. moved from the original bucket to the archive bucket)

...

No, throw not found, trigger alarm

...

DELETED

...

The file has been deleted, we can setup notifications in the S3 bucket and change the status when an object is deleted matching the key of the archived object(s). An alternative to deletion would be to store the objects in the S3 Glacier or Deep Archive for low long term storage costs.

...

No, throw not found, trigger alarm

Additionally we keep track of the status update timestamp with a new column STATUS_TIMESTAMP. We use this timestamp to decide both when to move to the UNLINKED status and when to move an UNLINKED file handle to the archive.

We proposed to introduce two new workers:

  • UnlinkedFileHandleDetectionWorker: This will scan the CREATED and LINKED file handles in order to identify UNLINKED ones if the last time they were seen (STATUS_TIMESTAMP) is earlier than X days from the last time that the scan was performed, as long as no other scanning is on-going and the scanning was completed successfully. The job should be triggered by a remote request and run periodically. Additionally it can be parameterized with the bucket name (e.g. for now we can use it on the prod bucket, but we might want to enable it for other buckets as well).

  • UnlinkedFileHandleArchiveWorker: This job will fetch the UNLINKED file handles whose STATUS_TIMESTAMP is older than X days (e.g. only archive files that have been unlinked for 30 days) and:

    • Update their status to ARCHIVING

    • Copy the file from the source bucket to the archive bucket, the destination key will be prefixed with the source bucket name. E.g. proddata.sagebase.org/key1 → archive.sagebase.org/proddata.sagebase.org/key1

    • Update their status to ARCHIVED

    • Delete the file from the source bucket

Considerations and open questions:

  • We might want to split the latter worker in multiple workers, a driver one that changes the status to (an additional status) ARCHIVE_REQUESTED and send an async message for another set of workers that will do the moving so that it can be parallelized. Note that we could externalize the move to avoid using our workers, but since we drive the copy and it’ll be relatively infrequent for the moment we can use the standard workers infrastructure.

  • An alternative to use solely the STATUS_TIMESTAMP for archiving unlinked files is to use a counter: every time the UnlinkedFileHandleDetectionWorker runs we re-process also the UNLINKED file handles and we increment a counter, we only archive file handles that on top of being UNLINKED for X days have the counter over a given value Y.

  • Does it make sense to store the history of the file handle status instead? E.g. we can have a dedicated table that stores the status and timestamps, a missing entry could be considered the default CREATED. Queries can get complicated especially considering the last status and joins.

  • If a pre-signed URL is requested for a UNLINKED file handle, should we update its status to LINKED?

  • I left out the restore for now which will work the same as archiving but in reverse. This becomes a bit more complicated with different storage classes (e.g. if we never want to delete).

Unindexed Data

...

  1. the handling needs to be done in each place where file handles are used). The advantage of this approach is that if an association is broken and the link is not removed the worst that can happen is that the data is kept (so no harm). At the same time we would have the migration problem, this would turn into a big migratable table, but we could technically store it in a separate database that does not migrate, a good potential candidate that would scale well would be to store this index in Dynamo DB (using as PK the file handle id and as SK the association).

  2. Another solution is instead to periodically scan all the file handle associations and ask which ids are linked, with this information we can build an “index” that can be queried to identify the file handles that are un-linked and archive them for deletion with the possibility to restore them within a given period of time. This approach has the advantage that its implementation has a lower impact on the rest of the system and it is isolated to a specific task potentially reducing the risk of wrong deletions. It’s not immune to mistakes, since when scanning the association a developer could still make a mistake but we can design it so that we can be alerted when mistakes happens before the actual archival/deletion occurs.

In the following we propose a design that revolves around this last solution. There are three main phases that take place:

  1. A Discovery phase, where the association are scanned so that links to file handles can be recorded

  2. A Detection phase, where with the information from the previous phase we can establish which file handles is not linked

  3. An Archival phase, where the file handles that are deemed un-linked are placed into an archive state and will eventually be deleted

File Handle Associations Discovery

The first step is to discover the existing association to file handles. In the backend we maintain already a set of references, generally using dedicated tables with foreign keys back to the file handles table that keep track of the current associations. This goes back to the https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/file/FileHandleAssociateType.html and theoretically we keep all the type of associations we have for file handles in order to handle access permissions (e.g. who can download what through which link). This is a bit more complicated in practice, sometimes we do not have a foreign key, but rather a field in a serialized form in the associated record (e.g. profile pictures), or even in a separate database (e.g. File references in synapse tables are maintained as dedicated tables, one for each synapse table).

In the following table we provide the list of known associations to file handles along with a description of how the association is stored:

Association Type

Table

Foreign Key (ON DELETE CASCADE)

Description

Current Size

FileEntity

JDOREVISION

FILE_HANDLE_ID (RESTRICT)

Each file entity revision has a FK back to the referenced file handle. Multiple revision can reference a single file entity (e.g. One node might be linked to multiple file handles through revisions).

~11M

TableEntity

Each table entity has an associated table with the file handles. They are not migratable tables and not consistent. The data is also store in the various transactions used to build tables in S3 in a dedicated bucket.

No, the tables are stored in a separate DB

Each table might reference multiple file handles, when a table is built each transaction is processed and if a file handle is in the transaction it is added to a dedicated table, one for each synapse table. Unfortunately this table is not migratable and is rebuilt every week. We keep a migratable table with all the table transactions and table row changes packages in a zip file and stored in a dedicated S3 bucket.

~36M, distributed in around ~10K tables (~1.3K non empty)

WikiAttachment

V2_WIKI_ATTACHMENT_RESERVATION

V2_WIKI_MARKDOWN

FILE_HANDLE_ID (RESTRICT)

No, contained in the ATTACHMENT_ID_LIST blob

The attachments to a wiki page, the table includes both the file handles storing the wiki page and its attachments. The list of attachments is also stored in the V2_WIKI_MARKDOWN table in a blob with the list of ids.

~1M

WikiMarkdown

V2_WIKI_MARKDOWN

FILE_HANDLE_ID (RESTRICT)

The markdown of a wiki page.

~770K

UserProfileAttachment

JDOUSERPROFILE

PICTURE_ID (SET NULL)

The user profile image.

~60K

TeamAttachment

TEAM

No, contained in the PROPERTIES blob that stores a serialized version of the team object (icon property)

The team picture.

~4.5K

MessageAttachment

MESSAGE_CONTENT

FILE_HANDLE_ID (RESTRICT - NO ACTION)

The messages to users content.

~460K

SubmissionAttachment

JDOSUBMISSION_FILE

FILE_HANDLE_ID (RESTRICT - NO ACTION)

The file handles that are part of an evaluation submission, in particular this are the file handles associated with a file entity that is part of a submission (e.g. all the version or a specific version).

~110K

VerificationSubmission

VERIFICATION_FILE

FILE_HANDLE_ID (RESTRICT)

The files that are submitted as part of the user verification. Note that when a user is approved or rejected the association is removed.

<10

AccessRequirementAttachment

ACCESS_REQUIREMENT_REVISION

No, a file handle might be contained in the SERIALIZED_ENTITY blob that stores a managed access requirement (the ducTemplateFileHandleId property)

A managed access requirement might have a file handle pointing to a DUC template.

~5K

DataAccessRequestAttachment

DATA_ACCESS_REQUEST

No, various file handles are referenced in the REQUEST_SERIALIZED blob that stores a serialized version of the access request.

A data access request might have multiple files attached for the approval phase (e.g DUC, IRB approval and other attachments).

~2K

DataAccessSubmissionAttachment

DATA_ACCESS_SUBMISSION

No, various file handles are referenced in the SUBMISSION_SERIALIZED blob that stores a serialized version of the submission.

Same as above, but for the actual submission.

~3K

FormData

FORM_DATA

FILE_HANDLE_ID (RESTRICT)

The data of a form.

~300

The idea would be to periodically scan over all the associations and record the last time a file handle was “seen” (e.g. as associated). In this way we can build an index that can be queried to fetch the last time a file handle association was seen so that file handles can be flagged as un-linked.

There are three aspects that require considerations for this process:

  1. How to perform the scan

  2. Where to store the result

  3. When to perform the scan

Scanning

As we can see from the table above the way the associations are stored for each type is not consistent, some types use dedicated tables, other use a column with a field referring to the file handle id and some other store the information embedded in a serialized field. Additionally the size distribution is widely uneven, we have tables with a few thousands associations and tables with millions of records.

The simplest approach would be to simply run a driver job that for each type of association sends an SQS message to starts a sub-job to scan all the associations of that type. Each sub-job performs a scan reading the data in batches. This approach is problematic because while in most of the cases each job might take a few seconds, in some cases it might take several hours (e.g. tables). If a worker goes down while scanning and the job is put back in the queue the job would have to start from scratch.

We instead can split each association in independent sub-jobs that can run in parallel and are only scanning a batch of data (e.g. and dedicate a small fleet of workers for this task), similar to what happens when we migrate (where we scan in batches of 100k records). An idea can be to divide the scan for a given association type in partitions of a given size, for example considering an id used in a table, its min and max value we can divide in sub-jobs of potentially evenly distributed batches. Each batch can be processed by a dedicated worker. This approach can be probably generalized for most of the associations types above providing a default implementation with some variations (e.g. some objects will need to be de-serialized). The tricky part is to divide the scanning in partitions, this can be approximated using in some cases a unique id of the table being scanned, in some other cases (e.g. nodes) we could partition using the file handle id itself.

The idea is to rely on the fault tolerance of both SQS and the worker architecture: if each batch is small enough and a recoverable failure happens the worker can start the same batch assuming that all the information is contained in the SQS message driving the worker.

Synapse tables problem

The most problematic part is scanning the associations in tables:

  1. We have several thousand tables, each one with its own file association table

  2. Tables are rebuilt every week as are not migratable and in some cases they might actually fail to build all together rendering the file association table useless.

Each table is built using “transactions” which contain the set of changes to the table, we store the history of transactions in a migratable table. If a transaction succeeds we store the change set in S3 as a compressed serialized object. When a table is built it goes through the history of the transactions reading from S3 the change set and applying it to the table. Each table might have several thousands of changes applied to it. Each change set might contain a file handle association and this is when we record the file handle in the table file association table.

One way we could approach scanning the associations for tables is to use the TABLE_ROW_CHANGE table that contains the history of change sets for each table, we can read the changes directly from this table and use it to create sub-jobs that process a batch of changes. Each worker will read from the batch of changes from S3 and look for file handles in the change set itself.

Note that the file handles in this case are freed only when the table is deleted and the changes are dropped. This is unavoidable at the moment due to how the tables are built.

An alternative that would make the scanning much faster is to keep a single migratable table with the file handle associations that is populated when the changes are processed, this table would need to be back-filled with previous data. We can keep this as an optimization if the first approach does not work well.

Storage

We have various options to store this information so that it can be used later in the process.

  1. We could store the “linked” status in the file handle table itself along with the timestamp: this would be ideal as we would have to work with only one table and queries would be relatively efficient. Unfortunately this is not possible as the file handle table is our biggest table that is migrated every week and updating the file handles in such a way would most likely lead to unsustainable delays in the migration process. An option would be to move this table in a different “non-migratable” DB, maybe building a dedicated service over file handles, this is a huge task and would required quite a lot of refactoring (e.g. we rely on joying this table).

  2. We could store the last time a file handle is linked in a companion migratable table and use it in a similar fashion: this has a similar limitation, it would probably lead to delays in migration as the table would grow substantially over time.

  3. We can store this information in another type of external storage.

At the moment option 3. seems the more viable: we can leverage existing technology to store this information as an append only log in S3 directly and perform queries using Athena:

  1. We can setup a kinesis firehose stream that delivers the data to a sort of append only log in S3: each time a file handle is scanned we send an event to a kinesis firehose stream, the stream can be setup to target a dedicated S3 bucket and transform the data in a columnar format (e.g. apache parquet) before storage. The records would contain the triple: object type, file handle id, timestamp

  2. We can setup a glue table over this S3 destination so that the data can be queried with Athena

  3. The bucket can be setup so that old data is automatically removed (e.g. after 60 days)

During the S3 bucket analysis we used Athena to query data in S3 heavily and the results are very promising, joining on mid-size tables directly in S3 (~50M records) is extremely fast (less than 20 seconds) and the results can be stored in S3 directly as target tables. It also has the advantage that once the data is in S3 all the computation can be done externally from a dedicated job (e.g. the data can be joined in S3 to find out the un-liniked file handles and stored as a dedicated table that can be queried separately).

Timing

An important aspect to consider is that the scanning of the associations might take a long time. Even if we do it in parallel we need to limit the amount data scanned at a given time to avoid overloading the DB. We can dedicated a small amount of workers (e.g. 10 workers) to process a batch of data, the data is then sent to a kinesis stream that will eventually deliver it to S3.

This process is not deterministic and failure can occur during the scanning (e.g. a worker goes down, network connectivity is lost etc.). We would still need to have an idea of the scanning progress, or at least when it started and more or less when it completed. For example we do not want to start several scans of the all the associations in parallel, ideally we would scan once in a while (e.g. once a week) and use the start time of the job to filter on the file handles considered for further processing. Before querying the log we would need to know more or less when was the last time a scan finished:

For this we can monitor the stream to check if records are still being processed, if the stream has been empty for more than 24 hours for example we can consider the job that is being running as “finished”. We can also monitor the queue dedicated to the workers, and check if for example errors happens during the process (e.g. messages end up in a dead letter queue). In this case we maintain a table with the scanning jobs and ensure that only one is running at a given time.

Another aspect to consider is how to start a scan:

  1. Automatically, according to the status of the current job

  2. Externally, using a jenkins job or some other trigger (e.g. a cloud watch event with a lambda) that periodically starts the scan (e.g. every week)

Option 1. would be ideal as the system would be self contained but at the moment Option 2. is preferred to avoid the usual prod vs staging problem.

Un-linked File Handle Detection

Once one or more scans are performed we end up with a log with the last time a file handle was seen as linked. This data will live in S3 and can be queried with Athena. We can now use this information to check if any file handle has not being linked for more than a given amount of time, assuming that a scan was performed recently. Note that we can consider file handles that have been modified before a certain amount of time E.g. we consider only file handles modified more than 30 days ago.

Since the file handles are in the main DB we need a way to get this information and we could simply run several queries with Athena with a batch of ids. This is not ideal as it’s a slow process with several limitations, Athena queries are limited in size (e.g. we can probably ask for a 10K batch of ids), we have a limit on the number of concurrent queries that can be performed (e.g. 20/s) and handling the results is tricky (e.g. queries are asynchronous). Where Athena shines is when a single query is performed on a big dataset.

We can instead periodically export the file handle table in S3 and join the data directly using Athena. We can write a script that weekly does the following:

  1. Restore a recent snapshot of the DB

  2. Run a Glue job on the snapshot to export the file handle table in S3 (from an initial experiment this can take around 10-20 minutes) in parquet format, we can filter on the modification date (e.g. only file handles that have been modified 30 days prior the last scan was started) and avoid including previews (which are technically linked file handles).

  3. Join this table with the data delivered by the kinesis stream to detect un-linked file handles to produce a new table in S3 with only the file handles that are un-linked

  4. Send a message to the backend to inform that the results are ready, a worker can read this results in batches to populate a table with the un-linked file handle ids.

Note that given the uncertainty in the process (e.g. unexpected failures, missed scanned data etc) when a file handle is detected as un-linked we do not actually perform any other action. We can simply have a counter (or several records) that tracks how many times a file handle was detected as un-linked. This means that we can boost our confidence over time, for example if a file handle was detected as un-linked in 3 subsequent scans we can flag the file handles as UNLINKED and for subsequent ARCHIVAL. We can have a worker that periodically counts how many times a file was detected as un-linked, update the file handle status and remove the records from this table (e.g. to avoid reprocessing).

An interesting alternative approach to test would be to append the results in S3 directly (Athena allows to have an INSERT INTO kind of operation), building a log of un-linked detections, we can then run a query on this dataset directly in S3: for example we can group by the id of the file handle in the un-linked log over a window of the previous 30 days and count how many times the file handle was detected as un-linked, the result would effectively be the file handles that were detected as un-linked at least 3 times in the span of a month (assuming we run the detection often enough).

Un-linked File Handle Archival

Deleting user data is a tricky business, especially if we cannot be 100% sure that the data is not used. Instead we propose an approach that goes in stages where first the un-linked data is detected but leaving it accessible and only after a certain amount of time we start archiving it and eventually delete it, with the option to restore it before it is actually garbage collected. The archived data will be stored in a dedicated bucket with a life cycle policy to delete objects after X months/years.

The reason to move the data in a dedicated bucket is because we can defined life cycle policies at the bucket level keeping the data well organized.

We can enforce the objects storage tier in this bucket to be set as S3 Standard - Infrequent Access (Seehttps://docs.aws.amazon.com/AmazonS3/latest/dev/storage-class-intro.html)to reduce storage costs, the costs of using the infrequent access tiering for objects translates in a reduced storage cost but additional cost to add and retrieve data:

  • Storage cost is $0.0125/GB (vs standard is $0.023/GB for the first 50TB, $0.022/GB for the next 450TB and $0.021/GB for over 500TB): E.g. 1TB is ~$12.5 vs $25

  • PUT/POST/COPY/LIST cost is $0.01/1000 requests (vs standard $0.005/1000 requests): e.g. a million objects is ~$10 vs $5

  • GET/SELECT cost is $0.001/1000 requests (vs standard $0.0004/1000 requests): e.g. a million objects is ~$1 vs $0.4

  • Life Cycle cost $0.01/1000 requests (e.g. automatically delete): e.g. a million objects is ~$10

  • Fetch costs is $0.01/GB (e.g. if we want to restore): 1TB is ~$10

Storing the data in an archiving tier (e.g. glacier) could be an option but it complicates restores (first the objects needs to be restored) and access to data becomes expensive and slow. A new offering from AWS is the S3 Intelligent-Tiering that automatically manages and monitor usage patterns of objects keys to change the access tiers and while the fetching of the data has no costs associated we would pay ($0.1 per million objects/month) for monitoring, but for this use case it makes sense to directly store the data in the infrequent access storage class with custom life cycles to move it to other classes or delete the objects since the intelligent tiering class does not have (yet) fine grained control on how the data is moved between classes.

We can introduce a new STATUS column in the file handles:

STATUS

Description

File Handle Accessible

AVAILABLE

Default status

Yes

UNLINKED

The file handle has been identified as un-linked, if a pre-signed URL is requested for such an object we trigger an alarm (e.g. this should never happen unless we mistakenly identified a linked object)

Yes, trigger alarm. Should we instead consider this as not accessible?

ARCHIVING

The file is being archived

No, throw not found, trigger alarm

ARCHIVED

The file has been archived (e.g. moved from the original bucket to the archive bucket)

No, throw not found, trigger alarm

DELETED

The file has been deleted, we can setup notifications in the S3 bucket and change the status when an object is deleted matching the key of the archived object(s). An alternative to deletion would be to store the objects in the S3 Glacier or Deep Archive for low long term storage costs.

No, throw not found, trigger alarm

Additionally we keep track of the status update timestamp with a new column STATUS_TIMESTAMP. We use this timestamp to decide when to move an UNLINKED file handle to the archive. E.g. we can keep the file handle un-linked for 30 days before we move it to the archive.

A worker will simply periodically scan the UNLINKED file handles whose status was updated more than 30 days in the past:

  1. The worker will check if other file handles that are still AVAILABLE use the same key (e.g. it’s a logical copy), in this case we cannot move the underlying data yet and the archival is delayed. In this case we can update the timestamp (so that the file handle is re-processed in the next 30 days). Note: this is a tricky edge case, we could delete the file handle at this point but then we only allow the 30 days window to restore it in the case of a copy. Maybe we can use a special status instead? What happen if the last copy is archived? What if this copy is restored then?

  2. For eligible file handles the worker will set the status to ARCHIVING and send a message for another worker to start the archiving that will move the data to the archive bucket and set the status to ARCHIVED once done. We can keep the key of the file handle the same but prefixed with the source bucket (e.g. archive.bucket/proddata.sagebase.org/originalKey). In this way it’s easy to restore the file handle to the original key. Note that the previews can be deleted here, if a file handle is restored we can trigger the preview generation.

While a file handle is in the UNLINKED, ARCHIVING or ARCHIVED state it can be restored, the process would move back the data to the original key (if the object is in archiving tier such as glacier we first need to request the restore from AWS and the process is a bit more involved). We can introduce dedicated statuses (e.g. RESTORING → AVAILABLE) while the restore request is in progress.

Unindexed Data

From the S3 Bucket Analysis it turns out that we have around 7.5TB of data in S3 for which there is no file handle.

This might happen because of various reasons:

  • Unfinished multipart uploads: We currently store parts in temporary objects in S3 and the backend later copies the part over to the multipart upload and deletes the temporary parts when the multipart is completed, if the multipart upload is not finished the parts are never deleted. Additionally S3 keeps the multipart upload data “hidden” until the multipart is finished or aborted. As of November 10th we do have 1417823 uncompleted multipart uploads, of which 1414593 were initiated before October 11th (See S3 Bucket Analysis )Data that can be considered temporary: For example bulk download packages might end up stored in the production bucket, this zip packages are most likely used once and never re-downloaded. An rough estimate on the current unfinished multipart parts that were uploaded accounts for 2.4TB of data.

  • Data from staging: Data created in staging is removed after migrations, but of course the data in S3 is left intact.

  • Old data already present in the bucket and never cleaned up.

...