Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Leave the data in STD: $15,185

  2. Move current unlinked (>=128 KB) to infrequent Access*: $14,394

    1. Using the unlinked data >= 128KB: (680T - 91U) in standard + 91U in IA

    2. Consider an avg of 28,808/month unlinked files for moving data to IA (PUTs cost)

  3. Eventually move unlinked data to Glacier deep archive*: $13,323

    1. Using the unlinked data >= 128KB: (680T - 91U) in standard + 91U in Glacier Deep Archive

    2. Consider an avg of 28,808/month unlinked files for moving data to IA (PUTs cost)

    3. Consider the avg of 28,808/month for lifecycle transitions to Glacier and Glacier Deep Archive

  4. Move everything to INT**: $11,775

    1. Consider 44% in INT-Standard (300H/680T)

    2. Consider 56% in INT-IA ((680T-300H)/680T)

    3. Includes the 44M objects monitoring fee

  5. Move to INT + unlinked eventually in INT-deep archive**: $10,734

    1. Assumes that unlinked data is part of the cold data

    2. Consider 44% in INT-Standard (300H/680T)

    3. Consider 43% in INT-IA ((680TD - 300H - 91U)/680T)

    4. Consider 13% in INT-Deep Archive (91U/680T)

    5. Includes the 44M objects monitoring fee

    6. Assumes that we have 28,808 tags per month (GET/PUT request + tag costs)

  6. Move to INT + delete unlinked data***: $10,544

    1. Assumes that unlinked data is part of the cold data

    2. Consider 50% in INT-Standard (300H/(680T - 91))

    3. Consider 50% in INT-IA

    4. Includes the 44M - 8M objects monitoring fee, no tags, no lifecycle transitions

...

** Note that it does not include the initial fee for moving all the 44M objects to INT of about $400 (through a lifecycle transition), once an object is in INT there are no fees for other transitions. Additionally in order to detect UNLINKED file handles we use a few additional services heavily, such as kinesis firehose to stream data to S3, Athena for querying, Step Functions to orchestrate and EventBridge for triggering periodically. The service that is most expensive is kinesis firehose as we are pushing a lot of data through, checking the billing for the last few months though this adds up to a few tens of dollars so we excluded it from the estimates, see the cost explorer for kinesis.

It is clear that just moving everything to INT is cost effective for our use case where our access patterns are unknown since we have data that even if still linked is rarely accessed *** Note that there is the possibility that a lot of data becomes unlinked in a single month (e.g. older projects, older versions of entities, older tables, messages etc). On top of it, archiving unlinked data might be worth it even though the cost savings is not comparable to just using the INT storage class, we have around 1TiB of data each month in average that gets unlinked.

Of note is that part of the unlinked data is temporary file handles that synapse creates internally (e.g. table queries, tables csv for uploads etc), the fact that each month we have such a high number (30K) of file handles that are unlinked might be attributed to this. Unfortunately we do not tag this kind of data so we do not know exactly how much of those 1TiB are actually temporary file handles (We do know that at least 20TiB of the current unlinked data is most likely due to this: https://sagebionetworks.jira.com/wiki/spaces/PLFM/pages/1629782065/S3+Bucket+Analysis#Temporary-File-Handles).

To solve this we should upload this data either with an special object tag that expires the object through a lifecycle transition or in a dedicated bucket (e.g. another storage location) with a lifecycle that expires everything after 30 days. This temporary objects could be flagged in the database with a special status (e.g. TEMPORARY) so that they are not collected as UNLINKED but rather removed either by S3 notifications (see open questions below) or through a worker.

Another clear point that surfaced is that deleting the unlinked data is not worth the effort, if we move it to the cheapest archive tiering it is actually worth keeping the file handles around just in case we need to recover them.

Proposed Implementation

Archival

In the FILES table we store the status of the file handle which is when created set as AVAILABLE, files that have been detected as UNLINKED are flagged as such. Due to the fact that we allow to create copies of the same file handle that points to the same key we have the following possible scenarios for a file that is flagged as UNLINKED:

  1. At least one copy exists for the same key that is still AVAILABLE

  2. All of the copies for the same key are flagged as UNLINKED

Note that a file handle can be copied (not the physical data) by different users through a dedicated API, in other words we have multiple file handles pointing to the same bucket and key that could have been “created” by different users.

We want to have a window of time (e.g. 30 days) before we actually archive an object, this provides an additional safety net against potential bugs and misdetections. If a file handle that is not AVAILABLE is accessed the system will send an alert for further investigation, additionally it will not be possible to perform certain operations on a file handle that is not AVAILABLE (e.g. cannot copy it, eventually we will disable any kind of access to not AVAILABLE file handles).

An additional aspect that we need to consider is that the staging database shares the same file handle pointers as production, but at the same time it might be in a state of partial synced truth with prod therefore it is not possible to have an automated scheduled job that is part of the stack to perform the archival.

Instead we propose to add a new dedicated (admin only) asynchronous job that will be invoked by an external process periodically (e.g. a jenkins job):

  • The request specifies a batch size and an amount in days to look backwards. This is needed to limit the amount of data that we need to scan per job and avoiding having expensive table scans (Maybe using a separate table for the status might be better, but we would have to refactor every file handle query to use a left join).

  • The worker will fetch a batch (LIMIT) of unique keys that:

    • Have been updated more than 30 days ago (let us call this point in time T) and no more than T - <the input days> (if we call this job every 10 minutes we can limit this to a day to avoid big scans)

    • Are flagged as UNLINKED

    • Whose bucket is proddata.sagebase.org (for now we process only our bucket, we can later extend it to other buckets, maybe through a storage location configuration)

  • For each key a message is put on a queue that includes the pair <key, T> and processed by another worker that (in a single transaction):

    • Updates all the file handles for that key older than T and that are UNLINKED so that their status is ARCHIVED

    • Fetches the count of file handles for that key that are AVAILABLE OR (UNLINKED and have been updated before T)

      • If the count is > 0 then the key is still available or another file handle is unlinked but still not past the 30 days window

      • If the count is 0 then we can archive the object in S3: we first get the tag set of the object and then merge it with a new tag such as “synapse-status=unlinked”. With a configuration for the INT storage class that automatically moves the objects tagged as such in the archive tier and deep archive tier if they were not accessed for more than 90 and 180 days respectively.

Note that since we do not care much about keeping track of the job itself the job completes when the batch has finished sending messages to the queue (e.g. not when all the keys in the batch have been processed). If we call this job again too frequently we might risk to put the same key in the queue multiple times but I don’t see any side effects.

Note that we do not delete file handles, in this way we allow them to be restored at a later time if needed by their ids (due to the fact that different users might have created copies of the same file handle). Additionally we want to skip tagging (and therefore moving them to the archive tiers) files smaller than 128KB for two reasons:

  • Even if we enable a lifecycle to automatically move objects to INT those will not be moved (we can upload new ones in the INT class though without limitation but they are not automatically move into INT-IA after 30 days)

  • The amount of data is irrelevant (a few GB) for being moved to the archive tier tagging it. We would pay more for the get/put requests and object tags than for standard storage (e.g. we have ~6M objects < 128KB, only the tags costs $6/month. Storing the 100GB in Standard is $2.5).

Restore

Since we do not actually delete the data nor the file handles but we potentially store them in an archive tier we need a way to perform a restore (once in the archive tier an object needs an async operation to restore objects that might take a few hours).

We propose to add a new dedicated API that performs the restore through an asynchronous job:

  • Takes in input a batch of ids (max 1000) to restore

  • For each id checks the file handle status (checking that the user is the creator or admin):

    • If UNLINKED updates it and all the matching file handles to AVAILABLE.

    • If ARCHIVED, fetch the archive status of the S3 object (a head request should be enough, or we can batch this into a list objects):

      • If none then the object is still in INT-IA or INT-Standard, the object can be updated as AVAILABLE along with all the copies and the synapse-status tag can be cleared from the object

      • If present updates it and the matching file handles to RESTORING and sends a restore request to the s3 object, this is an asynchronous call. We enable the S3 notifications and when a restore completes a worker will set the all the file handles that match the keys as AVAILABLE and the synapse-status tag can be cleared from the object

    • Any other status is preserved

Note that even though the job completes some of the file handles might still be RESTORING, for each file handle in input we will include the result in the output of the job with a restore_status: FORBIDDEN, NOT_FOUND, NO_ACTION, RESTORED, RESTORING. We do not wait for the actual restore of the archive tier files, because we would either need to refactor the asynchronous job machinery to support migratable jobs or add dedicated custom code. Hopefully we will never have to use this job.

Note that we need to make sure that when we flag a file handle as UNLINKED we only do so if the status is AVAILABLE and its current updatedOn is < 30 days otherwise we risk to put back the restored objects as UNLINKED.

Open questions

...

Why can we make “copies” of file handles in the first place? Maybe we should not allow it in S3 or GC buckets? Who is using this? If we get rid of this does it make it simpler?

...

a large amount of data is copied in another bucket). We would pay the overhead of this data for a few months that will move to cheaper tiers and eventually archived into deep archive. Given that this is not a common use case (e.g. It might happen once or twice every few years), the complexities around how we store file handles and the current limitations of S3 we do not plan to support a “speed up” of the archival. The user can still request the deletion of their data through the appropriate APIs.

It is clear that just moving everything to INT is cost effective for our use case where our access patterns are unknown since we have data that even if still linked is rarely accessed (e.g. older projects, older versions of entities, older tables, messages etc). On top of it, archiving unlinked data might be worth it even though the cost savings is not comparable to just using the INT storage class, we have around 1TiB of data each month in average that gets unlinked.

Of note is that part of the unlinked data is temporary file handles that synapse creates internally (e.g. table queries, tables csv for uploads etc), the fact that each month we have such a high number (30K) of file handles that are unlinked might be attributed to this. Unfortunately we do not tag this kind of data so we do not know exactly how much of those 1TiB are actually temporary file handles (We do know that at least 20TiB of the current unlinked data is most likely due to this: https://sagebionetworks.jira.com/wiki/spaces/PLFM/pages/1629782065/S3+Bucket+Analysis#Temporary-File-Handles).

To solve this we should upload this data either with an special object tag that expires the object through a lifecycle transition or in a dedicated bucket (e.g. another storage location) with a lifecycle that expires everything after 30 days. This temporary objects could be flagged in the database with a special status (e.g. TEMPORARY) so that they are not collected as UNLINKED but rather removed either by S3 notifications (see open questions below) or through a worker.

Another clear point that surfaced is that deleting the unlinked data is not worth the effort, if we move it to the cheapest archive tiering it is actually worth keeping the file handles around just in case we need to recover them.

Proposed Implementation

Archival

In the FILES table we store the status of the file handle which is when created set as AVAILABLE, files that have been detected as UNLINKED are flagged as such. Due to the fact that we allow to create copies of the same file handle that points to the same key we have the following possible scenarios for a file that is flagged as UNLINKED:

  1. At least one copy exists for the same key that is still AVAILABLE

  2. All of the copies for the same key are flagged as UNLINKED

Note that a file handle can be copied (not the physical data) by different users through a dedicated API, in other words we have multiple file handles pointing to the same bucket and key that could have been “created” by different users.

We want to have a window of time (e.g. 30 days) before we actually archive an object, this provides an additional safety net against potential bugs and misdetections. If a file handle that is not AVAILABLE is accessed the system will send an alert for further investigation, additionally it will not be possible to perform certain operations on a file handle that is not AVAILABLE (e.g. cannot copy it, eventually we will disable any kind of access to not AVAILABLE file handles).

An additional aspect that we need to consider is that the staging database shares the same file handle pointers as production, but at the same time it might be in a state of partial synced truth with prod therefore it is not possible to have an automated scheduled job that is part of the stack to perform the archival.

Instead we propose to add a new dedicated (admin only) asynchronous job that will be invoked by an external process periodically (e.g. a jenkins job):

  • The request specifies a batch size and an amount in days to look backwards. This is needed to limit the amount of data that we need to scan per job and avoiding having expensive table scans (Maybe using a separate table for the status might be better, but we would have to refactor every file handle query to use a left join).

  • The worker will fetch a batch (LIMIT) of unique keys that:

    • Have been updated more than 30 days ago (let us call this point in time T) and no more than T - <the input days> (if we call this job every 10 minutes we can limit this to a day to avoid big scans)

    • Are flagged as UNLINKED

    • Whose bucket is proddata.sagebase.org (for now we process only our bucket, we can later extend it to other buckets, maybe through a storage location configuration)

  • A message is put on a queue that includes a batch of keys (e.g. 100) and T. Another worker will process each key in the message in a separate transaction, for each key:

    • Updates all the file handles for that key older than T and that are UNLINKED so that their status is ARCHIVED

    • Fetches the count of file handles for that key that are AVAILABLE OR (UNLINKED and have been updated after T)

      • If the count is > 0 then the key is still available or another file handle is unlinked but still not past the 30 days window

      • If the count is 0 then we can archive the object in S3: we first get the tag set of the object and then merge it with a new tag such as “synapse-status=archive”. With a configuration for the INT storage class that automatically moves the objects tagged as such in the archive tier and deep archive tier if they were not accessed for more than 90 and 180 days respectively.

    • All the previews of the ARCHIVED file handles can be deleted, as long as they are not used by other available file handles (yeah, this is unfortunately a possibility). The S3 object of the previews is deleted iff the file handle is the last pointing to its key

Note that since we do not care much about keeping track of the job itself the job completes when the batch has finished sending messages to the queue (e.g. not when all the keys in the batch have been processed). If we call this job again too frequently we might risk to put the same key in the queue multiple times but I don’t see any side effects.

Note that we do not delete file handles, in this way we allow them to be restored at a later time if needed by their ids (due to the fact that different users might have created copies of the same file handle). Additionally we want to skip tagging (and therefore moving them to the archive tiers) files smaller than 128KB for two reasons:

  • Even if we enable a lifecycle to automatically move objects to INT those will not be moved (we can upload new ones in the INT class though without limitation but they are not automatically move into INT-IA after 30 days)

  • The amount of data is irrelevant (a few GB) for being moved to the archive tier tagging it. We would pay more for the get/put requests and object tags than for standard storage (e.g. we have ~6M objects < 128KB, only the tags costs $6/month. Storing the 100GB in Standard is $2.5).

Restore

Since we do not actually delete the data nor the file handles but we potentially store them in an archive tier we need a way to perform a restore (once in the archive tier an object needs an async operation to restore objects that might take a few hours).

We propose to add a new dedicated API that performs the restore through an asynchronous job:

  • Takes in input a batch of ids (max 1000) to restore

  • For each id checks the file handle status (checking that the user is the creator or admin):

    • If UNLINKED updates it and all the matching file handles to AVAILABLE.

    • If ARCHIVED, fetch the archive status of the S3 object (a head request should be enough, or we can batch this into a list objects):

      • If none then the object is still in INT-IA or INT-Standard, the object can be updated as AVAILABLE along with all the copies and the synapse-status tag can be cleared from the object

      • If present updates it and the matching file handles to RESTORING and sends a restore request to the s3 object, this is an asynchronous call. We enable the S3 notifications and when a restore completes a worker will set the all the file handles that match the keys as AVAILABLE and the synapse-status tag can be cleared from the object

    • Any other status is preserved

Note that even though the job completes some of the file handles might still be RESTORING, for each file handle in input we will include the result in the output of the job with a restore_status: FORBIDDEN, NOT_FOUND, NO_ACTION, RESTORED, RESTORING. We do not wait for the actual restore of the archive tier files, because we would either need to refactor the asynchronous job machinery to support migratable jobs or add dedicated custom code. Hopefully we will never have to use this job.

Note that we need to make sure that when we flag a file handle as UNLINKED we only do so if the status is AVAILABLE and its current updatedOn is < 30 days otherwise we risk to put back the restored objects as UNLINKED.

Open questions

  • Why can we make “copies” of file handles in the first place? Maybe we should not allow it in S3 or GC buckets? Who is using this? If we get rid of this does it make it simpler? → ✅ This seems to be used heavily by scientists (from DW data), unrealistic to get rid of it and/or deduplicate them.

  • I have no idea when the last access date is updated for the INT class nor a way to test it quickly, does tagging the object reset this and moves the object back to INT-Standard? If so we might have a weird lifecycle where an object that is uploaded might go: INT-Standard → (Day 30) INT-IA → UNLINKED → (Day 60, tagging) INT-Standard → (Day 90) INT-IA → (Day 150) INT-Archive → (Day 240) INT-DeepArchive. Basically we might have a month where we go back paying the INT-Standard, in the long run the cost would be amortized since objects will eventually move to the archive tier. → ✅ I verified that tagging does NOT in fact push the object back to the INT-Standard. In my own bucket I have objects that are INT-IA, I tagged those objects from the console (therefore fetching the tags as well) and the metrics reported after a day did not change the count of object in INT-IA. This is great news as the lifecycle is now clear and more cost effective: INT-Standard → (Day 30) INT-IA → UNLINKED → (Day 60, taggingTagging) INT-Standard → (Day 90) INT-IA → (Day 150) INT-Archive → (Day 240180) INT-DeepArchive. Basically we might have a month where we go back paying the INT-Standard, in the long run the cost would be amortized since objects will eventually move to the archive tier.

  • Should we delete the copies of the file handles if at least one is AVAILABLE? If so, if all copies are UNLINKED should we just keep one around? Which one? → ✅ No, copies effectively make the ownership a 1 to N relationship

  • It is really not clear if lifecycle transitions generate S3 notifications, for example if we wanted to get a notification when a temporary object is automatically expired so that we can remove the file handle record. It looks like this was not possible a few years ago, but the reference in the documentation disappeared, we will have to test this or contact support.

  • I didn’t even start thinking about GC or other type of file handles. The vast majority of unlinked data is in prod though.

...