Skip to end of banner
Go to start of banner

Synapse S3 Storage Maintenance

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 16 Next »

JIRA: https://sagebionetworks.jira.com/browse/PLFM-510 (description outdated, left for reference).

Related JIRAs:

Related Previous Work:

Introduction

Synapse by default stores most of its data in an encrypted AWS S3 bucket. Additionally external buckets owned by users can be linked to projects and folders so that data can be stored outside the main bucket and its associated costs can be billed separately. Note that synapse supports different types of storage, such a bucket provisioned in google cloud and can link data that is stored elsewhere as long as it’s dereferenceable through a URL. The scope of this document focuses on the S3 storage as it’s the main type of storage currently being adopted.

Whenever a file is uploaded though synapse the reference to the file is maintained in an index, which is a table in an RDS MySQL instance. We refer to a record in this index table as a File Handle. A file handle is simply a pointer to where the physical data is stored and does not provide context where the data is actually used and for this reason there is no explicit access model and instead the user that created the file handle “owns” it (we only allow the owner to delete the file handle or access it directly). Additionally from a user perspective a file handle is immutable and its metadata cannot be updated, this simplifies the internal handling of files in Synapse and provides a generic abstraction over file management.

Once a file handle is created it can be referenced directly in multiple places including file entities, records in a synapse table, user and team profile pictures, wiki attachments, user messages etc. The complete list of type of references is maintained in what we call a file handle associate type:

https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/file/FileHandleAssociateType.html.

Through its association a file can be shared with users and teams and downloaded usually through a pre-signed URL.

From a maintenance perspective there are two main categories of data that could lead to potential unwanted costs:

  • Unlinked Data
    The current architecture has the advantage that the infrastructure used to manage file uploads is unified in the whole system, but has the drawback that it separates the indexing of the data from the usage of the data, leading to data that can be potentially be un-linked and effectively unused but for which we would still pay the associated storage costs.

  • Unindexed Data
    Worth mentioning another category of data in our bucket: data that is instead in the S3 bucket but not indexed in synapse (e.g. a file handle does not exists for a key), this is mostly a concern for the default bucket used by Synapse (as external buckets are managed by other users).

Ideally the system would be designed in a way that the amount of unlinked and unindexed data is kept as low as possible and could self-heal from potential abuse.

Unlinked Data

There are various scenarios when this might happen:

  • Updates: for example when a user updates its profile picture or when a version of a file entity is updated with a new file handle. Another example is records in a synapse table that points to a file handle, a table might be updated to use different file handles without deleting the old one.

  • Deletions: for example when a project is deleted and after it is purged from the trashcan all the entities in the projects are deleted, the file handles are instead left intact and the data maintained in the bucket.

  • No association: a file could be uploaded, a file handle could be created but never linked to anything, making the data inaccessible but to the user itself.

  • Data migration: a recent use case might lead to several terabytes of un-linked data, when users want to move their data to an external bucket: users might download from one bucket and re-upload to another bucket (or copying the data over other buckets using the recently developed APIs) updating the file handles in the entities and leaving behind a trail of file handles and data that is not used.

There are various ways that we could tackle this problem:

  1. We could introduce an explicit link at the time the data is uploaded and maintain a consistent index enforcing a one to one relationship: this is hard to implement in practice on top of the current architecture, we do not know where the data uploaded will be linked to (we could generate expiring tokens for uploads to be used when linking) and would be a breaking change that would probably take years to be introduced. Additionally we found several millions of file handles already shared between different associations.

  2. We could maintain a single index with all the associations and try to keep it consistent: each time a link is established a record is added with the type of association, when an association is broken the index is updated. When the last link is removed the file handle can be flagged as potentially un-linked and archived after a certain amount of time unless it’s linked back. This would require keeping the index up to date (potentially eventually consistent) but it does not work well with deletes as it might be extremely complex to know when an association is broken and communicate the change back to the index: e.g. when a folder is deleted all the files under the folder might be deleted (even though technically we already traverse the hierarchy when purging the trashcan). This brings overhead for each type of association (e.g. the handling needs to be done in each place where file handles are used). The advantage of this approach is that if an association is broken and the link is not removed the worst that can happen is that the data is kept (so no harm). Additionally there are less moving parts and point of failures (compared to the next solution). At the same time we would have the migration problem, this would turn into a big migratable table, but we could technically store it in a separate database that does not migrate, e.g. a potential candidate (there seem to be limitations in how to delete data though) that would scale well would be to store this index in Dynamo DB (using as PK the file handle id and as SK the association).

  3. Another solution is instead to periodically scan all the file handle associations and ask which ids are linked, with this information we can build an “index” that can be queried to identify the file handles that are un-linked and archive them for deletion with the possibility to restore them within a given period of time. This approach has the advantage that its implementation has a lower impact on the rest of the system and it is isolated to a specific task potentially reducing the risk of wrong deletions. It’s not immune to mistakes, since when scanning the association a developer could still make a mistake but we can design it so that we can be alerted when mistakes happens before the actual archival/deletion occurs.

In the following we propose a design that revolves around this last solution. There are three main phases that take place:

  1. A Discovery phase, where the association are scanned so that links to file handles can be recorded

  2. A Detection phase, where with the information from the previous phase we can establish which file handles is not linked

  3. An Archival phase, where the file handles that are deemed un-linked are placed into an archive state and will eventually be deleted

File Handle Associations Discovery

The first step is to discover the existing association to file handles. In the backend we maintain already a set of references, generally using dedicated tables with foreign keys back to the file handles table that keep track of the current associations. This goes back to the https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/file/FileHandleAssociateType.html and theoretically we keep all the type of associations we have for file handles in order to handle access permissions (e.g. who can download what through which link). This is a bit more complicated in practice, sometimes we do not have a foreign key, but rather a field in a serialized form in the associated record (e.g. profile pictures), or even in a separate database (e.g. File references in synapse tables are maintained as dedicated tables, one for each synapse table).

In the following table we provide the list of known associations to file handles along with a description of how the association is stored:

Association Type

Table

Foreign Key (ON DELETE CASCADE)

Description

Current Size

Unlinking

FileEntity

JDOREVISION

FILE_HANDLE_ID (RESTRICT)

Each file entity revision has a FK back to the referenced file handle. Multiple revision can reference a single file entity (e.g. One node might be linked to multiple file handles through revisions).

~11M

The association can be broken in several ways:

  • The entity is deleted and purged from the trashcan

  • A revision is deleted

  • The file handle id is updated

Note that even if a revision is deleted or a file handle is changed other revisions might still refer to the file handle.

TableEntity

Each table entity has an associated table with the file handles. They are not migratable tables and not consistent. The data is also store in the various transactions used to build tables in S3 in a dedicated bucket.

No, the tables are stored in a separate DB

Each table might reference multiple file handles, when a table is built each transaction is processed and if a file handle is in the transaction it is added to a dedicated table, one for each synapse table. Unfortunately this table is not migratable and is rebuilt every week. We keep a migratable table with all the table transactions and table row changes packages in a zip file and stored in a dedicated S3 bucket.

~36M, distributed in around ~10K tables (~1.3K non empty)

The association can be broken only when the table is deleted (removed from the trashcan).

WikiAttachment

V2_WIKI_ATTACHMENT_RESERVATION

V2_WIKI_MARKDOWN

FILE_HANDLE_ID (RESTRICT)

No, contained in the ATTACHMENT_ID_LIST blob

The attachments to a wiki page, the table includes both the file handles storing the wiki page and its attachments. The list of attachments is also stored in the V2_WIKI_MARKDOWN table in a blob with the list of ids.

~1M

The association might be broken in several ways:

  • A wiki page is deleted

  • The attachments are updated

WikiMarkdown

V2_WIKI_MARKDOWN

FILE_HANDLE_ID (RESTRICT)

The markdown of a wiki page.

~770K

See above

UserProfileAttachment

JDOUSERPROFILE

PICTURE_ID (SET NULL)

The user profile image.

~60K

The association can be broken when the profile image is changed.

TeamAttachment

TEAM

No, contained in the PROPERTIES blob that stores a serialized version of the team object (icon property)

The team picture.

~4.5K

The association can be broken when the team picture is changed.

MessageAttachment

MESSAGE_CONTENT

FILE_HANDLE_ID (RESTRICT - NO ACTION)

The messages to users content.

~460K

The association can be broken if the message is deleted (only admins).

SubmissionAttachment

JDOSUBMISSION_FILE

FILE_HANDLE_ID (RESTRICT - NO ACTION)

The file handles that are part of an evaluation submission, in particular this are the file handles associated with a file entity that is part of a submission (e.g. all the version or a specific version).

~110K

The association can be broken when the submission is deleted or when the evaluation is deleted.

VerificationSubmission

VERIFICATION_FILE

FILE_HANDLE_ID (RESTRICT)

The files that are submitted as part of the user verification. Note that when a user is approved or rejected the association is removed.

<10

The association is broken when the submission is approved or rejected.

AccessRequirementAttachment

ACCESS_REQUIREMENT_REVISION

No, a file handle might be contained in the SERIALIZED_ENTITY blob that stores a managed access requirement (the ducTemplateFileHandleId property)

A managed access requirement might have a file handle pointing to a DUC template.

~5K

The association is broken when the access requirement is deleted or updated with a new file handle.

DataAccessRequestAttachment

DATA_ACCESS_REQUEST

No, various file handles are referenced in the REQUEST_SERIALIZED blob that stores a serialized version of the access request.

A data access request might have multiple files attached for the approval phase (e.g DUC, IRB approval and other attachments).

~2K

The association is broken when the request is updated with different file handles.

DataAccessSubmissionAttachment

DATA_ACCESS_SUBMISSION

No, various file handles are referenced in the SUBMISSION_SERIALIZED blob that stores a serialized version of the submission.

Same as above, but for the actual submission.

~3K

Never?

FormData

FORM_DATA

FILE_HANDLE_ID (RESTRICT)

The data of a form.

~300

When the form is deleted.

Note: An interesting observation (Also from the S3 Bucket Analysis ) is that most of data is referenced by file entities (as expected). Most of our concerns could revolve around un-linking entities rather than other objects. For this it might be actually worth implementing option 2. above instead. We could simply register the links when we encounter them and only take care of un-registering tables and entities links for cleanup.

The idea would be to periodically scan over all the associations and record the last time a file handle was “seen” (e.g. as associated). In this way we can build an index that can be queried to fetch the last time a file handle association was seen so that file handles can be flagged as un-linked.

There are three aspects that require considerations for this process:

  1. How to perform the scan

  2. Where to store the result

  3. When to perform the scan

Scanning

As we can see from the table above the way the associations are stored for each type is not consistent, some types use dedicated tables, other use a column with a field referring to the file handle id and some other store the information embedded in a serialized field. Additionally the size distribution is widely uneven, we have tables with a few thousands associations and tables with millions of records.

The simplest approach would be to simply run a driver job that for each type of association sends an SQS message to starts a sub-job to scan all the associations of that type. Each sub-job performs a scan reading the data in batches. This approach is problematic because while in most of the cases each job might take a few seconds, in some cases it might take several hours (e.g. tables). If a worker goes down while scanning and the job is put back in the queue the job would have to start from scratch.

We instead can split each association in independent sub-jobs that can run in parallel and are only scanning a batch of data (e.g. and dedicate a small fleet of workers for this task), similar to what happens when we migrate (where we scan in batches of 100k records). An idea can be to divide the scan for a given association type in partitions of a given size, for example considering an id used in a table, its min and max value we can divide in sub-jobs of potentially evenly distributed batches. Each batch can be processed by a dedicated worker. This approach can be probably generalized for most of the associations types above providing a default implementation with some variations (e.g. some objects will need to be de-serialized). The tricky part is to divide the scanning in partitions, this can be approximated using in some cases a unique id of the table being scanned, in some other cases (e.g. nodes) we could partition using the file handle id itself.

The idea is to rely on the fault tolerance of both SQS and the worker architecture: if each batch is small enough and a recoverable failure happens the worker can start the same batch assuming that all the information is contained in the SQS message driving the worker.

Synapse tables problem

The most problematic part is scanning the associations in tables:

  1. We have several thousand tables, each one with its own file association table

  2. Tables are rebuilt every week as are not migratable and in some cases they might actually fail to build all together rendering the file association table useless.

Each table is built using “transactions” which contain the set of changes to the table, we store the history of transactions in a migratable table. If a transaction succeeds we store the change set in S3 as a compressed serialized object. When a table is built it goes through the history of the transactions reading from S3 the change set and applying it to the table. Each table might have several thousands of changes applied to it. Each change set might contain a file handle association and this is when we record the file handle in the table file association table.

One way we could approach scanning the associations for tables is to use the TABLE_ROW_CHANGE table that contains the history of change sets for each table, we can read the changes directly from this table and use it to create sub-jobs that process a batch of changes. Each worker will read from the batch of changes from S3 and look for file handles in the change set itself.

Note that the file handles in this case are freed only when the table is deleted and the changes are dropped. This is unavoidable at the moment due to how the tables are built.

An alternative that would make the scanning much faster is to keep a single migratable table with the file handle associations that is populated when the changes are processed, this table would need to be back-filled with previous data. We can keep this as an optimization if the first approach does not work well.

Storage

We have various options to store this information so that it can be used later in the process.

  1. We could store the “linked” status in the file handle table itself along with the timestamp: this would be ideal as we would have to work with only one table and queries would be relatively efficient. Unfortunately this is not possible as the file handle table is our biggest table that is migrated every week and updating the file handles in such a way would most likely lead to unsustainable delays in the migration process. An option would be to move this table in a different “non-migratable” DB, maybe building a dedicated service over file handles, this is a huge task and would required quite a lot of refactoring (e.g. we rely on joying this table).

  2. We could store the last time a file handle is linked in a companion migratable table and use it in a similar fashion: this has a similar limitation, it would probably lead to delays in migration as the table would grow substantially over time.

  3. We can store this information in another type of external storage.

At the moment option 3. seems the more viable: we can leverage existing technology to store this information as an append only log in S3 directly and perform queries using Athena:

  1. We can setup a kinesis firehose stream that delivers the data to a sort of append only log in S3: each time a file handle is scanned we send an event to a kinesis firehose stream, the stream can be setup to target a dedicated S3 bucket and transform the data in a columnar format (e.g. apache parquet) before storage. The records would contain the triple: object type, file handle id, timestamp

  2. We can setup a glue table over this S3 destination so that the data can be queried with Athena

  3. The bucket can be setup so that old data is automatically removed (e.g. after 60 days)

During the S3 bucket analysis we used Athena to query data in S3 heavily and the results are very promising, joining on mid-size tables directly in S3 (~50M records) is extremely fast (less than 20 seconds) and the results can be stored in S3 directly as target tables. It also has the advantage that once the data is in S3 all the computation can be done externally from a dedicated job (e.g. the data can be joined in S3 to find out the un-liniked file handles and stored as a dedicated table that can be queried separately).

Timing

An important aspect to consider is that the scanning of the associations might take a long time. Even if we do it in parallel we need to limit the amount data scanned at a given time to avoid overloading the DB. We can dedicated a small amount of workers (e.g. 10 workers) to process a batch of data, the data is then sent to a kinesis stream that will eventually deliver it to S3.

This process is not deterministic and failure can occur during the scanning (e.g. a worker goes down, network connectivity is lost etc.). We would still need to have an idea of the scanning progress, or at least when it started and more or less when it completed. For example we do not want to start several scans of the all the associations in parallel, ideally we would scan once in a while (e.g. once a week) and use the start time of the job to filter on the file handles considered for further processing. Before querying the log we would need to know more or less when was the last time a scan finished:

For this we can monitor the stream to check if records are still being processed, if the stream has been empty for more than 24 hours for example we can consider the job that is being running as “finished”. We can also monitor the queue dedicated to the workers, and check if for example errors happens during the process (e.g. messages end up in a dead letter queue). In this case we maintain a table with the scanning jobs and ensure that only one is running at a given time.

Another aspect to consider is how to start a scan:

  1. Automatically, according to the status of the current job

  2. Externally, using a jenkins job or some other trigger (e.g. a cloud watch event with a lambda) that periodically starts the scan (e.g. every week)

Option 1. would be ideal as the system would be self contained but at the moment Option 2. is preferred to avoid the usual prod vs staging problem.

Un-linked File Handle Detection

Once one or more scans are performed we end up with a log with the last time a file handle was seen as linked. This data will live in S3 and can be queried with Athena. We can now use this information to check if any file handle has not being linked for more than a given amount of time, assuming that a scan was performed recently. Note that we can consider file handles that have been modified before a certain amount of time E.g. we consider only file handles modified more than 30 days ago, this provides a window of for scanning associations multiple times.

Since the file handles are in the main DB we need a way to get this information and we could simply run several queries with Athena with a batch of ids. This is not ideal as it’s a slow process with several limitations, Athena queries are limited in size (e.g. we can probably ask for a 10K batch of ids), we have a limit on the number of concurrent queries that can be performed (e.g. 20/s) and handling the results is tricky (e.g. queries are asynchronous). Where Athena shines is when a single query is performed on a big dataset.

We can instead periodically export the file handle table in S3 and join the data directly using Athena, we can reuse the current infrastructure that replay the change messages on file handles and keep exporting the file handle data in a dedicated glue table using kinesis.

The process is as follow (Note: the data resides in a dedicated bucket that automatically deletes objects after 30 days):

  • The file scanner continuously exports associations (currently every 6 days all the associations are scanned, both in prod and staging), using kinesis and a glue table to convert to parquet

  • A worker exports the file handle data, including id and updatedOn, using kinesis and a glue table to convert to parquet

  • Once a month (1st monday of the month at night) we run an Athena query that joins the two glue tables together to discover all the file handles that do not have an associations (left join with null check), the id of the query is sent to the backend so that results can be processed

  • A worker process the athena query results to flag the file handles as unlinked (only handles that are in the available status)

Alternatives Considered:
  1. When we run the scanner we update a timestamp of the file handle directly in the DB, a worker will collect the file handles that have not been updated in more than 30 days and flag them as unlinked: This was the original option. It was dismissed because of the risk of trashing migration.

  2. Keep a consistent list of links (e.g. add/remove links according to the object that work on file handles): We dismissed it because of the complexity of maintaining a reference count index, additionally we would duplicate data in the DB and migration would suffer from it.

  3. Use of S3 infrastructure and objects tags: The idea was to use object tags and a lifecycle configuration, basically trying to tag objects when they were scanned. We dismissed this because updating a tag does not update the update date of the object and lifecycles in s3 buckets are based on it, we would have had to come up with a complex mechanism to update the lifecycle configuration periodically. Additionally the tagging operation costs would add up over time especially if scans are done frequently on all the objects.

See additional notes at

Un-linked File Handle Archival

Deleting user data is a tricky business, especially if we cannot be 100% sure that the data is not used. Instead we propose an approach that goes in stages where first the un-linked data is detected but leaving it accessible and only after a certain amount of time we start archiving it and eventually delete it, with the option to restore it before it is eventually deleted.

We considered various options for archiving the unlinked data:

  1. Archive bucket + lifecycle
    The unlinked file handles can be stored in a dedicated bucket starting with the S3 Standard - Infrequent Access storage class. A lifecycle configuration in the archive bucket can be used to move objects after a certain amount of time of their upload to cheaper storage classes and eventually deleted. The process would be as follow:

    • Periodically collect a batch of eligible UNLINKED file handles for which no copy that is still AVAILABLE exist and send a message to a dedicated SQS queue to archive the key

    • A worker uses the transfer manager to copy the object (if bigger than 128KB to avoid paying the overhead in the infrequent access class) to the destination bucket, the file handle status is updated accordingly

    • Through S3 notifications when a file handle is deleted by the lifecycle we can remove the file handle

    Restoring an archived file handle in this case would involve in this case to check the storage class and perform a restore request if in an archive tier, process the S3 notifications when the objects is restored and move the object back to the original location.

  2. Intelligent Tiering + tagging
    Instead of using the infrequent access storage class, we can use the new intelligent tiering class (INT). This special class automatically enables monitoring of the objects for access and moves them automatically between standard and infrequent access according to access patterns. If an object is not accessed for 30 days it is moved automatically to INT-infrequent access. If the object is accessed again it is automatically moved back to INT-standard.
    Additionally a new feature allows to enable a specific lifecycle configuration for the INT storage class that moves objects that have not been accessed for more than X days (min 90) to archive tier and after Y days (min 180) to deep archive tier. The archive tiers require a restore operation to access the objects that would move them to the standard class. Additionally this particular lifecycle can be setup to use a rule that uses object tags, so we can enable this only for UNLINKED data.

    The idea would be the following:

    • Move everything in our bucket to INT (we can use a lifecycle configuration for this), objects that are smaller than 128KB are not moved (not cost effective)

    • Any object uploaded to our bucket is automatically uploaded as INT (iff > 128KB), this can be part of the storage location metadata

    • Setup a lifecycle configuration for INT so that objects tagged with some special value (e.g. synapse-status=UNLINKED) are moved to archive tier after 90 days and to deep archive tier after 180 days

    • Setup a worker that as in option 1. collects a batch of eligible UNLINKED file handles and tag them appropriately and set the status to ARCHIVED

    • For deletions we can have a worker that process the archived data older than 1 year. Given the recent development of the INT class (e.g. support for archive tier) I expect AWs to implement this for us (e.g. add in the lifecycle the option to delete objects similar to the normal lifecycles), we can wait 1 year before implementing this and see if amazon will deliver.

    Restoring an archived file handle in this case would involve checking the storage class and perform a restore request if in the archive tier and process the S3 notification when the object is restored (no move involved, there is not temporary object for archived data in this case as it is moved automatically to the standard class).

There are several advantages of option 2. vs option 1: if our objects are all in INT we should see additional cost savings not only for UNLINKED data but also for data that is linked but not accessed while keeping the same durability and performance. The fact that INT works on the access date rather than creation date is a good fit for our use case where we do not know the access patterns to our data. Additionally the INT class even though it moves object to infrequent storage does not have additional access/retrieval fees for objects while the infrequent access and archive classes such as glacier class can get expensive if we have to access the objects. The biggest advantage is that S3 moves the files around for us according to access date and tagging, we only have to tag objects that we want to archive. The restore operation is also simpler.

Cost Analysis

In order to determine what is the best option and if it is worth deleting data we collected statistics about the unlinked data and made an estimate of the amount of data that is actually access (hot data) so that we can compare the various options, the following data was collected for the proddata.sagebase.org bucket:

  • Total Size: 748,386,690,923,161 (680.7 TiB)

  • Number of objects: 44,509,789

  • Hot Data Count* : 9,802,820

  • Hot Data Size*: 314,611,245,904,518 (286.1 TiB)

  • Hot Data (>= 128KB) Count: 4,555,052

  • Hot Data (>= 128KB) Size: 314,472,063,061,465 (286 TiB)

  • Unlinked Data Count: 8,762,805

  • Unlinked Data Size: 100,456,586,079,288 (91.36 TiB)

  • Unlinked Data Count (>= 128 KB): 2,823,189

  • Unlinked Data Size (>= 128 KB): 100,408,552,794,768 (91.32 TiB)

  • Monthly Unlinked Count (>= 128KB): 28,808

  • Monthly Unlinked Size (>= 128KB): 1,024,577,069,334 (0.93 TiB)

* We considered the file handles downloaded from tables and/or entities from 2020 till now. Does not include other type of downloads, so we will round up to 300TiB of hot data for estimates

We estimated the monthly cost using the S3 pricing calculator and rounded to 680TiB the total amount of data(T), 300TiB of hot data (H) and 91 of unlinked data (U):

  1. Leave the data in STD: $15,185

  2. Move current unlinked (>=128 KB) to infrequent Access*: $14,394

    1. Using the unlinked data >= 128KB: (680T - 91U) in standard + 91U in IA

    2. Consider an avg of 28,808/month unlinked files for moving data to IA (PUTs cost)

  3. Eventually move unlinked data to Glacier deep archive*: $13,323

    1. Using the unlinked data >= 128KB: (680T - 91U) in standard + 91U in Glacier Deep Archive

    2. Consider an avg of 28,808/month unlinked files for moving data to IA (PUTs cost)

    3. Consider the avg of 28,808/month for lifecycle transitions to Glacier and Glacier Deep Archive

  4. Move everything to INT**: $11,775

    1. Consider 44% in INT-Standard (300H/680T)

    2. Consider 56% in INT-IA ((680T-300H)/680T)

    3. Includes the 44M objects monitoring fee

  5. Move to INT + unlinked eventually in INT-deep archive**: $10,734

    1. Assumes that unlinked data is part of the cold data

    2. Consider 44% in INT-Standard (300H/680T)

    3. Consider 43% in INT-IA ((680TD - 300H - 91U)/680T)

    4. Consider 13% in INT-Deep Archive (91U/680T)

    5. Includes the 44M objects monitoring fee

    6. Assumes that we have 28,808 tags per month (GET/PUT request + tag costs)

  6. Move to INT + delete unlinked data**: $10,544

    1. Assumes that unlinked data is part of the cold data

    2. Consider 50% in INT-Standard (300H/(680T - 91))

    3. Consider 50% in INT-IA

    4. Includes the 44M - 8M objects monitoring fee, no tags, no lifecycle transitions

It is clear that just moving everything to INT is cost effective for our use case where our access patterns are unknown. On top of it archiving unlinked data might be worth it even though the cost savings is not comparable to just using the INT storage class, we have around 1TiB of data each month in average that gets unlinked.

Of note is that part of the unlinked data is temporary file handles that synapse creates internally (e.g. table queries, tables csv for uploads etc), the fact that each month we have such a high number (30K) of file handles that are unlinked might be attributed to this. Unfortunately we do not tag this kind of data so we do not know how much of those 1TiB are actually temporary file handles.

To solve this we should upload this data either with an special object tag that expires the object through a lifecycle transition or in a dedicated bucket (e.g. another storage location) with a dedicated lifecycle that expires everything after 30 days. This temporary objects could be flagged in the database with a special status (e.g. TEMPORARY) so that they are not collected as UNLINKED but rather removed either by S3 notifications (see open questions below) or through a worker.

Another clear point that surfaced is that deleting the unlinked data is not worth the effort, if we move it to the cheapest archive tiering it is actually worth keeping the file handles around just in case we need to recover them.

Open questions:

  • Even though we flag file handles as UNLINKED, we allow file handles to be copied and we have in the database multiple file handles that point to the same key. In some cases some of the copies might be flagged as UNLINKED while one or more existing copy for the same key might still be linked. This poses a technical challenge: we cannot archive this data since it is still unlinked but the record in the database is still there as UNLINKED, if we do not change their status or date we will always encounter this file handles unless all the links are eventually removed, for example:
    F1 (k1, AVAILABLE)
    F2 (k1, UNLINKED)
    F3 (k1, UNLINKED)
    F4 (k2, UNLINKED)
    If we collect a batch of file handles that are unlinked and we limit the batch to a size of 2, we will fetch F2, F3. They all point to the same key but cannot be archived because F1 that references the same key is still linked. Should we in this case delete F2 and F3? (so that in the next run we can process F4). If we do not delete them we could update their timestamp but we risk to keep updating, maybe using a special status? Why would we want to keep those around anyway?

  • It is really not clear if lifecycle transitions generate S3 notifications, for example if we wanted to get a notification when a temporary object is automatically expired so that we can remove the file handle record. It looks like this was not possible a few years ago, but the reference in the documentation disappeared, we will have to test this or contact support.

Unindexed Data

From the S3 Bucket Analysis it turns out that we have around 7.5TB of data in S3 for which there is no file handle.

This might happen because of various reasons:

  • Unfinished multipart uploads: We currently store parts in temporary objects in S3 and the backend later copies the part over to the multipart upload and deletes the temporary parts when the multipart is completed, if the multipart upload is not finished the parts are never deleted. Additionally S3 keeps the multipart upload data “hidden” until the multipart is finished or aborted. As of November 10th we do have 1417823 uncompleted multipart uploads, of which 1414593 were initiated before October 11th (See S3 Bucket Analysis ). An rough estimate on the current unfinished multipart parts that were uploaded accounts for 2.4TB of data.

  • Data from staging: Data created in staging is removed after migrations, but of course the data in S3 is left intact.

  • Old data already present in the bucket and never cleaned up.

In this case the amount of data compared to the category of un-linked data is most likely irrelevant and it’s probably not worth tackling at the moment but still worth mentioning potential solutions. The first point with multi-part uploads is what might be more relevant but we can enable solutions to avoid future costs such as:

  • Enable a life cycle in the S3 bucket that automatically removes un-finished multipart uploads (See https://sagebionetworks.jira.com/browse/PLFM-6462). This would also mean to expire our multipart uploads in the backend (e.g. if the life cycle deleted incomplete uploads after 2 months, the multipart uploads in the backend should be forcibly restarted for example after a month).

  • Refactor the multipart upload to avoid uploading to temporary objects in the bucket (See https://sagebionetworks.jira.com/browse/PLFM-6412)

We enabled the S3 inventory for our bucket and we can write a job that compares the file handle index with the inventory to delete un-indexed data. A potential approach is to write a job that:

  • Streams to S3 the keys of the file handles that point to the prod bucket that were created between two given dates (e.g. last time the job was run and a month in the past from now)

  • Use Athena to join such data on the latest inventory (filtering again by the given dates) to identify un-indexed data

  • Delete from S3 the un-indexed data (or move it to a low cost storage class bucket, with an automatic deletion policy)

An alternative is to setup a Glue job that periodically dumps the files table to S3 similar to the S3 inventory and then a job will join on the 2 tables using Athena to find un-indexed data, note however that we need to make sure that no temporary data (e.g. the multipart upload parts) is included in the join, e.g. filtering by sensible dates.

In general we should make sure that any data that ends up in the prod bucket is actually indexed in file handles, for example I would move the temporary objects used for the multipart upload in its own dedicated bucket.

  • No labels