Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Whenever a file is uploaded though synapse the reference to the file is maintained in an index, which is a table in an RDS MySQL instance. We refer to a record in this index table as a File Handle. A file handle is simply a pointer to where the physical data is stored and does not provide context where the data is actually used and for this reason there is no explicit access model and instead the user that created the file handle “owns” it (we only allow the owner to delete the file handle or access it directly). Additionally from a user perspective a file handle is immutable and its metadata cannot be updated, this simplifies the internal handling of files in Synapse and provides a generic abstraction over file management.

...

Once one or more scans are performed we end up with a log with the last time a file handle was seen as linked. This data will live in S3 and can be queried with Athena. We can now use this information to check if any file handle has not being linked for more than a given amount of time, assuming that a scan was performed recently. Note that we can consider file handles that have been modified before a certain amount of time E.g. we consider only file handles modified more than 30 days ago, this provides a window of for scanning associations multiple times.

Since the file handles are in the main DB we need a way to get this information and we could simply run several queries with Athena with a batch of ids. This is not ideal as it’s a slow process with several limitations, Athena queries are limited in size (e.g. we can probably ask for a 10K batch of ids), we have a limit on the number of concurrent queries that can be performed (e.g. 20/s) and handling the results is tricky (e.g. queries are asynchronous). Where Athena shines is when a single query is performed on a big dataset.

We can instead periodically export the file handle table in S3 and join the data directly using Athena. We can write a script that weekly does the following:

  1. Restore a recent snapshot of the DB

  2. Run a Glue job on the snapshot to export the file handle table in S3 (from an initial experiment this can take around 10-20 minutes) in parquet format, we can filter on the modification date (e.g. only file handles that have been modified 30 days prior the last scan was started) and avoid including previews (which are technically linked file handles).

  3. Join this table with the data delivered by the kinesis stream to detect un-linked file handles to produce a new table in S3 with only the file handles that are un-linked

  4. Send a message to the backend to inform that the results are ready, a worker can read this results in batches to populate a table with the un-linked file handle ids.

Note that given the uncertainty in the process (e.g. unexpected failures, missed scanned data etc) when a file handle is detected as un-linked we do not actually perform any other action. We can simply have a counter (or several records) that tracks how many times a file handle was detected as un-linked. This means that we can boost our confidence over time, for example if a file handle was detected as un-linked in 3 subsequent scans we can flag the file handles as UNLINKED and for subsequent ARCHIVAL. We can have a worker that periodically counts how many times a file was detected as un-linked, update the file handle status and remove the records from this table (e.g. to avoid reprocessing).

An interesting alternative approach to test would be to append the results in S3 directly (Athena allows to have an INSERT INTO kind of operation), building a log of un-linked detections, we can then run a query on this dataset directly in S3: for example we can group by the id of the file handle in the un-linked log over a window of the previous 30 days and count how many times the file handle was detected as un-linked, the result would effectively be the file handles that were detected as un-linked at least 3 times in the span of a month (assuming we run the detection often enough)., we can reuse the current infrastructure that replay the change messages on file handles and keep exporting the file handle data in a dedicated glue table using kinesis.

The process is as follow (Note: the data resides in a dedicated bucket that automatically deletes objects after 30 days):

  • The file scanner continuously exports associations (currently every 6 days all the associations are scanned, both in prod and staging), using kinesis and a glue table to convert to parquet

  • A worker exports the file handle data, including id and updatedOn, using kinesis and a glue table to convert to parquet

  • Once a month (1st monday of the month at night) we run an Athena query that joins the two glue tables together to discover all the file handles that do not have an associations (left join with null check), the id of the query is sent to the backend so that results can be processed

  • A worker process the athena query results to flag the file handles as unlinked (only handles that are in the available status)

Alternatives Considered:

  1. When we run the scanner we update a timestamp of the file handle directly in the DB, a worker will collect the file handles that have not been updated in more than 30 days and flag them as unlinked: This was the original option. It was dismissed because of the risk of trashing migration.

  2. Keep a consistent list of links (e.g. add/remove links according to the object that work on file handles): We dismissed it because of the complexity of maintaining a reference count index, additionally we would duplicate data in the DB and migration would suffer from it.

  3. Use of S3 infrastructure and objects tags: The idea was to use object tags and a lifecycle configuration, basically trying to tag objects when they were scanned. We dismissed this because updating a tag does not update the update date of the object and lifecycles in s3 buckets are based on it, we would have had to come up with a complex mechanism to update the lifecycle configuration periodically. Additionally the tagging operation costs would add up over time especially if scans are done frequently on all the objects.

See additional notes at

Lref gdrive file
urlhttps://docs.google.com/document/d/1JgkziMyEtSXlaYbECcOPX0KZ6cvPwKfEbYGFD1K3MZI/edit#heading=h.94pafhl4kn4f

Un-linked File Handle Archival

Deleting user data is a tricky business, especially if we cannot be 100% sure that the data is not used. Instead we propose an approach that goes in stages where first the un-linked data is detected but leaving it accessible and only after a certain amount of time we start archiving it and eventually delete it, with the option to restore it before it is actually garbage collected. The archived data will eventually deleted.

We considered various options for archiving the unlinked data:

  1. Archive bucket + lifecycle
    The unlinked file handles can be stored in a dedicated bucket

...

The reason to move the data in a dedicated bucket is because we can defined life cycle policies at the bucket level keeping the data well organized.

...

  1. starting with the S3 Standard - Infrequent Access

...

  1. storage

...

  1. class

...

  • Storage cost is $0.0125/GB (vs standard is $0.023/GB for the first 50TB, $0.022/GB for the next 450TB and $0.021/GB for over 500TB): E.g. 1TB is ~$12.5 vs $25

  • PUT/POST/COPY/LIST cost is $0.01/1000 requests (vs standard $0.005/1000 requests): e.g. a million objects is ~$10 vs $5

  • GET/SELECT cost is $0.001/1000 requests (vs standard $0.0004/1000 requests): e.g. a million objects is ~$1 vs $0.4

  • Life Cycle cost $0.01/1000 requests (e.g. automatically delete): e.g. a million objects is ~$10

  • Fetch costs is $0.01/GB (e.g. if we want to restore): 1TB is ~$10

Storing the data in an archiving tier (e.g. glacier) could be an option but it complicates restores (first the objects needs to be restored) and access to data becomes expensive and slow. A new offering from AWS is the S3 Intelligent-Tiering that automatically manages and monitor usage patterns of objects keys to change the access tiers and while the fetching of the data has no costs associated we would pay ($0.1 per million objects/month) for monitoring, but for this use case it makes sense to directly store the data in the infrequent access storage class with custom life cycles to move it to other classes or delete the objects since the intelligent tiering class does not have (yet) fine grained control on how the data is moved between classes.

We can introduce a new STATUS column in the file handles:

...

STATUS

...

Description

...

File Handle Accessible

...

AVAILABLE

...

Default status

...

Yes

...

UNLINKED

...

The file handle has been identified as un-linked, if a pre-signed URL is requested for such an object we trigger an alarm (e.g. this should never happen unless we mistakenly identified a linked object)

...

Yes, trigger alarm. Should we instead consider this as not accessible?

...

ARCHIVING

...

The file is being archived

...

No, throw not found, trigger alarm

...

ARCHIVED

...

The file has been archived (e.g. moved from the original bucket to the archive bucket)

...

No, throw not found, trigger alarm

...

DELETED

...

The file has been deleted, we can setup notifications in the S3 bucket and change the status when an object is deleted matching the key of the archived object(s). An alternative to deletion would be to store the objects in the S3 Glacier or Deep Archive for low long term storage costs.

...

No, throw not found, trigger alarm

Additionally we keep track of the status update timestamp with a new column STATUS_TIMESTAMP. We use this timestamp to decide when to move an UNLINKED file handle to the archive. E.g. we can keep the file handle un-linked for 30 days before we move it to the archive.

A worker will simply periodically scan the UNLINKED file handles whose status was updated more than 30 days in the past:

  1. The worker will check if other file handles that are still AVAILABLE use the same key (e.g. it’s a logical copy), in this case we cannot move the underlying data yet and the archival is delayed. In this case we can update the timestamp (so that the file handle is re-processed in the next 30 days). Note: this is a tricky edge case, we could delete the file handle at this point but then we only allow the 30 days window to restore it in the case of a copy. Maybe we can use a special status instead? What happen if the last copy is archived? What if this copy is restored then?

  2. For eligible file handles the worker will set the status to ARCHIVING and send a message for another worker to start the archiving that will move the data to the archive bucket and set the status to ARCHIVED once done. We can keep the key of the file handle the same but prefixed with the source bucket (e.g. archive.bucket/proddata.sagebase.org/originalKey). In this way it’s easy to restore the file handle to the original key. Note that the previews can be deleted here, if a file handle is restored we can trigger the preview generation.

While a file handle is in the UNLINKED, ARCHIVING or ARCHIVED state it can be restored, the process would move back the data to the original key (if the object is in archiving tier such as glacier we first need to request the restore from AWS and the process is a bit more involved). We can introduce dedicated statuses (e.g. RESTORING → AVAILABLE) while the restore request is in progress.

Alternative Implementation (Dismissed)

The process proposed above has the advantage of being mostly done externally to Synapse, but also introduce several point of failures that might be difficult to tackle correctly. Additionally it adds infrastructure and technologies that the backend team has limited knowledge of and in general it’s a substantial engineering effort that might span several months of development.

It’s worth mentioning a simpler alternative that reuses familiar technologies, especially considering the fact that from the bucket analysis what we are most concerned about is data linked in file entities and tables. The process might go as follow:

  1. Every time a link is added (e.g. a file entity is created, a revision is updated etc) a dedicated service is invoked to inform the system about the association. We simply add a record to a dedicated table that maps the file handle with the object type/id, this should be done in the same transaction.

  2. When a link is removed we invoke the service again to remove the link

  3. Periodically we scan this table to detect unlinked file handles (e.g. file handles that do not have records in this table), we record this information in another table with a timestamp.

  4. Periodically we scan the “un-linked” table for file handles that have been un-linked for more than 30 days and update the file handles, we remove the record from this table

  5. When a link is added we make sure that the the record is removed from the previous table as well

This is a simpler solution, the drawback is that we specifically have to “add the link” when we work with file handles and it adds a bit of overhead (e.g. additional insert to the DB), but we already do it one way or another. Forgetting to cleanup the links does not lead to any data loss. Additionally we can incrementally add the cleanup for each type if needed, for example we can focus on entities and tables only now. The deletion can be asynchronous, taking care of the deletion timestamp (e.g. delete only if the deletion timestamp is after any existing link), this is because deleting an entity or a table might lead to several thousands or millions of deletions that needs to be done in a background job.

Note that this would not necessarily replace the existing associations used for the access checks.

Of course this has the drawback of introducing another big table that needs to be migrated and the existing links needs to be backfilled, we could store this in a dedicated DB that does not migrate but in this case the “un-linked” discovery would not take advantage of joining the file handle table.

...

  1. . A lifecycle configuration in the archive bucket can be used to move objects after a certain amount of time of their upload to cheaper storage classes and eventually deleted. The process would be as follow:

    • Periodically collect a batch of eligible UNLINKED file handles for which no copy that is still AVAILABLE exist and send a message to a dedicated SQS queue to archive the key

    • A worker uses the transfer manager to copy the object (if bigger than 128KB to avoid paying the overhead in the infrequent access class) to the destination bucket, the file handle status is updated accordingly

    • Through S3 notifications when a file handle is deleted by the lifecycle we can remove the file handle

    Restoring an archived file handle in this case would involve in this case to check the storage class and perform a restore request if in an archive tier, process the S3 notifications when the objects is restored and move the object back to the original location.

  2. Intelligent Tiering + tagging
    Instead of using the infrequent access storage class, we can use the new intelligent tiering class (INT). This special class automatically enables monitoring of the objects for access and moves them automatically between standard and infrequent access according to access patterns. If an object is not accessed for 30 days it is moved automatically to INT-infrequent access. If the object is accessed again it is automatically moved back to INT-standard.
    Additionally a new feature allows to enable a specific lifecycle configuration for the INT storage class that moves objects that have not been accessed for more than X days (min 90) to archive tier and after Y days (min 180) to deep archive tier. The archive tiers require a restore operation to access the objects that would move them to the standard class. Additionally this particular lifecycle can be setup to use a rule that uses object tags, so we can enable this only for UNLINKED data.

    The idea would be the following:

    • Move everything in our bucket to INT (we can use a lifecycle configuration for this), objects that are smaller than 128KB are not moved (not cost effective)

    • Any object uploaded to our bucket is automatically uploaded as INT (iff > 128KB), this can be part of the storage location metadata

    • Setup a lifecycle configuration for INT so that objects tagged with some special value (e.g. synapse-status=UNLINKED) are moved to archive tier after 90 days and to deep archive tier after 180 days

    • Setup a worker that as in option 1. collects a batch of eligible UNLINKED file handles and tag them appropriately and set the status to ARCHIVED

    • For deletions we can have a worker that process the archived data older than 1 year. Given the recent development of the INT class (e.g. support for archive tier) I expect AWs to implement this for us (e.g. add in the lifecycle the option to delete objects similar to the normal lifecycles), we can wait 1 year before implementing this and see if amazon will deliver.

    Restoring an archived file handle in this case would involve checking the storage class and perform a restore request if in the archive tier and process the S3 notification when the object is restored (no move involved, there is not temporary object for archived data in this case as it is moved automatically to the standard class).

There are several advantages of option 2. vs option 1: if our objects are all in INT we should see additional cost savings not only for UNLINKED data but also for data that is linked but not accessed while keeping the same durability and performance. The fact that INT works on the access date rather than creation date is a good fit for our use case where we do not know the access patterns to our data. Additionally the INT class even though it moves object to infrequent storage does not have additional access/retrieval fees for objects while the infrequent access and archive classes such as glacier class can get expensive if we have to access the objects. The biggest advantage is that S3 moves the files around for us according to access date and tagging, we only have to tag objects that we want to archive. The restore operation is also simpler.

Cost Analysis

In order to determine what is the best option and if it is worth deleting data we collected statistics about the unlinked data and made an estimate of the amount of data that is actually access (hot data) so that we can compare the various options, the following data was collected for the proddata.sagebase.org bucket:

  • Total Size: 748,386,690,923,161 (680.7 TiB)

  • Number of objects: 44,509,789

  • Hot Data Count* : 9,802,820

  • Hot Data Size*: 314,611,245,904,518 (286.1 TiB)

  • Hot Data (>= 128KB) Count: 4,555,052

  • Hot Data (>= 128KB) Size: 314,472,063,061,465 (286 TiB)

  • Unlinked Data Count: 8,762,805

  • Unlinked Data Size: 100,456,586,079,288 (91.36 TiB)

  • Unlinked Data Count (>= 128 KB): 2,823,189

  • Unlinked Data Size (>= 128 KB): 100,408,552,794,768 (91.32 TiB)

  • Monthly Unlinked Count (>= 128KB): 28,808

  • Monthly Unlinked Size (>= 128KB): 1,024,577,069,334 (0.93 TiB)

* We considered the file handles downloaded from tables and/or entities from 2020 till now. Does not include other type of downloads, so we will round up to 300TiB of hot data for estimates

We estimated the monthly cost using the S3 pricing calculator and rounded to 680TiB the total amount of data(T), 300TiB of hot data (H) and 91 of unlinked data (U):

  1. Leave the data in STD: $15,185

  2. Move current unlinked (>=128 KB) to infrequent Access*: $14,394

    1. Using the unlinked data >= 128KB: (680T - 91U) in standard + 91U in IA

    2. Consider an avg of 28,808/month unlinked files for moving data to IA (PUTs cost)

  3. Eventually move unlinked data to Glacier deep archive*: $13,323

    1. Using the unlinked data >= 128KB: (680T - 91U) in standard + 91U in Glacier Deep Archive

    2. Consider an avg of 28,808/month unlinked files for moving data to IA (PUTs cost)

    3. Consider the avg of 28,808/month for lifecycle transitions to Glacier and Glacier Deep Archive

  4. Move everything to INT**: $11,775

    1. Consider 44% in INT-Standard (300H/680T)

    2. Consider 56% in INT-IA ((680T-300H)/680T)

    3. Includes the 44M objects monitoring fee

  5. Move to INT + unlinked eventually in INT-deep archive**: $10,734

    1. Assumes that unlinked data is part of the cold data

    2. Consider 44% in INT-Standard (300H/680T)

    3. Consider 43% in INT-IA ((680TD - 300H - 91U)/680T)

    4. Consider 13% in INT-Deep Archive (91U/680T)

    5. Includes the 44M objects monitoring fee

    6. Assumes that we have 28,808 tags per month (GET/PUT request + tag costs)

  6. Move to INT + delete unlinked data**: $10,544

    1. Assumes that unlinked data is part of the cold data

    2. Consider 50% in INT-Standard (300H/(680T - 91))

    3. Consider 50% in INT-IA

    4. Includes the 44M - 8M objects monitoring fee, no tags, no lifecycle transitions

It is clear that just moving everything to INT is cost effective for our use case where our access patterns are unknown. On top of it archiving unlinked data might be worth it even though the cost savings is not comparable to just using the INT storage class, we have around 1TiB of data each month in average that gets unlinked.

Of note is that part of the unlinked data is temporary file handles that synapse creates internally (e.g. table queries, tables csv for uploads etc), the fact that each month we have such a high number (30K) of file handles that are unlinked might be attributed to this. Unfortunately we do not tag this kind of data so we do not know how much of those 1TiB are actually temporary file handles.

To solve this we should upload this data either with an special object tag that expires the object through a lifecycle transition or in a dedicated bucket (e.g. another storage location) with a dedicated lifecycle that expires everything after 30 days. This temporary objects could be flagged in the database with a special status (e.g. TEMPORARY) so that they are not collected as UNLINKED but rather removed either by S3 notifications (see open questions below) or through a worker.

Another clear point that surfaced is that deleting the unlinked data is not worth the effort, if we move it to the cheapest archive tiering it is actually worth keeping the file handles around just in case we need to recover them.

Open questions:

  • Even though we flag file handles as UNLINKED, we allow file handles to be copied and we have in the database multiple file handles that point to the same key. In some cases some of the copies might be flagged as UNLINKED while one or more existing copy for the same key might still be linked. This poses a technical challenge: we cannot archive this data since it is still unlinked but the record in the database is still there as UNLINKED, if we do not change their status or date we will always encounter this file handles unless all the links are eventually removed, for example:
    F1 (k1, AVAILABLE)
    F2 (k1, UNLINKED)
    F3 (k1, UNLINKED)
    F4 (k2, UNLINKED)
    If we collect a batch of file handles that are unlinked and we limit the batch to a size of 2, we will fetch F2, F3. They all point to the same key but cannot be archived because F1 that references the same key is still linked. Should we in this case delete F2 and F3? (so that in the next run we can process F4). If we do not delete them we could update their timestamp but we risk to keep updating, maybe using a special status? Why would we want to keep those around anyway?

  • It is really not clear if lifecycle transitions generate S3 notifications, for example if we wanted to get a notification when a temporary object is automatically expired so that we can remove the file handle record. It looks like this was not possible a few years ago, but the reference in the documentation disappeared, we will have to test this or contact support.

Unindexed Data

From the S3 Bucket Analysis it turns out that we have around 7.5TB of data in S3 for which there is no file handle.

...