Document toolboxDocument toolbox

Service to migrate content at the S3/Storage Level: Use Cases

A design document for building a solution for use case 1 can be found here: Synapse Storage Reports: API Design Document

Use Cases

Use Case 1: Move projects between Synapse-managed S3 buckets

A handful of Synapse projects that are managed internally store so much data (10s-100s TB) that there are benefits (namely, itemized S3 billing) to placing each project in their own storage location. It would be useful to be able to move all of the content in one of these projects to its own bucket. Additionally, we can encrypt this data as we move it into this bucket, dramatically reducing the amount of data we need to encrypt in the main bucket as a part of  PLFM-5201 - Getting issue details... STATUS .

This is a high-priority use case and is driving this proposal. This issue "blocks" tickets related to certification remediation (namely, storing PHI on AWS). Those tickets do not necessarily need this service (we can come up with workarounds), but this could simplify the work that needs to be done later.

Requirements
  • Project-level resolution is necessary (i.e. we must be able to get all of the file handles that are "owned" by a project)
  • Synapse must transfer the underlying S3 files to the new bucket
  • It is necessary that the original files (and their previews) are eventually deleted from Synapse storage, since we are trying to shift the costs in the main S3 bucket to other S3 buckets.
  • All file versions must be moved
  • It is not acceptable to create new file handle IDs if the actual file does not change.

Use Case 2: Moving content that is in non-S3 storage to another non-S3 location

Users have stored files on an SFTP server that is expected to be decommissioned, so they must move them to another storage location (an external object store). The data is not managed by Synapse, so storage users must transfer the files manually. Because the content of the files have not changed, but the location has, users would like to update all file handles tied to the old storage location (or an enumerated list of particular file handles, if only a subset of the data is being moved) to point to the new file location.

Priority is not currently high, but it would be very useful for some users (and would save members of the engineering team time, since on at least one occasion this has been done manually on the database). It saves a lot of time down the road, and may be fairly trivial to implement given the similarity of requirements in other use cases.

Requirements
  • Synapse does not manage any underlying file transfer here.
  • Users must have a way to identify all of the file handles they can/should update.
    • What are all of my file handles on storage location X?
    • How can I move all file handles on storage location X, even if I am not the owner of these file handles?
  • File handle IDs should not be updated, for the same reasons outlined in Use Case 1.
  • We should consider validating that a matching file is in the new location before updating the file handle (e.g. via md5? might need to be credentialed as well)

Use Case 3: Moving content that is in SFTP storage to a managed S3-bucket

Synapse is deprecating SFTP storage. To simplify migration (and accelerate deprecation), we can give SFTP users the option to migrate their data into Synapse storage.

Requirements
  • Synapse must be able to transfer files from the SFTP server (authentication).
    • We should consider the implications of cases where Synapse cannot access the SFTP server 
    • This might need to be done through through a client that can connect and authenticate into SFTP and then, the client uploads that file into Synapse.
  • File handles would probably want to be selected based on storage location.
  • We would not delete the old files, because these should be managed by the owner of the SFTP server.

High-level service proposal that would cover use cases, if feasible

Based on the following use cases, we would probably need to

  1. Create a way to find all file handles that
    1. Are "owned" by a project
    2. Belong to a particular storage location (or more likely, do not belong to the destination storage location)
    3. Have not already been processed/migrated
  2. In cases where the destination is Synapse-managed, be able to sync the content from the original storage location to a specified S3 bucket.
    1. Log this sync and its final status, timestamp, etc.
  3. Update a file handle to point to a new location. Atomically, this includes:
    1. Verifying that the origin and destination files match (e.g. a checksum provided by S3, or that we can calculate if Synapse has access to both files).
    2. If the files match, update the file handle (excluding the ID and eTag).
    3. Log the file handle change
  4. In cases where the origin is Synapse-managed, mark the origin file for eventual deletion.
  5. Create a log of all file transfers in case we later need to review the history of what was done.

Use Case 1 could probably be one operation called by a project owner or bucket owner. This operation could handle (1-4).

Use Case 4 could be similar, but only (1-3)

Use Case 3 could involve a user calling (1) to get a list of file handles and (3) to update the file handles (one suggestion was to have a similar interface to the Python client's copyFileHandles method. Perhaps an updateFileHandles method would make sense here from a user perspective) 

Questions

Issues we need to resolve that may guide implementation or refine use cases.

Who is authorized to initiate transfers? Who can move a particular file to another storage location, or modify a file handle?

An individual user should not be able to modify file handles that they do not own. Migrations that involve file handles owned by multiple users should only be performed by an admin after determining that the.

Users should be permitted to update their own file handles in cases where Synapse does not manage the storage.

Which file handles in a project should be moved

This JSON file outlines all of the objects that can be tied to a file handle

Of these, we can walk through a project and pick out these objects and migrate/modify their associated file handles.

FileEntity

TableEntity

WikiAttachment

WikiMarkdown

We must consider the implications of moving all file handles belonging to entities structured under a project. By moving all files referenced by a project to an S3 bucket that isn't managed by Sage, the  S3 bucket owner will able to delete files referenced in Synapse. If those files are owned by a different Synapse user, and used in other places in Synapse, the file could be deleted or modified, when the user owning the file handle assumed they were safe in Synapse storage. For this reason, we should initially only consider moving files between buckets managed by Sage until we are sure there are no risks similar to this.