Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

...

...


...

serverSystem JIRA
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverIdba6fb084-9827-3160-8067-8ac7470f78b2
keyPLFM-5227

There has been a demonstrated need to move content in Synapse to different S3 buckets, for various reasons. This document will contain the use cases and design plan.

Table of Content Zone

Table of Contents

Table of Contents

Background Jiras

Jira Legacy
serverSystem JIRA
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
maximumIssues20
jqlQuerykey in (PLFM-5082, PLFM-5085, PLFM-5108, PLFM-5201, PLFM-5108)
serverIdba6fb084-9827-3160-8067-8ac7470f78b2

Background

In 

Jira Legacy
serverSystem JIRA
serverIdba6fb084-9827-3160-8067-8ac7470f78b2
keyPLFM-5082
, there was a need to move content from Synapse storage to an encrypted, private bucket managed by the user. For this use case, SynapseBucketMover was developed. The tool is sufficient for this use case, but it has undesired behavior related to versioning (it creates a new version of every file moved), and the behavior is undefined for interacting with files as they are moved. This is undesirable in cases where a large project with many collaborators must be moved.

In 

Jira Legacy
serverSystem JIRA
serverIdba6fb084-9827-3160-8067-8ac7470f78b2
keyPLFM-5108
, we plan to move content from the main Synapse storage bucket to another internally managed bucket. This lays the groundwork for projects like 
Jira Legacy
serverSystem JIRA
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverIdba6fb084-9827-3160-8067-8ac7470f78b2
keyPLFM-5085
. These are the main use cases.

A reason for the prioritization of these use cases, is that in 

Jira Legacy
serverSystem JIRA
serverIdba6fb084-9827-3160-8067-8ac7470f78b2
keyPLFM-5201
, we plan to encrypt all content in the default Synapse storage bucket. Our top 10 projects account for about 3/4 of our total bucket. By moving this content to encrypted buckets (per
Jira Legacy
serverSystem JIRA
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverIdba6fb084-9827-3160-8067-8ac7470f78b2
keyPLFM-5108
), we can dramatically reduce the amount of data that we need to encrypt in PLFM-5201.

Revised Use Cases and Possible Implementation Notes

Use Case 1: Moving projects for cost/billing purposes

A handful of Synapse projects store so much data (10+ TB, up to ~150 TB) that there are concerns about how we should approach dealing with their growing costs. It would be useful to be able to have a way to measure the impact of each project on our S3 bill.

Use Case 1a: Determine the individual storage and egress usage of projects that store massive amounts (10+ TB) of data for the purpose of itemizing costs by project

This is the primary use case. Currently, all of the large projects in Synapse are internally managed. We need a way to approximate the storage/egress costs of each of these projects.

Use Case 1b: Service to move files into externally managed storage so we no longer have to pay costs

Considering this use case could drive major implementation decisions. We currently have no buckets that are not managed by Sage internally that we wish to move, but it is foreseeable that an external third party could use an excess of Synapse storage, and we would want to offload them and make them assume the costs of storage and egress in their own S3 bucket.

Considering needs and relative urgency/priority, use cases 2 and 3 below are considered unrelated and out of scope. Their requirements will not be considered while designing this service. We may approach them in a separate task in the future.

Implementation

AWS Cost Allocation Tags and Cost Allocations in Synapse

AWS has a feature called cost allocation tags that we could leverage to separate billing. Cost allocation tags can only be applied at the bucket level.

We can create a new Cost Allocation service in Synapse that can be used by members of a "Synapse Cost Allocation Team", a bootstrapped team with users that have the authority to manage these cost allocations.

The members of the Cost Allocation Team can create new cost allocations with a descriptive name (e.g. ampad, nih-r01-12345) that matches how costs should be broken down in Synapse. When a new cost allocation being created, a new bucket is provisioned with an AWS Cost Allocation Tag.

After a cost allocation is created, a project can be assigned to it. Cost allocations can have many projects, but a project can be assigned to at most one cost allocation. When a project is assigned to a cost allocation, the underlying files associated with that project will eventually be moved to the new S3 bucket, and the file handles will be updated. After all files have been moved, the AWS bill and cost explorer tools can facet costs by the cost allocations, as desired.

API

...

None

...

Gets the cost allocation on an entity, if there is one.

This information may also be added to the entity bundle.

...

Path parameter

name: String

the name of the cost allocation

...

AsyncJobId

...

Removes the cost allocation tied to a project. The contents of the project that are in the cost allocation storage location will be moved to the default Synapse storage.

After all of the contents have been moved, the project is removed from the cost allocation.

...

CostAllocationAsyncJobStatus (ext AsynchronousJobStatus)

...

Image Removed
Requirements and Design Strategies
  • Retrieval of file handle metadata, particularly project association and file size 
    • File Replication Table
      • Columns from File Table: ID, ETAG, PREVIEW_ID, CREATED_ON, CREATED_BY, METADATA_TYPE, CONTENT_TYPE, CONTENT_SIZE, CONTENT_MD5, BUCKET_NAME, NAME, KEY, STORAGE_LOCATION_ID, ENDPOINT
      • Primary key ID
    • Creation of this table allows retrieval of file metadata by joining it with the entity replication table. This allows us to find all of the file handles and their metadata in one database call. Without this table, we must query the tables database to find the entities in a project, and then separately query the repo database to retrieve the metadata of those files.
  • A way to enumerate cost allocations
    • Cost allocation table: ID, NAME, CREATED_BY, CREATED_ON
  • A way to associate cost allocations and projects
    • Cost Allocation association table
      • Columns: COST_ALLOCATION_ID, PROJECT_ID
      • Primary key: PROJECT_ID
  • Overriding uploads to default Synapse storage to redirect to the cost allocation bucket.
  • An eventually consistent workflow for moving underlying files to a new bucket, and then modifying the file handle to reflect the move
    • Asynchronous worker scans file replication table for files that are in the project with the cost allocation AND are in Synapse storage AND are not in the correct cost allocation bucket
      • This worker creates bundles of files that should be updated and sends these bundles to another asynchronous worker
    • Asynchronous worker finds each file in the bundle in the files table and verifies that it is not in the correct cost allocation bucket
      • The file is copied to the new S3 bucket and updates the file handle, archiving/deleting the old underlying file.
      • The actual copying may be fairly trivial with AWS Batch Operations for S3 Buckets and Copy Objects Between S3 Buckets With Lambda 
        • We still must somehow mark the old file for deletion, perhaps by moving it to low-cost storage with a lifecycle deletion policy
        • If we can track the batch operations, we can modify file handles one-by-one. Otherwise we may have to find another solution, or wait for the batch operation to complete
Flowchart

This gives a general overview of the process required to apply a cost allocation to a project.

Image Removed
Questions/Concerns
  • AWS has a 100-bucket soft cap. They can increase the cap to 200 if they approve of your use case. Do we anticipate needing to store more than 100-200 internally managed projects that are 10+TB or larger? This may be a distant/unlikely problem. There are also many buckets that we can probably remove from our prod account (we currently have ~65)
Component 1 - Create a File Replication Table
Component 2 - Implement Cost Allocation System

Once file handles are associated with their parent projects, we can build the cost allocation tool described in the API above to move those file handles that belong to a particular project and are stored in Synapse default storage.

We can do this by 

  1. Bootstrapping a Synapse Cost Allocation Team
  2. Building a way to provision S3 buckets (underneath the abstraction of creating a cost allocation)
  3. Finding all of the file handles that need to be moved
  4. Transferring/copying the files to the new bucket
  5. Modifying file handles to point to the new file location in the new bucket
  6. Archiving/marking for deletion the old copy of the file

...

Note

A design document for building a solution for use case 1 can be found here: 

Use Cases

Use Case 1: Move projects between Synapse-managed S3 buckets

A handful of Synapse projects that are managed internally store so much data (10s-100s TB) that there are benefits (namely, itemized S3 billing) to placing each project in their own storage location. It would be useful to be able to move all of the content in one of these projects to its own bucket. Additionally, we can encrypt this data as we move it into this bucket, dramatically reducing the amount of data we need to encrypt in the main bucket as a part of 

Jira Legacy
serverSystem JIRA
serverIdba6fb084-9827-3160-8067-8ac7470f78b2
keyPLFM-5201
.

This is a high-priority use case and is driving this proposal. This issue "blocks" tickets related to certification remediation (namely, storing PHI on AWS). Those tickets do not necessarily need this service (we can come up with workarounds), but this could simplify the work that needs to be done later.

Requirements
  • Project-level resolution is necessary (i.e. we must be able to get all of the file handles that are "owned" by a project)
  • Synapse must transfer the underlying S3 files to the new bucket
  • It is necessary that the original files (and their previews) are eventually deleted from Synapse storage, since we are trying to shift the costs in the main S3 bucket to other S3 buckets.
  • All file versions must be moved
  • It is not acceptable to create new file handle IDs if the actual file does not change.

Use Case 2: Moving content that is in non-S3 storage to another non-S3 location

Users have stored files on an SFTP server that is expected to be decommissioned, so they must move them to another storage location (an external object store). The data is not managed by Synapse, so storage users must transfer the files manually. Because the content of the files have not changed, but the location has, users would like to update all file handles tied to the old storage location (or an enumerated list of particular file handles, if only a subset of the data is being moved) to point to the new file location.

Priority is not currently high, but it would be very useful for some users (and would save members of the engineering team time, since on at least one occasion this has been done manually on the database). It saves a lot of time down the road, and may be fairly trivial to implement given the similarity of requirements in other use cases.

Requirements
  • Synapse does not manage any underlying file transfer here.
  • Users must have a way to identify all of the file handles they can/should update.
    • What are all of my file handles on storage location X?
    • How can I move all file handles on storage location X, even if I am not the owner of these file handles?
  • File handle IDs should not be updated, for the same reasons outlined in Use Case 1.
  • We should consider validating that a matching file is in the new location before updating the file handle (e.g. via md5? might need to be credentialed as well)

Use Case 3: Moving content that is in SFTP storage to a managed S3-bucket

Synapse is deprecating SFTP storage. To simplify migration (and accelerate deprecation), we can give SFTP users the option to migrate their data into Synapse storage.

Requirements
  • Synapse must be able to transfer files from the SFTP server (authentication).
    • We should consider the implications of cases where Synapse cannot access the SFTP server 
    • This might need to be done through through a client that can connect and authenticate into SFTP and then, the client uploads that file into Synapse.
  • File handles would probably want to be selected based on storage location.
  • We would not delete the old files, because these should be managed by the owner of the SFTP server.

High-level service proposal that would cover use cases, if feasible

Based on the following use cases, we would probably need to

  1. Create a way to find all file handles that
    1. Are "owned" by a project
    2. Belong to a particular storage location (or more likely, do not belong to the destination storage location)
    3. Have not already been processed/migrated
  2. In cases where the destination is Synapse-managed, be able to sync the content from the original storage location to a specified S3 bucket.
    1. Log this sync and its final status, timestamp, etc.
  3. Update a file handle to point to a new location. Atomically, this includes:
    1. Verifying that the origin and destination files match (e.g. a checksum provided by S3, or that we can calculate if Synapse has access to both files).
    2. If the files match, update the file handle (excluding the ID and eTag).
    3. Log the file handle change
  4. In cases where the origin is Synapse-managed, mark the origin file for eventual deletion.
  5. Create a log of all file transfers in case we later need to review the history of what was done.

Use Case 1 could probably be one operation called by a project owner or bucket owner. This operation could handle (1-4).

...

Use Case 3 could involve a user calling (1) to get a list of file handles and (3) to update the file handles (one suggestion was to have a similar interface to the Python client's copyFileHandles method. Perhaps an updateFileHandles method would make sense here from a user perspective) 

Questions

Issues we need to resolve that may guide implementation or refine use cases.

Who is authorized to initiate transfers? Who can move a particular file to another storage location, or modify a file handle?

An individual user should not be able to modify file handles that they do not own. Migrations that involve file handles owned by multiple users should only be performed by an admin after determining that the.

Users should be permitted to update their own file handles in cases where Synapse does not manage the storage.

Which file handles in a project should be moved? 

This JSON file outlines all of the objects that can be tied to a file handle

...

We must consider the implications of moving all file handles belonging to entities structured under a project. By moving all files referenced by a project to an S3 bucket that isn't managed by Sage, the  S3 bucket owner will able to delete files referenced in Synapse. If those files are owned by a different Synapse user, and used in other places in Synapse, the file could be deleted or modified, when the user owning the file handle assumed they were safe in Synapse storage. For this reason, we should initially only consider moving files between buckets managed by Sage until we are sure there are no risks similar to this.

How do we handle race conditions?

Download

As long as we delay the deletion of files we plan to delete, there should be no issue. Depending on how S3 handles deleting files, an existing download may complete even if the file is deleted from S3 before completion. We should see how S3 works to determine our deletion policy.

Edit Entity/Upload

When migrating to a managed S3 bucket, the project storage location should be updated to the new bucket before migration. All new files would then be uploaded to the new bucket, and should be excluded from the migration.

Are there other race conditions not here?

...

this

...

At a minimum, we should keep logs of the work we plan to do, and the work we have done in each migration.