...
...
...
...
server | System JIRA |
---|---|
columns | key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution |
serverId | ba6fb084-9827-3160-8067-8ac7470f78b2 |
key | PLFM-5227 |
There has been a demonstrated need to move content in Synapse to different S3 buckets, for various reasons. This document will contain the use cases and design plan.
Table of Content Zone | |
---|---|
Table of Contents
|
Background Jiras
Jira Legacy | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
Background
In
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
In
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
Jira Legacy | ||||||||
---|---|---|---|---|---|---|---|---|
|
A reason for the prioritization of these use cases, is that in
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
Jira Legacy | ||||||||
---|---|---|---|---|---|---|---|---|
|
Revised Use Cases and Possible Implementation Notes
Use Case 1: Moving projects for cost/billing purposes
A handful of Synapse projects store so much data (10+ TB, up to ~150 TB) that there are concerns about how we should approach dealing with their growing costs. It would be useful to be able to have a way to measure the impact of each project on our S3 bill.
Use Case 1a: Determine the individual storage and egress usage of projects that store massive amounts (10+ TB) of data for the purpose of itemizing costs by project
This is the primary use case. Currently, all of the large projects in Synapse are internally managed. We need a way to approximate the storage/egress costs of each of these projects.
Use Case 1b: Service to move files into externally managed storage so we no longer have to pay costs
Considering this use case could drive major implementation decisions. We currently have no buckets that are not managed by Sage internally that we wish to move, but it is foreseeable that an external third party could use an excess of Synapse storage, and we would want to offload them and make them assume the costs of storage and egress in their own S3 bucket.
Considering needs and relative urgency/priority, use cases 2 and 3 below are considered unrelated and out of scope. Their requirements will not be considered while designing this service. We may approach them in a separate task in the future.
Implementation
AWS Cost Allocation Tags and Cost Allocations in Synapse
AWS has a feature called cost allocation tags that we could leverage to separate billing. Cost allocation tags can only be applied at the bucket level.
We can create a new Cost Allocation service in Synapse that can be used by members of a "Synapse Cost Allocation Team", a bootstrapped team with users that have the authority to manage these cost allocations.
The members of the Cost Allocation Team can create new cost allocations with a descriptive name (e.g. ampad, nih-r01-12345) that matches how costs should be broken down in Synapse. When a new cost allocation being created, a new bucket is provisioned with an AWS Cost Allocation Tag.
After a cost allocation is created, a project can be assigned to it. Cost allocations can have many projects, but a project can be assigned to at most one cost allocation. When a project is assigned to a cost allocation, the underlying files associated with that project will eventually be moved to the new S3 bucket, and the file handles will be updated. After all files have been moved, the AWS bill and cost explorer tools can facet costs by the cost allocations, as desired.
API
...
None
...
Gets the cost allocation on an entity, if there is one.
This information may also be added to the entity bundle.
...
Path parameter
name: String
the name of the cost allocation
...
AsyncJobId
...
Removes the cost allocation tied to a project. The contents of the project that are in the cost allocation storage location will be moved to the default Synapse storage.
After all of the contents have been moved, the project is removed from the cost allocation.
...
CostAllocationAsyncJobStatus (ext AsynchronousJobStatus)
...
Requirements and Design Strategies
- Retrieval of file handle metadata, particularly project association and file size
- File Replication Table
- Columns from File Table: ID, ETAG, PREVIEW_ID, CREATED_ON, CREATED_BY, METADATA_TYPE, CONTENT_TYPE, CONTENT_SIZE, CONTENT_MD5, BUCKET_NAME, NAME, KEY, STORAGE_LOCATION_ID, ENDPOINT
- Primary key ID
- Creation of this table allows retrieval of file metadata by joining it with the entity replication table. This allows us to find all of the file handles and their metadata in one database call. Without this table, we must query the tables database to find the entities in a project, and then separately query the repo database to retrieve the metadata of those files.
- File Replication Table
- A way to enumerate cost allocations
- Cost allocation table: ID, NAME, CREATED_BY, CREATED_ON
- A way to associate cost allocations and projects
- Cost Allocation association table
- Columns: COST_ALLOCATION_ID, PROJECT_ID
- Primary key: PROJECT_ID
- Cost Allocation association table
- Overriding uploads to default Synapse storage to redirect to the cost allocation bucket.
- An eventually consistent workflow for moving underlying files to a new bucket, and then modifying the file handle to reflect the move
- Asynchronous worker scans file replication table for files that are in the project with the cost allocation AND are in Synapse storage AND are not in the correct cost allocation bucket
- This worker creates bundles of files that should be updated and sends these bundles to another asynchronous worker
- Asynchronous worker finds each file in the bundle in the files table and verifies that it is not in the correct cost allocation bucket
- The file is copied to the new S3 bucket and updates the file handle, archiving/deleting the old underlying file.
- The actual copying may be fairly trivial with AWS Batch Operations for S3 Buckets and Copy Objects Between S3 Buckets With Lambda
- We still must somehow mark the old file for deletion, perhaps by moving it to low-cost storage with a lifecycle deletion policy
- If we can track the batch operations, we can modify file handles one-by-one. Otherwise we may have to find another solution, or wait for the batch operation to complete
- Asynchronous worker scans file replication table for files that are in the project with the cost allocation AND are in Synapse storage AND are not in the correct cost allocation bucket
Flowchart
This gives a general overview of the process required to apply a cost allocation to a project.
Questions/Concerns
- AWS has a 100-bucket soft cap. They can increase the cap to 200 if they approve of your use case. Do we anticipate needing to store more than 100-200 internally managed projects that are 10+TB or larger? This may be a distant/unlikely problem. There are also many buckets that we can probably remove from our prod account (we currently have ~65)
Component 1 - Create a File Replication Table
Component 2 - Implement Cost Allocation System
Once file handles are associated with their parent projects, we can build the cost allocation tool described in the API above to move those file handles that belong to a particular project and are stored in Synapse default storage.
We can do this by
- Bootstrapping a Synapse Cost Allocation Team
- Building a way to provision S3 buckets (underneath the abstraction of creating a cost allocation)
- Finding all of the file handles that need to be moved
- Transferring/copying the files to the new bucket
- Modifying file handles to point to the new file location in the new bucket
- Archiving/marking for deletion the old copy of the file
...
Note |
---|
A design document for building a solution for use case 1 can be found here: |
Use Cases
Use Case 1: Move projects between Synapse-managed S3 buckets
A handful of Synapse projects that are managed internally store so much data (10s-100s TB) that there are benefits (namely, itemized S3 billing) to placing each project in their own storage location. It would be useful to be able to move all of the content in one of these projects to its own bucket. Additionally, we can encrypt this data as we move it into this bucket, dramatically reducing the amount of data we need to encrypt in the main bucket as a part of
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
This is a high-priority use case and is driving this proposal. This issue "blocks" tickets related to certification remediation (namely, storing PHI on AWS). Those tickets do not necessarily need this service (we can come up with workarounds), but this could simplify the work that needs to be done later.
Requirements
- Project-level resolution is necessary (i.e. we must be able to get all of the file handles that are "owned" by a project)
- Synapse must transfer the underlying S3 files to the new bucket
- It is necessary that the original files (and their previews) are eventually deleted from Synapse storage, since we are trying to shift the costs in the main S3 bucket to other S3 buckets.
- All file versions must be moved
- It is not acceptable to create new file handle IDs if the actual file does not change.
- This breaks provenance records
- This invalidates user's local caches (this could trigger users to re-download TBs of files that are unchanged). See Common Client Command set and Cache ("C4").
Use Case 2: Moving content that is in non-S3 storage to another non-S3 location
Users have stored files on an SFTP server that is expected to be decommissioned, so they must move them to another storage location (an external object store). The data is not managed by Synapse, so storage users must transfer the files manually. Because the content of the files have not changed, but the location has, users would like to update all file handles tied to the old storage location (or an enumerated list of particular file handles, if only a subset of the data is being moved) to point to the new file location.
Priority is not currently high, but it would be very useful for some users (and would save members of the engineering team time, since on at least one occasion this has been done manually on the database). It saves a lot of time down the road, and may be fairly trivial to implement given the similarity of requirements in other use cases.
Requirements
- Synapse does not manage any underlying file transfer here.
- Users must have a way to identify all of the file handles they can/should update.
- What are all of my file handles on storage location X?
- How can I move all file handles on storage location X, even if I am not the owner of these file handles?
- File handle IDs should not be updated, for the same reasons outlined in Use Case 1.
- We should consider validating that a matching file is in the new location before updating the file handle (e.g. via md5? might need to be credentialed as well)
Use Case 3: Moving content that is in SFTP storage to a managed S3-bucket
Synapse is deprecating SFTP storage. To simplify migration (and accelerate deprecation), we can give SFTP users the option to migrate their data into Synapse storage.
Requirements
- Synapse must be able to transfer files from the SFTP server (authentication).
- We should consider the implications of cases where Synapse cannot access the SFTP server
- This might need to be done through through a client that can connect and authenticate into SFTP and then, the client uploads that file into Synapse.
- File handles would probably want to be selected based on storage location.
- We would not delete the old files, because these should be managed by the owner of the SFTP server.
High-level service proposal that would cover use cases, if feasible
Based on the following use cases, we would probably need to
- Create a way to find all file handles that
- Are "owned" by a project
- Belong to a particular storage location (or more likely, do not belong to the destination storage location)
- Have not already been processed/migrated
- In cases where the destination is Synapse-managed, be able to sync the content from the original storage location to a specified S3 bucket.
- Log this sync and its final status, timestamp, etc.
- Update a file handle to point to a new location. Atomically, this includes:
- Verifying that the origin and destination files match (e.g. a checksum provided by S3, or that we can calculate if Synapse has access to both files).
- If the files match, update the file handle (excluding the ID and eTag).
- Log the file handle change
- In cases where the origin is Synapse-managed, mark the origin file for eventual deletion.
- Create a log of all file transfers in case we later need to review the history of what was done.
Use Case 1 could probably be one operation called by a project owner or bucket owner. This operation could handle (1-4).
...
Use Case 3 could involve a user calling (1) to get a list of file handles and (3) to update the file handles (one suggestion was to have a similar interface to the Python client's copyFileHandles method. Perhaps an updateFileHandles method would make sense here from a user perspective)
Questions
Issues we need to resolve that may guide implementation or refine use cases.
Who is authorized to initiate transfers? Who can move a particular file to another storage location, or modify a file handle?
An individual user should not be able to modify file handles that they do not own. Migrations that involve file handles owned by multiple users should only be performed by an admin after determining that the.
Users should be permitted to update their own file handles in cases where Synapse does not manage the storage.
Which file handles in a project should be moved?
This JSON file outlines all of the objects that can be tied to a file handle
...
We must consider the implications of moving all file handles belonging to entities structured under a project. By moving all files referenced by a project to an S3 bucket that isn't managed by Sage, the S3 bucket owner will able to delete files referenced in Synapse. If those files are owned by a different Synapse user, and used in other places in Synapse, the file could be deleted or modified, when the user owning the file handle assumed they were safe in Synapse storage. For this reason, we should initially only consider moving files between buckets managed by Sage until we are sure there are no risks similar to this.
How do we handle race conditions?
Download
As long as we delay the deletion of files we plan to delete, there should be no issue. Depending on how S3 handles deleting files, an existing download may complete even if the file is deleted from S3 before completion. We should see how S3 works to determine our deletion policy.
Edit Entity/Upload
When migrating to a managed S3 bucket, the project storage location should be updated to the new bucket before migration. All new files would then be uploaded to the new bucket, and should be excluded from the migration.
Are there other race conditions not here?
...
this
...
At a minimum, we should keep logs of the work we plan to do, and the work we have done in each migration.