Data Migration
JIRA
- PLFM-6282Getting issue details... STATUS
Related:
- PLFM-5620Getting issue details... STATUS
- PLFM-6227Getting issue details... STATUS
- PLFM-5756Getting issue details... STATUS : We should do this in any case as we are paying unnecessary NAT costs
- PLFM-6165Getting issue details... STATUS
Applicable Use Cases
Migrate existing data to the newly introduced STS storage locations, so that compute can be attached directly to the S3 location on the existing data indexed in Synapse
General need to migrate existing data between buckets (e.g. STRIDES program) without creating new entities or updating their version
Overview
Synapse allows to upload data to its own S3 storage, the way the data is organized is around a user centric partitioning of the data (e.g. each object is in the default S3 storage is prefixed with the id of the user uploading the data). Additionally external buckets (both S3 and GC) can be linked to a project or a folder so that data can be uploaded in a user owned bucket. Other options are present to index external data that synapse does not have control over.
In the past there have been instances where the users decided to move data from the synapse S3 storage to an external locations and today this is handled as one-off tasks by a backend engineer since we do not provide an automatic way to migrate existing data over a new location.
STS Storage Locations
With the addition of STS (See AWS Security Token Service) storage locations users can now setup a special storage location that will point to either the synapse bucket using a common (generated) prefix or an external storage location where the user can allow access to the bucket (with an optional common prefix) so that compute can be attached directly to S3. Using the AWS STS service synapse can generate temporary AWS credentials that include an inline session policy that limits the (read or read-write) access to the specified bucket and prefix of the storage location.
This special storage locations can be then associated with a project or folder as the default upload destination, so any file uploaded through synapse will end up in the bucket with a common prefix, organizing the data in a partitioned (either the entire bucket or a common prefix) S3 space. Note that the need for a common prefix is due to technical limitations of policies that can be attached to the STS request, in fact the inline policy is limited to a total of 2048 characters (excluding spaces) so having a prefix allows to provide a grant with a short policy.
Setting a new storage location to a project or folder affects only future uploads, therefore existing data might not reside under the same object prefix of the new storage location. Note however that for an existing external storage location one could simply create a new STS storage location in the same bucket, associate it with a new folder and automatically gain access through the synapse STS service. In the first implementation iteration a limitation was put in place that an STS storage location cannot be associated with a non-empty project/folder.
STRIDES
TODO: Gather more information about this, e.g. amount of files, number of projects, number of storage locations, types of storage locations, number of buckets, max length of object keys, max file size etc.
500TB of data
Across multiple projects, storage locations
Cost of storage ~11K a month
Technical Consideration
In order to provide a service to migrate data there are several considerations on the technical aspects of both the synapse platform and dependent components used by Synapse that needs to taken into account that prevent today to do this from it’s API. Note that we do not consider google cloud at this time.
Synapse Technical Considerations
A file handle in synapse is an abstraction over the physical file location, they are immutable and reusable by the same user (e.g. same file handle can be associated to different objects)
File handles do not have a permissions model (e.g. do not use the ACL as for entities) and due to the 1 to N relationship with different objects Synapse allows their deletion only from the user that created them
Due to the permission model, deletion of a file handle can be performed only by the creator (or uploader) of the file handle itself (See https://rest-docs.synapse.org/rest/DELETE/fileHandle/handleId.html). This can lead to an actual data deletion in S3 or GC iff the bucket and key used by the file handle is unique among file handles (See https://github.com/Sage-Bionetworks/Synapse-Repository-Services/blob/2bb0ca3a77c89b4c3182a8f515f015389cebc508/services/repository-managers/src/main/java/org/sagebionetworks/repo/manager/file/FileHandleManagerImpl.java#L329).
Same as file handles, storage locations are abstraction over places where data can be stored, they are immutable and reusable by the user and they belong to the user that created them
When uploading a file through synapse a key is automatically generated that uses the following scheme:
{baseKey/}{userId}/{randomUUID}/filename. The base key might not be present making the base common prefix the id of the user uploading filesFor external storage locations where the user links the bucket with synapse (E.g. ExternalS3StorageLocationSetting) data can be either uploaded through synapse where synapse generates the key following the scheme above, or uploaded externally and linked to synapse using the dedicated APIs for creating external file handles (e.g. https://rest-docs.synapse.org/rest/POST/externalFileHandle/s3.html)
When a storage location is associated with a folder or a project only future uploads are affected, making the distribution of keys and buckets for a project or folder unpredictable
When updating an entity with a new file handle id, a new version is automatically created. Additionally previous versions of entities cannot be updated.
A user can technically upload a file handle in any storage location and use it in any entity despite the storage location set on the folder/project (An exception is for STS project/folders, where there is a check in place on the id of the storage location)
An STS storage location cannot be created in a non-empty project/folder
An STS storage location can be created in the synapse S3 bucket, but a prefix (called base key) is automatically generated by the backend (See S3StorageLocationSetting), if STS is not enabled in this case the user CANNOT set a base key for this storage locations.
S3 Considerations
S3 does not allow to move an object, rather a copy and a delete request achieve the move
Copying an object in S3 is the same operation as an upload (a PUT request) with a special header that specifies the source to copy from
Copying objects do not incur data transfer costs for same region operations (See https://aws.amazon.com/s3/pricing) the costs are for the requests themselves (e.g. COPY, 0.005 per 1000 requests)
Copying an object in a single operation can be done up to a limit of 5GB for file, for larger files a multi part upload is required where each part is copied from a source key with a specified range
It is possible to generate pre-singed URL for the copy requests, including the multi-part with copy requests (See the details in the attached test file). This is not a documented “feature” of S3 but at the same time is not an “hack”, but rather an consequence of how pre-signed URLs work in AWS.
S3 provides a few ways to copy a large amount of files, for example using the batch operation service https://docs.aws.amazon.com/AmazonS3/latest/dev/batch-ops-basics.html or the more expensive and complex setup using EMR and S3DistCP (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html. This options are suited for batch operations on a set of objects residing in the same bucket and have in general limitations and costs associated. For example the S3 batch operation can specify for a single job a single operation to apply on each key in the inventory provided for the job. If the job is a copy operation each file is limited to 5GB since the job itself will perform a PUT copy. As an alternative a lambda can be called for each key where the actual copy can be done (within the limits of the lambda itself)
In general the fastest way to copy data between buckets is using a copy no matter the way it is invoked, while AWS provides some automatic ways to do this when flexibility is needed for each key a more complex setup with dedicated infrastructure is needed. Note that the key to speed is parallelization of the copies, both across different objects and also within the copy itself across different parts for a multipart copy
Potential Solutions
The basic idea is to provide a way to migrate existing data to a new storage location making the process transparent to the user consuming the data, avoiding breaking changes to the entities.
Ideally we would simply be able to move the data and update the path in the file handles making the move invisible to the end user, unfortunately this is not possible as the system relies on the immutability of the file handle itself and the permission model does not fit this scenario (e.g. The creator owns the file handle). This would be possible only at the file handle level if the creator was the one asking for the operation.
Instead we can treat this as an update to the file handle pointer in the entity itself.
There are two main potential paths identified to approach the problem:
Client driven migration
Backend driven migration
Client Driven Migration
The idea behind this is to allow the client to copy the data they have access to, allowing them the flexibility to setup their own infrastructure and perform the migration themselves.
This path can be enabled as follows:
Allowing an entity update to change the file handle id, without automatically increasing the version (e.g. maybe a soft md5 check can be performed)
Allowing to update the file handle id of a specific version of an entity
Removing the STS limitation of not being able to associate a STS storage location to a non-empty folder: If in the synapse storage this should not be an issue since a prefix is automatically assigned and the user cannot change it. For external storage locations maintaining consistency would anyway be a responsibility of the user
Similar to the current https://rest-docs.synapse.org/rest/POST/externalFileHandle/s3.html API, adding a service to allow creating file handles in an S3StorageLocation where STS is enabled (this to allow linking the data that is stored IN synapse S3 but in a STS enabled storage location). There does not seem to be a clear use case to have an STS storage location in the synapse S3 bucket at the moment (See - PLFM-6227Getting issue details... STATUS ). The usefulness of this feature would be if a user cannot setup its own external S3 bucket.
Provide a service to copy an existing S3 key to a new file handle:
Extend the current STS service to provide credentials that allow the direct S3 copy of a specific version of the entity to a given (S3) storage location if the file handle resides in the synapse S3 bucket. The destination storage location in this case should be owned by the user and either a synapse STS enabled storage location or connected S3 storage location. Checks on the bucket regions should be in place to avoid transfer costs.
As an alternative, which might be more flexible and more integrated into how Synapse works: we could extend the current multipart upload API to support a multipart upload copying from an existing file handle. This would be very similar to how S3 actually implements the S3 multipart copy where the normal multipart is used but specifies the source bucket and key in the header of the various requests (Copying objects using the multipart upload API - Amazon Simple Storage Service). The pre-signed URLs for the parts would be specially crafted around the copy part requests so no upload is involved; When the upload completes a new file handle is created and we can track the source file handle in the file handle itself. The drawback (which might be seen as an advantage) of this solution is that the target bucket and keys are generated by Synapse (e.g. The file handle creation is handled by Synapse, note however that for external buckets we might want to allow the definition of the target key). The special pre-signed URL can be generated including additional custom headers (e.g. the
x-amz-copy-source
andx-amz-copy-source-range
) in the signature, the client would need to include those headers in order for the request to be authorized by AWS. Advantages of this solution over the STS solution include:
- Less limitations imposed on the source file handles (See not below)
- Easier auditing of the specific operation
- Fewer security concerns (See note below)
- Automatic creation of the file handles, this also has the advantage that we are sure the client will create a trackable file handle for the copied data
- Easier integration with the clients
Note about point 5.a: when creating temporary credentials an inline session policy can be attached to limit the access to the resources, in particular for a copy operation between two S3 buckets the permissions needed are at least s3:GetObject and s3:ListBucket on the source and s3:PutObject and s3:ListBucket on the destination (See https://aws.amazon.com/premiumsupport/knowledge-center/s3-troubleshoot-copy-between-buckets/). Additionally multipart related operations should be enabled on the destination bucket.
An example of policy that can be used that allows the copy is as follows (some boilerplate removed):
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "SourceFileAccess",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::${sourceBucket}/12353635534532/79a86795-f0d8-45b4-b3e1-6a6850b8b0a0/some-file"
]
},
{
"Sid": "TargetWriteAccess",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::${targetBucket}",
"arn:aws:s3:::${targetBucket}/*"
]
}
]
}
There are some limitations in this approach, in particular the policy that can be attached has a limit of 2048 characters (excluding spaces). Additionally the real limit is computed by AWS using a compressed binary version of the policy, making the maximum allowed limit difficult to predict.
We do not set a limit on the length of the key that can be used by an S3 object (e.g. the file name), the limit imposed by AWS is 1024 characters on the key and 63 on the bucket name which seems to already exceed (tested empirically on a randomly generated key) the policy limit (if it includes all the boiler plate), therefore we have the risk of not being able to even create such a token.
For keys created automatically by Synapse this is not an issue observing that the each file has a unique prefix:
{userId}/{UUID}/file_name
We can effectively remove the file_name from the policy and use the {userid}/{UUID}/* as prefix in the policy, this can be a problem for externally created file handles where the user decides the key name. On the other end if the bucket is owned by the user there is no need for synapse to provide this service as they can already perform the copy themselves, so this service would be limited to files stored in the synapse S3 bucket.
Note that if the data is stored in the synapse S3 bucket but the storage location is STS enabled and we provide a service (Point 4.) to create a file handle that points to this location with a user defined key, the STS token to allow the copy will potentially work only at the prefix level. E.g. we can extend the current STS API to provide a “copy” type of permission in order to vend a token that give read access to the storage location prefix and write access to the destination bucket since we would not be able to predict the length of the keys chosen by the user. This could be a generic API for copying data from an STS storage location. We could still provide the same service at the file level with the caveat that if the key is too long for the policy they service might fail.
Potential estimate for the synapse backend: 2~3 Weeks * (the unknown coefficient)
Backend Driven Migration
The idea would be to introduce a new service that does the job for the user, given an entity and (optional) version in order to migrate it to a given storage location. Potentially this could be done on a whole hierarchy (e.g. project or folder). Note however that also in this case we should only handle data that is stored in the synapse S3 bucket, since we should avoid dedicating platform compute for data that is stored elsewhere.
Today’s synapse workers infrastructure might not be suitable for this (Even though we do perform a similar operation when we prepare a zipped download). Potentially we could have an asynchronous job that starts by creating a manifest of data to be copied (with all the eventual checks on the source and destination) and pushes to a queue the operations to be performed, another set of (potentially non-synapse) workers might take on the copy operation itself.
Synapse can update the file handles in the copied entities before completing the job. Since this happens inside the synapse infrastructure it does not have to be implemented in the synapse backend itself, other than the services required as glue (e.g. perform the file handle updates in the affected entities).
Ideally we would offload this to AWS “server-less” infrastructure as much as possible, for example a lambda could perform the copy operation itself so that the synapse workers are not idled during the copy. We could expose the API to the client from synapse itself (e.g. /entity/{id}/migrate) but the only thing that would happen is that a queue is signaled in order to offload the actual job. Note that a lambda has a run time limit of 15 minutes, additionally copying a file bigger than 5GB needs to be split in a multipart copy where each lambda invocation might just request a copy part. An alternative to consider might include AWS Batch (See https://docs.aws.amazon.com/batch/latest/userguide/Batch_GetStarted.html).
This approach has the advantage of reducing the potential security risks in vending STS tokens since everything would happen inside the synapse infrastructure where we already have access. On the other end the whole infrastructure needs to be built.
Potential estimate for the synapse backend: 2~3 Months * (the unknown coefficient)
Note: Even if we decide to add backend infrastructure to do the copy, points 1 - 4 proposed in the previous section might still be needed and potentially suffice for certain use cases. For example if the source and destination locations (not only S3) are controlled by the user and they just want to update the references to the file handles in the entities. For the STS use case point 3 seems to be unavoidable.
Storage Costs Consideration
Due to the nature of file handles, data deletion is impractical: the users that created the file handle can invoke a deletion already which would lead to an actual object deletion if the same bucket and key is not used by another file handle. The deletion is not possible if the handle is referenced directly by other objects (e.g. a file entity or a wiki, this does not apply if file handles are linked in table rows).
This is a general (known) issue to be investigated (See - PLFM-6165Getting issue details... STATUS ) since we never delete data, we could build a system to index the file handle references and periodically cleanup the unused ones (e.g. moving them to a low frequency access storage for a period X), so it can be treated as a separate issue.
Given the permission model the migration itself might not include deleting the data itself (unless the deletion is invoked by the user itself), meaning that we would still pay for storage costs.
A potential approach if we wanted to get rid of the data as soon as possible is to have a dedicated administrative task to delete data given a manifest of file handles (e.g. generated during a migration process), this would need to go through some sort of approval and check on the creators of the file handles and would not be something exposed to the end users.
API Design and Backend Changes
From a discussion with the interested parties it seems that for now the client driven migration is the most flexible solution. Further changes might be introduced in a later moment if the need arises.
In order to support the client driven data migration we propose the following changes to the backend:
API | New | Description |
---|---|---|
No | Current: The backend automatically updates the version of the entity if the file handle id provided in the body of the request differs from the file handle id associated with the (current) revision of the entity (ignoring the newVersion parameter value). Changes: | |
PUT /entity/{id}/version/{versionNumber}/filehandle | Yes | Current: We do not have an API to update the file handle of a specific entity version. Changes: Introduce the new API so that the file handle of a specific version of a FileEntity can be updated. In this case no “auto-versioning” is involved, the body of the request FileHandleUpdateRequest will need to include the current file handle id (to ensure non conflicting update) and the new file handle id. The MD5 of the two file handles should match for this operation to be successful. |
No | Current: When the UploadDestinationListSetting are set for a project or folder, and the storage location is an STS enabled storage location validation is performed on the target folder/project so that it must be empty. Changes: Remove the validation for the empty folder/project |
Multipart Copy
Point 5.b as the chosen alternative:
API | New | Description |
---|---|---|
No | The current API can be altered to accept a new type of request body MultipartUploadCopyRequest with the following property:
Other properties in common with the current MultipartUploadRequest:
An interface MultipartRequest will need to be introduced that is implemented by both the MultipartUploadCopyRequest and the MultipartUploadRequest. Additionally to maintain backward compatibility we need to support a default implementation for the new interface for the API, since we verified that the python client does not include the concreteType property when initiating a multipart upload request, meaning that the backend would not be able to discriminate between a MultipartUploadCopyRequest and a MultipartUploadRequest during de-serialization. A MultipartUploadCopyRequest would initiate a multipart copy request from the given source if the user is allowed to download the file. The rest of the multipart process will be the same, but the presigned-url for the parts will be specially crafted for a copy request and the client would simply execute the PUT request as indicated by the backend including the special copy headers without any data upload (See API change below). Completing the copy upload would yield a new file handle that is copied from the source, a new property in the file handle stores the id of the original file handle (we could store the copy reference in a companion table or in the file handle itself as a nullable column). | |
No | Currently returns a BatchPresignedUploadUrlResponse containing a list of PartPresignedUrl each containing the part number and the presign-url. We would need to extend this in the copy case to provide the correct headers that the clients needs to send in the PUT request, e.g. a map of <String,String> |
An example in Java using the standard S3 SDK that uses pre-signed URLS to perform the copy (both direct and through multi-part), tested in a source bucket in dev and a target bucket in a different account (same region) configured for to allow dev read/write access:
In summary this can be achieved as follows:
Start the multipart upload as usual on the target bucket and key:
InitiateMultipartUploadResult multiPart = s3Client.initiateMultipartUpload( new InitiateMultipartUploadRequest(targetBucket, objectKey) .withCannedACL(CannedAccessControlList.BucketOwnerFullControl) );
For each part generate a pre-signed URL for the generated upload id and part number for a PUT request that includes in the signature the x-amz-copy-source header pointing to the source bucket/key and the x-amz-copy-source-range header with the range request for the part:
The client performs a PUT request to the given URL (with no request body) including the additional signed headers (x-amz-copy-source and the x-amz-copy-source-range). The client can directly copy those headers and values from the batch response API above. The response from AWS will be a CopyPartResult:
The ETag is used as usual to add the the part to synapse PUT /file/multipart/{uploadId}/add/{partNumber}
The upload is completed as usual PUT /file/multipart/{uploadId}/complete
STS Copy (Previous Solution)
Point 5.a as the previously considered solution:
API | New | Description |
---|---|---|
POST /fileHandle/sts/copy | Yes | Current: The current STS api (GET /entity/{id}/sts) can be invoked on a folder or project to gain read or write access (using the permission parameter) if the storage location associated with the entity is STS enabled. Changes: The new API will be specific to the copy operation, the body of the request (FileHandleStsCopyRequest) will include:
The operation will vend STS credentials with WRITE access to the given target storage location prefix and READ access to the unique prefix of the file handle. If the target bucket is in a different region this operation will fail. The response body FileHandleStsCopyResponse will include a property credentials whose value is an object that extends the StsCredentials (E.g. StsCopyCredentials) with two new properties destinationBucket and destinationBaseKey. |
Notes
I didn’t add an API to address - PLFM-6227Getting issue details... STATUS . We can create a custom storage location in the synapse S3 bucket with STS enabled, but we have no way to manually create a file handle pointing to it. This means that you cannot migrate existing data into it. From the comments it seems that there is no use case for this yet? Should we remove this option all together? Should we just wait for the use case? We decided not to consider this scenario for now for lack of a clear use case.
UPDATE: Going for the multipart copy solution this is a non-issue.I assumed that there is no current need for copying an entire STS enabled storage location, e.g. the file handle id in the request is mandatory. We could extend this to work on folder and project entities with no source file handle id property to give access to the whole storage location but I’d wait for a use case.
UPDATE: Going for the multipart copy solution this is a non-issue.Data deletion: I initially thought we could add a parameter when updating an entity file handle to ask the removal if the current user is the creator of the file handle, but the user can already do this through the dedicated API: https://rest-docs.synapse.org/rest/DELETE/fileHandle/handleId.html and adding a new parameter that has meaning only in specific circumstances does not bring any clear advantage.