Multipart copy integration

Description

In order to support migrating existing data to different different buckets we extended the multipart API to support copying data referenced by existing file handles to a given storage location, see and the updated File Services API documentation.

With the multipart copy we can use the underlying S3 machinery to quickly move big amount of data from one bucket to the other using the provided copy operation from S3 without downloading and re-uploading.

Some notes about the new API:

  • Only copying from/to S3 buckets is supported (GC is not supported)

  • Only in region copies are supported, if the source and target reside in different regions the copy cannot be initiated

  • The user must be the owner of the target storage location

  • The user must have read and download access to the source file handle through the provided association

The idea is to provide an integration in the python client, below are some ideas:

At a lower level the API integration should work more or less the same as the multipart upload. I suggest to use a part size bigger than a normal upload, since it's amazon doing the copy and no data is transferred from the client (aside from the requests), the TransferManager in the java AWS SDK implementation switches to multipart copies when the file is > 5GB so I suggest to start from something similar and test out various options.

At a higher level I would suggest to have a new operation in the python client such as "migrate" for entities, that can be used to "re-align" an entity to the storage location in the its folder (or optionally a given storage location in input). This would need to fetch the ids of the file handles for each revision of the entity and use the copy API to create new file handles, then update the file handle of the entity revision using the dedicated API: https://rest-docs.synapse.org/rest/PUT/entity/id/version/versionNumber/filehandle.html.

There are some aspects to take into account:

  • Users might want to create new versions of the entity, rather than updating the previous revision (and I would set this as the default). In this case the normal PUT entity with newVersion=true should be used to link the new file handle.

  • Users might want to delete the previous file handle after a revision is updated with the new one, this is possible using the dedicated API: https://rest-docs.synapse.org/rest/DELETE/fileHandle/handleId.html. Note however that only the creator of the file handle itself can delete a file handle.

  • Users might want to have a recursive option and run it on an entire folder or project, but I would leave this as optional since the process might be done in a cluster where each entity is copied separately.

  • Files that are already in the provided storage location should be skipped, e.g. should be idempotent (and I just realized that the backend should do this check and throw a 40x, I'm pretty sure I forgot to implement it).

  • I suggest we add an option that allows to stop on failures or continue, e.g. if some of the revisions use file handles that are not in a compatible S3 location (e.g. different region, or not a linked S3 location) the operation might fail. According to the parameter the operation would stop or continue with the next file. We could produce a summary of the operation (e.g. a CSV file).

  • We might want to support a manifest of entities to move

  • Finally, even though Amazon confirmed that this is not necessary when the source file does not change, users might want to force the MD5 check of the copied parts, this is possible to do providing the MD5 checksum of the parts when requesting the pre-signed URLs: https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/file/BatchPresignedUploadUrlRequest.html. If the part size is bigger than the file size then the file handle MD5 can be used, otherwise the part MD5 must be computed either from a cached version of the file, or using sending a range request for the part on the pre-signed URL of the source file and compute it on the fly. I would put a warning or something about this and an explicit parameter such as forceMD5Check. Maybe an interactive user confirmation should be also in place with an optional parameter to skip the interaction so that the user understands that files might need to be downloaded in order to do this.

Related APIs that might be useful:

Environment

None

Activity

Show:
Marco Marasca
October 13, 2020, 7:28 PM

Note: The new API is deployed on stack-330 (staging at the time of writing).

Jordan Kiang
January 8, 2021, 7:26 PM

I’d appreciate it if you could review the documentation for the migration utility also when you have a chance, it can be previewed here:

https://jkiang13.github.io/synapsePythonClient/build/html/S3Storage.html#storage-location-migration

Jordan Kiang
yesterday
Edited

as we discussed yesterday also would appreciate any feedback you have on the above and the trial version I sent you when you have a chance since it sounds like you will also be a user of this feature.

Assignee

Jordan Kiang

Reporter

Marco Marasca

Labels

None

Validator

Bruce Hoff

Development Area

Synapse Core Infrastructure

Release Version History

None

Fix versions

Priority

Major
Configure