Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • S3 does not allow to move an object, rather a copy and a delete request achieve the move

  • Copying an object in S3 is the same operation as an upload (a PUT request) with a special header that specifies the source to copy from

  • Copying objects do not incur data transfer costs for same region operations (See https://aws.amazon.com/s3/pricing) the costs are for the requests themselves (e.g. COPY, 0.005 per 1000 requests)

  • Copying an object in a single operation can be done up to a limit of 5GB for file, for larger files a multi part upload is required where each part is copied from a source key with a specified range

  • It is possible to generate pre-singed URL for the copy requests, including the multi-part with copy requests (See the details in the attached test file). This is not a documented “feature” of S3 but at the same time is not an “hack”, but rather an consequence of how pre-signed URLs work in AWS.

  • S3 provides a few ways to copy a large amount of files, for example using the batch operation service https://docs.aws.amazon.com/AmazonS3/latest/dev/batch-ops-basics.html or the more expensive and complex setup using EMR and S3DistCP (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html. This options are suited for batch operations on a set of objects residing in the same bucket and have in general limitations and costs associated. For example the S3 batch operation can specify for a single job a single operation to apply on each key in the inventory provided for the job. If the job is a copy operation each file is limited to 5GB since the job itself will perform a PUT copy. As an alternative a lambda can be called for each key where the actual copy can be done (within the limits of the lambda itself)

  • In general the fastest way to copy data between buckets is using a copy no matter the way it is invoked, while AWS provides some automatic ways to do this when flexibility is needed for each key a more complex setup with dedicated infrastructure is needed. Note that the key to speed is parallelization of the copies, both across different objects and also within the copy itself across different parts for a multipart copy

...

  1. Allowing an entity update to change the file handle id, without automatically increasing the version (e.g. maybe a soft md5 check can be performed)

  2. Allowing to update the file handle id of a specific version of an entity

  3. Removing the STS limitation of not being able to associate a STS storage location to a non-empty folder: If in the synapse storage this should not be an issue since a prefix is automatically assigned and the user cannot change it. For external storage locations maintaining consistency would anyway be a responsibility of the user

  4. Similar to the current https://rest-docs.synapse.org/rest/POST/externalFileHandle/s3.html API, adding a service to allow creating file handles in an S3StorageLocation where STS is enabled (this to allow linking the data that is stored IN synapse S3 but in a STS enabled storage location). There does not seem to be a clear use case to have an STS storage location in the synapse S3 bucket at the moment (See

    Jira Legacy
    serverSystem JIRA
    serverIdba6fb084-9827-3160-8067-8ac7470f78b2
    keyPLFM-6227
    ). The usefulness of this feature would be if a user cannot setup its own external S3 bucket.

  5. Provide a service to copy an existing S3 key to a new file handle:

    1. Extend the current STS service to provide credentials that allow the direct S3 copy of a specific version of the entity to a given (S3) storage location if the file handle resides in the synapse S3 bucket. The destination storage location in this case should be owned by the user and either a synapse STS enabled storage location or connected S3 storage location. Checks on the bucket regions should be in place to avoid transfer costs.

    2. As an alternative, which might be more flexible and more integrated into how Synapse works: we could extend the current multipart upload API to support a multipart upload copying from an existing file handle. This would be very similar to how S3 actually implements the S3 multipart copy where the normal multipart is used but specifies the source bucket and key in the header of the various requests (Copying objects using the multipart upload API - Amazon Simple Storage Service). The pre-signed URLs for the parts would be specially crafted around the copy part requests so no upload is involved; When the upload completes a new file handle is created and we can track the source file handle in the file handle itself. The drawback (which might be seen as an advantage) of this solution is that the target bucket and keys are generated by Synapse (e.g. The file handle creation is handled by Synapse). The special , note however that for external buckets we might want to allow the definition of the target key). The special pre-signed URL can be generated including additional custom headers (e.g. the x-amz-copy-source and x-amz-copy-source-range) in the signature, the client would need to include those headers in order for the request to be authorized by AWS. Advantages of this solution over the STS solution include:
      - Less limitations imposed on the source file handles (See not below)
      - Easier auditing of the specific operation
      - Fewer security concerns (See note below)
      - Automatic creation of the file handles, this also has the advantage that we are sure the client will create a trackable file handle for the copied data
      - Easier integration with the clients

Note about point 5.a: when creating temporary credentials an inline session policy can be attached to limit the access to the resources, in particular for a copy operation between two S3 buckets the permissions needed are at least s3:GetObject and s3:ListBucket on the source and s3:PutObject and s3:ListBucket on the destination (See https://aws.amazon.com/premiumsupport/knowledge-center/s3-troubleshoot-copy-between-buckets/). Additionally multipart related operations should be enabled on the destination bucket.

...