Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Allowing an entity update to change the file handle id, without automatically increasing the version (e.g. maybe a soft md5 check can be performed)

  2. Allowing an entity update on to update the file handle id of a specific version of an entity (including 1.)

  3. Removing the STS limitation of not being able to associate a STS storage location to a non-empty folder: If in the synapse storage this should not be an issue since a prefix is automatically assigned and the user cannot change it. For external storage locations maintaining consistency would anyway be a responsibility of the user

  4. Similar to the current https://rest-docs.synapse.org/rest/POST/externalFileHandle/s3.html API, adding a service to allow creating file handles in an S3StorageLocation where STS is enabled (this to allow linking the data that is stored IN synapse S3 but in a STS enabled storage location). There does not seem to be a clear use case to have an STS storage location in the synapse S3 bucket at the moment (See

    Jira Legacy
    serverSystem JIRA
    serverIdba6fb084-9827-3160-8067-8ac7470f78b2
    keyPLFM-6227
    ). The usefulness of this feature would be if a user cannot setup its own external S3 bucket.

  5. Provide a service to copy an existing S3 key to a new file handle:

    1. Extend the current STS service to provide credentials that allow the direct S3 copy of a specific version of the entity to a given (S3) storage location if the file handle resides in the synapse S3 bucket. The destination storage location in this case should be owned by the user and either a synapse STS enabled storage location or connected S3 storage location. Checks on the bucket regions should be in place to avoid transfer costs.

    2. As an alternative

    to 5.
    1. , which might be more flexible and more integrated into how Synapse works: we could extend the current multipart upload API to support a multipart upload copying from an existing file handle. This would be very similar to how S3 actually implements the S3 multipart copy where the normal multipart is used but specifies the source bucket and key in the header of the various requests (Copying objects using the multipart upload API - Amazon Simple Storage Service). The pre-signed URLs for the parts would be specially crafted around the copy part requests so no upload is involved; When the upload completes a new file handle is created and we can track the source file handle in the file handle itself. The drawback (which might be seen as an advantage) of this solution is that the target bucket and keys are generated by Synapse (e.g. The file handle creation is handled by Synapse). The special pre-signed URL can be generated including additional custom headers (e.g. the x-amz-copy-source and x-amz-copy-source-range) in the signature, the client would need to include those headers in order for the request to be authorized by AWS.

Note about point 5 (which actually enables the client driven migration).a: when creating temporary credentials an inline session policy can be attached to limit the access to the resources, in particular for a copy operation between two S3 buckets the permissions needed are at least s3:GetObject and s3:ListBucket on the source and s3:PutObject and s3:ListBucket on the destination (See https://aws.amazon.com/premiumsupport/knowledge-center/s3-troubleshoot-copy-between-buckets/). Additionally multipart related operations should be enabled on the destination bucket.

...

API

New

Description

PUT /entity/{id}

No

Current: The backend automatically updates the version of the entity if the file handle id provided in the body of the request differs from the file handle id associated with the (current) revision of the entity (ignoring the newVersion parameter value).

Changes:
1. Avoid automatic versioning when the file handle differs , assuming and the MD5 of the file handles matchmatches the MD5 of the current file handle.

PUT /entity/{id}/version/{versionNumber}/filehandle

Yes

Current: We do not have an API to update an entity revision meta-datathe file handle of a specific entity version.

Changes: Introduce the new API so that the specific revision meta-data (Comment, label, file handle id) of an entity file handle of a specific version of a FileEntity can be updated. In this case no “auto-versioning” is involved, the body of the request FileHandleUpdateRequest will need to include the current file handle id (to ensure non conflicting update) and the new file handle id. The MD5 of the two file handles should match for this operation to be successful.

PUT /projectSettings

No

Current: When the UploadDestinationListSetting are set for a project or folder, and the storage location is an STS enabled storage location validation is performed on the target folder/project so that it must be empty.

Changes: Remove the validation for the empty folder/project

...

Multipart Copy

Point 5.b as the chosen alternative:

API

New

Description

POST /fileHandlefile/sts/copymultipart

YesNo

Current: The current STS api (GET /entity/{id}/sts) can be invoked on a folder or project to gain read or write access (using the permission parameter) if the storage location associated with the entity is STS enabled.

Changes: The new API will be specific to the copy operation, the body of the request (FileHandleStsCopyRequest) will include:

fileHandleAssociation

API can be altered to accept a new type of request body MultipartUploadCopyRequest with the following property:

  • sourceFileHandleAssociation: A FileHandleAssociation object describing the association to a file handle to copy from, the user must have download access to the file through the given association (e.g. file entity or table). The source file handle must be stored in the default an S3 synapse storage.

  • targetStorageLocationId: that specifies the destination for the copy. The user must own (be the creator) of the storage location with the given id and must be an S3 storage location linked with synapse. Given that the storage location is owned by the user no check is performed on the STS setting of the storage location.

The operation will vend STS credentials with WRITE access to the given target storage location prefix and READ access to the unique prefix of the file handle.

If the target bucket is in a different region this operation will fail.

The response body FileHandleStsCopyResponse will include a property credentials whose value is an object that extends the StsCredentials (E.g. StsCopyCredentials) with two new properties destinationBucket and destinationBaseKey.

Multipart Upload with Copy

Point 6. as a potential alternative (or addition?)

API

New

Description

POST /file/multipart

No

The current request body MultiparUploadRequest can be modified to take an optional property:

  • sourceFileHandleAssociation: A FileHandleAssociation object describing the association to a file handle to copy from, the user must have download access to the file through the given association (e.g. file entity or table). The source file handle must be stored in an S3 storage location within the same region of the target (S3) storage location.

The rest of the process will be the same, but the presigned-url for the parts will be specially crafted
  • storage location within the same region of the target (S3) storage location.

Other properties in common with the current MultipartUploadRequest:

  • fileName

  • storageLocationId

  • generatePreview

An interface MultipartRequest will need to be introduced that is implemented by both the MultipartUploadCopyRequest and the MultipartUploadRequest. Additionally to maintain backward compatibility we need to support a default implementation for the new interface for the API, since we verified that the python client does not include the concreteType property when initiating a multipart upload request, meaning that the backend would not be able to discriminate between a MultipartUploadCopyRequest and a MultipartUploadRequest during de-serialization.

A MultipartUploadCopyRequest would initiate a multipart copy request from the given source if the user is allowed to download the file. The rest of the multipart process will be the same, but the presigned-url for the parts will be specially crafted for a copy request and the client would simply execute the PUT request as indicated by the backend including the special copy headers without any data upload (See API change below).

Completing the copy upload would yield a new file handle that is copied from the source, a new property in the file handle stores the id of the original file handle (we could store the copy reference in a companion table or in the file handle itself as a nullable column).

POST /file/multipart/{uploadId}/presigned/url/batch

No

Currently returns a BatchPresignedUploadUrlResponse containing a list of PartPresignedUrl each containing the part number and the presign-url. We would need to extend this in the copy case to provide the correct headers that the clients needs to send in the PUT request.

...

, e.g. a map of <String,String>

An example in Java using the standard S3 SDK that uses pre-signed URLS to perform the copy (both direct and through multi-part), tested in a source bucket in dev and a target bucket in a different account (same region) configured for to allow dev read/write access:

...

  • Start the multipart upload as usual on the target bucket and key:

    Code Block
    languagejava
    InitiateMultipartUploadResult multiPart = s3Client.initiateMultipartUpload(
    		new InitiateMultipartUploadRequest(targetBucket, objectKey)
    			.withCannedACL(CannedAccessControlList.BucketOwnerFullControl)
    );
  • For each part generate a pre-signed URL for the generated upload id and part number for a PUT request that includes in the signature the x-amz-copy-source header pointing to the source bucket/key and the x-amz-copy-source-range header with the range request for the part:

    Code Block
    languagejava
    GeneratePresignedUrlRequest presignedCopyUrlRequest = new GeneratePresignedUrlRequest(targetBucket, objectKey, HttpMethod.PUT);
                
    presignedCopyUrlRequest.addRequestParameter("partNumber", String.valueOf(partNum));
    presignedCopyUrlRequest.addRequestParameter("uploadId", multiPart.getUploadId());
    
    Map<String, String> headers = ImmutableMap.of(
    		"x-amz-copy-source", sourceBucket + "/" + objectKey,
    		"x-amz-copy-source-range", String.format("bytes=%s-%s", bytePosition, lastByte)
    );
    
    headers.forEach((headerKey, headerValue) -> {
    	presignedCopyUrlRequest.putCustomRequestHeader(headerKey, headerValue);
    });
    
    URL presignedCopyPartUrl = s3Client.generatePresignedUrl(presignedCopyUrlRequest);
  • The client performs a PUT request to the given URL (with no request body) including the additional signed headers (x-amz-copy-source and the x-amz-copy-source-range). The client can directly copy those headers and values from the batch response API above. The response from AWS will be a CopyPartResult:

    Code Block
    languagexml
    <CopyPartResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><LastModified>2020-09-16T02:21:31.000Z</LastModified><ETag>&quot;2f282b84e7e608d5852449ed940bfc51&quot;</ETag></CopyPartResult>, headers=[...]]
  • The ETag is used as usual to add the the part to synapse PUT /file/multipart/{uploadId}/add/{partNumber}

  • The upload is completed as usual PUT /file/multipart/{uploadId}/complete
    /CopyPartResult>, headers=[...]]

  • The ETag is used as usual to add the the part to synapse PUT /file/multipart/{uploadId}/add/{partNumber}

  • The upload is completed as usual PUT /file/multipart/{uploadId}/complete

STS Copy (Previous Solution)

Point 5.a as the previously considered solution:

API

New

Description

POST /fileHandle/sts/copy

Yes

Current: The current STS api (GET /entity/{id}/sts) can be invoked on a folder or project to gain read or write access (using the permission parameter) if the storage location associated with the entity is STS enabled.

Changes: The new API will be specific to the copy operation, the body of the request (FileHandleStsCopyRequest) will include:

  • fileHandleAssociation: A FileHandleAssociation object describing the association to a file handle, the user must have download access to the file through the given association (e.g. file entity or table). The source file handle must be stored in the default S3 synapse storage.

  • targetStorageLocationId: that specifies the destination for the copy. The user must own (be the creator) of the storage location with the given id and must be an S3 storage location linked with synapse. Given that the storage location is owned by the user no check is performed on the STS setting of the storage location.

The operation will vend STS credentials with WRITE access to the given target storage location prefix and READ access to the unique prefix of the file handle.

If the target bucket is in a different region this operation will fail.

The response body FileHandleStsCopyResponse will include a property credentials whose value is an object that extends the StsCredentials (E.g. StsCopyCredentials) with two new properties destinationBucket and destinationBaseKey.

Notes

  • I didn’t add an API to address

    Jira Legacy
    serverSystem JIRA
    serverIdba6fb084-9827-3160-8067-8ac7470f78b2
    keyPLFM-6227
    . We can create a custom storage location in the synapse S3 bucket with STS enabled, but we have no way to manually create a file handle pointing to it. This means that you cannot migrate existing data into it. From the comments it seems that there is no use case for this yet? Should we remove this option all together? Should we just wait for the use case? We decided not to consider this scenario for now for lack of a clear use case.
    UPDATE: Going for the multipart copy solution this is a non-issue.

  • I assumed that there is no current need for copying an entire STS enabled storage location, e.g. the file handle id in the request is mandatory. We could extend this to work on folder and project entities with no source file handle id property to give access to the whole storage location but I’d wait for a use case.
    UPDATE: Going for the multipart copy solution this is a non-issue.

  • Data deletion: I initially thought we could add a parameter when updating an entity file handle to ask the removal if the current user is the creator of the file handle, but the user can already do this through the dedicated API: https://rest-docs.synapse.org/rest/DELETE/fileHandle/handleId.html and adding a new parameter that has meaning only in specific circumstances does not bring any clear advantage.