Robust File Upload

The current multi-part file upload to Synapse involves the following four steps:

  1. Start the file upload using POST /createChunkedFileUploadToken.
  2. The client divides the file into parts (typically 5MB each) and uploads each part using an HTTPS POST to a pre-signed obtained from: POST /createChunkedFileUploadChunkURL.
  3. After all of the parts are upload a job is started to "complete" the file upload using: POST /startCompleteUploadDaemon.
  4. The client monitors the job started in step 3 until it completes (or fails) using: GET /completeUploadDaemonStatus/daemonId.

 

The current clients (R, Python, Web, Java) all implements some level of robustness using the above four steps.  For example, if a client fails to POST a part to its  pre-signed URL the client will attempt to re-try that part.  Also, if the job fails at step 4 the clients may restart the file upload starting back at step 1.

The most common problem that causes file upload to fails is the client receives a 201 from a POST of a part, but the part cannot be found in S3.  When this occurs the "complete" upload job will fail after waiting 5 minutes for the part to appear.  When this happens there is no way for the client to determine which part failed so the only option is to start the file upload from the beginning.  Starting over from the beginning is simply unacceptable for large file uploads.

There are three other problems with the current chunked upload API:

  1. We do not do require clients to calculate the MD5 of each part which makes it impossible to check the integrity of the final file on the server side (See PLFM-3660).
  2. There is a limit of 10K parts imposed by Amazon on multi-part upload.  Since all clients use a part size of 5MB the largest file we can upload is 50GB (see PLFM-3657).
  3. When users upload extremely large files (say > 10G), these file uploads can block all other users from uploading files (see PLFM-3637).

 

We believe, all of these problems can be solved by changing our multi-part file upload API.

New Multi-part Upload

With the new file upload, the server will persist the state of a file's upload as each part is successfully added to the upload.  This will allow clients to resume a file upload even after critical failures such as client crashes, power outages, and network outages.  In order to start a multi-part file upload, the client will be expected to provide key pieces of information that will uniquely identify a file to be uploaded: <user_id>-<md5>-<part_size>.  Each unique combination of  <user_id>-<md5>-<part_size> will then be issued an ID from the server.  All upload state will be persisted and will migrate from stack to stack.

When a client starts a multi-part upload, no assumption about the state of the files should be made.  Instead, the server will return the state of the requested file upload which could be anywhere between not-started (no parts upload) to already complete (all parts uploads, and complete with an issued file handle id).  If the file has already been upload and a file handle already issued, there will be nothing else for the client to do.  If the upload is not completed, the server will return the state of each part (according to the provided part size).  The client is then expected to only upload parts that have not yet been successfully added to the multi-part upload.  In this way, once a part is successfully added to the multi-part, that part will never need to be re-uploaded even if the client crashes.

Here is an example of how file upload would work:

  1. Using the web client a users selects a large file to upload to Synapse from their local hard-drive.
  2. An hour after starting the file upload the user's machine shuts down due to a low battery.
  3. After providing power to their laptop, the user decides to try uploading the same file again using the Python client.
  4. When the Python client starts the file upload, the server indicates that only a few parts are missing from the multi-part upload.
  5. The Python client uploads the few remaining parts and completes the upload only a few minutes after starting.

Here is another example:

  1. The user wishes to upload a new FileEntity with the name foo.
  2. The user names their file "foo" and then select a large file to upload.
  3. The file takes thirty minutes to upload, and then the FileEntity creation fails because there is already a file named "foo" in the users chosen location.
  4. The user decides to name the file "foo2" and re-starts the file upload.
  5. When the client starts the file upload, the server indicates the file has already been upload and provides the ID of the file handle.
  6. A new FileEntity with the name "foo2" is then created within a second after re-starting the upload.

Multi-part Upload API

DescriptionResponseURLRequestType
Start or resume a multipart upload of a file.  By default this method is idempotent, so subsequent calls will simply return the current status of the file upload.  If for some reason, the client must restart a file upload from the beginning then then following optional query parameters should be included: forceRestart=true (/file/multipart/upload?forceRestart=true)MultipartUploadStatus/file/multipart/uploadMultipartUploadRequestPOST
Used to get a batch of pre-signed URLS that should be used to upload file parts.  Each part will require a unique pre-signed URL.  The client is expected to PUT the contents of each part to the corresponding pre-signed URL.  Each per-signed URL will expire 15 minute after issued.  If a URL has expired, the client will need to request a new URL for that part.BatchPartUploadURLResponse/file/multipart/{uploadId}/presignedurl/batchBatchPartUploadURLRequestPOST
After the contents of part have been upload (PUT to a pre-signed URL) this method is used to added the part to the multipart upload.  If the upload part can be found, and the provided MD5 matches the MD5 of the part, the part will be accepted and added to the multipart upload.AddMultipartResponse/file/multipart/{uploadId}/add/{partNumber}?partMD5Hex={partMD5Hex} PUT
After all of the parts have been upload and added successfully, this method is called to complete the upload resulting in the creation of a new file handle.MultipartUploadStatus/file/multipart/{uploadId}/complete PUT