Document toolboxDocument toolbox

S3 Locations Design v2

Problems with the current S3 design

  1. There is a couple second delay between creating an IAM user and being able to use it on S3. Therefore new Synapse users whose first download is via the R client get an exception, when they try again a few second later it is fine.
  2. For files larger than 5GB, Multipart Upload is required. (Need confirmation from AWS but) It appears that you must have an accessId and secretKey to perform the multipart upload. We cannot let client have the IAM credentials we create for them because those credentials have permission to access all objects in the S3 bucket due to scaling limitations with IAM. (Note that federated users and security tokens did not launch till August 2011)

Solution

Repository Service Changes

  1. Switch from IAM users to federated users and security tokens so that we can sufficiently limit the permissions afforded by those credentials
  2. add a new request url /<LocationableEntityType>/<id>/s3Token to get a security token for use in an upload
  3. complete the work to collapse layer and location
    1. move md5 and content type from LocationData to Layer (all LocationDatas for a Layer should be holding files that match the md5)
  4. turn on auto-versioning when the md5 changes for any locationable object

Client Changes

  1. Upgrade the web client to add one extra header containing the security token when performing downloads from S3 turns out this isn't necessary, presigned urls work for federated users
  2. Upgrade the java client to do multipart uploads and downloads by default
  3. Modify the R client to use the Java client for uploads/downloads
    • Note no special client is needed for multipart downloads because it is part of the HTTP protocol via the Range header and not an extension to the S3 API

Data Migration Plan

Incremental Switch

  • When loading data for locationable entities, first look in the new spot to see if location data is there, if not, look in the old spot
  • When storing data for locationable entities, clients use the new API
  • Run a background task to walk all locationable entities, download then upload the data (which has the side effect of versioning the entity)
  • Pros:
    • presumably no outage needed
  • Cons:
    • confusing, more likely to make errors and corrupt data
    • we may never retire the old API

The Big Switch

  1. copy all S3 data to new S3 bucket so that the S3 urls have the correct entity prefix
  2. copy values in Location entities to Layer entities with new S3 paths
  3. nuke all Location entities
  4. nuke code for Location entities
  • Pros:
    • very safe, we can keep the old S3 bucket and database backup for a long time
  • Cons:
    • might be a long outage

Sequence Diagram for Design

Alternate Approaches

Skip Federated Users and just add a tracking ID to the presigned urls for audit purposes

I've confirmed that you can include a tracking querystring along with your signedURL. However, querystring cannot be part of the signed message. In other words, when you sign the URL, the string to sign should only be objectkey+S3bucket+expiretime. Example URL:

http:/s3.amazonaws.com/testbucket/foo.obj?Expires=X&AWSAccessKeyId=XXXXX&Signature=XXX&ID=YYYYYY.

In this example ID=YYY is your unique tracking id. The only caveat with this option is that users can remove the tracking ID and still be able to upload/download from S3. Not sure if that's going to be a concern.


We did not go with this because since users can strip off the tracking tag and still use the url, it doesn't meet our bar for auditing per Mike.

Use a different library than the AWS Java SDK

http://jets3t.s3.amazonaws.com/index.htmlhttp://jets3t.s3.amazonaws.com/toolkit/code-samples.html

We did not pull this in because we already have the AWS Java SDK as a dependencies and it is meeting our needs right now.  If later on we also support other cloud storage vendors, we might switch to this library since it works with several.