Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Accelerate upload of small files: Uploading a file to Synapse includes some overhead in terms of a few web requests. Synapse is optimized for transferring large files, for which the overhead is negligible. Uploading many small (say, ~1 KB) files is slow. A leading use case is the upload of measurements from mobile devices, collected by Bridge. A solution to the problem is for the client (Bridge) to upload directly to the S3 bucket. (TODO: Why does this require STS access? Couldn’t we simply provision a private bucket and give the user access to the whole thing? That is, do we need the fine-grained auth' that the STS mechanism provides?)

  • To allow using existing workflow tools: Some popular tools are built to access S3 directly. A leading example is Apache Spark, a cluster computing solution built on Hadoop. It has an AmazonS3 connector allowing it to work with S3 directly. (I believe the code is here. ) To work with Synapse would require writing and maintaining a Synapse-specific connector. The significant effort to do this argues for simply letting tool access the underlying bucket. Groups needing to access Synapse data through Spark include Sage Sys Bio (Tess Thayer) and the BMGF. (TODO: Why does this require STS access? Couldn’t we simply provision a private bucket and give the user access to the whole thing? That is, do we need the fine-grained auth' that the STS mechanism provides?)


Potential Solutions

Separate S3 Access from Synapse

...