Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Data in Synapse is normally accessed via the Synapse client but Synapse allows access access to select data directly in Amazon S3. The original, motivating use case is to support access to data uploaded by the Bridge Exporter. The details are given here:

Synapse S3 Tokens (in Google Drive)

More recently , some edges cases have been noted: It’s possible to configure nested folders in Synapse such that STS access is less restrictive than Synapse access. The details are here: PLFM-6431. This begs the question of where the feature is used, and whether it be better served by another mechanism.

the The use of STS token folders is incompatible with the normal workflow of:

...

  • Security fails to align between the two mechanisms.

  • canCan't audit can add to our individual file download (usage) countsor collect accurate usage statistics.

Use cases

Why direct S3 access?

  • Accelerate upload of small files: Uploading a file to Synapse includes some overhead in terms of a few web requests. Synapse is optimized for transferring large files, for which the overhead is negligible. Uploading many small (say, ~1 KB) files is slow. A leading use case is the upload of measurements from mobile devices, collected by Bridge. A solution to the problem is for the client (Bridge) to upload directly to the S3 bucket.

  • To allow using existing workflow toolsto : Some popular tools are built to access S3 directly. A leading example is Apache Spark, a cluster computing solution built on Hadoop. It has an AmazonS3 connector allowing it to work with S3 directly. (I believe the code is here. ) To work with Synapse would require writing and maintaining a Synapse-specific connector. The significant effort to do this argues for simply letting tool access the underlying bucket.

  • To move large quantities of data

...

Larson: BMGF will use STS when analyzing data that can’t be exported from the Amazon Cloud. (TODO: Find out more)

Tess: Apache Spark connects to a file system or other data stores, using specific connectors.

Here is the documentation for the use of the AmazonS3 connector. I believe the code is here. To use Spark with Synapse (without the STS token feature) would require a Hadoop Synapse connector. Note that Spark uses data partitioning, nominally based on file path hierarchy, so that may have to be reflected in the connector.

Potential Solutions

Separate S3 Access from Synapse

...