Document toolboxDocument toolbox

Direct (STS Token) S3 access in Synapse

The Need

 

Data in Synapse is normally accessed via the Synapse client but Synapse allows access access to data directly in Amazon S3. The original, motivating use case is to support access to data uploaded by the Bridge Exporter. The details are given here:

Synapse S3 Tokens (in Google Drive)

More recently some edges cases have been noted: It’s possible to configure nested folders in Synapse such that STS access is less restrictive than Synapse access. The details are here. This begs the question of where the feature is used, and whether it be better served by another mechanism.

 

The use of STS token folders is incompatible with the normal workflow of:

  • uploading sensitive data to a private folder;

  • crafting access restrictions on selected parts of the data;

  • publishing to the results

Downsides include:

  • Security fails to align between the two mechanisms.

  • Can't audit individual file download or collect accurate usage statistics.

 

Use cases

Why direct S3 access?

  • Accelerate upload of small files: Uploading a file to Synapse includes some overhead in terms of a few web requests. Synapse is optimized for transferring large files, for which the overhead is negligible. Uploading many small (say, ~1 KB) files is slow. A leading use case is the upload of measurements from mobile devices, collected by Bridge. A solution to the problem is for the client (Bridge) to upload directly to the S3 bucket.

  • To allow using existing workflow tools: Some popular tools are built to access S3 directly. A leading example is Apache Spark, a cluster computing solution built on Hadoop. It has an AmazonS3 connector allowing it to work with S3 directly. (I believe the code is here. ) To work with Synapse would require writing and maintaining a Synapse-specific connector. The significant effort to do this argues for simply letting tool access the underlying bucket. Groups needing to access Synapse data through Spark include Sage Sys Bio (Tess Thayer) and the BMGF.

 

The need to access S3 does not in and of itself require S3 support through Synapse. The additional requirement is that of fine grained authorization (the need to provide access to just some of the data in a bucket and/or to provide access to data in the context of a group of people), or the need to grant access through the Sage governance process, or the need to grant access in the context of a project or to a team of users simultaneously.



 

Potential Solutions

Separate S3 Access from Synapse

In this approach, Synapse would merely track user permissions, but have no representation of the object being accessed. The user info returned by Synapse could include governance related information as well as permissions. A separate application (acting as an OIDC client) would authenticate a user through Synapse, retrieve user permissions, and contain the logic that creates STS tokens for approved users. Note that such an application could gate access to a variety of AWS resources (EC2 instances, clusters, batch compute services), providing the appropriate STS token for the service of interest.

Synapse S3 Bucket Entity

In this approach Synapse would have an entity representing the S3 bucket, or a folder within, but would not represent the contained objects. The entity could have an access control list and governance settings (access requirements) and Synapse would contain the logic by which STS tokens are generated.

Maintain STS Folder Feature, With Limited Flexibility

 

Fix All Known Edge Cases