The Need
Data in Synapse is normally accessed via the Synapse client but Synapse allows access access to select data directly in Amazon S3. The original, motivating use case is to support access to data uploaded by the Bridge Exporter. The details are given here:
Synapse S3 Tokens (in Google Drive)
More recently , some edges cases have been noted: It’s possible to configure nested folders in Synapse such that STS access is less restrictive than Synapse access. The details are here: PLFM-6431. This begs the question of where the feature is used, and whether it be better served by another mechanism.
the The use of STS token folders is incompatible with the normal workflow of:
...
Security fails to align between the two mechanisms.
canCan't audit can add to our individual file download (usage) countsor collect accurate usage statistics.
Use cases
Why direct S3 access?
to allow using existing workflow tools
to move large quantities of data
What tools?
TODO: Ask Kelsey what tools they will us in PsychEncode. What specific workflows?
Will gave the example of bucket to bucket copy, but STS doesn't provide that, since the token only provides access to the bucket which the STS folder is linked to.
Larson: BMGF will use STS when analyzing data that can’t be exported from the Amazon Cloud. (TODO: Find out more)
Tess: Apache Spark connects to a file system or other data stores, using specific connectors.
...
Accelerate upload of small files: Uploading a file to Synapse includes some overhead in terms of a few web requests. Synapse is optimized for transferring large files, for which the overhead is negligible. Uploading many small (say, ~1 KB) files is slow. A leading use case is the upload of measurements from mobile devices, collected by Bridge. A solution to the problem is for the client (Bridge) to upload directly to the S3 bucket.
To allow using existing workflow tools: Some popular tools are built to access S3 directly. A leading example is Apache Spark, a cluster computing solution built on Hadoop. It has an AmazonS3 connector allowing it to work with S3 directly. (I believe the code is here.
...
) To work with Synapse would require writing and maintaining a Synapse-specific connector. The significant effort to do this argues for simply letting tool access the underlying bucket. Groups needing to access Synapse data through Spark include Sage Sys Bio (Tess Thayer) and the BMGF.
The need to access S3 does not in and of itself require S3 support through Synapse. The additional requirement is that of fine grained authorization (the need to provide access to just some of the data in a bucket and/or to provide access to data in the context of a group of people), or the need to grant access through the Sage governance process, or the need to grant access in the context of a project or to a team of users simultaneously.
Potential Solutions
Separate S3 Access from Synapse
In this approach, Synapse would merely track user permissions, but have no representation of the object being accessed. The user info returned by Synapse could include governance related information as well as permissions. A separate application (acting as an OIDC client) would authenticate a user through Synapse, retrieve user permissions, and contain the logic that creates STS tokens for approved users. Note that such an application could gate access to a variety of AWS resources (EC2 instances, clusters, batch compute services), providing the appropriate STS token for the service of interest.
Synapse S3 Bucket Entity
In this approach Synapse would have an entity representing the S3 bucket, or a folder within, but would not represent the contained objects. The entity could have an access control list and governance settings (access requirements) and Synapse would contain the logic by which STS tokens are generated.