The Need
Data in Synapse is normally accessed via the Synapse client but Synapse allows access to select data directly in Amazon S3. The original, motivating use case is to support access to data uploaded by the Bridge Exporter. The details are given here:
Synapse S3 Tokens (in Google Drive)
More recently, some edges cases have been noted: It’s possible to configure nested folders in Synapse such that STS access is less restrictive than Synapse access. The details are here: PLFM-6431. This begs the question of where the feature is used, and whether it be better served by another mechanism.
the use of STS token folders is incompatible with the normal workflow of:
uploading sensitive data to a private folder;
crafting access restrictions on selected parts of the data;
publishing to the results
Downsides include:
Security fails to align between the two mechanisms
can't audit
can add to our file download (usage) counts
Use cases
Why direct S3 access?
to allow using existing workflow tools
to move large quantities of data
What tools?
TODO: Ask Kelsey what tools they will us in PsychEncode. What specific workflows?
Will gave the example of bucket to bucket copy, but STS doesn't provide that, since the token only provides access to the bucket which the STS folder is linked to.
Larson: BMGF will use STS when analyzing data that can’t be exported from the Amazon Cloud. (TODO: Find out more)
Tess: Apache Spark connects to a file system or other data stores, using specific connectors.
Here is the documentation for the use of the AmazonS3 connector. I believe the code is here. To use Spark with Synapse (without the STS token feature) would require a Hadoop Synapse connector. Note that Spark uses data partitioning, nominally based on file path hierarchy, so that may have to be reflected in the connector.
Potential Solutions
Separate S3 Access from Synapse
In this approach, Synapse would merely track user permissions, but have no representation of the object being accessed. The user info returned by Synapse could include governance related information as well as permissions. A separate application (acting as an OIDC client) would authenticate a user through Synapse, retrieve user permissions, and contain the logic that creates STS tokens for approved users. Note that such an application could gate access to a variety of AWS resources (EC2 instances, clusters, batch compute services), providing the appropriate STS token for the service of interest.
Synapse S3 Bucket Entity
In this approach Synapse would have an entity representing the S3 bucket, or a folder within, but would not represent the contained objects. The entity could have an access control list and governance settings (access requirements) and Synapse would contain the logic by which STS tokens are generated.