Data in Synapse is normally accessed via the Synapse client but Synapse allows access to select data directly in Amazon S3. The original, motivating use case is to support access to data uploaded by the Bridge Exporter. The details are given here:
Synapse S3 Tokens (in Google Drive)
More recently, some edges cases have been noted: It’s possible to configure nested folders in Synapse such that STS access is less restrictive than Synapse access. The details are here: PLFM-6431. This begs the question of where the feature is used, and whether it be better served by another mechanism.
the use of STS token folders is incompatible with the normal workflow of:
uploading sensitive data to a private folder;
crafting access restrictions on selected parts of the data;
publishing to the results
Downsides include:
Security fails to align between the two mechanisms
can't audit
can add to our file download (usage) counts
Use cases
Why direct S3 access?
to allow using existing workflow tools
to move large quantities of data
What tools?
TODO: Ask Kelsey what tools they will us in PsychEncode. What specific workflows?
Will gave the example of bucket to bucket copy, but STS doesn't provide that, since the token only provides access to the bucket which the STS folder is linked to.
Larson: BMGF will use STS when analyzing data that can’t be exported from the Amazon Cloud. (TODO: Find out more)
Tess: Apache Spark connects to a file system or other data stores, using specific connectors.
Here is the documentation for the use of the AmazonS3 connector. I believe the code is here. To use Spark with Synapse (without the STS token feature) would require a Hadoop Synapse connector. Note that Spark uses data partitioning, nominally based on file path hierarchy, so that may have to be reflected in the connector.