Accessing the dataData comes in two formats, : raw and parquet.
Raw Data
Raw data is the data exactly as it has been sent by the app to Bridge. You can find the The raw data can be found under the “Files” tab in a folder called “Bridge Raw Data”. You can also view Additionally, this data and its metadata can be viewed in a view under the “Tables” tab. The view will be named “Bridge Raw Data View”. We don’t recommend working with the data in this format, although you may find the view convenient for working with the file metadata.
Parquet Data
Parquet data is a relational version of the raw data. It can be found under the “Files” tab in a folder named “parquet”. Don’t freak out if this folder is empty! This is by design. You will notice something Underneath the folder name will be text similar to the following text underneath the folder name:
This tells you us where the data really is. We use Synapse to control access to this data, although all the real action happens in S3.
Parquet Datasets
The Parquet data is stored in S3 as a Parquet dataset, a collection of files partitioned by a folder hierarchy. For an explanation of how this data is organized, please see TODO.
Accessing the Data
To interact access with this the data, you will need it’s necessary to authenticate with AWS so that their services know you have can grant access to this the data in S3. Synapse makes this authentication easy with STS tokens. STS tokens can be retrieved using the Python, R, or command line Synapse clients. You may need to install one of those clients, but assuming you have already installed the client, here is some sample code which will allow you to access the parquet data.
Python
R
Command LineInstructions are provided below for Python and R.
1. Install clients
Because of the way we will be interfacing with the data, we don’t need to install an AWS client. Instead, we will use the Synapse client to authenticate with Synapse and retrieve an STS token. We will then use an Arrow client to load the data directly from S3. To work with the parquet data as a data frame in R we also need the dplyr
dependency.
Python
R
There are multiple ways to cache Synapse credentials so that it isn’t necessary to type your username and password each time you log in with the Synapse client. One of the simplest ways to do so, which works for both Python and R clients, is to edit the following section in the .synapseConfig
file, which is written to the home directory (~
) when installing the Synapse client. Don’t forget to uncomment the [authentication]
header and log in parameters.
Code Block |
---|
###########################
# Login Credentials #
###########################
## Used for logging in to Synapse
## Alternatively, you can use rememberMe=True in synapseclient.login or login subcommand of the commandline client.
[authentication]
username = <username>
authtoken = <authtoken> |
Complete documentation on how to configure the client can be found here.
3. Authenticate with AWS using an STS token
Python
Code Block |
---|
|
import synapseclient
from pyarrow import fs, parquet
PARQUET_FOLDER = "syn00000000"
syn = synapseclient.login() # Pass your credentials
# or configure your .synapseConfig
# Get STS credentials
token = syn.get_sts_storage_token(
entity=PARQUET_FOLDER,
permission="read_only",
output_format="json")
# Pass STS credentials to Arrow filesystem interface
s3 = fs.S3FileSystem(
access_key=token['accessKeyId'],
secret_key=token['secretAccessKey'],
session_token=token['sessionToken'],
region="us-east-1") |
R
Code Block |
---|
|
# We use :: operators to make the namespace explicit
# library(synapser)
# library(arrow)
PARQUET_FOLDER <- "syn00000000"
synapser::synLogin() # Pass your credentials
# or configure your .synapseConfig
# Get STS credentials
token <- synapser::synGetStsStorageToken(
entity = PARQUET_FOLDER,
permission = "read_only",
output_format = "json")
# Pass STS credentials to Arrow filesystem interface
s3 <- arrow::S3FileSystem$create(
access_key = token$accessKeyId,
secret_key = token$secretAccessKey,
session_token = token$sessionToken,
region="us-east-1") |
For those who are curious, full documentation on using STS with Synapse can be found here.
4. Read parquet dataset as a data frame
Now that we have an S3 file system interface, we can begin interfacing with the S3 bucket. An explanation of how the parquet datasets are organized within the S3 bucket can be found TODO. To view which parquet datasets are available to us:
Python
Code Block |
---|
|
base_s3_uri = "{}/{}".format(token["bucket"], token["baseKey"])
parquet_datasets = s3.get_file_info(
fs.FileSelector(base_s3_uri, recursive=False))
for dataset in parquet_datasets:
print(dataset.path) |
R
Code Block |
---|
base_s3_uri <- paste0(token$bucket, "/", token$baseKey)
parquet_datasets = s3$GetFileInfo(
arrow::FileSelector$create(base_s3_uri, recursive=FALSE))
for (dataset in parquet_datasets) {
print(dataset$path)
} |
Example output:
Code Block |
---|
|
bridge-downstream-parquet/bridge-downstream/app-id/study-id/parquet/dataset_audiolevelrecord_v1
bridge-downstream-parquet/bridge-downstream/app-id/study-id/parquet/dataset_motionrecord_v1
bridge-downstream-parquet/bridge-downstream/app-id/study-id/parquet/dataset_sharedschema_v1
bridge-downstream-parquet/bridge-downstream/app-id/study-id/parquet/dataset_sharedschema_v1_userInteractions
bridge-downstream-parquet/bridge-downstream/app-id/study-id/parquet/dataset_sharedschema_v1_userInteractions_controlEvent
bridge-downstream-parquet/bridge-downstream/app-id/study-id/parquet/dataset_taskmetadata_v1
bridge-downstream-parquet/bridge-downstream/app-id/study-id/parquet/dataset_taskmetadata_v1_files
bridge-downstream-parquet/bridge-downstream/app-id/study-id/parquet/dataset_taskresultobject_v1
bridge-downstream-parquet/bridge-downstream/app-id/study-id/parquet/dataset_weatherresult_v1
bridge-downstream-parquet/bridge-downstream/app-id/study-id/parquet/owner.txt |
Each of the above is a parquet dataset, with the exception of owner.txt
. We can read a parquet dataset into memory directly from S3 using the s3FileSystem
object we created earlier.
Python
Code Block |
---|
|
metadata = parquet.read_table(
"{}/{}".format(base_s3_uri, "dataset_taskmetadata_v1"),
filesystem=s3)
metadata_df = metadata.to_pandas() |
R
Code Block |
---|
metadata <- arrow::open_dataset(
s3$path(paste0(base_s3_uri, "/", "dataset_taskmetadata_v1")))
metadata_df <- dplyr::collect(metadata) |