Data Formats

Digital health data is exported by Bridge to Synapse Projects in two formats: raw and parquet.

Raw Data

Raw data is the data exactly as it has been sent by the app to Bridge. Bridge exports this data to Synapse as Synapse File entities. The raw data can be found in the study’s Synapse project under the “Files” tab in a folder called “Bridge Raw Data”. Additionally, raw data and its metadata can be viewed in a view under the “Tables” tab. The view will be named “Bridge Raw Data View”. We don’t recommend working directly with raw data, although you may find the view convenient for working with the metadata.

Parquet Data

Parquet data is a normalized version of the raw data. It can be found in the study’s Synapse project under the “Files” tab in a folder named “parquet”. Don’t freak out if this folder is empty! This is by design. Underneath the folder name will be text similar to the following:

This tells us where the data really is. We use Synapse to control access to this data, although all the real action happens in S3.

Parquet Datasets

The Parquet data is stored in S3 as a Parquet dataset, a collection of Parquet files partitioned by a folder hierarchy. For an explanation of how this data is organized, please see https://sagebionetworks.jira.com/wiki/spaces/BD/pages/2644115482 .

Accessing the Data

We assume that you already have access to the Synapse project containing your study data. If you don’t, please talk with your PI or whoever configured the study in the Bridge Study Manager. The Bridge Downstream team cannot unilaterally grant you access to any digital health data.

To access the Parquet data, it’s necessary to authenticate with AWS so that their services can grant access to the data in S3. Synapse makes this authentication easy with STS tokens. STS tokens can be retrieved using the Python, R, Java, or command line Synapse clients. Instructions are provided below for Python and R.

1. Install clients

Because of the way we will be interfacing with the data, we don’t need to install an AWS client. Instead, we will use the Synapse client to authenticate with Synapse and retrieve an STS token. We will then use an Arrow client to load the parquet data directly from S3. To work with the parquet data as a data frame in R we also need the dplyr dependency.

Python

R

2. (optional) Configure Synapse credentials

There are multiple ways to cache Synapse credentials so that you don’t need to type your username and password each time you log in with the Synapse client. One of the simplest ways to do so, which works for both Python and R clients, is to edit the following section in the .synapseConfig file, which is written to the home directory (~) when installing the Synapse client. Don’t forget to uncomment the [authentication] header and log in parameters.

###########################
# Login Credentials       #
###########################

## Used for logging in to Synapse
## Alternatively, you can use rememberMe=True in synapseclient.login or login subcommand of the commandline client.
[authentication]
username = <username>
authtoken = <authtoken>

Complete documentation on how to configure the client can be found here.

3. Authenticate with AWS using an STS token

The below code defines a global variable PARQUET_FOLDER. This is the Synapse ID of the parquet folder described in the “Parquet Data” section above. Change this value to match the ID of the parquet folder specific to your project.

Python

import synapseclient
from pyarrow import fs, parquet

PARQUET_FOLDER = "syn00000000"
syn = synapseclient.login() # Pass your credentials
                            # or configure your .synapseConfig
# Get STS credentials
token = syn.get_sts_storage_token( 
    entity=PARQUET_FOLDER,
    permission="read_only",
    output_format="json")    
# Pass STS credentials to Arrow filesystem interface
s3 = fs.S3FileSystem(
    access_key=token['accessKeyId'],
    secret_key=token['secretAccessKey'],
    session_token=token['sessionToken'],
    region="us-east-1")

R

# Optional, we use :: operators to make the namespace explicit
library(synapser)
library(arrow)
library(dplyr)

PARQUET_FOLDER <- "syn00000000"
synapser::synLogin() # Pass your credentials
                     # or configure your .synapseConfig
# Get STS credentials
token <- synapser::synGetStsStorageToken(
    entity = PARQUET_FOLDER,
    permission = "read_only",
    output_format = "json")
# Pass STS credentials to Arrow filesystem interface
s3 <- arrow::S3FileSystem$create(
    access_key = token$accessKeyId,
    secret_key = token$secretAccessKey,
    session_token = token$sessionToken,
    region="us-east-1")

For those who are curious, full documentation on using STS with Synapse can be found here.

4. Read parquet dataset as a data frame

Now that we have an S3 file system interface, we can begin interfacing with the S3 bucket. An explanation of how the parquet datasets are organized within the S3 bucket can be found on the page https://sagebionetworks.jira.com/wiki/spaces/BD/pages/2644115482 . To view which parquet datasets are available to us:

List Parquet datasets

Python

base_s3_uri = "{}/{}".format(token["bucket"], token["baseKey"])
parquet_datasets = s3.get_file_info(
    fs.FileSelector(base_s3_uri, recursive=False))
for dataset in parquet_datasets:
    print(dataset.path)

R

base_s3_uri <- paste0(token$bucket, "/", token$baseKey)
parquet_datasets <- s3$GetFileInfo(
    arrow::FileSelector$create(base_s3_uri, recursive=FALSE))
for (dataset in parquet_datasets) {
    print(dataset$path)
}

Example output:

bridge-downstream-parquet/bridge-downstream/app-id/study-id/parquet/dataset_audiolevelrecord_v1
bridge-downstream-parquet/bridge-downstream/app-id/study-id/parquet/dataset_motionrecord_v1
bridge-downstream-parquet/bridge-downstream/app-id/study-id/parquet/dataset_sharedschema_v1
bridge-downstream-parquet/bridge-downstream/app-id/study-id/parquet/dataset_sharedschema_v1_userInteractions
bridge-downstream-parquet/bridge-downstream/app-id/study-id/parquet/dataset_sharedschema_v1_userInteractions_controlEvent
bridge-downstream-parquet/bridge-downstream/app-id/study-id/parquet/dataset_archivemetadata_v1
bridge-downstream-parquet/bridge-downstream/app-id/study-id/parquet/dataset_archivemetadata_v1_files
bridge-downstream-parquet/bridge-downstream/app-id/study-id/parquet/dataset_taskresultobject_v1
bridge-downstream-parquet/bridge-downstream/app-id/study-id/parquet/dataset_weatherresult_v1
bridge-downstream-parquet/bridge-downstream/app-id/study-id/parquet/owner.txt

Each of the above is a parquet dataset, with the exception of owner.txt. We can read a parquet dataset into memory directly from S3.

Serialize Parquet dataset as data frame

Python

metadata = parquet.read_table(
    "{}/{}".format(base_s3_uri, "dataset_archivemetadata_v1"),
    filesystem=s3)
metadata_df = metadata.to_pandas()

R

metadata <- arrow::open_dataset(
    s3$path(paste0(base_s3_uri, "/", "dataset_archivemetadata_v1")))
metadata_df <- dplyr::collect(metadata)

In the above example, we loaded the entire parquet dataset – all metadata across all assessments – into memory. In some cases this isn’t desirable. For example, we might only want to work with data from a specific assessment. To do so, we can take advantage of the partition field assessmentid, and only read data from a specific assessment. Assuming there is an assessment called memory-for-sequences :

Filter and serialize Parquet dataset as data frame

Python

metadata_mfs = parquet.read_table(
    "{}/{}".format(base_s3_uri, "dataset_archivemetadata_v1"),
    filters = [("assessmentid", "=", "memory-for-sequences")],
    filesystem=s3)
metadata_mfs_df = metadata_mfs.to_pandas()

R

# If you haven't loaded dplyr into your global namespace yet, do so now
# library(dplyr)

metadata_mfs <- arrow::open_dataset(
    s3$path(paste0(base_s3_uri, "/", "dataset_archivemetadata_v1")))
metadata_mfs_df <- metadata %>%
    filter(assessmentid == "memory-for-sequences") %>%
    collect()

You can filter on non-partition columns as well, but this isn’t any faster than serializing the entire parquet dataset. The only advantage is that you won’t need to store the entire dataset in memory, which can be particularly useful for motion or other time-series data.

Conclusion

You now understand the basics of serializing assessment data from a parquet dataset. For more information about interpreting and working with the parquet datasets, see https://sagebionetworks.jira.com/wiki/spaces/BD/pages/2644115482 .

1 Data Formats
- 1.1 Raw Data
- 1.2 Parquet Data
  - 1.2.1 Parquet Datasets
2 Accessing the Data
- 2.1 1. Install clients
  - 2.1.1 Python
  - 2.1.2 R
- 2.2 2. (optional) Configure Synapse credentials
- 2.3 3. Authenticate with AWS using an STS token
  - 2.3.1 Python
  - 2.3.2 R
- 2.4 4. Read parquet dataset as a data frame
  - 2.4.1 List Parquet datasets
    - 2.4.1.1 Python
    - 2.4.1.2 R
  - 2.4.2 Serialize Parquet dataset as data frame
    - 2.4.2.1 Python
    - 2.4.2.2 R
  - 2.4.3 Filter and serialize Parquet dataset as data frame
    - 2.4.3.1 Python
    - 2.4.3.2 R
3 Conclusion

Getting Started

Data Formats

Raw Data

Parquet Data

Parquet Datasets

Accessing the Data

1. Install clients

Python

R

2. (optional) Configure Synapse credentials

3. Authenticate with AWS using an STS token

Python

R

4. Read parquet dataset as a data frame

List Parquet datasets

Python

R

Serialize Parquet dataset as data frame

Python

R

Filter and serialize Parquet dataset as data frame

Python

R

Conclusion