Document toolboxDocument toolbox

Understanding Parquet Datasets

What is Parquet data?

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

Why should I care?

Assessment data collected as part of the mobile toolbox study is sent by the app to Bridge as a .zip archive of JSON files. This is difficult for analysts to work with, so we normalize the JSON data and write the normalized data as one or more Parquet datasets to an S3 bucket. Because of its columnar format, Parquet data is easy to perform map-reduce style operations upon. There exists software (like Apache Arrow) for reading specific columns or partitions of Parquet data into memory directly from cloud storage like S3. For instructions on how to access this data in S3 and serialize it as a data frame in Python or R, see Getting Started .

Parquet Datasets

Parquet datasets consist of one or more Parquet files which share the same schema and are partitioned by a directory hierarchy.

Partitions

We use the following partitions in our Parquet datasets:

  • assessmentid

  • year

  • month

  • day

assessmentid is the Bridge assessment identifier of this data. year, month, and day are derived from the uploadedOn date. The uploadedOn date is not always the same as the date the assessment was complete. To speed up serializing a Parquet dataset as a data frame by only reading partitions matching a query, see Getting Started.

How are Bridge Downstream Parquet datasets organized?

Our Parquet dataset names are derived from three pieces of information:

  1. The JSON schema identifier

  2. The Parquet dataset schema version

  3. The JSON hierarchy identifier

The Parquet dataset name is formatted:

dataset_{schema identifier}_{version}_{hierarchy identifier}

1. The JSON schema identifier

Every JSON file included in the .zip archive sent by the app to Bridge conforms to a JSON Schema. These JSON Schema have an $id property which acts as a unique identifier. We derive a JSON schema identifier from the base name minus any extension of the $id property. For example, if a piece of JSON data conforms to this JSON Schema, which has $id https://sage-bionetworks.github.io/mobile-client-json/schemas/v2/ArchiveMetadata.json, then the schema identifier is ArchiveMetadata.

Determining the JSON Schema of JSON data

There are two, mutually exclusive ways to potentially determine the schema for data. Which option is right for your situation will depend on the assessment of the data.

Method 1: Reference the metadata

For newer assessments, a URL to the JSON Schema of each file is included in the metadata.json file within the .zip archive sent by the app to Bridge. This information is propagated to the jsonSchema field of the ArchiveMetadata_v1_files Parquet dataset.

Method 2: Reference archive-map.json

Older assessments do not contain a reference to their schema in their .zip archive. Instead, we must look up the schema in archive-map.json. To do so, you will need to know the app name, assessment identifier, assessment revision, and file name of the JSON file. Information on where to find each of these is below:

  • App name – this is not included in the .zip archive, but is included as an annotation on the Synapse project where the .zip archive was sourced. Probably you will be able to recognize this name without knowing the exact way it is formatted.

  • Assessment identifier – this is included in every parquet dataset because it is a partition field.

  • Assessment revision – this is included in the ArchiveMetadata Parquet dataset.

  • File name - this should be obvious.

The above fields are sufficient to map any file in an older assessment’s .zip archive to a JSON Schema in archive-map.json. For more information on the JSON Schema used in assessments, see https://github.com/Sage-Bionetworks/mobile-client-json .

2. The Parquet dataset schema version

The Parquet dataset schema version differentiates Parquet dataset schemas within the same JSON schema identifier. In some cases, data which conforms to different JSON Schema can be written to the same Parquet dataset under the same Parquet dataset schema. The important thing to understand about this component is that Parquet datasets with the same JSON schema identifier but different Parquet dataset schema versions have different Parquet dataset schemas. For example, the Parquet dataset dataset_ArchiveMetadata_v1 and dataset_ArchiveMetadata_v2 have different schemas.

3. The JSON hierarchy identifier

The JSON hierarchy identifier explicitly spells out where in the JSON data hierarchy this data comes from. For example, if you are looking for the startDate of each step in the assessment and the JSON data looks like this:

{ "stepHistory": [ { "type": "edu.northwestern.mobiletoolbox.mfs.serialization.MfsStepResult", "identifier": "MFS_welcome", "position": 1, "startDate": "2022-04-08T20:16:02.412-05:00", "endDate": "2022-04-08T20:16:07.510-05:00" } ] }

Then the JSON hierarchy identifier of the parquet dataset containing this data is stepHistory, since stepHistory is an array of objects where each object has a startDate property. We refer to the dataset created by pivoting out stepHistory’s array data as the child dataset. Fields which exist at the same hierarchical level as stepHistory belong to the parent dataset. If startDate had instead been an array, rather than a string, than the parquet dataset containing this data would have the JSON hierarchy identifier stepHistory_startDate .

A new parquet dataset will be written iff a property is a JSON array.

Joining a child dataset with its parent dataset

When a child dataset is created by pivoting out JSON array data, that data can be joined with the data in the parent dataset. Every child dataset contains the fields id and index. id can be joined with the parent dataset upon the original property name. index indicates the order in which this data appears in the JSON array. This is perhaps best understood through an example:

{ "stepHistory": [ { "identifier": "MFS_welcome", "startDate": "2022-04-08T20:16:02.412-05:00", "endDate": "2022-04-08T20:16:07.510-05:00" }, { "identifier": "MFS_goodbye", "startDate": "2022-04-08T20:17:02.412-05:00", "endDate": "2022-04-08T20:17:07.510-05:00" } ], "sampleId": "Am_pRV-tsHDXZz4hoBd4PHfH" }

This JSON becomes two parquet datasets: one without a JSON hierarchy identifier (the parent) and another with the JSON hierarchy identifier stepHistory (the child).

Parent: dataset_exampledata

sampleId

stepHistory

Am_pRV-tsHDXZz4hoBd4PHfH

1

Child: dataset_exampledata_stepHistory

id

index

identifier

startDate

endDate

1

1

MFS_welcome

2022-04-08T20:16:02.412-05:00

2022-04-08T20:16:07.510-05:00

1

2

MFS_goodbye

2022-04-08T20:17:02.412-05:00

2022-04-08T20:17:07.510-05:00

The sampleId in the parent dataset can be associated with the data in the child dataset by joining the id field of the child dataset upon the stepHistory field of the parent dataset.

Â