Skip to end of banner
Go to start of banner

Understanding Parquet Datasets

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

What is Parquet data?

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

Why should I care?

Assessment data collected as part of the mobile toolbox study is sent by the app to Bridge as a .zip archive of JSON files. This is difficult for analysts to work with, so we normalize the JSON data and write the normalized data as (usually multiple) Parquet datasets to an S3 bucket. For instructions on how to access this data and serialize it as a data frame in Python or R, see Getting Started .

Parquet Datasets

Parquet datasets consist of one or more Parquet files which share the same schema and are partitioned by a directory hierarchy.

Partitions

We use the following partitions in our Parquet datasets:

  • assessmentid

  • year

  • month

  • day

assessmentid is the Bridge assessment identifier of this data. year, month, and day are derived from the uploadedOn property of this data. The uploadedOn date is not always the same as the date the assessment was completed. To speed up serializing a Parquet dataset as a data frame by only reading partitions matching a query, see Getting Started.

How are Parquet datasets organized?

Parquet dataset names are derived from three pieces of information:

  1. The schema identifier

  2. The parquet dataset version

  3. The JSON hierarchy identifier

The name is then formatted:

dataset_{schema identifier}_{version}_{hierarchy identifier}

The schema identifier

Every JSON file included in the .zip archive sent by the app to Bridge conforms to a JSON Schema. These JSON Schema have an $id property which acts as the unique identifier of this schema. We derive a schema identifier from the base name minus any extension of the $id property. For example, if a piece of JSON data conforms to this JSON Schema, which has $id https://sage-bionetworks.github.io/mobile-client-json/schemas/v1/TaskMetadata.json, then the schema identifier is TaskMetadata.

Determining the JSON Schema of JSON data

As previously mentioned, assessment data is sent by the app to Bridge as a .zip archive of JSON data. This assessment data has an assessment identifier and an assessment version. These pieces of information, along with the file name, are mapped to a JSON Schema in archive-map.json. For more information on the JSON Schema used with the mobile clients, see https://github.com/Sage-Bionetworks/mobile-client-json .

The Parquet dataset version

The Parquet dataset version differentiates parquet dataset schemas within the same schema identifier. In some cases, different JSON schemas can be written to the same parquet dataset. The important thing to understand about this

The JSON hierarchy identifier

Let’s take a look at an example to better illustrate how the parquet datasets are produced and how they relate to the original JSON data.

JSON to Parquet datasets example

Suppose we are interested in

  • No labels