Versions Compared
Key
- This line was added.
- This line was removed.
- Formatting was changed.
What is Parquet data?
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.
Why should I care?
Assessment data collected as part of the mobile toolbox study is sent by the app to Bridge as a .zip archive of JSON files. This is difficult for analysts to work with, so we normalize the JSON data and write the normalized data as (usually multiple) Parquet datasets to an S3 bucket. Because of its columnar format, Parquet data is easy to perform map-reduce style operations upon. There exists software (like Apache Arrow) for reading specific columns or partitions of Parquet data into memory directly from cloud storage like S3. For instructions on how to access this data in S3 and serialize it as a data frame in Python or R, see Getting Started .
Parquet Datasets
Parquet datasets consist of one or more Parquet files which share the same schema and are partitioned by a directory hierarchy.
Partitions
We use the following partitions in our Parquet datasets:
assessmentid
year
month
day
assessmentid
is the Bridge assessment identifier of this data. year
, month
, and day
are derived from the uploadedOn
property of this data date. The uploadedOn
date is not always the same as the date the assessment was completedcomplete. To speed up serializing a Parquet dataset as a data frame by only reading partitions matching a query, see Getting Started.
How are Bridge Downstream Parquet datasets organized?
Our Parquet dataset names are derived from three pieces of information:
The JSON
schema identifier
The Parquet dataset schema
version
The JSON
hierarchy identifier
The name is then formatted:
dataset_{schema identifier}_{version}_{hierarchy identifier}
The JSON schema identifier
Every JSON file included in the .zip archive sent by the app to Bridge conforms to a JSON Schema. These JSON Schema have an $id
property which acts as the a unique identifier of this schema. We derive a JSON schema identifier from the base name minus any extension of the $id
property. For example, if a piece of JSON data conforms to this JSON Schema, which has $id
https://sage-bionetworks.github.io/mobile-client-json/schemas/v1v2/TaskMetadataArchiveMetadata.json
, then the schema identifier is TaskMetadata
ArchiveMetadata
.
Determining the JSON Schema of JSON data
As previously mentioned, assessment data is sent by the app to Bridge as a .zip archive of JSON data. This assessment data has an assessment identifier and an assessment version. These pieces of information, along with the file name, are mapped to a JSON Schema in archive-map.json. For more information on the JSON Schema used in assessments, see https://github.com/Sage-Bionetworks/mobile-client-json .
The Parquet dataset schema version
The Parquet dataset schema version differentiates parquet Parquet dataset schemas within the same JSON schema identifier. In some cases, data which conforms to different JSON schemas Schema can be written to the same parquet Parquet dataset under the same parquet Parquet dataset schema. The important thing to understand about this component is that parquet Parquet datasets with the same JSON schema identifier but different parquet Parquet dataset schema versions have different parquet dataset Parquet dataset schemas. For example, the Parquet dataset dataset_ArchiveMetadata_v1
and dataset_ArchiveMetadata_v2
have different schemas.
The JSON hierarchy identifier
The JSON hierarchy identifier explicitly spells out where in the JSON data hierarchy this data comes from. For example, if you are looking for startDate
in a parquet Parquet dataset and the JSON data looks like this:
Code Block |
---|
{ "stepHistory": [ { "type": "edu.northwestern.mobiletoolbox.mfs.serialization.MfsStepResult", "identifier": "MFS_welcome", "position": 1, "startDate": "2022-04-08T20:16:02.412-05:00", "endDate": "2022-04-08T20:16:07.510-05:00" } ] } |
Then the JSON hierarchy identifier is stepHistory
, since stepHistory
is a list of objects where each object has a startDate
property.
JSON to Parquet datasets example
Suppose we are interested in TODO
Table of Contents |
---|