This design doc covers:

Bridge Downstream Learnings

Example raw data for taskIdentifier “Number Match”. The file taskData.json looks like:

{
  "taskRunUUID" : <guid>,
  "schemaIdentifier" : "MTB_NumberMatch",
  "testVersion" : "1.5.0",
  "stepHistory" : [
    //...
  ],
  "locale" : "en_US",
  "endDate" : <ISO8601 Date-Time>,
  "identifier" : "Number Match",
  "type" : "mssTaskResult",
  "scores" : {
    "rawScore" : 27
  },
  "taskStatus" : [
    "completed"
  ],
  "startDate" : <ISO8601 Date-Time>,
  "taskName" : "Number Match",
  "userInteractions" : [
    //...
  ],
  "steps" : [
    //...
  ]
}

This creates the following files in parquet in the following form: bridge-downstream-parquet/bridge-downstream/<appId>/<studyId>/parquet/<type>/assessmentid=number-match/year=<YYYY>/month=<MM>/day=<DD>/part-<5-digit num>-<guid>.c000.snappy.parquet

Note that the relevant types for this assessment are:

Each JSON file in an assessment is broken into multiple sub-tables. Each assessment is further sub-partitioned by date. Each parquet file appears to contain only rows for a single record ID. D&T team uses PyArrow and Pandas to aggregate the data into a whole dataframe. We can either use this solution as well, or we can use something like https://github.com/apache/parquet-mr. (Note: This may or may not require us to pull in some Hadoop dependencies.)

Open Questions

How frequently is JSON to Parquet triggered? As per https://sagebionetworks.jira.com/wiki/spaces/BD/pages/2749759500/JSON+to+Parquet+Workflow#Scheduled-Trigger, the JSON to Parquet job is scheduled using a Cron job. What is the current schedule, and where is it configured? Can it be changed to trigger for each individual upload?

How do we aggregate data? We want to return one table per app per study per assessment revision that contains all rows for all participants across all days. Is this already handled in Bridge Downstream? Where is this output stored?

How is Bridge Downstream being monitored? Are there metrics? Dashboards? Alarms? How can Bridge team get access to these?

Should Bridge team have access to Bridge Downstream’s AWS accounts?

Requirements For Open Bridge

Results in tabular format in Researcher UI - We want research data available in the Researcher UI, one table per app per study per assessment revision. This table should include all rows for all participants in the study for all dates in the study. The exact format of the data varies from assessment to assessment, but should include scores if they are available.

Fast turnaround times - When the participant uploads data on their phone, we want Bridge to Export, Validate, Score (if available), and make the data available in the Researcher UI with as little delay as possible.

ARC Measures

Things we will have to do regardless of our design decisions:

Proposed Design: Replace Parquet

The schema validation stuff works just fine. However, the JSON to Parquet layer doesn’t quite do what we want it to do. In particular, we still need to aggregate the data across dates and record IDs.

Pros:

Cons:

Alternate Design: Keep Parquet

We keep the Bridge Downstream pipeline exactly as is, Parquet and all, and add our own code at the end to aggregate the data and send it back to Bridge.

Pros:

Cons:

Alternate Design 2: Do both

Pros:

Cons:

Surveys

Surveys are even easier than ARC measures. We already know what our survey table format looks like. (This is one of the things Exporter 2.0 actually did well. However, that survey engine is currently deprecated, as are Exporter 2.0 schemas.)

It’s not clear what JSON-to-Parquet gets us in this use case. Best case scenario, it converts survey JSON to tabular format similar to what we already expect, but we would end up with a bunch of 1-row files that we would need to aggregate. And we’d still need to integrate with the new Survey Builder to determine what table columns we want.

It would probably be simpler to write our custom Survey-to-Table implementation, custom-built to the new Survey Builder and survey format.

Additional Notes

Bridge Downstream code base:https://github.com/Sage-Bionetworks/BridgeDownstream

Bridge Downstream getting started: Getting Started

Bridge Downstream developer docs: /wiki/spaces/BD/pages/2746351624

How to read parquet data in the command line:

  1. Pre-req: pip needs to be installed on your Mac. The fastest way to install it is port install py-pip. This will also install Python, if necessary.

  2. pip install parquet-cli - You only need to do this once. This will put a command-line utility called parq in your bin.

  3. parq <filename> to get metadata for a parquet file.

  4. parq <filename> --head Nor parq <filename> --tail Nto read the actual parquet data.

  5. Alternate solution: https://github.com/devinrsmith/deephaven-parquet-viewer Allows you to view parquet files in your browser. Requires Docker.