This design doc covers:

Bridge Downstream Learnings

Example raw data for taskIdentifier “Number Match”. The file taskData.json looks like:

{
  "taskRunUUID" : <guid>,
  "schemaIdentifier" : "MTB_NumberMatch",
  "testVersion" : "1.5.0",
  "stepHistory" : [
    //...
  ],
  "locale" : "en_US",
  "endDate" : <ISO8601 Date-Time>,
  "identifier" : "Number Match",
  "type" : "mssTaskResult",
  "scores" : {
    "rawScore" : 27
  },
  "taskStatus" : [
    "completed"
  ],
  "startDate" : <ISO8601 Date-Time>,
  "taskName" : "Number Match",
  "userInteractions" : [
    //...
  ],
  "steps" : [
    //...
  ]
}

This creates the following files in parquet in the following form: bridge-downstream-parquet/bridge-downstream/<appId>/<studyId>/parquet/<type>/assessmentid=number-match/year=<YYYY>/month=<MM>/day=<DD>/part-<5-digit num>-<guid>.c000.snappy.parquet

Note that the relevant types for this assessment are:

Each JSON file in an assessment is broken into multiple sub-tables. Each assessment is further sub-partitioned by date. Each parquet file appears to contain only rows for a single record ID. D&T team uses PyArrow and Pandas to aggregate the data into a whole dataframe. We can either use this solution as well, or we can use something like https://github.com/apache/parquet-mr. (Note: This may or may not require us to pull in some Hadoop dependencies.)

Open Questions

How frequently is JSON to Parquet triggered? As per https://sagebionetworks.jira.com/wiki/spaces/BD/pages/2749759500/JSON+to+Parquet+Workflow#Scheduled-Trigger, the JSON to Parquet job is scheduled using a Cron job. What is the current schedule, and where is it configured? Can it be changed to trigger for each individual upload?

How do we aggregate data? We want to return one table per app per study per assessment revision that contains all rows for all participants across all days. Is this already handled in Bridge Downstream? Where is this output stored?

How is Bridge Downstream being monitored? Are there metrics? Dashboards? Alarms? How can Bridge team get access to these?

Should Bridge team have access to Bridge Downstream’s AWS accounts?

Requirements For Open Bridge

Results in tabular format in Researcher UI - We want research data available in the Researcher UI, one table per app per study per assessment revision. This table should include all rows for all participants in the study for all dates in the study. The exact format of the data varies from assessment to assessment, but should include scores if they are available.

Fast turnaround times - When the participant uploads data on their phone, we want Bridge to Export, Validate, Score (if available), and make the data available in the Researcher UI with as little delay as possible.

ARC Measures

Clone the Bridge Downstream pipeline for Open Bridge. We have an indefinite code freeze for MTB. However, we want to set up a new Bridge Downstream pipeline for Open Bridge. This will be used to convert the raw JSON data into Parquet.

Note that the SQS triggers are currently not hooked up to anything. We’ll need to use existing Cron trigger (although we might want to consider increasing the interval from hourly to every 5 minutes). Bridge Downstream can also be kicked off manually by the Parquet to CSV worker job (see below).

Parquet-to-CSV Worker. This worker loads the parquet fragments into a single Parquet data frame, then exports a snapshot of the Parquet data in CSV format. It stores this CSV in S3 and writes the URL and the job status to Bridge Server, so that Bridge Server knows about it.

Note that because an assessment can result in multiple tables and multiple CSVs, this API will return a zip file of CSVs.

Scoring Code. We only have 3 ARC measures, and only one of them has meaningful scoring code. For V1, we should have a hardcoded mapping from assessment ID to function calls that live inside the Worker. As part of the Parquet-to-CSV Worker, we should call this function to score the data set and return the scores as a separate file in the zip file.

In 2024, we’ll need to do some work to make this scalable. This may involve a solution to run 3rd party scoring code in a sandboxed environment. We don’t need to solve this until 2024.

Bridge APIs to request a snapshot of the study data.

Researcher UI changes to expose the request and result API for CSV Snapshots and to download the resulting file.

Additional work items:

“Architectural Diagram” and design notes 2023-10-25

Addendum 1: Who Needs Parquet

We’ve got power users using JSON and basic users using the simple CSV summaries in the Researcher UI. However, we don’t really know who wants to use Parquet. If no one needs Parquet, then we can save a lot of work by skipping Bridge Downstream for Q4. Since we plan to re-build a lot of this in 2024 anyway, we wouldn’t be saving any work by standing up Bridge Downstream right now.

Specifically, we can cut the following Jiras from our plan:

The following Jiras would change:

We would need to add a Jira for storing the intermediate tabular results that the JSON-to-CSV Worker generates. This work would be costed at 3.

During the discussion, we also identified additional work that would need to be done. These work items would have had to be done whether or not we use Parquet, so they don’t change the cost of the project.

Pros

Cons

Surveys

We can easily build survey processing on top of the Bridge Downstream pipeline as described above for ARC measures.

Survey Schemas. We need survey schemas to validate the raw JSON data.

Verify Survey Result Format. We want the Survey CSV in a format where each row is one participant’s answers to a survey and each column is one survey question. For JSON-to-Parquet to generate this, we need survey results to be in a JSON file where each key is the survey question identifier and each value is the survey answer.

iOS already supports this, but we would need to do Android work to support this.

Free text answers. We support this, but we may need to add max character limits to the answers. It’s currently unclear if Parquet has field size limits or if there are other limitations in the app or in the Researcher UI. However, a max character limit of either 250 or 500 should be sufficient for all use cases.

If this turns out to be a lot of work, we always the option to simply disallow free text answers in V1 and postpone the feature to 2024.

Multiple choice, multiple answers. A multiple choice question where the participant can select multiple answers from the list of choices. One limitation is if we represent this as a JSON Array, JSON-to-Parquet will separate this array into its own table, whereas we want it in its own column.

One possible workaround is to represent the answer is a comma separated list instead. This can be accomplished because multiple choice answers contain both the display text (which is shown to the participant) and the value (an internal value representing the answer). We can restrict the value to only contain alphanumeric spaces and characters. This means that we will never have commas in the survey answers, so we can always use a comma-separated list.

This will require work in iOS and Android, and possibly in the Researcher UI as well.

If this turns out to be a lot of work, we have the option to simply disallow multiple answers and postpone the feature to 2024.

Jira Task Breakdown

ARC Epic:

Surveys Epic:

Researcher UI

Bridge Back-End

(Server and Worker work)

(1)

(2)

(3)

(5)

(3)

(5)

(3)

(3)

(1)

2024 Improvements

(3)

Defunct

(8)

(8)

(1)

(1)

(3)

Additional Notes

Bridge Downstream code base:https://github.com/Sage-Bionetworks/BridgeDownstream

Bridge Downstream getting started: Getting Started

Bridge Downstream developer docs: /wiki/spaces/BD/pages/2746351624

Bridge Downstream setting up a new study: Setting Up a New Study

How to read parquet data in the command line:

  1. Pre-req: pip needs to be installed on your Mac. The fastest way to install it is port install py-pip. This will also install Python, if necessary.

  2. pip install parquet-cli - You only need to do this once. This will put a command-line utility called parq in your bin.

  3. parq <filename> to get metadata for a parquet file.

  4. parq <filename> --head Nor parq <filename> --tail Nto read the actual parquet data.

  5. Alternate solution: https://github.com/devinrsmith/deephaven-parquet-viewer Allows you to view parquet files in your browser. Requires Docker.

  6. Alternate solution: PyCharm for IntelliJ IDE.

Long-term, we can look into something like https://hevodata.com/learn/parquet-to-postgresql/ for storing and managing Parquet data.