This design doc covers:
Investigation of Bridge Downstream as Bridge team prepares to take ownership.
Exploration of if/how we can generalize Bridge Downstream to include ARC measures.
Exploration of if/how we can use Bridge Downstream to include surveys.
Bridge Downstream Learnings
Example raw data for taskIdentifier “Number Match”. The file taskData.json looks like:
{ "taskRunUUID" : <guid>, "schemaIdentifier" : "MTB_NumberMatch", "testVersion" : "1.5.0", "stepHistory" : [ //... ], "locale" : "en_US", "endDate" : <ISO8601 Date-Time>, "identifier" : "Number Match", "type" : "mssTaskResult", "scores" : { "rawScore" : 27 }, "taskStatus" : [ "completed" ], "startDate" : <ISO8601 Date-Time>, "taskName" : "Number Match", "userInteractions" : [ //... ], "steps" : [ //... ] }
This creates the following files in parquet in the following form: bridge-downstream-parquet/bridge-downstream/<appId>/<studyId>/parquet/<type>/assessmentid=number-match/year=<YYYY>/month=<MM>/day=<DD>/part-<5-digit num>-<guid>.c000.snappy.parquet
Note that the relevant types for this assessment are:
dataset_sharedschema_v1/
dataset_sharedschema_v1_stepHistory/
dataset_sharedschema_v1_steps/
dataset_sharedschema_v1_taskStatus/
dataset_sharedschema_v1_userInteractions/
dataset_sharedschema_v1_userInteractions_controlEvent/
Each JSON file in an assessment is broken into multiple sub-tables. Each assessment is further sub-partitioned by date. Each parquet file appears to contain only rows for a single record ID. D&T team uses PyArrow and Pandas to aggregate the data into a whole dataframe. We can either use this solution as well, or we can use something like https://github.com/apache/parquet-mr. (Note: This may or may not require us to pull in some Hadoop dependencies.)
Open Questions
How frequently is JSON to Parquet triggered? As per https://sagebionetworks.jira.com/wiki/spaces/BD/pages/2749759500/JSON+to+Parquet+Workflow#Scheduled-Trigger, the JSON to Parquet job is scheduled using a Cron job. What is the current schedule, and where is it configured? Can it be changed to trigger for each individual upload?
Answer: Currently, the cron job is configured to run every hour. The SQS trigger is currently not hooked up to Bridge Downstream, and it relies only on the Cron trigger.
How do we aggregate data? We want to return one table per app per study per assessment revision that contains all rows for all participants across all days. Is this already handled in Bridge Downstream? Where is this output stored?
Answer: We use PyArrow and Pandas to aggregate the data into a single data frame.
How is Bridge Downstream being monitored? Are there metrics? Dashboards? Alarms? How can Bridge team get access to these?
Answer: Built-in Glue dashboards for the Glue jobs. (Lambda we already know how to monitor. SQS and SNS are trivial.)
Should Bridge team have access to Bridge Downstream’s AWS accounts?
Requirements For Open Bridge
Results in tabular format in Researcher UI - We want research data available in the Researcher UI, one table per app per study per assessment revision. This table should include all rows for all participants in the study for all dates in the study. The exact format of the data varies from assessment to assessment, but should include scores if they are available.
MVP is to return downloadable CSVs in the Researcher UI.
Note that we might not be able to achieve one and only one table per assessment. Some assessments contain multiple data sets. For example, the Number Match assessment includes contains multiple rows for stepHistory, steps, and userInteractions. These rows may or may not match one-to-one, and there may or may not be any meaningful way to combine these disparate data sets into a single table. We might need to provide a set of tables to represent an assessment version.
Regardless, the tables should not be sub-partitioned on date. Each CSV should contain all rows for all participants taking that assessment in the study.
Fast turnaround times - When the participant uploads data on their phone, we want Bridge to Export, Validate, Score (if available), and make the data available in the Researcher UI with as little delay as possible.
This precludes nightly processing or hourly processing, as this would add significant delays to returning the data.
How much of a delay is acceptable? Minutes? This will determine if we can do batch processing on a short schedule (eg once per 5 minutes), or if we need to trigger a job for each individual upload.
Note that given the current upload rates (peak less than 100 per day for all of MTB, and averaging much lower), there’s probably no difference between a batch job every 5 minutes and triggering per individual upload.
ARC Measures
Things we will have to do regardless of our design decisions:
Modify the Glue job to trigger off each upload instead of a cron schedule. (Or move the Python script to something that’s not Glue.)
Aggregate data across all dates and participants.
Write the aggregated data back to Bridge.
Write the post-processing status to Adherence.
Bridge API to download the aggregated data as a CSV.
Researcher UI to surface the API to download a CSV.
Note that the Bridge Downstream code hardcodes appId=mobile-toolbox. We want to un-hardcode this and either read the appId from the FileEntity annotations, or propagate the appId through each step so that we never lose it.
Also, the appId is currently being set by the app. This should instead be a top-level property in the Exporter itself.
Proposed Design: Replace Parquet
The schema validation stuff works just fine. However, the JSON to Parquet layer doesn’t quite do what we want it to do. In particular, we still need to aggregate the data across dates and record IDs.
Pros:
It will be easier to aggregate the data if we build it in as part of JSON-to-Table instead of making it an extra step at the end of the pipeline.
Allows us to use an existing table solution, such as Synapse tables, which also allow us to easily export to CSV.
Cons:
Will need to re-write the JSON-to-Table CSV code.
Alternate Design: Keep Parquet
We keep the Bridge Downstream pipeline exactly as is, Parquet and all, and add our own code at the end to aggregate the data and send it back to Bridge.
Pros:
Don’t need to re-write the JSON to tabular data workflow.
Cons:
Parquet is poorly supported in Java, and may or may not require us to pull in Hadoop as a dependency.
Parquet is a file format, so appending to Parquet tables will involve a lot of file I/O.
The current implementation of Parquet doesn’t prevent table fragments with different columns from appearing in the same partition, and the fragments don’t contain the assessment ID or revision. We will need to solve this problem in Parquet.
Alternate Design 2: Do both
Keep Exporter 3 with push to Synapse
For any supported assessment or survey, build an “answers.json” file that is a flat dictionary of scores/answers + metadata.
Build a unit testing setup that can use python or R scripts (what researchers like) and port to Kotlin for on-device cross platform scoring
write the “answers” to the Adherence Record as a dictionary
write a “answers.json” file to the archive
Add a back-end service to get the “answers.json” file into a table
Pros:
Feedback to the participant (if desired)
Easier to aggregate the data if we build it in as part of JSON-to-Table
Allows us to use an existing table solution, such as Synapse tables, which also allow us to easily export to CSV
Cons:
Why did we change from Bridge 1.0/Exporter 2.0 again?
Will require robust testing of the unit tests for converting R/Python scoring to Kotlin
Surveys
Surveys are even easier than ARC measures. We already know what our survey table format looks like. (This is one of the things Exporter 2.0 actually did well. However, that survey engine is currently deprecated, as are Exporter 2.0 schemas.)
It’s not clear what JSON-to-Parquet gets us in this use case. Best case scenario, it converts survey JSON to tabular format similar to what we already expect, but we would end up with a bunch of 1-row files that we would need to aggregate. And we’d still need to integrate with the new Survey Builder to determine what table columns we want.
It would probably be simpler to write our custom Survey-to-Table implementation, custom-built to the new Survey Builder and survey format.
Additional Notes
Bridge Downstream code base:https://github.com/Sage-Bionetworks/BridgeDownstream
Bridge Downstream getting started: Getting Started
Bridge Downstream developer docs: /wiki/spaces/BD/pages/2746351624
How to read parquet data in the command line:
Pre-req: pip needs to be installed on your Mac. The fastest way to install it is
port install py-pip
. This will also install Python, if necessary.pip install parquet-cli
- You only need to do this once. This will put a command-line utility calledparq
in your bin.parq <filename>
to get metadata for a parquet file.parq <filename> --head N
orparq <filename> --tail N
to read the actual parquet data.Alternate solution: https://github.com/devinrsmith/deephaven-parquet-viewer Allows you to view parquet files in your browser. Requires Docker.