This design doc covers:
Investigation of Bridge Downstream as Bridge team prepares to take ownership.
Exploration of if/how we can generalize Bridge Downstream to include ARC measures.
Exploration of if/how we can use Bridge Downstream to include surveys.
Note: A lot of this is preliminary and depends on discussion with Bridge Downstream team and Bridge team.
Bridge Downstream Learnings
Example raw data for taskIdentifier “Number Match”. The file taskData.json looks like:
{ "taskRunUUID" : <guid>, "schemaIdentifier" : "MTB_NumberMatch", "testVersion" : "1.5.0", "stepHistory" : [ //... ], "locale" : "en_US", "endDate" : <ISO8601 Date-Time>, "identifier" : "Number Match", "type" : "mssTaskResult", "scores" : { "rawScore" : 27 }, "taskStatus" : [ "completed" ], "startDate" : <ISO8601 Date-Time>, "taskName" : "Number Match", "userInteractions" : [ //... ], "steps" : [ //... ] }
This creates the following files in parquet in the following form: bridge-downstream-parquet/bridge-downstream/<appId>/<studyId>/parquet/<type>/assessmentid=number-match/year=<YYYY>/month=<MM>/day=<DD>/part-<5-digit num>-<guid>.c000.snappy.parquet
Note that the relevant types for this assessment are:
dataset_sharedschema_v1/
dataset_sharedschema_v1_stepHistory/
dataset_sharedschema_v1_steps/
dataset_sharedschema_v1_taskStatus/
dataset_sharedschema_v1_userInteractions/
dataset_sharedschema_v1_userInteractions_controlEvent/
We observe the following:
Each JSON file in an assessment is broken into multiple sub-tables.
Each assessment is further sub-partitioned by date.
Each parquet file appears to contain only rows for a single record ID.
The data is partitioned by assessment ID, but not by assessment version. As a result, each day=<DD> directory can contain files with different columns.
Open Questions
How frequently is JSON to Parquet triggered? As per https://sagebionetworks.jira.com/wiki/spaces/BD/pages/2749759500/JSON+to+Parquet+Workflow#Scheduled-Trigger, the JSON to Parquet job is scheduled using a Cron job. What is the current schedule, and where is it configured? Can it be changed to trigger for each individual upload?
How do we aggregate data? We want to return one table per app per study per assessment revision that contains all rows for all participants across all days. Is this already handled in Bridge Downstream? Where is this output stored?
How is Bridge Downstream being monitored? Are there metrics? Dashboards? Alarms? How can Bridge team get access to these?
Should Bridge team have access to Bridge Downstream’s AWS accounts?
Requirements For Open Bridge
Results in tabular format in Researcher UI - We want research data available in the Researcher UI, one table per app per study per assessment revision. This table should include all rows for all participants in the study for all dates in the study. The exact format of the data varies from assessment to assessment, but should include scores if they are available.
MVP is to return downloadable CSVs in the Researcher UI.
Note that we might not be able to achieve one and only one table per assessment. Some assessments contain multiple data sets. For example, the Number Match assessment includes contains multiple rows for stepHistory, steps, and userInteractions. These rows may or may not match one-to-one, and there may or may not be any meaningful way to combine these disparate data sets into a single table. We might need to provide a set of tables to represent an assessment version.
Regardless, the tables should not be sub-partitioned on date. Each CSV should contain all rows for all participants taking that assessment in the study.
Fast turnaround times - When the participant uploads data on their phone, we want Bridge to Export, Validate, Score (if available), and make the data available in the Researcher UI with as little delay as possible.
This precludes nightly processing or hourly processing, as this would add significant delays to returning the data.
How much of a delay is acceptable? Minutes? This will determine if we can do batch processing on a short schedule (eg once per 5 minutes), or if we need to trigger a job for each individual upload.
Note that given the current upload rates (peak less than 100 per day for all of MTB, and averaging much lower), there’s probably no difference between a batch job every 5 minutes and triggering per individual upload.
ARC Measures
Proposed Design: Replace Parquet
The schema validation stuff works just fine. However, the JSON to Parquet layer doesn’t quite do what we want it to do. In particular, we still need to aggregate the data across dates and record IDs.
Pros:
It will be easier to aggregate the data if we build it in as part of JSON-to-Table instead of making it an extra step at the end of the pipeline.
Allows us to use an existing table solution, such as Synapse tables, which also allow us to easily export to CSV.
Cons:
Will need to re-write the JSON-to-Table CSV code.
Alternate Design: Keep Parquet
We keep the Bridge Downstream pipeline exactly as is, Parquet and all, and add our own code at the end to aggregate the data and send it back to Bridge.
Pros:
Don’t need to re-write the JSON to tabular data workflow.
Cons:
Parquet is poorly supported in Java, and may or may not require us to pull in Hadoop as a dependency.
Parquet is a file format, so appending to Parquet tables will involve a lot of file I/O.
The current implementation of Parquet doesn’t prevent table fragments with different columns from appearing in the same partition, and the fragments don’t contain the assessment ID or revision. We will need to solve this problem in Parquet.
Surveys
Surveys are even easier than ARC measures. We already know what our survey table format looks like. (This is one of the things Exporter 2.0 actually did well. However, that survey engine is currently deprecated, as are Exporter 2.0 schemas.)
It’s not clear what JSON-to-Parquet gets us in this use case. Best case scenario, it converts survey JSON to tabular format similar to what we already expect, but we would end up with a bunch of 1-row files that we would need to aggregate. And we’d still need to integrate with the new Survey Builder to determine what table columns we want.
It would probably be simpler to write our custom Survey-to-Table implementation, custom-built to the new Survey Builder and survey format.