This design doc covers:
Investigation of Bridge Downstream as Bridge team prepares to take ownership.
Exploration of if/how we can generalize Bridge Downstream to include ARC measures.
Exploration of if/how we can use Bridge Downstream to include surveys.
Bridge Downstream Learnings
Example raw data for taskIdentifier “Number Match”. The file taskData.json looks like:
{ "taskRunUUID" : <guid>, "schemaIdentifier" : "MTB_NumberMatch", "testVersion" : "1.5.0", "stepHistory" : [ //... ], "locale" : "en_US", "endDate" : <ISO8601 Date-Time>, "identifier" : "Number Match", "type" : "mssTaskResult", "scores" : { "rawScore" : 27 }, "taskStatus" : [ "completed" ], "startDate" : <ISO8601 Date-Time>, "taskName" : "Number Match", "userInteractions" : [ //... ], "steps" : [ //... ] }
This creates the following files in parquet in the following form: bridge-downstream-parquet/bridge-downstream/<appId>/<studyId>/parquet/<type>/assessmentid=number-match/year=<YYYY>/month=<MM>/day=<DD>/part-<5-digit num>-<guid>.c000.snappy.parquet
Note that the relevant types for this assessment are:
dataset_sharedschema_v1/
dataset_sharedschema_v1_stepHistory/
dataset_sharedschema_v1_steps/
dataset_sharedschema_v1_taskStatus/
dataset_sharedschema_v1_userInteractions/
dataset_sharedschema_v1_userInteractions_controlEvent/
Each JSON file in an assessment is broken into multiple sub-tables. Each assessment is further sub-partitioned by date. Each parquet file appears to contain only rows for a single record ID. D&T team uses PyArrow and Pandas to aggregate the data into a whole dataframe. We can either use this solution as well, or we can use something like https://github.com/apache/parquet-mr. (Note: This may or may not require us to pull in some Hadoop dependencies.)
Open Questions
How frequently is JSON to Parquet triggered? As per https://sagebionetworks.jira.com/wiki/spaces/BD/pages/2749759500/JSON+to+Parquet+Workflow#Scheduled-Trigger, the JSON to Parquet job is scheduled using a Cron job. What is the current schedule, and where is it configured? Can it be changed to trigger for each individual upload?
Answer: Currently, the cron job is configured to run every hour. The SQS trigger is currently not hooked up to Bridge Downstream, and it relies only on the Cron trigger.
How do we aggregate data? We want to return one table per app per study per assessment revision that contains all rows for all participants across all days. Is this already handled in Bridge Downstream? Where is this output stored?
Answer: We use PyArrow and Pandas to aggregate the data into a single data frame.
How is Bridge Downstream being monitored? Are there metrics? Dashboards? Alarms? How can Bridge team get access to these?
Answer: Built-in Glue dashboards for the Glue jobs. (Lambda we already know how to monitor. SQS and SNS are trivial.)
Should Bridge team have access to Bridge Downstream’s AWS accounts?
Requirements For Open Bridge
Results in tabular format in Researcher UI - We want research data available in the Researcher UI, one table per app per study per assessment revision. This table should include all rows for all participants in the study for all dates in the study. The exact format of the data varies from assessment to assessment, but should include scores if they are available.
MVP is to return downloadable CSVs in the Researcher UI.
Note that we might not be able to achieve one and only one table per assessment. Some assessments contain multiple data sets. For example, the Number Match assessment includes contains multiple rows for stepHistory, steps, and userInteractions. These rows may or may not match one-to-one, and there may or may not be any meaningful way to combine these disparate data sets into a single table. We might need to provide a set of tables to represent an assessment version.
Regardless, the tables should not be sub-partitioned on date. Each CSV should contain all rows for all participants taking that assessment in the study.
Fast turnaround times - When the participant uploads data on their phone, we want Bridge to Export, Validate, Score (if available), and make the data available in the Researcher UI with as little delay as possible.
This precludes nightly processing or hourly processing, as this would add significant delays to returning the data.
How much of a delay is acceptable? Minutes? This will determine if we can do batch processing on a short schedule (eg once per 5 minutes), or if we need to trigger a job for each individual upload.
Note that given the current upload rates (peak less than 100 per day for all of MTB, and averaging much lower), there’s probably no difference between a batch job every 5 minutes and triggering per individual upload.
ARC Measures
Clone the Bridge Downstream pipeline for Open Bridge. We have an indefinite code freeze for MTB. However, we want to set up a new Bridge Downstream pipeline for Open Bridge. This will be used to convert the raw JSON data into Parquet.
Note that the SQS triggers are currently not hooked up to anything. We’ll need to use existing Cron trigger (although we might want to consider increasing the interval from hourly to every 5 minutes). Bridge Downstream can also be kicked off manually by the Parquet to CSV worker job (see below).
Parquet-to-CSV Worker. This worker loads the parquet fragments into a single Parquet data frame, then exports a snapshot of the Parquet data in CSV format. It stores this CSV in S3 and writes the URL and the job status to Bridge Server, so that Bridge Server knows about it.
Note that because an assessment can result in multiple tables and multiple CSVs, this API will return a zip file of CSVs.
Possible Design 1: We write this worker purely in Java, using something like https://github.com/apache/parquet-mr. Testing will need to be done to see if this does, in fact, pull in Hadoop dependencies, and if so, how much of a problem that is. We also need to verify that this Parquet library aggregates the data frame as we expect and can filter and can export to CSV.
Possible Design 2: Run PyArrow and Pandas in Java. The upside is that these libraries are what the Scoring code uses, so we know it does what we want. Will need to verify that we can run Python in a Java environment (probably yes) and whether these specific libraries will run in a Java environment (hopefully yes).
Possible Design 3: Set up a new Python Worker to run PyArrow and Pandas natively. This will require us to set up a new stack in AWS, complete with all the infrastructure and monitoring. However, running a native Python environment may be easier than running Python in Java.
Scoring Code. We only have 3 ARC measures, and only one of them has meaningful scoring code. For V1, we should have a hardcoded mapping from assessment ID to function calls that live inside the Worker. As part of the Parquet-to-CSV Worker, we should call this function to score the data set and return the scores as a separate file in the zip file.
In 2024, we’ll need to do some work to make this scalable. This may involve a solution to run 3rd party scoring code in a sandboxed environment. We don’t need to solve this until 2024.
Bridge APIs to request a snapshot of the study data.
Bridge API to request the CSV Snapshot. This is scoped to the study and takes in an optional date range and a list of assessments to include in the snapshot (left blank to include everything). This triggers the Parquet-to-CSV Worker and returns a job ID.
Bridge API for the Parquet-to-CSV Worker to mark a job ID as complete and set the URL for the result.
Bridge API to get the CSV Snapshot result. Note that if the Worker hasn’t marked the job ID as complete, this may return an incomplete status with no data URL.
These APIs are scoped to the study. We can currently fake this by using study_coordinator or study_designer roles.
Optionally, we add a new study_researcher role, which has the ability to read the research data and the participant roster, as a separate role from the study coordinators and designers. This would be an almost trivial amount of work, but would be replaced by the permissions re-work in 2024.
Researcher UI changes to expose the request and result API for CSV Snapshots and to download the resulting file.
Additional work items:
Write the post-processing status to Adherence.
Note that the Bridge Downstream code hardcodes appId=mobile-toolbox. We want to un-hardcode this and either read the appId from the FileEntity annotations, or propagate the appId through each step so that we never lose it.
Also, the appId is currently being set by the app. This should instead be a top-level property in the Exporter itself.
Surveys
We can easily build survey processing on top of the Bridge Downstream pipeline as described above for ARC measures.
Survey Schemas. We need survey schemas to validate the raw JSON data.
Possible Design 1: The Researcher UI automatically creates the survey schema.
Possible Design 2: The app includes the survey schema in-line in the survey results. This will require changes in the mobile apps as well as in Bridge Downstream to read the in-line schema instead of following a link.
Possible Design 3: A generic survey schema that simply validates that we have key-value pairs. This means that the survey results are not fully validated, but since the survey JSON format is relatively simple, we don’t lose much by not validating it. This is also the cheapest of the 3 solutions.
Verify Survey Result Format. We want the Survey CSV in a format where each row is one participant’s answers to a survey and each column is one survey question. For JSON-to-Parquet to generate this, we need survey results to be in a JSON file where each key is the survey question identifier and each value is the survey answer.
iOS already supports this, but we would need to do Android work to support this.
Free text answers. We support this, but we may need to add max character limits to the answers. It’s currently unclear if Parquet has field size limits or if there are other limitations in the app or in the Researcher UI. However, a max character limit of either 250 or 500 should be sufficient for all use cases.
If this turns out to be a lot of work, we always the option to simply disallow free text answers in V1 and postpone the feature to 2024.
Multiple choice, multiple answers. A multiple choice question where the participant can select multiple answers from the list of choices. One limitation is if we represent this as a JSON Array, JSON-to-Parquet will separate this array into its own table, whereas we want it in its own column.
One possible workaround is to represent the answer is a comma separated list instead. This can be accomplished because multiple choice answers contain both the display text (which is shown to the participant) and the value (an internal value representing the answer). We can restrict the value to only contain alphanumeric spaces and characters. This means that we will never have commas in the survey answers, so we can always use a comma-separated list.
This will require work in iOS and Android, and possibly in the Researcher UI as well.
If this turns out to be a lot of work, we have the option to simply disallow multiple answers and postpone the feature to 2024.
Jira Task Breakdown
TODO
Additional Notes
Bridge Downstream code base:https://github.com/Sage-Bionetworks/BridgeDownstream
Bridge Downstream getting started: Getting Started
Bridge Downstream developer docs: /wiki/spaces/BD/pages/2746351624
Bridge Downstream setting up a new study: Setting Up a New Study
How to read parquet data in the command line:
Pre-req: pip needs to be installed on your Mac. The fastest way to install it is
port install py-pip
. This will also install Python, if necessary.pip install parquet-cli
- You only need to do this once. This will put a command-line utility calledparq
in your bin.parq <filename>
to get metadata for a parquet file.parq <filename> --head N
orparq <filename> --tail N
to read the actual parquet data.Alternate solution: https://github.com/devinrsmith/deephaven-parquet-viewer Allows you to view parquet files in your browser. Requires Docker.
Alternate solution: PyCharm for IntelliJ IDE.
Long-term, we can look into something like https://hevodata.com/learn/parquet-to-postgresql/ for storing and managing Parquet data.