This design doc covers:
Investigation of Bridge Downstream as Bridge team prepares to take ownership.
Exploration of if/how we can generalize Bridge Downstream to include ARC measures.
Exploration of if/how we can use Bridge Downstream to include surveys.
Bridge Downstream Learnings
Example raw data for taskIdentifier “Number Match”. The file taskData.json looks like:
{ "taskRunUUID" : <guid>, "schemaIdentifier" : "MTB_NumberMatch", "testVersion" : "1.5.0", "stepHistory" : [ //... ], "locale" : "en_US", "endDate" : <ISO8601 Date-Time>, "identifier" : "Number Match", "type" : "mssTaskResult", "scores" : { "rawScore" : 27 }, "taskStatus" : [ "completed" ], "startDate" : <ISO8601 Date-Time>, "taskName" : "Number Match", "userInteractions" : [ //... ], "steps" : [ //... ] }
This creates the following files in parquet in the following form: bridge-downstream-parquet/bridge-downstream/<appId>/<studyId>/parquet/<type>/assessmentid=number-match/year=<YYYY>/month=<MM>/day=<DD>/part-<5-digit num>-<guid>.c000.snappy.parquet
Note that the relevant types for this assessment are:
dataset_sharedschema_v1/
dataset_sharedschema_v1_stepHistory/
dataset_sharedschema_v1_steps/
dataset_sharedschema_v1_taskStatus/
dataset_sharedschema_v1_userInteractions/
dataset_sharedschema_v1_userInteractions_controlEvent/
Each JSON file in an assessment is broken into multiple sub-tables. Each assessment is further sub-partitioned by date. Each parquet file appears to contain only rows for a single record ID. D&T team uses PyArrow and Pandas to aggregate the data into a whole dataframe. We can either use this solution as well, or we can use something like https://github.com/apache/parquet-mr. (Note: This may or may not require us to pull in some Hadoop dependencies.)
Open Questions
How frequently is JSON to Parquet triggered? As per https://sagebionetworks.jira.com/wiki/spaces/BD/pages/2749759500/JSON+to+Parquet+Workflow#Scheduled-Trigger, the JSON to Parquet job is scheduled using a Cron job. What is the current schedule, and where is it configured? Can it be changed to trigger for each individual upload?
Answer: Currently, the cron job is configured to run every hour. The SQS trigger is currently not hooked up to Bridge Downstream, and it relies only on the Cron trigger.
How do we aggregate data? We want to return one table per app per study per assessment revision that contains all rows for all participants across all days. Is this already handled in Bridge Downstream? Where is this output stored?
Answer: We use PyArrow and Pandas to aggregate the data into a single data frame.
How is Bridge Downstream being monitored? Are there metrics? Dashboards? Alarms? How can Bridge team get access to these?
Answer: Built-in Glue dashboards for the Glue jobs. (Lambda we already know how to monitor. SQS and SNS are trivial.)
Should Bridge team have access to Bridge Downstream’s AWS accounts?
Requirements For Open Bridge
Results in tabular format in Researcher UI - We want research data available in the Researcher UI, one table per app per study per assessment revision. This table should include all rows for all participants in the study for all dates in the study. The exact format of the data varies from assessment to assessment, but should include scores if they are available.
MVP is to return downloadable CSVs in the Researcher UI.
Note that we might not be able to achieve one and only one table per assessment. Some assessments contain multiple data sets. For example, the Number Match assessment includes contains multiple rows for stepHistory, steps, and userInteractions. These rows may or may not match one-to-one, and there may or may not be any meaningful way to combine these disparate data sets into a single table. We might need to provide a set of tables to represent an assessment version.
Regardless, the tables should not be sub-partitioned on date. Each CSV should contain all rows for all participants taking that assessment in the study.
Fast turnaround times - When the participant uploads data on their phone, we want Bridge to Export, Validate, Score (if available), and make the data available in the Researcher UI with as little delay as possible.
This precludes nightly processing or hourly processing, as this would add significant delays to returning the data.
How much of a delay is acceptable? Minutes? This will determine if we can do batch processing on a short schedule (eg once per 5 minutes), or if we need to trigger a job for each individual upload.
Note that given the current upload rates (peak less than 100 per day for all of MTB, and averaging much lower), there’s probably no difference between a batch job every 5 minutes and triggering per individual upload.
ARC Measures
Clone the Bridge Downstream pipeline for Open Bridge. We have an indefinite code freeze for MTB. However, we want to set up a new Bridge Downstream pipeline for Open Bridge. This will be used to convert the raw JSON data into Parquet.
Note that the SQS triggers are currently not hooked up to anything. We’ll need to use existing Cron trigger (although we might want to consider increasing the interval from hourly to every 5 minutes). Bridge Downstream can also be kicked off manually by the Parquet to CSV worker job (see below).
Parquet-to-CSV Worker. This worker loads the parquet fragments into a single Parquet data frame, then exports a snapshot of the Parquet data in CSV format. It stores this CSV in S3 and writes the URL and the job status to Bridge Server, so that Bridge Server knows about it.
Note that because an assessment can result in multiple tables and multiple CSVs, this API will return a zip file of CSVs.
Possible Design 1: We write this worker purely in Java, using something like https://github.com/apache/parquet-mr. Testing will need to be done to see if this does, in fact, pull in Hadoop dependencies, and if so, how much of a problem that is. We also need to verify that this Parquet library aggregates the data frame as we expect and can filter and can export to CSV.
Possible Design 2: Run PyArrow and Pandas in Java. The upside is that these libraries are what the Scoring code uses, so we know it does what we want. Will need to verify that we can run Python in a Java environment (probably yes) and whether these specific libraries will run in a Java environment (hopefully yes).
Possible Design 3: Set up a new Python Worker to run PyArrow and Pandas natively. This will require us to set up a new stack in AWS, complete with all the infrastructure and monitoring. However, running a native Python environment may be easier than running Python in Java.
Scoring Code. We only have 3 ARC measures, and only one of them has meaningful scoring code. For V1, we should have a hardcoded mapping from assessment ID to function calls that live inside the Worker. As part of the Parquet-to-CSV Worker, we should call this function to score the data set and return the scores as a separate file in the zip file.
In 2024, we’ll need to do some work to make this scalable. This may involve a solution to run 3rd party scoring code in a sandboxed environment. We don’t need to solve this until 2024.
Bridge APIs to request a snapshot of the study data.
Bridge API to request the CSV Snapshot. This is scoped to the study and takes in an optional date range and a list of assessments to include in the snapshot (left blank to include everything). This triggers the Parquet-to-CSV Worker and returns a job ID.
Bridge API for the Parquet-to-CSV Worker to mark a job ID as complete and set the URL for the result.
Bridge API to get the CSV Snapshot result. Note that if the Worker hasn’t marked the job ID as complete, this may return an incomplete status with no data URL.
These APIs are scoped to the study. We can currently fake this by using study_coordinator or study_designer roles.
Optionally, we add a new study_researcher role, which has the ability to read the research data and the participant roster, as a separate role from the study coordinators and designers. This would be an almost trivial amount of work, but would be replaced by the permissions re-work in 2024.
Researcher UI changes to expose the request and result API for CSV Snapshots and to download the resulting file.
Additional work items:
Write the post-processing status to Adherence.
Note that the Bridge Downstream code hardcodes appId=mobile-toolbox. We want to un-hardcode this and either read the appId from the FileEntity annotations, or propagate the appId through each step so that we never lose it.
Also, the appId is currently being set by the app. This should instead be a top-level property in the Exporter itself.
Addendum 1: Who Needs Parquet
We’ve got power users using JSON and basic users using the simple CSV summaries in the Researcher UI. However, we don’t really know who wants to use Parquet. If no one needs Parquet, then we can save a lot of work by skipping Bridge Downstream for Q4. Since we plan to re-build a lot of this in 2024 anyway, we wouldn’t be saving any work by standing up Bridge Downstream right now.
Specifically, we can cut the following Jiras from our plan:
The following Jiras would change:
- DHP-1023Getting issue details... STATUS would be split into 2 different Jiras: (a) a JSON-to-Table-Row method, which would live inside the Exporter 3 Worker, costed at 3 sprint points and (b) a CSV Worker which queries all the rows and aggregates them into a CSV, costed at 5. The cost for this task would stay the same, changing from 8 points to 2 tasks costing 3 and 5 points. However, there is much less uncertainty because we don’t have to figure out how to read Parquet in Java.
- DHP-1025Getting issue details... STATUS , previously costed at 5, would be split into 2 Jiras, one for “summarizing” the CSV, costed at 3, and another Jira for re-writing the ARC scoring code, probably costed at 2 or 3.
This scoring code could either be part of the apps or it could live in the Worker.
- DHP-1028Getting issue details... STATUS would live in Bridge Worker instead of Bridge Downstream, but would stay costed at 1.
We would need to add a Jira for storing the intermediate tabular results that the JSON-to-CSV Worker generates. This work would be costed at 3.
During the discussion, we also identified additional work that would need to be done. These work items would have had to be done whether or not we use Parquet, so they don’t change the cost of the project.
Add a field to Assessment Resources (for some reason renamed to “External Resources” in the JavaSDK) to store data in-place (instead of a link). This would involve adding a new field to Assessment Resource (probably a Json Node so we can store structured data if desired). This could be used to store the JSON Schema or the list of table column names or other documentation. We would also want to add new enum values to Resource Category for SCHEMA and TABLE_COLUMNS. (SCIENCE_DOCUMENTATION and DEVELOPER_DOCUMENTATION already exist.) This would be trivial. Costed at 1 sprint point.
A validation step in JSON-to-Table-Row method, which logs a message if we fail validation. This would include validating the data against a JSON Schema (if it exists) and comparing the flat map to the list of table columns (if it exists). Costed at 2-3 sprint points.
Setting up alarms for the validation failures is so trivial that this is included in the previous work item.
CSV Worker should pull the Table Column list from Assessment Resources. If it exists, the CSV Worker should always ensure that these columns are present in the CSV, even if the values are null. Costed at 2-3 sprint points.
Pros
Reduces the estimated amount of work by 9 sprint points. (More accurately, somewhere between 1 and 9 sprint points, since DHP-1020 is in progress, but closer to 9 since DHP-1020 is very far from complete.)
Completely eliminates DHP-1020, which is the riskiest work item, and almost completely reduces the risk from DHP-1023.
Potentially frees up Dwayne’s schedule later in Q4 to help with the Android stuff or the Permissions stuff.
Bridge Downstream runs hourly. If we remove Bridge Downstream, we don’t have to worry about triggering Bridge Downstream in the CSV Worker and adding additional delay.
This trivially solves the multi-select problem. We can always just render the raw JSON array in the CSV.
That said, this is not a particularly user-friendly output, so we might want to return a comma-delimited list anyway.
Schemas are no longer necessary for the data pipeline. Since we generate the CSV on the fly, we can always just determine the columns on the fly.
Schemas might still be necessary for documentation purposes and for future data validation work.
Cons
Would have to re-write the scoring code from that one Arc measure.
This is mitigated because we were planning to re-write it from R to Python anyway, and now we’re considering re-writing it in Kotlin (mobile) or Java (Worker).
In the old design, the JSON-to-CSV Worker would generate a zip file with multiple CSVs per assessment. (Some assessments, such as Number Match, could be relationalized into up to 6 Parquet tables.) In the new design, the JSON-to-CSV Worker would generate only metadata unless the Summarize component were written for the assessment.
This can be mitigated if we have some kind of reasonable default, like a link to the raw data in Synapse.
We have to write the Summarize components for all 3 Arc measures and for surveys anyway, so this might be a non-issue.
One possibility is that the apps provide a separate answers.json which is just a flattened map of key-value pairs, which makes Summarize very easy.
Surveys
We can easily build survey processing on top of the Bridge Downstream pipeline as described above for ARC measures.
Survey Schemas. We need survey schemas to validate the raw JSON data.
Possible Design 1: The Researcher UI automatically creates the survey schema.
Possible Design 2: The app includes the survey schema in-line in the survey results. This will require changes in the mobile apps as well as in Bridge Downstream to read the in-line schema instead of following a link.
Possible Design 3: A generic survey schema that simply validates that we have key-value pairs. This means that the survey results are not fully validated, but since the survey JSON format is relatively simple, we don’t lose much by not validating it. This is also the cheapest of the 3 solutions.
Verify Survey Result Format. We want the Survey CSV in a format where each row is one participant’s answers to a survey and each column is one survey question. For JSON-to-Parquet to generate this, we need survey results to be in a JSON file where each key is the survey question identifier and each value is the survey answer.
iOS already supports this, but we would need to do Android work to support this.
Free text answers. We support this, but we may need to add max character limits to the answers. It’s currently unclear if Parquet has field size limits or if there are other limitations in the app or in the Researcher UI. However, a max character limit of either 250 or 500 should be sufficient for all use cases.
If this turns out to be a lot of work, we always the option to simply disallow free text answers in V1 and postpone the feature to 2024.
Multiple choice, multiple answers. A multiple choice question where the participant can select multiple answers from the list of choices. One limitation is if we represent this as a JSON Array, JSON-to-Parquet will separate this array into its own table, whereas we want it in its own column.
One possible workaround is to represent the answer is a comma separated list instead. This can be accomplished because multiple choice answers contain both the display text (which is shown to the participant) and the value (an internal value representing the answer). We can restrict the value to only contain alphanumeric spaces and characters. This means that we will never have commas in the survey answers, so we can always use a comma-separated list.
This will require work in iOS and Android, and possibly in the Researcher UI as well.
If this turns out to be a lot of work, we have the option to simply disallow multiple answers and postpone the feature to 2024.
Jira Task Breakdown
ARC Epic: - DHP-1017Getting issue details... STATUS
Surveys Epic: - DHP-1029Getting issue details... STATUS
Researcher UI
- DHP-1027Getting issue details... STATUS
Bridge Back-End
(Server and Worker work)
- DHP-1019Getting issue details... STATUS (1)
could be a good cross-training project for Nathaniel
- DHP-1070Getting issue details... STATUS (2)
another good cross-training project
- DHP-1071Getting issue details... STATUS (3)
- DHP-1072Getting issue details... STATUS (5)
- DHP-1025Getting issue details... STATUS (3)
This scoring code could either be part of the apps or it could live in the Worker.
- DHP-1074Getting issue details... STATUS (3)
- DHP-1073Getting issue details... STATUS (5)
- DHP-1075Getting issue details... STATUS (3)
- DHP-1026Getting issue details... STATUS (3)
Nathaniel can take this work in parallel and get some cross-training in Bridge Back-End
- DHP-1028Getting issue details... STATUS (1)
Defunct
- DHP-1022Getting issue details... STATUS
- DHP-1030Getting issue details... STATUS
- DHP-1031Getting issue details... STATUS
- DHP-1023Getting issue details... STATUS (8)
- DHP-1020Getting issue details... STATUS (8)
- DHP-1021Getting issue details... STATUS (1)
- DHP-1018Getting issue details... STATUS (1)
- DHP-1032Getting issue details... STATUS (3)
Additional Notes
Bridge Downstream code base:https://github.com/Sage-Bionetworks/BridgeDownstream
Bridge Downstream getting started: Getting Started
Bridge Downstream developer docs: /wiki/spaces/BD/pages/2746351624
Bridge Downstream setting up a new study: Setting Up a New Study
How to read parquet data in the command line:
Pre-req: pip needs to be installed on your Mac. The fastest way to install it is
port install py-pip
. This will also install Python, if necessary.pip install parquet-cli
- You only need to do this once. This will put a command-line utility calledparq
in your bin.parq <filename>
to get metadata for a parquet file.parq <filename> --head N
orparq <filename> --tail N
to read the actual parquet data.Alternate solution: https://github.com/devinrsmith/deephaven-parquet-viewer Allows you to view parquet files in your browser. Requires Docker.
Alternate solution: PyCharm for IntelliJ IDE.
Long-term, we can look into something like https://hevodata.com/learn/parquet-to-postgresql/ for storing and managing Parquet data.