...
Investigation of Bridge Downstream as Bridge team prepares to take ownership.
Exploration of if/how we can generalize Bridge Downstream to include ARC measures.
Exploration of if/how we can use Bridge Downstream to include surveys.
Note: A lot of this is preliminary and depends on discussion with Bridge Downstream team and Bridge team.
Table of Contents |
---|
Bridge Downstream Learnings
...
dataset_sharedschema_v1/
dataset_sharedschema_v1_stepHistory/
dataset_sharedschema_v1_steps/
dataset_sharedschema_v1_taskStatus/
dataset_sharedschema_v1_userInteractions/
dataset_sharedschema_v1_userInteractions_controlEvent/
We observe the following:
Each JSON file in an assessment is broken into multiple sub-tables.
...
Each assessment is further sub-partitioned by date.
...
Each parquet file appears to contain only rows for a single record ID.
...
D&T team uses PyArrow and Pandas to aggregate the data into a whole dataframe. We can either use this solution as well, or we can use something like https://github.com/apache/parquet-mr. (Note: This may or may not require us to pull in some Hadoop dependencies.)
Open Questions
How frequently is JSON to Parquet triggered? As per https://sagebionetworks.jira.com/wiki/spaces/BD/pages/2749759500/JSON+to+Parquet+Workflow#Scheduled-Trigger, the JSON to Parquet job is scheduled using a Cron job. What is the current schedule, and where is it configured? Can it be changed to trigger for each individual upload?
Answer: Currently, the cron job is configured to run every hour. The SQS trigger is currently not hooked up to Bridge Downstream, and it relies only on the Cron trigger.
How do we aggregate data? We want to return one table per app per study per assessment revision that contains all rows for all participants across all days. Is this already handled in Bridge Downstream? Where is this output stored?
Answer: We use PyArrow and Pandas to aggregate the data into a single data frame.
How is Bridge Downstream being monitored? Are there metrics? Dashboards? Alarms? How can Bridge team get access to these?
Answer: Built-in Glue dashboards for the Glue jobs. (Lambda we already know how to monitor. SQS and SNS are trivial.)
Should Bridge team have access to Bridge Downstream’s AWS accounts?
Answer:
Jira Legacy server System JIRA serverId ba6fb084-9827-3160-8067-8ac7470f78b2 key IT-3154
Requirements For Open Bridge
...
MVP is to return downloadable CSVs in the Researcher UI.
Note that we might not be able to achieve one and only one table per assessment. Some assessments contain multiple data sets. For example, the Number Match assessment includes contains multiple rows for stepHistory, steps, and userInteractions. These rows may or may not match one-to-one, and there may or may not be any meaningful way to combine these disparate data sets into a single table. We might need to provide a set of tables to represent an assessment version.
Regardless, the tables should not be sub-partitioned on date. Each CSV should contain all rows for all participants taking that assessment in the study.
Each assessment will generate a separate CSV file. All CSV files can be downloaded in a single zip file. To see the sample data, s
Fast turnaround times - When the participant uploads data on their phone, we want Bridge to Export, Validate, Score (if available), and make the data available in the Researcher UI with as little delay as possible.
This precludes nightly processing or hourly processing, as this would add significant delays to returning the data.
How much of a delay is acceptable? Minutes? This will determine if we can do batch processing on a short schedule (eg once per 5 minutes), or if we need to trigger a job for each individual upload.
Note that given the current upload rates (peak less than 100 per day for all of MTB, and averaging much lower), there’s probably no difference between a batch job every 5 minutes and triggering per individual upload.
ARC Measures
Things we will have to do regardless of our design decisions:
...
Modify the Glue job to trigger off each upload instead of a cron schedule. (Or move the Python script to something that’s not Glue.)
...
Aggregate data across all dates and participants.
...
Write the aggregated data back to Bridge.
...
Write the post-processing status to Adherence.
...
Bridge API to download the aggregated data as a CSV.
...
Clone the Bridge Downstream pipeline for Open Bridge. We have an indefinite code freeze for MTB. However, we want to set up a new Bridge Downstream pipeline for Open Bridge. This will be used to convert the raw JSON data into Parquet.
Note that the SQS triggers are currently not hooked up to anything. We’ll need to use existing Cron trigger (although we might want to consider increasing the interval from hourly to every 5 minutes). Bridge Downstream can also be kicked off manually by the Parquet to CSV worker job (see below).
Parquet-to-CSV Worker. This worker loads the parquet fragments into a single Parquet data frame, then exports a snapshot of the Parquet data in CSV format. It stores this CSV in S3 and writes the URL and the job status to Bridge Server, so that Bridge Server knows about it.
Note that because an assessment can result in multiple tables and multiple CSVs, this API will return a zip file of CSVs.
Possible Design 1: We write this worker purely in Java, using something like https://github.com/apache/parquet-mr. Testing will need to be done to see if this does, in fact, pull in Hadoop dependencies, and if so, how much of a problem that is. We also need to verify that this Parquet library aggregates the data frame as we expect and can filter and can export to CSV.
Possible Design 2: Run PyArrow and Pandas in Java. The upside is that these libraries are what the Scoring code uses, so we know it does what we want. Will need to verify that we can run Python in a Java environment (probably yes) and whether these specific libraries will run in a Java environment (hopefully yes).
Possible Design 3: Set up a new Python Worker to run PyArrow and Pandas natively. This will require us to set up a new stack in AWS, complete with all the infrastructure and monitoring. However, running a native Python environment may be easier than running Python in Java.
Scoring Code. We only have 3 ARC measures, and only one of them has meaningful scoring code. For V1, we should have a hardcoded mapping from assessment ID to function calls that live inside the Worker. As part of the Parquet-to-CSV Worker, we should call this function to score the data set and return the scores as a separate file in the zip file.
In 2024, we’ll need to do some work to make this scalable. This may involve a solution to run 3rd party scoring code in a sandboxed environment. We don’t need to solve this until 2024.
Bridge APIs to request a snapshot of the study data.
Bridge API to request the CSV Snapshot. This is scoped to the study and takes in an optional date range and a list of assessments to include in the snapshot (left blank to include everything). This triggers the Parquet-to-CSV Worker and returns a job ID.
Bridge API for the Parquet-to-CSV Worker to mark a job ID as complete and set the URL for the result.
Bridge API to get the CSV Snapshot result. Note that if the Worker hasn’t marked the job ID as complete, this may return an incomplete status with no data URL.
These APIs are scoped to the study. We can currently fake this by using study_coordinator or study_designer roles.
Optionally, we add a new study_researcher role, which has the ability to read the research data and the participant roster, as a separate role from the study coordinators and designers. This would be an almost trivial amount of work, but would be replaced by the permissions re-work in 2024.
Researcher UI changes to expose the request and result API for CSV Snapshots and to download the resulting file.
Additional work items:
Write the post-processing status to Adherence.
Note that the Bridge Downstream code hardcodes appId=mobile-toolbox. We want to un-hardcode this and either read the appId from the FileEntity annotations, or propagate the appId through each step so that we never lose it.
Also, the appId is currently being set by the app. This should instead be a top-level property in the Exporter itself.
Proposed Design: Replace Parquet
The schema validation stuff works just fine. However, the JSON to Parquet layer doesn’t quite do what we want it to do. In particular, we still need to aggregate the data across dates and record IDs.
Pros:
It will be easier to aggregate the data if we build it in as part of JSON-to-Table instead of making it an extra step at the end of the pipeline.
Allows us to use an existing table solution, such as Synapse tables, which also allow us to easily export to CSV.
Cons:
Will need to re-write the JSON-to-Table CSV code.
Alternate Design: Keep Parquet
We keep the Bridge Downstream pipeline exactly as is, Parquet and all, and add our own code at the end to aggregate the data and send it back to Bridge.
Pros:
Don’t need to re-write the JSON to tabular data workflow.
Cons:
Parquet is poorly supported in Java, and may or may not require us to pull in Hadoop as a dependency.
Parquet is a file format, so appending to Parquet tables will involve a lot of file I/O.
The current implementation of Parquet doesn’t prevent table fragments with different columns from appearing in the same partition, and the fragments don’t contain the assessment ID or revision. We will need to solve this problem in Parquet.
Alternate Design 2: Do both
Keep Exporter 3 with push to Synapse
For any supported assessment or survey, build an “answers.json” file that is a flat dictionary of scores/answers + metadata.
Build a unit testing setup that can use python or R scripts (what researchers like) and port to Kotlin for on-device cross platform scoring
write the “answers” to the Adherence Record as a dictionary
write a “answers.json” file to the archive
Add a back-end service to get the “answers.json” file into a table
Pros:
Feedback to the participant (if desired)
Easier to aggregate the data if we build it in as part of JSON-to-Table
Allows us to use an existing table solution, such as Synapse tables, which also allow us to easily export to CSV
Cons:
Why did we change from Bridge 1.0/Exporter 2.0 again?
Will require robust testing of the unit tests for converting R/Python scoring to Kotlin
Surveys
Surveys are even easier than ARC measures. We already know what our survey table format looks like. (This is one of the things Exporter 2.0 actually did well. However, that survey engine is currently deprecated, as are Exporter 2.0 schemas.)
It’s not clear what JSON-to-Parquet gets us in this use case. Best case scenario, it converts survey JSON to tabular format similar to what we already expect, but we would end up with a bunch of 1-row files that we would need to aggregate. And we’d still need to integrate with the new Survey Builder to determine what table columns we want.
It would probably be simpler to write our custom Survey-to-Table implementation, custom-built to the new Survey Builder and survey format.
“Architectural Diagram” and design notes 2023-10-25
...
Addendum 1: Who Needs Parquet
We’ve got power users using JSON and basic users using the simple CSV summaries in the Researcher UI. However, we don’t really know who wants to use Parquet. If no one needs Parquet, then we can save a lot of work by skipping Bridge Downstream for Q4. Since we plan to re-build a lot of this in 2024 anyway, we wouldn’t be saving any work by standing up Bridge Downstream right now.
Specifically, we can cut the following Jiras from our plan:
(8)Jira Legacy server System JIRA serverId ba6fb084-9827-3160-8067-8ac7470f78b2 key DHP-1020
(1)Jira Legacy server System JIRA serverId ba6fb084-9827-3160-8067-8ac7470f78b2 key DHP-1021
(1)Jira Legacy server System JIRA serverId ba6fb084-9827-3160-8067-8ac7470f78b2 key DHP-1018
(3)Jira Legacy server System JIRA serverId ba6fb084-9827-3160-8067-8ac7470f78b2 key DHP-1032
The following Jiras would change:
would be split into 2 different Jiras: (a) a JSON-to-Table-Row method, which would live inside the Exporter 3 Worker, costed at 3 sprint points and (b) a CSV Worker which queries all the rows and aggregates them into a CSV, costed at 5. The cost for this task would stay the same, changing from 8 points to 2 tasks costing 3 and 5 points. However, there is much less uncertainty because we don’t have to figure out how to read Parquet in Java.Jira Legacy server System JIRA serverId ba6fb084-9827-3160-8067-8ac7470f78b2 key DHP-1023
, previously costed at 5, would be split into 2 Jiras, one for “summarizing” the CSV, costed at 3, and another Jira for re-writing the ARC scoring code, probably costed at 2 or 3.Jira Legacy server System JIRA serverId ba6fb084-9827-3160-8067-8ac7470f78b2 key DHP-1025 This scoring code could either be part of the apps or it could live in the Worker.
would live in Bridge Worker instead of Bridge Downstream, but would stay costed at 1.Jira Legacy server System JIRA serverId ba6fb084-9827-3160-8067-8ac7470f78b2 key DHP-1028
We would need to add a Jira for storing the intermediate tabular results that the JSON-to-CSV Worker generates. This work would be costed at 3.
During the discussion, we also identified additional work that would need to be done. These work items would have had to be done whether or not we use Parquet, so they don’t change the cost of the project.
Add a field to Assessment Resources (for some reason renamed to “External Resources” in the JavaSDK) to store data in-place (instead of a link). This would involve adding a new field to Assessment Resource (probably a Json Node so we can store structured data if desired). This could be used to store the JSON Schema or the list of table column names or other documentation. We would also want to add new enum values to Resource Category for SCHEMA and TABLE_COLUMNS. (SCIENCE_DOCUMENTATION and DEVELOPER_DOCUMENTATION already exist.) This would be trivial. Costed at 1 sprint point.
A validation step in JSON-to-Table-Row method, which logs a message if we fail validation. This would include validating the data against a JSON Schema (if it exists) and comparing the flat map to the list of table columns (if it exists). Costed at 2-3 sprint points.
Setting up alarms for the validation failures is so trivial that this is included in the previous work item.
CSV Worker should pull the Table Column list from Assessment Resources. If it exists, the CSV Worker should always ensure that these columns are present in the CSV, even if the values are null. Costed at 2-3 sprint points.
Pros
Reduces the estimated amount of work by 9 sprint points. (More accurately, somewhere between 1 and 9 sprint points, since DHP-1020 is in progress, but closer to 9 since DHP-1020 is very far from complete.)
Completely eliminates DHP-1020, which is the riskiest work item, and almost completely reduces the risk from DHP-1023.
Potentially frees up Dwayne’s schedule later in Q4 to help with the Android stuff or the Permissions stuff.
Bridge Downstream runs hourly. If we remove Bridge Downstream, we don’t have to worry about triggering Bridge Downstream in the CSV Worker and adding additional delay.
This trivially solves the multi-select problem. We can always just render the raw JSON array in the CSV.
That said, this is not a particularly user-friendly output, so we might want to return a comma-delimited list anyway.
Schemas are no longer necessary for the data pipeline. Since we generate the CSV on the fly, we can always just determine the columns on the fly.
Schemas might still be necessary for documentation purposes and for future data validation work.
Cons
Would have to re-write the scoring code from that one Arc measure.
This is mitigated because we were planning to re-write it from R to Python anyway, and now we’re considering re-writing it in Kotlin (mobile) or Java (Worker).
In the old design, the JSON-to-CSV Worker would generate a zip file with multiple CSVs per assessment. (Some assessments, such as Number Match, could be relationalized into up to 6 Parquet tables.) In the new design, the JSON-to-CSV Worker would generate only metadata unless the Summarize component were written for the assessment.
This can be mitigated if we have some kind of reasonable default, like a link to the raw data in Synapse.
We have to write the Summarize components for all 3 Arc measures and for surveys anyway, so this might be a non-issue.
One possibility is that the apps provide a separate answers.json which is just a flattened map of key-value pairs, which makes Summarize very easy.
Surveys
We can easily build survey processing on top of the Bridge Downstream pipeline as described above for ARC measures.
Survey Schemas. We need survey schemas to validate the raw JSON data.
Possible Design 1: The Researcher UI automatically creates the survey schema.
Possible Design 2: The app includes the survey schema in-line in the survey results. This will require changes in the mobile apps as well as in Bridge Downstream to read the in-line schema instead of following a link.
Possible Design 3: A generic survey schema that simply validates that we have key-value pairs. This means that the survey results are not fully validated, but since the survey JSON format is relatively simple, we don’t lose much by not validating it. This is also the cheapest of the 3 solutions.
Verify Survey Result Format. We want the Survey CSV in a format where each row is one participant’s answers to a survey and each column is one survey question. For JSON-to-Parquet to generate this, we need survey results to be in a JSON file where each key is the survey question identifier and each value is the survey answer.
iOS already supports this, but we would need to do Android work to support this.
Free text answers. We support this, but we may need to add max character limits to the answers. It’s currently unclear if Parquet has field size limits or if there are other limitations in the app or in the Researcher UI. However, a max character limit of either 250 or 500 should be sufficient for all use cases.
If this turns out to be a lot of work, we always the option to simply disallow free text answers in V1 and postpone the feature to 2024.
Multiple choice, multiple answers. A multiple choice question where the participant can select multiple answers from the list of choices. One limitation is if we represent this as a JSON Array, JSON-to-Parquet will separate this array into its own table, whereas we want it in its own column.
One possible workaround is to represent the answer is a comma separated list instead. This can be accomplished because multiple choice answers contain both the display text (which is shown to the participant) and the value (an internal value representing the answer). We can restrict the value to only contain alphanumeric spaces and characters. This means that we will never have commas in the survey answers, so we can always use a comma-separated list.
This will require work in iOS and Android, and possibly in the Researcher UI as well.
If this turns out to be a lot of work, we have the option to simply disallow multiple answers and postpone the feature to 2024.
Jira Task Breakdown
ARC Epic:
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
Surveys Epic:
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
Researcher UI
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
Bridge Back-End
(Server and Worker work)
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
could be a good cross-training project for Nathaniel
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
another good cross-training project
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
This scoring code could either be part of the apps or it could live in the Worker.
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
Nathaniel can take this work in parallel and get some cross-training in Bridge Back-End
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
2024 Improvements
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
Defunct
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
Additional Notes
Bridge Downstream code base:https://github.com/Sage-Bionetworks/BridgeDownstream
...
Bridge Downstream developer docs: /wiki/spaces/BD/pages/2746351624
Bridge Downstream setting up a new study: Setting Up a New Study
How to read parquet data in the command line:
Pre-req: pip needs to be installed on your Mac. The fastest way to install it is
port install py-pip
. This will also install Python, if necessary.pip install parquet-cli
- You only need to do this once. This will put a command-line utility calledparq
in your bin.parq <filename>
to get metadata for a parquet file.parq <filename> --head N
orparq <filename> --tail N
to read the actual parquet data.Alternate solution: https://github.com/devinrsmith/deephaven-parquet-viewer Allows you to view parquet files in your browser. Requires Docker.
Alternate solution: PyCharm for IntelliJ IDE.
Long-term, we can look into something like https://hevodata.com/learn/parquet-to-postgresql/ for storing and managing Parquet data.