...
This precludes nightly processing or hourly processing, as this would add significant delays to returning the data.
How much of a delay is acceptable? Minutes? This will determine if we can do batch processing on a short schedule (eg once per 5 minutes), or if we need to trigger a job for each individual upload.
Note that given the current upload rates (peak less than 100 per day for all of MTB, and averaging much lower), there’s probably no difference between a batch job every 5 minutes and triggering per individual upload.
ARC Measures
Things we will have to do regardless of our design decisions:
...
Modify the Glue job to trigger off each upload instead of a cron schedule. (Or move the Python script to something that’s not Glue.)
...
Aggregate data across all dates and participants.
...
Write the aggregated data back to Bridge.
...
Write the post-processing status to Adherence.
...
Bridge API to download the aggregated data as a CSV.
...
Clone the Bridge Downstream pipeline for Open Bridge. We have an indefinite code freeze for MTB. However, we want to set up a new Bridge Downstream pipeline for Open Bridge. This will be used to convert the raw JSON data into Parquet.
Note that the SQS triggers are currently not hooked up to anything. We’ll need to use existing Cron trigger (although we might want to consider increasing the interval from hourly to every 5 minutes). Bridge Downstream can also be kicked off manually by the Parquet to CSV worker job (see below).
Parquet-to-CSV Worker. This worker loads the parquet fragments into a single Parquet data frame, then exports a snapshot of the Parquet data in CSV format. It stores this CSV in S3 and writes the URL and the job status to Bridge Server, so that Bridge Server knows about it.
Note that because an assessment can result in multiple tables and multiple CSVs, this API will return a zip file of CSVs.
Possible Design 1: We write this worker purely in Java, using something like https://github.com/apache/parquet-mr. Testing will need to be done to see if this does, in fact, pull in Hadoop dependencies, and if so, how much of a problem that is. We also need to verify that this Parquet library aggregates the data frame as we expect and can filter and can export to CSV.
Possible Design 2: Run PyArrow and Pandas in Java. The upside is that these libraries are what the Scoring code uses, so we know it does what we want. Will need to verify that we can run Python in a Java environment (probably yes) and whether these specific libraries will run in a Java environment (hopefully yes).
Possible Design 3: Set up a new Python Worker to run PyArrow and Pandas natively. This will require us to set up a new stack in AWS, complete with all the infrastructure and monitoring. However, running a native Python environment may be easier than running Python in Java.
Scoring Code. We only have 3 ARC measures, and only one of them has meaningful scoring code. For V1, we should have a hardcoded mapping from assessment ID to function calls that live inside the Worker. As part of the Parquet-to-CSV Worker, we should call this function to score the data set and return the scores as a separate file in the zip file.
In 2024, we’ll need to do some work to make this scalable. This may involve a solution to run 3rd party scoring code in a sandboxed environment. We don’t need to solve this until 2024.
Bridge APIs to request a snapshot of the study data.
Bridge API to request the CSV Snapshot. This is scoped to the study and takes in an optional date range and a list of assessments to include in the snapshot (left blank to include everything). This triggers the Parquet-to-CSV Worker and returns a job ID.
Bridge API for the Parquet-to-CSV Worker to mark a job ID as complete and set the URL for the result.
Bridge API to get the CSV Snapshot result. Note that if the Worker hasn’t marked the job ID as complete, this may return an incomplete status with no data URL.
These APIs are scoped to the study. We can currently fake this by using study_coordinator or study_designer roles.
Optionally, we add a new study_researcher role, which has the ability to read the research data and the participant roster, as a separate role from the study coordinators and designers. This would be an almost trivial amount of work, but would be replaced by the permissions re-work in 2024.
Researcher UI changes to expose the request and result API for CSV Snapshots and to download the resulting file.
Additional work items:
Write the post-processing status to Adherence.
Note that the Bridge Downstream code hardcodes appId=mobile-toolbox. We want to un-hardcode this and either read the appId from the FileEntity annotations, or propagate the appId through each step so that we never lose it.
Also, the appId is currently being set by the app. This should instead be a top-level property in the Exporter itself.
Proposed Design: Replace Parquet
The schema validation stuff works just fine. However, the JSON to Parquet layer doesn’t quite do what we want it to do. In particular, we still need to aggregate the data across dates and record IDs.
Pros:
It will be easier to aggregate the data if we build it in as part of JSON-to-Table instead of making it an extra step at the end of the pipeline.
Allows us to use an existing table solution, such as Synapse tables, which also allow us to easily export to CSV.
Cons:
Will need to re-write the JSON-to-Table CSV code.
Alternate Design: Keep Parquet
We keep the Bridge Downstream pipeline exactly as is, Parquet and all, and add our own code at the end to aggregate the data and send it back to Bridge.
Pros:
Don’t need to re-write the JSON to tabular data workflow.
Cons:
Parquet is poorly supported in Java, and may or may not require us to pull in Hadoop as a dependency.
Parquet is a file format, so appending to Parquet tables will involve a lot of file I/O.
The current implementation of Parquet doesn’t prevent table fragments with different columns from appearing in the same partition, and the fragments don’t contain the assessment ID or revision. We will need to solve this problem in Parquet.
Alternate Design 2: Do both
Keep Exporter 3 with push to Synapse
For any supported assessment or survey, build an “answers.json” file that is a flat dictionary of scores/answers + metadata.
Build a unit testing setup that can use python or R scripts (what researchers like) and port to Kotlin for on-device cross platform scoring
write the “answers” to the Adherence Record as a dictionary
write a “answers.json” file to the archive
Add a back-end service to get the “answers.json” file into a table
Pros:
Feedback to the participant (if desired)
Easier to aggregate the data if we build it in as part of JSON-to-Table
Allows us to use an existing table solution, such as Synapse tables, which also allow us to easily export to CSV
Cons:
Why did we change from Bridge 1.0/Exporter 2.0 again?
Will require robust testing of the unit tests for converting R/Python scoring to Kotlin
Surveys
Surveys are even easier than ARC measures. We already know what our survey table format looks like. (This is one of the things Exporter 2.0 actually did well. However, that survey engine is currently deprecated, as are Exporter 2.0 schemas.)
It’s not clear what JSON-to-Parquet gets us in this use case. Best case scenario, it converts survey JSON to tabular format similar to what we already expect, but we would end up with a bunch of 1-row files that we would need to aggregate. And we’d still need to integrate with the new Survey Builder to determine what table columns we want.
...
Surveys
We can easily build survey processing on top of the Bridge Downstream pipeline as described above for ARC measures.
Survey Schemas. We need survey schemas to validate the raw JSON data.
Possible Design 1: The Researcher UI automatically creates the survey schema.
Possible Design 2: The app includes the survey schema in-line in the survey results. This will require changes in the mobile apps as well as in Bridge Downstream to read the in-line schema instead of following a link.
Possible Design 3: A generic survey schema that simply validates that we have key-value pairs. This means that the survey results are not fully validated, but since the survey JSON format is relatively simple, we don’t lose much by not validating it. This is also the cheapest of the 3 solutions.
Verify Survey Result Format. We want the Survey CSV in a format where each row is one participant’s answers to a survey and each column is one survey question. For JSON-to-Parquet to generate this, we need survey results to be in a JSON file where each key is the survey question identifier and each value is the survey answer.
iOS already supports this, but we would need to do Android work to support this.
Free text answers. We support this, but we may need to add max character limits to the answers. It’s currently unclear if Parquet has field size limits or if there are other limitations in the app or in the Researcher UI. However, a max character limit of either 250 or 500 should be sufficient for all use cases.
If this turns out to be a lot of work, we always the option to simply disallow free text answers in V1 and postpone the feature to 2024.
Multiple choice, multiple answers. A multiple choice question where the participant can select multiple answers from the list of choices. One limitation is if we represent this as a JSON Array, JSON-to-Parquet will separate this array into its own table, whereas we want it in its own column.
One possible workaround is to represent the answer is a comma separated list instead. This can be accomplished because multiple choice answers contain both the display text (which is shown to the participant) and the value (an internal value representing the answer). We can restrict the value to only contain alphanumeric spaces and characters. This means that we will never have commas in the survey answers, so we can always use a comma-separated list.
This will require work in iOS and Android, and possibly in the Researcher UI as well.
If this turns out to be a lot of work, we have the option to simply disallow multiple answers and postpone the feature to 2024.
Jira Task Breakdown
TODO
Additional Notes
Bridge Downstream code base:https://github.com/Sage-Bionetworks/BridgeDownstream
...
Bridge Downstream developer docs: /wiki/spaces/BD/pages/2746351624
Bridge Downstream setting up a new study: Setting Up a New Study
How to read parquet data in the command line:
Pre-req: pip needs to be installed on your Mac. The fastest way to install it is
port install py-pip
. This will also install Python, if necessary.pip install parquet-cli
- You only need to do this once. This will put a command-line utility calledparq
in your bin.parq <filename>
to get metadata for a parquet file.parq <filename> --head N
orparq <filename> --tail N
to read the actual parquet data.Alternate solution: https://github.com/devinrsmith/deephaven-parquet-viewer Allows you to view parquet files in your browser. Requires Docker.
Alternate solution: PyCharm for IntelliJ IDE.
Long-term, we can look into something like https://hevodata.com/learn/parquet-to-postgresql/ for storing and managing Parquet data.