...
This precludes nightly processing or hourly processing, as this would add significant delays to returning the data.
How much of a delay is acceptable? Minutes? This will determine if we can do batch processing on a short schedule (eg once per 5 minutes), or if we need to trigger a job for each individual upload.
Note that given the current upload rates (peak less than 100 per day for all of MTB, and averaging much lower), there’s probably no difference between a batch job every 5 minutes and triggering per individual upload.
ARC Measures
Things we will have to do regardless of our design decisions:
Modify the Glue job to trigger off each upload instead of a cron schedule. (Or move the Python script to something that’s not Glue.)
Aggregate data across all dates and participants.
Write the aggregated data back to Bridge.
Write the post-processing status to Adherence.
Bridge API to download the aggregated data as a CSV.
Researcher UI to surface the API to download a CSV.
Note that the Bridge Downstream code hardcodes appId=mobile-toolbox. We want to un-hardcode this and either read the appId from the FileEntity annotations, or propagate the appId through each step so that we never lose it.
Also, the appId is currently being set by the app. This should instead be a top-level property in the Exporter itself.
Proposed Design: Replace Parquet
...
It would probably be simpler to write our custom Survey-to-Table implementation, custom-built to the new Survey Builder and survey format.
Additional Notes
Bridge Downstream code base:https://github.com/Sage-Bionetworks/BridgeDownstream
Bridge Downstream getting started: Getting Started
Bridge Downstream developer docs: /wiki/spaces/BD/pages/2746351624
How to read parquet data in the command line:
Pre-req: pip needs to be installed on your Mac. The fastest way to install it is
port install py-pip
. This will also install Python, if necessary.pip install parquet-cli
- You only need to do this once. This will put a command-line utility calledparq
in your bin.parq <filename>
to get metadata for a parquet file.parq <filename> --head N
orparq <filename> --tail N
to read the actual parquet data.Alternate solution: https://github.com/devinrsmith/deephaven-parquet-viewer Allows you to view parquet files in your browser. Requires Docker.