Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Investigation of Bridge Downstream as Bridge team prepares to take ownership.

  • Exploration of if/how we can generalize Bridge Downstream to include ARC measures.

  • Exploration of if/how we can use Bridge Downstream to include surveys.

Note: A lot of this is preliminary and depends on discussion with Bridge Downstream team and Bridge team.

Table of Contents

Bridge Downstream Learnings

...

  • dataset_sharedschema_v1/

  • dataset_sharedschema_v1_stepHistory/

  • dataset_sharedschema_v1_steps/

  • dataset_sharedschema_v1_taskStatus/

  • dataset_sharedschema_v1_userInteractions/

  • dataset_sharedschema_v1_userInteractions_controlEvent/

...

Each JSON file in an assessment is broken into multiple sub-tables.

...

Each assessment is further sub-partitioned by date.

...

Each parquet file appears to contain only rows for a single record ID.

...

D&T team uses PyArrow and Pandas to aggregate the data into a whole dataframe. We can either use this solution as well, or we can use something like https://github.com/apache/parquet-mr. (Note: This may or may not require us to pull in some Hadoop dependencies.)

Open Questions

How frequently is JSON to Parquet triggered? As per https://sagebionetworks.jira.com/wiki/spaces/BD/pages/2749759500/JSON+to+Parquet+Workflow#Scheduled-Trigger, the JSON to Parquet job is scheduled using a Cron job. What is the current schedule, and where is it configured? Can it be changed to trigger for each individual upload?

  • Answer: Currently, the cron job is configured to run every hour. The SQS trigger is currently not hooked up to Bridge Downstream, and it relies only on the Cron trigger.

How do we aggregate data? We want to return one table per app per study per assessment revision that contains all rows for all participants across all days. Is this already handled in Bridge Downstream? Where is this output stored?

  • Answer: We use PyArrow and Pandas to aggregate the data into a single data frame.

How is Bridge Downstream being monitored? Are there metrics? Dashboards? Alarms? How can Bridge team get access to these?

  • Answer: Built-in Glue dashboards for the Glue jobs. (Lambda we already know how to monitor. SQS and SNS are trivial.)

Should Bridge team have access to Bridge Downstream’s AWS accounts?

  • Answer:

    Jira Legacy
    serverSystem JIRA
    serverIdba6fb084-9827-3160-8067-8ac7470f78b2
    keyIT-3154

Requirements For Open Bridge

...