...

MVP is to return downloadable CSVs in the Researcher UI.
Note that we might not be able to achieve one and only one table per assessment. Some assessments contain multiple data sets. For example, the Number Match assessment includes contains multiple rows for stepHistory, steps, and userInteractions. These rows may or may not match one-to-one, and there may or may not be any meaningful way to combine these disparate data sets into a single table. We might need to provide a set of tables to represent an assessment version.
Regardless, the tables should not be sub-partitioned on date. Each CSV should contain all rows for all participants taking that assessment in the study.
Each assessment will generate a separate CSV file. All CSV files can be downloaded in a single zip file. To see the sample data, s

Fast turnaround times - When the participant uploads data on their phone, we want Bridge to Export, Validate, Score (if available), and make the data available in the Researcher UI with as little delay as possible.

...

Write the post-processing status to Adherence.
Note that the Bridge Downstream code hardcodes appId=mobile-toolbox. We want to un-hardcode this and either read the appId from the FileEntity annotations, or propagate the appId through each step so that we never lose it.
Also, the appId is currently being set by the app. This should instead be a top-level property in the Exporter itself.

“Architectural Diagram” and design notes 2023-10-25

...

Addendum 1: Who Needs Parquet

...

Jira Legacy
server System JIRA
serverId ba6fb084-9827-3160-8067-8ac7470f78b2
key DHP-1023
is now would be split into 2 different Jiras: (a) a JSON-to-Table-CSV Worker, and the cost would decrease from 8 to 5 as there’s less uncertainty and less risk.Row method, which would live inside the Exporter 3 Worker, costed at 3 sprint points and (b) a CSV Worker which queries all the rows and aggregates them into a CSV, costed at 5. The cost for this task would stay the same, changing from 8 points to 2 tasks costing 3 and 5 points. However, there is much less uncertainty because we don’t have to figure out how to read Parquet in Java.
Jira Legacy
server System JIRA
serverId ba6fb084-9827-3160-8067-8ac7470f78b2
key DHP-1025
, previously costed at 5, would be split into 2 Jiras, one for “summarizing” the CSV, costed at 3, and another Jira for re-writing the ARC scoring code, probably costed at 2 or 3.
- This scoring code could either be part of the apps or it could live in the Worker.
Jira Legacy
server System JIRA
serverId ba6fb084-9827-3160-8067-8ac7470f78b2
key DHP-1028
would live in Bridge Worker instead of Bridge Downstream, but would stay costed at 1.

We would need to add a Jira for storing the intermediate tabular results that the JSON-to-CSV Worker generates. This work would be costed at 3.

Pros

Reduces the estimated amount of work by 12 sprint points.
Completely eliminates DHP-1020, which is the riskiest work item, and almost completely reduces the risk from DHP-1023.
Potentially frees up Dwayne’s schedule later in Q4 to help with the Android stuff or the Permissions stuff.
Bridge Downstream runs hourly. If we remove Bridge Downstream, we don’t have to worry about triggering Bridge Downstream in the CSV Worker and adding additional delay.

Cons

Would have to re-write the scoring code from that one Arc measure.
- This is mitigated because we were planning to re-write it from R to Python anyway, and now we’re considering re-writing it in Kotlin (mobile) or Java (Worker).
In the old design, the JSON-to-CSV Worker would generate a zip file with multiple CSVs per assessment. (Some assessments, such as Number Match, could be relationalized into up to 6 Parquet tables.) In the new design, the JSON-to-CSV Worker would generate only metadata unless the Summarize component were written for the assessment.
- This can be mitigated if we have some kind of reasonable default, like a link to the raw data in Synapse.
- We have to write the Summarize components for all 3 Arc measures and for surveys anyway, so this might be a non-issue.
- One possibility is that the apps provide a separate answers.json which is just a flattened map of key-value pairs, which makes Summarize very easy.

Surveys

We can easily build survey processing on top of the Bridge Downstream pipeline as described above for ARC measures.

Survey Schemas. We need survey schemas to validate the raw JSON data.

Possible Design 1: The Researcher UI automatically creates the survey schema.
Possible Design 2: The app includes the survey schema in-line in the survey results. This will require changes in the mobile apps as well as in Bridge Downstream to read the in-line schema instead of following a link.
Possible Design 3: A generic survey schema that simply validates that we have key-value pairs. This means that the survey results are not fully validated, but since the survey JSON format is relatively simple, we don’t lose much by not validating it. This is also the cheapest of the 3 solutions.

Verify Survey Result Format. We want the Survey CSV in a format where each row is one participant’s answers to a survey and each column is one survey question. For JSON-to-Parquet to generate this, we need survey results to be in a JSON file where each key is the survey question identifier and each value is the survey answer.

iOS already supports this, but we would need to do Android work to support this.

Free text answers. We support this, but we may need to add max character limits to the answers. It’s currently unclear if Parquet has field size limits or if there are other limitations in the app or in the Researcher UI. However, a max character limit of either 250 or 500 should be sufficient for all use cases.

If this turns out to be a lot of work, we always the option to simply disallow free text answers in V1 and postpone the feature to 2024.

Multiple choice, multiple answers. A multiple choice question where the participant can select multiple answers from the list of choices. One limitation is if we represent this as a JSON Array, JSON-to-Parquet will separate this array into its own table, whereas we want it in its own column.

One possible workaround is to represent the answer is a comma separated list instead. This can be accomplished because multiple choice answers contain both the display text (which is shown to the participant) and the value (an internal value representing the answer). We can restrict the value to only contain alphanumeric spaces and characters. This means that we will never have commas in the survey answers, so we can always use a comma-separated list.

This will require work in iOS and Android, and possibly in the Researcher UI as well.

If this turns out to be a lot of work, we have the option to simply disallow multiple answers and postpone the feature to 2024.

Jira Task Breakdown

ARC Epic: During the discussion, we also identified additional work that would need to be done. These work items would have had to be done whether or not we use Parquet, so they don’t change the cost of the project.

Add a field to Assessment Resources (for some reason renamed to “External Resources” in the JavaSDK) to store data in-place (instead of a link). This would involve adding a new field to Assessment Resource (probably a Json Node so we can store structured data if desired). This could be used to store the JSON Schema or the list of table column names or other documentation. We would also want to add new enum values to Resource Category for SCHEMA and TABLE_COLUMNS. (SCIENCE_DOCUMENTATION and DEVELOPER_DOCUMENTATION already exist.) This would be trivial. Costed at 1 sprint point.
A validation step in JSON-to-Table-Row method, which logs a message if we fail validation. This would include validating the data against a JSON Schema (if it exists) and comparing the flat map to the list of table columns (if it exists). Costed at 2-3 sprint points.
- Setting up alarms for the validation failures is so trivial that this is included in the previous work item.
CSV Worker should pull the Table Column list from Assessment Resources. If it exists, the CSV Worker should always ensure that these columns are present in the CSV, even if the values are null. Costed at 2-3 sprint points.

Pros

Reduces the estimated amount of work by 9 sprint points. (More accurately, somewhere between 1 and 9 sprint points, since DHP-1020 is in progress, but closer to 9 since DHP-1020 is very far from complete.)
Completely eliminates DHP-1020, which is the riskiest work item, and almost completely reduces the risk from DHP-1023.
Potentially frees up Dwayne’s schedule later in Q4 to help with the Android stuff or the Permissions stuff.
Bridge Downstream runs hourly. If we remove Bridge Downstream, we don’t have to worry about triggering Bridge Downstream in the CSV Worker and adding additional delay.
This trivially solves the multi-select problem. We can always just render the raw JSON array in the CSV.
- That said, this is not a particularly user-friendly output, so we might want to return a comma-delimited list anyway.
Schemas are no longer necessary for the data pipeline. Since we generate the CSV on the fly, we can always just determine the columns on the fly.
- Schemas might still be necessary for documentation purposes and for future data validation work.

Cons

Would have to re-write the scoring code from that one Arc measure.
- This is mitigated because we were planning to re-write it from R to Python anyway, and now we’re considering re-writing it in Kotlin (mobile) or Java (Worker).
In the old design, the JSON-to-CSV Worker would generate a zip file with multiple CSVs per assessment. (Some assessments, such as Number Match, could be relationalized into up to 6 Parquet tables.) In the new design, the JSON-to-CSV Worker would generate only metadata unless the Summarize component were written for the assessment.
- This can be mitigated if we have some kind of reasonable default, like a link to the raw data in Synapse.
- We have to write the Summarize components for all 3 Arc measures and for surveys anyway, so this might be a non-issue.
- One possibility is that the apps provide a separate answers.json which is just a flattened map of key-value pairs, which makes Summarize very easy.

Surveys

We can easily build survey processing on top of the Bridge Downstream pipeline as described above for ARC measures.

Survey Schemas. We need survey schemas to validate the raw JSON data.

Possible Design 1: The Researcher UI automatically creates the survey schema.
Possible Design 2: The app includes the survey schema in-line in the survey results. This will require changes in the mobile apps as well as in Bridge Downstream to read the in-line schema instead of following a link.
Possible Design 3: A generic survey schema that simply validates that we have key-value pairs. This means that the survey results are not fully validated, but since the survey JSON format is relatively simple, we don’t lose much by not validating it. This is also the cheapest of the 3 solutions.