Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Jira Legacy
    serverSystem JIRA
    serverIdba6fb084-9827-3160-8067-8ac7470f78b2
    keyDHP-1023
    is now would be split into 2 different Jiras: (a) a JSON-to-CSV Worker, and the cost would decrease from 8 to 5 as there’s less uncertainty and less risk-Table-Row method, which would live inside the Exporter 3 Worker, costed at 3 sprint points and (b) a CSV Worker which queries all the rows and aggregates them into a CSV, costed at 5. The cost for this task would stay the same, changing from 8 points to 2 tasks costing 3 and 5 points. However, there is much less uncertainty because we don’t have to figure out how to read Parquet in Java.

  • Jira Legacy
    serverSystem JIRA
    serverIdba6fb084-9827-3160-8067-8ac7470f78b2
    keyDHP-1025
    , previously costed at 5, would be split into 2 Jiras, one for “summarizing” the CSV, costed at 3, and another Jira for re-writing the ARC scoring code, probably costed at 2 or 3.

    • This scoring code could either be part of the apps or it could live in the Worker.

  • Jira Legacy
    serverSystem JIRA
    serverIdba6fb084-9827-3160-8067-8ac7470f78b2
    keyDHP-1028
    would live in Bridge Worker instead of Bridge Downstream, but would stay costed at 1.

We would need to add a Jira for storing the intermediate tabular results that the JSON-to-CSV Worker generates. This work would be costed at 3.

During the discussion, we also identified additional work that would need to be done. These work items would have had to be done whether or not we use Parquet, so they don’t change the cost of the project.

  • Add a field to Assessment Resources (for some reason renamed to “External Resources” in the JavaSDK) to store data in-place (instead of a link). This would involve adding a new field to Assessment Resource (probably a Json Node so we can store structured data if desired). This could be used to store the JSON Schema or the list of table column names or other documentation. We would also want to add new enum values to Resource Category for SCHEMA and TABLE_COLUMNS. (SCIENCE_DOCUMENTATION and DEVELOPER_DOCUMENTATION already exist.) This would be trivial. Costed at 1 sprint point.

  • A validation step in JSON-to-Table-Row method, which logs a message if we fail validation. This would include validating the data against a JSON Schema (if it exists) and comparing the flat map to the list of table columns (if it exists). Costed at 2-3 sprint points.

    • Setting up alarms for the validation failures is so trivial that this is included in the previous work item.

  • CSV Worker should pull the Table Column list from Assessment Resources. If it exists, the CSV Worker should always ensure that

...

  • these columns are present in the CSV, even if the values are null. Costed at 2-3 sprint points.

Pros

  • Reduces the estimated amount of work by 12 sprint points.9 sprint points. (More accurately, somewhere between 1 and 9 sprint points, since DHP-1020 is in progress, but closer to 9 since DHP-1020 is very far from complete.)

  • Completely eliminates DHP-1020, which is the riskiest work item, and almost completely reduces the risk from DHP-1023.

  • Potentially frees up Dwayne’s schedule later in Q4 to help with the Android stuff or the Permissions stuff.

  • Bridge Downstream runs hourly. If we remove Bridge Downstream, we don’t have to worry about triggering Bridge Downstream in the CSV Worker and adding additional delay.

  • This trivially solves the multi-select problem. We can always just render the raw JSON array in the CSV.

    • That said, this is not a particularly user-friendly output, so we might want to return a comma-delimited list anyway.

  • Schemas are no longer necessary for the data pipeline. Since we generate the CSV on the fly, we can always just determine the columns on the fly.

    • Schemas might still be necessary for documentation purposes and for future data validation work.

...