Bridge Data Pipeline
version | comment |
---|---|
20211108 | Added this table |
20160307 | Remove BridgeExporter |
20150811 | Initial |
Overview
TODO: I bet a diagram would be helpful here.
- User calls the upload API to create an upload session, upload a file, and call upload complete to start the upload validation process. (See Bridge REST API#Pre-signedFileUpload for details.)
- The upload itself is (as described in Bridge Upload Data Format) a zip file containing files of a single activity (survey or task), encrypted with the study's public key (aka cert).
- Upload validation handlers are kicked off. See next section for more details.
- The Bridge Exporter runs every night at 10am UTC (2am PST / 3am PDT), which exports the previous day's health data records (using uploadDate) to Synapse, creating Synapse tables as appropriate. For more info, see Bridge Exporter
Upload Validation Handlers
- Upload validation goes through the UploadValidationService, which creates an UploadValidationTask and runs it in an asynchronous worker thread. Upload validation is broken down into handlers so we can implement, configure, and test each sub-task individually. Relevant code:
- UploadValidationService: https://github.com/Sage-Bionetworks/BridgePF/blob/develop/app/org/sagebionetworks/bridge/services/UploadValidationService.java
- UploadValidationTaskFactory: https://github.com/Sage-Bionetworks/BridgePF/blob/develop/app/org/sagebionetworks/bridge/upload/UploadValidationTaskFactory.java
- UploadValidationTask: https://github.com/Sage-Bionetworks/BridgePF/blob/develop/app/org/sagebionetworks/bridge/upload/UploadValidationTask.java
- Spring config for Upload Validation Handlers: https://github.com/Sage-Bionetworks/BridgePF/blob/develop/app/org/sagebionetworks/bridge/config/BridgeSpringConfig.java#L269
- UploadValidationContext: https://github.com/Sage-Bionetworks/BridgePF/blob/develop/app/org/sagebionetworks/bridge/upload/UploadValidationContext.java - This keeps track of state of the upload validation.
- S3DownloadHandler (https://github.com/Sage-Bionetworks/BridgePF/blob/develop/app/org/sagebionetworks/bridge/upload/S3DownloadHandler.java) downloads the file from S3 to context.data. The S3 bucket is usually org-sagebridge-upload-prod and the filename is the upload ID (usually a guid).
- DecryptHandler (https://github.com/Sage-Bionetworks/BridgePF/blob/develop/app/org/sagebionetworks/bridge/upload/DecryptHandler.java) decrypts the file from context.data and writes the results to context.decryptedData. This calls through to the UploadArchiveService, which calls through to Bouncy Castle.
- UnzipHandler (https://github.com/Sage-Bionetworks/BridgePF/blob/develop/app/org/sagebionetworks/bridge/upload/UnzipHandler.java) unzips the file from context.decryptedData and writes the results to context.unzippedDataMap. This calls through to the UploadArchiveService, which calls through to the Zipper. The Zipper uses a combination of Apache's ByteArrayOutputStream and Java's ZipOutputStream to unzip the file and protect against zip bomb attacks.
- ParseJsonHandler (https://github.com/Sage-Bionetworks/BridgePF/blob/develop/app/org/sagebionetworks/bridge/upload/ParseJsonHandler.java) goes through each file in the archive and attempts to parse it as JSON, using Jackson. For files that it successfully parses, it moves them from the context.unzippedDataMap to context.jsonDataMap. Files that can't be parsed as JSON are assumed to be non-JSON files and remain in context.unzippedDataMap.
- IosSchemaValidationHandler (https://github.com/Sage-Bionetworks/BridgePF/blob/develop/app/org/sagebionetworks/bridge/upload/IosSchemaValidationHandler2.java) parses the data against a schema, using both context.unzippedDataMap and context.jsonDataMap and creates a prototype health data record in context.healthDataRecordBuilder and upload attachments in context.attachmentsByFieldName. This is fairly complex. See the next section for details.
- TranscribeConsentHandler (https://github.com/Sage-Bionetworks/BridgePF/blob/develop/app/org/sagebionetworks/bridge/upload/TranscribeConsentHandler.java) copies the sharing scope (no sharing, sharing sparsely, sharing broadly) and external ID from participant options to the protoype health data record in context.healthDataRecordBuilder. This should probably be renamed to TranscribeParticipantOptionsHandler.
- UploadArtifactsHandler (https://github.com/Sage-Bionetworks/BridgePF/blob/develop/app/org/sagebionetworks/bridge/upload/UploadArtifactsHandler.java) takes the prototype record in context.healthDataRecordBuilder and uploads it to the HealthDataRecord table. It also takes the attachments in context.attachmentsByFieldName and uploads them to the health data attachments S3 bucket (generally org-sagebridge-attachment-prod), then writes their attachment IDs (which is the same as the S3 filename) back into the health data record.
Schemas and Data Formats
See Bridge Upload Data Format for more details about schemas and data formats. This document describes how the server processes data.
Uploaded data should always include an info.json that looks like:
{ "taskRun" : "331270A8-2280-4731-9247-0EA7FE2C35E9", "appName" : "Asthma Health", "files" : [ { "contentType" : "application\/json", "filename" : "NonIdentifiableDemographicsTask.json", "timestamp" : "2015-06-12T09:55:21-07:00" } ], "item" : "NonIdentifiableDemographicsTask", "appVersion" : "version 1.0.7, build 26-YML", "phoneInfo" : "iPhone 6" }
Upload validation (specifically the IosSchemaValidationHandler) looks at the value in the "item" field (in this case "NonIdentifiableDemographicsTask") and matches it with an upload schema. Upload validation currently assumes the latest schema revision. However, a future app update should send a "schemaRevision" field, so we can match up newer or older versions of tasks or surveys as appropriate.
The first thing IosSchemaValidationHandler does is determine whether this is survey data or non-survey data (because surveys and non-surveys have fundamentally different data formats). It does so by checking the schemaType field in the schema (which can be IOS_SURVEY or IOS_DATA).
IOS_DATA
IOS_DATA can contain one or more files, which can be JSON or otherwise. The best way to illustrate how this works is to use a contrived example, taken from the "mixed data" integration tests (https://github.com/Sage-Bionetworks/BridgePF/blob/develop/test/org/sagebionetworks/bridge/upload/IosSchemaValidationHandler2Test.java#L343).
File: nonJsonFile.txt (could be a binary file, like an m4a audio file) File: attachment.json { "attachment":"This is an attachment" } File: inline.json { "string":"inline value" } File: field.json { "attachment":["mixed", "data", "attachment"], "string":"This is a string" }
context.unzippedDataMap will contain only nonJsonFile.txt. context.jsonDataMap will contain attachment.json, inline.json, and field.json (and info.json, not pictured). Because the JSON data files can have the same field names (for a real world example, see the Parkinson's Walking Activity or the Cardio 6-Minute Walk Test), we need to pre-pend the file names to the field names. The IosSchemaValidationHandler will create a second map of all the JSON data, called the flattenedJsonDataMap, whose key is the concatenated filename.fieldname and whose value is the JSON value. In our example, this looks like:
(mixed syntax because the top-level map is a Java map, not a JSON node) "attachment.json.attachment" => "This is an attachment" "inline.json.string" => "inline value" "field.json.attachment" => ["mixed", "data", "attachment"] "field.json.string" => "This is a string"
The IosSchemaValidationHandler then walks the field definition list of the specified schema. First, it checks if the field name matches a filename, and if it does, it copies over the entire file into the field. If the field name doesn't match any file name, it then checks the concatenated field names of the flattenedJsonDataMap.
In our example, the schema looks like:
{ "name":"Mixed Data", "schemaId":"mixed-data", "revision":1, "schemaType":"ios_data", "fieldDefinitions": [ { "name":"nonJsonFile.txt", "type":"attachment_blob", "required":"true" }, { "name":"attachment.json", "type":"attachment_json_blob", "required":"true" }, { "name":"inline.json", "type":"inline_json_blob", "required":"true" }, { "name":"field.json.attachment", "type":"attachment_json_blob", "required":"true" }, { "name":"field.json.string", "type":"string", "required":"true" } ], "type":"UploadSchema" }
Going over each field one-by-one:
- nonJsonFile.txt - The handler finds nonJsonFile.txt in context.unzippedDataMap. This is of type attachment_blob, so it copies the the data in nonJsonFile.txt into context.attachmentsByFieldName. (Note that for non-JSON files, the field type should always be an attachment_* type. Currently, upload validation won't fail if it isn't, but this might lead to problems in the Bridge Exporter. TODO)
- attachment.json - Similar to the above, but the handler finds this in context.jsonDataMap instead. In this case, the handler finds attachment.json in context.jsonDataMap. Since the field type is attachment_json_blob, it copies the data into context.attachmentsByFieldName.
- inline.json - Similar to the above, but since the field type is inline_json_blob, it copies the data into the health data record.
- field.json.attachment - This is not a filename, but it has an entry in the flattenedJsonDataMap, and the field type is an attachment_json_blob so the value "This is an attachment" is copied into context.attachmentsByFieldName.
- field.json.string - Similar to the above, but the field type is string, so it's copied into the health data record.
NOTE: Bridge schemas have a concept of required vs optional fields. However, YML apps do not, so all fields in upload schemas currently are set to required, and if a field is missing, the only consequence is that a message is written to the upload validation context.
Surveys
Surveys have one file for each survey question/answer. To use a contrived example from the unit tests:
File: foo.json { "questionType":0, "booleanAnswer":"foo answer", "startDate":"2015-04-02T03:26:57-07:00", "questionTypeName":"Text", "item":"foo", "endDate":"2015-04-02T03:26:59-07:00" } File: bar.json { "questionType":0, "numericAnswer":42, "unit":"lb", "startDate":"2015-04-02T03:26:57-07:00", "questionTypeName":"Integer", "item":"bar", "endDate":"2015-04-02T03:26:59-07:00" } File: baz.json { "questionType":0, "choiceAnswers":["survey", "blob"], "startDate":"2015-04-02T03:26:57-07:00", "questionTypeName":"MultipleChoice", "item":"baz", "endDate":"2015-04-02T03:26:59-07:00" }
The fields that matter item, questionTypeName, unit, and the answer field (whose name varies). The other fields are ignored and are only included for completeness. (Note that new versions of the YML apps also include a "saveable", "userInfo", and "answers" field, none of which are used. The "answers" field would have been useful, except old apps don't use it, so we have to look for the variant answer fields anyway.)
Because the answer field can have different names depending on the question/answer type, we need a mapping that maps between questionTypeName and the answer field name (see https://github.com/Sage-Bionetworks/BridgePF/blob/517731c877ad185a7ac586dbdbf966d0e5c73180/app/org/sagebionetworks/bridge/upload/IosSchemaValidationHandler2.java#L68 for this mapping). It then uses the value in the item field as the field name and the value in the answer field as the field value and creates a new map. If the survey answer has a unit field, it also creates a field which is the item field with "_unit" appended to it. To follow our example, we'd end up with the following map:
(mixed syntax because the top-level map is a Java map, not a JSON node) "foo" => "foo answer" "bar" => 42 "bar_unit" => lb" "baz" => ["survey", "blob"]
The handler then goes through the same code path as IOS_DATA to match it with the schema, except since there are no more files or file names, it just copies the data straight across to the health data record, or to attachments, if any of the fields are attachments. (This is strictly speaking untrue. The handler actually treats this as the new jsonDataMap (that is, it treats each question-answer key-value pair as its own JSON file). Except in practice, survey answers are never JSON structs, so field names always match up directly with filenames, which are actually survey item identifier names.)
See Upload Survey Integration Design for how this could integrated with the survey API for future improvements.
Dynamo DB Tables
HealthDataAttachment - Keys health data attachments to health data records. A record can have many attachments, but an attachment can only live in a single record.
HealthDataRecord3 - Stores metadata and flattened data processed from uploads. The "data" field is a flat field ready to be exported directly to Synapse Tables. The data field also holds attachment IDs. Bridge knows these are attachment IDs because of the schema.
Upload2 - Stores metadata for raw uploads, including MD5, healthcode, timestamps, validation status, and validation messages. The primary key uploadId also serves as the unique S3 key for the raw upload data.
UploadDedupe - (IN DEVELOPMENT) A dedupe table that keys off of health code, schema, and timestamp.
UploadSchema - Upload schemas. See above for more details.
Misc Hacks
TODO
How to Set Up New Schemas
Long-term, the researcher UI should take care of this. Short-term, we want to follow the following process:
- Confirm with app developers the field names and types of the data coming in. Examples will be crucial.
- Create the schema in staging. The easiest way to do this is the copy-paste the example from the previous section and update the name, ID, and fields. (Don't forget to remove revision, as this will be automatically incremented by the server.) Use the REST APIs at Bridge REST API#UploadSchemas to create this on the server.
- Ensure the app developers send data to staging. Wait a day for the Bridge Exporter to run (or force a manual run of the Exporter).
- Confirm with the research team that the data in Synapse is the data they want. Fix issues and repeat if necessary.
- Once everything looks good, create the schema in prod.
I strongly recommend to keep the app developers and the research team on the same loop, rather than trying to play middleman between the two of them.
How to Debug Data Issues
There is currently a lot of noise in the upload stream. Currently, there are 800 upload validation failures per day in prod. We need to regularly scrub the logs to ensure there are no new data issues and verify the old ones are being fixed. I generally do this twice a week. (You'll also need to do these steps if there's a specific issue you're looking for.)
- When upload validation fails, it'll generate a log message that starts with "Exception thrown from upload validation handler". Using Logentries (or equivalent log tool), search for this string.
You can filter out log entries corresponding to known issues (see App Data Issues). For example, you can use the log search
"Exception thrown from upload validation handler" NOT "Log Food" NOT sevenDayFitnessAllocation NOT Journal NOT DuplicateZipEntryException
- A few issues aren't listed in known issues:
- "Upload schema not found for study parkinson, schema ID Medication Tracker" - This actually represents two issues. The first is that the original medication tracker is broken, and a new format is slated to replace it. The second is that this specific schema "Medication Tracker" always uploads blank data and should be discarded.
- "info.json is missing "item" field" - This also represents two issues. The first is that for v1.0 of the asthma app, the Air Quality Report didn't include the item field, which makes the data stream unparseable. The second is that some uploads of the old medication tracker didn't include the item field and are similarly unparseable. This may represent a legitimate issue, but given the amount of noise from the first two issues, this is difficult if not impossible to determine.
Sometimes, you might find a real issue, and you need to get at the raw upload in order to investigate it. When that happens, you can use the BulkDownloadUtil (https://github.com/Sage-Bionetworks/BridgePF/blob/develop/app/org/sagebionetworks/bridge/util/BulkDownloadUtil.java), which lives in the BridgePF project. To use this, follow these steps:
WARNING: This involves changing your local bridge.conf to point to prod. This is absolutely not safe, and we'll need to invest in a better tool that doesn't require you to do this.
Open up your ~/.sbt/bridge.conf and modify/add the following lines:
local.upload.bucket = org-sagebridge-upload-prod local.upload.cms.cert.bucket = org-sagebridge-upload-cms-cert-prod local.upload.cms.priv.bucket = org-sagebridge-upload-cms-priv-prod
(You may want to comment out the original lines, so you can restore them quickly. If this is in staging, replace "prod" with "uat".)
- If this is in staging, in the BulkDownloadUtil, where it sets up the DDB tables, change "prod-heroku-*" to "uat-heroku-*".
In your git root, run
activator "run-main org.sagebionetworks.bridge.util.BulkDownloadUtil (one or more upload IDs, separated by spaces)"
- The upload files will be downloaded, decrypted, and unzipped into the tmp directory in your git root.
- IMPORTANT: Once you're done, change your bridge.conf back.
Long-term, when the data stream is cleaned up, we'll want to change this info message to an error. Additionally, we want to start sending automated notifications to app developers for errors in the upload stream. For new studies, where the upload stream will presumably start clean, we want to start sending them automated notifications right away, so they'll keep the upload stream clean.