Overview
TODO: I bet a diagram would be helpful here.
- User calls the upload API to create an upload session, upload a file, and call upload complete to start the upload validation process. (See Bridge REST API for details.)
- The upload itself is (as described in Bridge Upload Data Format) a zip file containing files of a single activity (survey or task), encrypted with the study's public key (aka cert).
- Upload validation handlers are kicked off. See next section for more details.
- The Bridge Exporter runs every night at 9am UTC (2am PDT), which exports the previous day's health data records (using uploadDate) to Synapse, creating Synapse tables as appropriate. See details in next section as well as at Bridge Exporter Script
Upload Validation Handlers
- Upload validation goes through the UploadValidationService, which creates an UploadValidationTask and runs it in an asynchronous worker thread. Upload validation is broken down into handlers so we can implement, configure, and test each sub-task individually. Relevant code:
- UploadValidationService: https://github.com/Sage-Bionetworks/BridgePF/blob/develop/app/org/sagebionetworks/bridge/services/UploadValidationService.java
- UploadValidationTaskFactory: https://github.com/Sage-Bionetworks/BridgePF/blob/develop/app/org/sagebionetworks/bridge/upload/UploadValidationTaskFactory.java
- UploadValidationTask: https://github.com/Sage-Bionetworks/BridgePF/blob/develop/app/org/sagebionetworks/bridge/upload/UploadValidationTask.java
- Spring config for Upload Validation Handlers: https://github.com/Sage-Bionetworks/BridgePF/blob/develop/app/org/sagebionetworks/bridge/config/BridgeSpringConfig.java#L269
- UploadValidationContext: https://github.com/Sage-Bionetworks/BridgePF/blob/develop/app/org/sagebionetworks/bridge/upload/UploadValidationContext.java - This keeps track of state of the upload validation.
- S3DownloadHandler (https://github.com/Sage-Bionetworks/BridgePF/blob/develop/app/org/sagebionetworks/bridge/upload/S3DownloadHandler.java) downloads the file from S3 to context.data. The S3 bucket is usually org-sagebridge-upload-prod and the filename is the upload ID (usually a guid).
- DecryptHandler (https://github.com/Sage-Bionetworks/BridgePF/blob/develop/app/org/sagebionetworks/bridge/upload/DecryptHandler.java) decrypts the file from context.data and writes the results to context.decryptedData. This calls through to the UploadArchiveService, which calls through to Bouncy Castle.
- UnzipHandler (https://github.com/Sage-Bionetworks/BridgePF/blob/develop/app/org/sagebionetworks/bridge/upload/UnzipHandler.java) unzips the file from context.decryptedData and writes the results to context.unzippedDataMap. This calls through to the UploadArchiveService, which calls through to the Zipper. The Zipper uses a combination of Apache's ByteArrayOutputStream and Java's ZipOutputStream to unzip the file and protect against zip bomb attacks.
- ParseJsonHandler (https://github.com/Sage-Bionetworks/BridgePF/blob/develop/app/org/sagebionetworks/bridge/upload/ParseJsonHandler.java) goes through each file in the archive and attempts to parse it as JSON, using Jackson. For files that it successfully parses, it moves them from the context.unzippedDataMap to context.jsonDataMap. Files that can't be parsed as JSON are assumed to be non-JSON files and remain in context.unzippedDataMap.
- IosSchemaValidationHandler (https://github.com/Sage-Bionetworks/BridgePF/blob/develop/app/org/sagebionetworks/bridge/upload/IosSchemaValidationHandler2.java) parses the data against a schema, using both context.unzippedDataMap and context.jsonDataMap and creates a prototype health data record in context.healthDataRecordBuilder and upload attachments in context.attachmentsByFieldName. This is fairly complex. See the next section for details.
- TranscribeConsentHandler (https://github.com/Sage-Bionetworks/BridgePF/blob/develop/app/org/sagebionetworks/bridge/upload/TranscribeConsentHandler.java) copies the sharing scope (no sharing, sharing sparsely, sharing broadly) and external ID from participant options to the protoype health data record in context.healthDataRecordBuilder. This should probably be renamed to TranscribeParticipantOptionsHandler.
- UploadArtifactsHandler (https://github.com/Sage-Bionetworks/BridgePF/blob/develop/app/org/sagebionetworks/bridge/upload/UploadArtifactsHandler.java) takes the prototype record in context.healthDataRecordBuilder and uploads it to the HealthDataRecord table. It also takes the attachments in context.attachmentsByFieldName and uploads them to the health data attachments S3 bucket (generally org-sagebridge-attachment-prod), then writes their attachment IDs (which is the same as the S3 filename) back into the health data record.
Schemas and Data Formats
See Bridge Upload Data Format for more details about schemas and data formats. This document describes how the server processes data.
Uploaded data should always include an info.json that looks like:
{ "taskRun" : "331270A8-2280-4731-9247-0EA7FE2C35E9", "appName" : "Asthma Health", "files" : [ { "contentType" : "application\/json", "filename" : "NonIdentifiableDemographicsTask.json", "timestamp" : "2015-06-12T09:55:21-07:00" } ], "item" : "NonIdentifiableDemographicsTask", "appVersion" : "version 1.0.7, build 26-YML", "phoneInfo" : "iPhone 6" }
Upload validation (specifically the IosSchemaValidationHandler) looks at the value in the "item" field (in this case "NonIdentifiableDemographicsTask") and matches it with an upload schema. Upload validation currently assumes the latest schema revision. However, a future app update should send a "schemaRevision" field, so we can match up newer or older versions of tasks or surveys as appropriate.
The first thing IosSchemaValidationHandler does is determine whether this is survey data or non-survey data (because surveys and non-surveys have fundamentally different data formats). It does so by checking the schemaType field in the schema (which can be IOS_SURVEY or IOS_DATA).
IOS_DATA
IOS_DATA can contain one or more files, which can be JSON or otherwise. The best way to illustrate how this works is to use a contrived example, taken from the "mixed data" integration tests (https://github.com/Sage-Bionetworks/BridgePF/blob/develop/test/org/sagebionetworks/bridge/upload/IosSchemaValidationHandler2Test.java#L343).
File: nonJsonFile.txt (could be a binary file, like an m4a audio file) File: attachment.json { "attachment":"This is an attachment" } File: inline.json { "string":"inline value" } File: field.json { "attachment":["mixed", "data", "attachment"], "string":"This is a string" }
context.unzippedDataMap will contain only nonJsonFile.txt. context.jsonDataMap will contain attachment.json, inline.json, and field.json (and info.json, not pictured). Because the JSON data files can have the same field names (for a real world example, see the Parkinson's Walking Activity or the Cardio 6-Minute Walk Test), we need to pre-pend the file names to the field names. The IosSchemaValidationHandler will create a second map of all the JSON data, called the flattenedJsonDataMap, whose key is the concatenated filename.fieldname and whose value is the JSON value. In our example, this looks like:
(mixed syntax because the top-level map is a Java map, not a JSON node) "attachment.json.attachment" => "This is an attachment" "inline.json.string" => "inline value" "field.json.attachment" => ["mixed", "data", "attachment"] "field.json.string" => "This is a string"
The IosSchemaValidationHandler then walks the field definition list of the specified schema. First, it checks if the field name matches a filename, and if it does, it copies over the entire file into the field. If the field name doesn't match any file name, it then checks the concatenated field names of the flattenedJsonDataMap.
In our example, the schema looks like:
{ "name":"Mixed Data", "schemaId":"mixed-data", "revision":1, "schemaType":"ios_data", "fieldDefinitions": [ { "name":"nonJsonFile.txt", "type":"attachment_blob", "required":"true" }, { "name":"attachment.json", "type":"attachment_json_blob", "required":"true" }, { "name":"inline.json", "type":"inline_json_blob", "required":"true" }, { "name":"field.json.attachment", "type":"attachment_json_blob", "required":"true" }, { "name":"field.json.string", "type":"string", "required":"true" } ], "type":"UploadSchema" }
Going over each field one-by-one:
- nonJsonFile.txt - The handler finds nonJsonFile.txt in context.unzippedDataMap. This is of type attachment_blob, so it copies the the data in nonJsonFile.txt into context.attachmentsByFieldName. (Note that for non-JSON files, the field type should always be an attachment_* type. Currently, upload validation won't fail if it isn't, but this might lead to problems in the Bridge Exporter. TODO)
- attachment.json - Similar to the above, but the handler finds this in context.jsonDataMap instead. In this case, the handler finds attachment.json in context.jsonDataMap. Since the field type is attachment_json_blob, it copies the data into context.attachmentsByFieldName.
- inline.json - Similar to the above, but since the field type is inline_json_blob, it copies the data into the health data record.
- field.json.attachment - This is not a filename, but it has an entry in the flattenedJsonDataMap, and the field type is an attachment_json_blob so the value "This is an attachment" is copied into context.attachmentsByFieldName.
- field.json.string - Similar to the above, but the field type is string, so it's copied into the health data record.
NOTE: Bridge schemas have a concept of required vs optional fields. However, YML apps do not, so all fields in upload schemas currently are set to required, and if a field is missing, the only consequence is that a message is written to the upload validation context.
Surveys
Surveys have one file for each survey question/answer. To use a contrived example from the unit tests:
File: foo.json { "questionType":0, "booleanAnswer":"foo answer", "startDate":"2015-04-02T03:26:57-07:00", "questionTypeName":"Text", "item":"foo", "endDate":"2015-04-02T03:26:59-07:00" } File: bar.json { "questionType":0, "numericAnswer":42, "unit":"lb", "startDate":"2015-04-02T03:26:57-07:00", "questionTypeName":"Integer", "item":"bar", "endDate":"2015-04-02T03:26:59-07:00" } File: baz.json { "questionType":0, "choiceAnswers":["survey", "blob"], "startDate":"2015-04-02T03:26:57-07:00", "questionTypeName":"MultipleChoice", "item":"baz", "endDate":"2015-04-02T03:26:59-07:00" }
The fields that matter item, questionTypeName, unit, and the answer field (whose name varies). The other fields are ignored and are only included for completeness. (Note that new versions of the YML apps also include a "saveable", "userInfo", and "answers" field, none of which are used. The "answers" field would have been useful, except old apps don't use it, so we have to look for the variant answer fields anyway.)
Because the answer field can have different names depending on the question/answer type, we need a mapping that maps between questionTypeName and the answer field name (see https://github.com/Sage-Bionetworks/BridgePF/blob/517731c877ad185a7ac586dbdbf966d0e5c73180/app/org/sagebionetworks/bridge/upload/IosSchemaValidationHandler2.java#L68 for this mapping). It then uses the value in the item field as the field name and the value in the answer field as the field value and creates a new map. If the survey answer has a unit field, it also creates a field which is the item field with "_unit" appended to it. To follow our example, we'd end up with the following map:
(mixed syntax because the top-level map is a Java map, not a JSON node) "foo" => "foo answer" "bar" => 42 "bar_unit" => lb" "baz" => ["survey", "blob"]
The handler then goes through the same code path as IOS_DATA to match it with the schema, except since there are no more files or file names, it just copies the data straight across to the health data record, or to attachments, if any of the fields are attachments. (This is strictly speaking untrue. The handler actually treats this as the new jsonDataMap (that is, it treats each question-answer key-value pair as its own JSON file). Except in practice, survey answers are never JSON structs, so field names always match up directly with filenames, which are actually survey item identifier names.)
See /wiki/spaces/BRIDGE/pages/75071519 for how this could integrated with the survey API for future improvements.
Misc Hacks
TODO
Bridge Exporter
See Bridge Exporter Script for more details.
Components:
- EC2 host, which the Bridge Exporter lives on
- git workspace, which code updates need to be manually pulled to and built
- maven, which is both our build system and our launcher
- cron job, which runs the Bridge Exporter nightly at 9am UTC (2am PDT)
Future Improvements
- deployment (AWS CodeDeploy)
- Spring
- proper launcher
- configuration should be moved from JSON file to EC2 user data, to a cloud service, or some combination of the above
- IAM integration instead of ~/.aws/credentials (also, we need to audit these credentials and permissions)
- proper scheduling system - cron is neither redundant nor cloud-friendly
- proper logging solution - using a logger system instead of System.out.println, automatically copy logs to S3 or another persistent store
Updating the Bridge Exporter
- Make sure your update is in the github repository (https://github.com/Sage-Bionetworks/Bridge-Exporter).
- Today, the Bridge Exporter is in Dwayne's personal fork. We should migrate this to be in the Sage root repository.
- Log into the Bridge Exporter host (see Bridge Exporter Script). Pull your update from github.
- Run "mvn clean aspectj:compile". The clean is to make sure we clean out any outdated artifacts. The aspectj:compile is so we can compile with AspectJ weaving, so we can use jcabi-aspects (http://aspects.jcabi.com/).
Validating Bridge Exporter Runs
- Log into the Bridge Exporter host (see Bridge Exporter Script).
- Go to the ~/logs directory. The logs should be Bridge-Exporter-YYYY-MM-DD.log, where YYYY-MM-DD refers to yesterday's date.
- The end of the logs should say something like "Bridge Exporter done: 2015-06-19T02:43:24.332-07:00" near the end of the logs. This signifies that the Bridge Exporter ran to completion.
- grep the logs for errors. If it's an error from FileChunkUploadWorker, that's fine. Any other errors are a problem. See next section for how to run redrives
- Upload the logs to S3 bucket org-sagebridge-exporter-logs. (You may need to download the file to your local machine before uploading it to S3.)
Redriving the Bridge Exporter
There are three kinds of redrives: You can re-upload TSV files, you can redrive tables, or you can redrive individual records.
If the Bridge Exporter logs say there's an error uploading a TSV to a table, or an error importing a TSV file handle to a table, you'll need to re-upload the TSV file. You'll be able to tell when this happens because the Bridge Exporter logs will contain lines that look like
[re-upload-tsv] file /tmp/cardiovascular-satisfied-v1.6348225244877621202.tsv syn3469956 OR [re-upload-tsv] filehandle 2439961 syn3474928
To re-upload a TSV:
- grep out all of the "[re-upload-tsv]" lines into a separate file. This will serve as the input file for our re-upload.
- Ensure all the TSV files referenced in the re-upload-tsv entries exist and are valid.
- Run "mvn exec:java -f $BRIDGE_EXPORTER -Dexec.mainClass=org.sagebionetworks.bridge.scripts.BridgeExporterTsvReuploader -Dexec.args=(re-upload-tsv file) > ~/logs/(meaningful log name).log 2>&1 &". Note that we're piping stderr into stdout (which also goes to a file). Also note the & at the end, which kicks off the re-upload as a background task.
If you need to redrive individual tables (which might happen because there are many failures from a single table or the table wasn't created properly and needs to be deleted and recreated):
Create a file that looks like
[ { "studyId":"asthma", "schemaId":"AsthmaWeeklyPrompt", "rev":1 } ]
Filling in the studyId, schemaId, and rev as appropriate. (The list can contain more than one entry if you need to redrive multiple tables.)
- Upload this file to the Bridge Exporter host that you're running the redrive from (most likely bridge-exporter-redrive). Conventionally, this file will live in the ~/redrive directory.
- Run "mvn exec:java -f $BRIDGE_EXPORTER -Dexec.mainClass=org.sagebionetworks.bridge.exporter.BridgeExporter -Dexec.args="redrive-tables (table list file path)YYYY-MM-DD" > ~/logs/(meaningful log name).log 2>&1 &". YYYY-MM-DD is the upload date that you want to redrive (same as Bridge-Exporter's date, generally yesterday's date). Note that we're piping stderr into stdout (which also goes to a file). Also note the & at the end, which kicks off the re-upload as a background task.
If you need to redrive individual records (which generally happens if there's a small number of errors):
- Gather up all of the failed record IDs and store them in a file, one record ID per line. Note that you can pull record IDs from multiple tables and multiple upload dates.
- Upload this file to the Bridge Exporter host that you're running the redrive from (most likely bridge-exporter-redrive). Conventionally, this file will live in the ~/redrive directory.
- Run "mvn exec:java -f $BRIDGE_EXPORTER -Dexec.mainClass=org.sagebionetworks.bridge.exporter.BridgeExporter -Dexec.args="redrive-records (record list file path)" > ~/logs/(meaningful log name).log 2>&1 &". Note that redrive-records doesn't need an upload date. Note that we're piping stderr into stdout (which also goes to a file). Also note the & at the end, which kicks off the re-upload as a background task.
IMPORTANT NOTE: In some cases, if your terminal is terminated, it will terminate all tasks spawned by your SSH session. For safety, it's best to log out of the Bridge Exporter host immediately after kicking off the re-upload/redrive and log back in.
Implementation Details
On startup, the Bridge Exporter (https://github.com/DwayneJengSage/Bridge-Exporter-1/blob/develop/src/main/java/org/sagebionetworks/bridge/exporter/BridgeExporter.java) does the following initialization (TODO: Spring-ify this, where appropriate):
- For each schema, if the corresponding Synapse table hasn't been created, the Bridge Exporter
- creates that table
- sets that table with ACLs, which give the Bridge Exporter account ACCESS_TYPE_ALL and the data access team (configured and managed by the research team) ACCESS_TYPE_READ. The idea is the Bridge Exporter needs to create and write to these tables, but once exporter, the data should be immutable.
- adds the mapping to DDB table prod-exporter-SynapseTables (which is how we know whether the table has been created or not)
- For each study, the Bridge Exporter creates an appVersion table, if such a table doesn't already exist.
- This table is used to get a broad overview of all upload records in a study. It can also be used to get a table of uploads per day, uploads per version, unique healthCodes, or a combination of the above (depending on your SQL skills).
- The study to appVersion table mapping is found in DDB table prod-exporter-SynapseMetaTables.
- For each schema and each appVersion table, the Bridge Exporter also creates a TSV file on disk with the headers corresponding to the schema's field names.
The Bridge Exporter than does the following:
- The Bridge Exporter queries Dynamo DB in a tight loop for all records with the given uploadDate (generally, yesterday's date). The idea is that the nightly export run, which happens early morning AM, pulls all records uploaded the day before to generate a daily incremental upload. This part of the code is very quick (limited by DDB throttling rate), so we can do it single threaded.
- The Bridge Exporter filters records based on international filtering logic (not currently turned on) and filters out "no_sharing" records.
- The Bridge Exporter then creates an ExportTask with the given study ID, schema ID, and schema revision (collectively called a schemaKey) as well as the DDB record and passes it off to the ExportWorkerManager.
- (deprecated) There used to be a step where if the schema ID was "ios_survey", the ExportWorkerManager would convert the raw survey response into an exportable health data record. However, this now happens during upload validation (see previous section), so this step is no longer necessary.
- The ExportWorkerManager (https://github.com/DwayneJengSage/Bridge-Exporter-1/blob/develop/src/main/java/org/sagebionetworks/bridge/worker/ExportWorkerManager.java) contains
- a pool of worker threads
- a queue of outstanding tasks
- an export handler for each table, as well as an export handler for the "appVersion" tables.
- The ExportWorkerManager determines which schema, and therefore, which export handler and appVersion handler to use, and creates the outstanding tasks for the worker threads to process asynchronously.
- Each export handler (regardless of whether it's a health data export handler or an app version handler) will process that record, turn it into the corresponding table row, and write that row into the TSV file.
- Once the end of the DDB query is reached, each export handler is signaled with a call to endOfStream(). If the export handler processed at least 1 file (that is, if the TSV is not empty), it will upload the TSV to Synapse as a file handle, then import the file handle to the table.
For more details on what the exported Synapse tables look like, see Synapse Export Documentation.
appVersion table
Every record in each study also generates a record in the appVersion table. This is so that the appVersion table can be used to query number of users and records per day and per app version. For more info, see Synapse Export Documentation.
Why is the Upload Date different?
The Upload Date in the Health Data Records table in DDB tracks when the data was uploaded to Bridge. The Upload Date in the Synapse table tracks when the data was upload to Synapse. Since the Bridge Exporter runs every day at 2am and pulls yesterday's data, these dates are (in normal circumstances) always different.
This was done because the Upload Dates for each were done for very similar but different use cases. The Upload Date in DDB is for the Bridge Exporter to keep track of which data set it needs to export to Synapse. The Upload Date in Synapse is for researchers to keep track of which data they have processed, and to determine "today's data drop".
Bridge Exporter Synapse Accounts
This account is primarily used by the exporter to create tables and write health data to the tables. This account is also used for ad-hoc investigations and administration. (TODO: Should we have separate accounts for this?) There are 2 BridgeExporter accounts, called BridgeExporter and BridgeExporterStaging (used for prod and staging, respectively). The credentials for these accounts can be found in belltown in the usual location.
How to Set Up New Schemas
Long-term, the researcher UI should take care of this. Short-term, we want to follow the following process:
- Confirm with app developers the field names and types of the data coming in. Examples will be crucial.
- Create the schema in staging. The easiest way to do this is the copy-paste the example from the previous section and update the name, ID, and fields. (Don't forget to remove revision, as this will be automatically incremented by the server.) Use the REST APIs at Bridge REST API to create this on the server.
- Ensure the app developers send data to staging. Wait a day for the Bridge Exporter to run (or force a manual run of the Exporter).
- Confirm with the research team that the data in Synapse is the data they want. Fix issues and repeat if necessary.
- Once everything looks good, create the schema in prod.
I strongly recommend to keep the app developers and the research team on the same loop, rather than trying to play middleman between the two of them.
How to Debug Data Issues
There is currently a lot of noise in the upload stream. Currently, there are 800 upload validation failures per day in prod. We need to regularly scrub the logs to ensure there are no new data issues and verify the old ones are being fixed. I generally do this twice a week. (You'll also need to do these steps if there's a specific issue you're looking for.)
- When upload validation fails, it'll generate a log message that starts with "Exception thrown from upload validation handler". Using Logentries (or equivalent log tool), search for this string.
You can filter out log entries corresponding to known issues (see App Data Issues). For example, you can use the log search
"Exception thrown from upload validation handler" NOT "Log Food" NOT sevenDayFitnessAllocation NOT Journal NOT DuplicateZipEntryException
- A few issues aren't listed in known issues:
- "Upload schema not found for study parkinson, schema ID Medication Tracker" - This actually represents two issues. The first is that the original medication tracker is broken, and a new format is slated to replace it. The second is that this specific schema "Medication Tracker" always uploads blank data and should be discarded.
- "info.json is missing "item" field" - This also represents two issues. The first is that for v1.0 of the asthma app, the Air Quality Report didn't include the item field, which makes the data stream unparseable. The second is that some uploads of the old medication tracker didn't include the item field and are similarly unparseable. This may represent a legitimate issue, but given the amount of noise from the first two issues, this is difficult if not impossible to determine.
Sometimes, you might find a real issue, and you need to get at the raw upload in order to investigate it. When that happens, you can use the BulkDownloadUtil (https://github.com/Sage-Bionetworks/BridgePF/blob/develop/app/org/sagebionetworks/bridge/util/BulkDownloadUtil.java), which lives in the BridgePF project. To use this, follow these steps:
WARNING: This involves changing your local bridge.conf to point to prod. This is absolutely not safe, and we'll need to invest in a better tool that doesn't require you to do this.
Open up your ~/.sbt/bridge.conf and modify/add the following lines:
local.upload.bucket = org-sagebridge-upload-prod local.upload.cms.cert.bucket = org-sagebridge-upload-cms-cert-prod local.upload.cms.priv.bucket = org-sagebridge-upload-cms-priv-prod
(You may want to comment out the original lines, so you can restore them quickly. If this is in staging, replace "prod" with "uat".)
- If this is in staging, in the BulkDownloadUtil, where it sets up the DDB tables, change "prod-heroku-*" to "uat-heroku-*".
In your git root, run
activator "run-main org.sagebionetworks.bridge.util.BulkDownloadUtil (one or more upload IDs, separated by spaces)"
- The upload files will be downloaded, decrypted, and unzipped into the tmp directory in your git root.
- IMPORTANT: Once you're done, change your bridge.conf back.
Long-term, when the data stream is cleaned up, we'll want to change this info message to an error. Additionally, we want to start sending automated notifications to app developers for errors in the upload stream. For new studies, where the upload stream will presumably start clean, we want to start sending them automated notifications right away, so they'll keep the upload stream clean.