...
UploadSchema - Upload schemas. See above for more details.
Misc Hacks
TODO
Bridge Exporter
See Bridge Exporter Script for more details.
Components:
- EC2 host, which the Bridge Exporter lives on
- git workspace, which code updates need to be manually pulled to and built
- maven, which is both our build system and our launcher
- cron job, which runs the Bridge Exporter nightly at 9am UTC (2am PDT)
Future Improvements
- deployment (AWS CodeDeploy)
- Spring
- proper launcher
- configuration should be moved from JSON file to EC2 user data, to a cloud service, or some combination of the above
- IAM integration instead of ~/.aws/credentials (also, we need to audit these credentials and permissions)
- proper scheduling system - cron is neither redundant nor cloud-friendly
- proper logging solution - using a logger system instead of System.out.println, automatically copy logs to S3 or another persistent store
Updating the Bridge Exporter
- Make sure your update is in the github repository (https://github.com/Sage-Bionetworks/Bridge-Exporter).
- Today, the Bridge Exporter is in Dwayne's personal fork. We should migrate this to be in the Sage root repository.
- Log into the Bridge Exporter host (see Bridge Exporter Script). Pull your update from github.
- Run "mvn clean aspectj:compile". The clean is to make sure we clean out any outdated artifacts. The aspectj:compile is so we can compile with AspectJ weaving, so we can use jcabi-aspects (http://aspects.jcabi.com/).
Validating Bridge Exporter Runs
- Log into the Bridge Exporter host (see Bridge Exporter Script).
- Go to the ~/logs directory. The logs should be Bridge-Exporter-YYYY-MM-DD.log, where YYYY-MM-DD refers to yesterday's date.
- The end of the logs should say something like "Bridge Exporter done: 2015-06-19T02:43:24.332-07:00" near the end of the logs. This signifies that the Bridge Exporter ran to completion.
- grep the logs for errors. If it's an error from FileChunkUploadWorker, that's fine. Any other errors are a problem. See next section for how to run redrives
- Upload the logs to S3 bucket org-sagebridge-exporter-logs. (You may need to download the file to your local machine before uploading it to S3.)
Redriving the Bridge Exporter
There are three kinds of redrives: You can re-upload TSV files, you can redrive tables, or you can redrive individual records.
If the Bridge Exporter logs say there's an error uploading a TSV to a table, or an error importing a TSV file handle to a table, you'll need to re-upload the TSV file. You'll be able to tell when this happens because the Bridge Exporter logs will contain lines that look like
Code Block |
---|
[re-upload-tsv] file /tmp/cardiovascular-satisfied-v1.6348225244877621202.tsv syn3469956
OR
[re-upload-tsv] filehandle 2439961 syn3474928 |
To re-upload a TSV:
- grep out all of the "[re-upload-tsv]" lines into a separate file. This will serve as the input file for our re-upload.
- Ensure all the TSV files referenced in the re-upload-tsv entries exist and are valid.
- Run "mvn exec:java -f $BRIDGE_EXPORTER -Dexec.mainClass=org.sagebionetworks.bridge.scripts.BridgeExporterTsvReuploader -Dexec.args=(re-upload-tsv file) > ~/logs/(meaningful log name).log 2>&1 &". Note that we're piping stderr into stdout (which also goes to a file). Also note the & at the end, which kicks off the re-upload as a background task.
If you need to redrive individual tables (which might happen because there are many failures from a single table or the table wasn't created properly and needs to be deleted and recreated):
Create a file that looks like
Code Block [ { "studyId":"asthma", "schemaId":"AsthmaWeeklyPrompt", "rev":1 } ]
Filling in the studyId, schemaId, and rev as appropriate. (The list can contain more than one entry if you need to redrive multiple tables.)
- Upload this file to the Bridge Exporter host that you're running the redrive from (most likely bridge-exporter-redrive). Conventionally, this file will live in the ~/redrive directory.
- Run "mvn exec:java -f $BRIDGE_EXPORTER -Dexec.mainClass=org.sagebionetworks.bridge.exporter.BridgeExporter -Dexec.args="redrive-tables (table list file path)YYYY-MM-DD" > ~/logs/(meaningful log name).log 2>&1 &". YYYY-MM-DD is the upload date that you want to redrive (same as Bridge-Exporter's date, generally yesterday's date). Note that we're piping stderr into stdout (which also goes to a file). Also note the & at the end, which kicks off the re-upload as a background task.
If you need to redrive individual records (which generally happens if there's a small number of errors):
- Gather up all of the failed record IDs and store them in a file, one record ID per line. Note that you can pull record IDs from multiple tables and multiple upload dates.
- Upload this file to the Bridge Exporter host that you're running the redrive from (most likely bridge-exporter-redrive). Conventionally, this file will live in the ~/redrive directory.
- Run "mvn exec:java -f $BRIDGE_EXPORTER -Dexec.mainClass=org.sagebionetworks.bridge.exporter.BridgeExporter -Dexec.args="redrive-records (record list file path)" > ~/logs/(meaningful log name).log 2>&1 &". Note that redrive-records doesn't need an upload date. Note that we're piping stderr into stdout (which also goes to a file). Also note the & at the end, which kicks off the re-upload as a background task.
IMPORTANT NOTE: In some cases, if your terminal is terminated, it will terminate all tasks spawned by your SSH session. For safety, it's best to log out of the Bridge Exporter host immediately after kicking off the re-upload/redrive and log back in.
Implementation Details
On startup, the Bridge Exporter (https://github.com/DwayneJengSage/Bridge-Exporter-1/blob/develop/src/main/java/org/sagebionetworks/bridge/exporter/BridgeExporter.java) does the following initialization (TODO: Spring-ify this, where appropriate):
- For each schema, if the corresponding Synapse table hasn't been created, the Bridge Exporter
- creates that table
- sets that table with ACLs, which give the Bridge Exporter account ACCESS_TYPE_ALL and the data access team (configured and managed by the research team) ACCESS_TYPE_READ. The idea is the Bridge Exporter needs to create and write to these tables, but once exporter, the data should be immutable.
- adds the mapping to DDB table prod-exporter-SynapseTables (which is how we know whether the table has been created or not) IMPORTANT: If you're deleting a table from Synapse, be sure to update this mapping, or the Bridge Exporter will get confused.
- For each study, the Bridge Exporter creates an appVersion table, if such a table doesn't already exist.
- This table is used to get a broad overview of all upload records in a study. It can also be used to get a table of uploads per day, uploads per version, unique healthCodes, or a combination of the above (depending on your SQL skills).
- The study to appVersion table mapping is found in DDB table prod-exporter-SynapseMetaTables.
- For each schema and each appVersion table, the Bridge Exporter also creates a TSV file on disk with the headers corresponding to the schema's field names.
The Bridge Exporter than does the following:
- The Bridge Exporter queries Dynamo DB in a tight loop for all records with the given uploadDate (generally, yesterday's date). The idea is that the nightly export run, which happens early morning AM, pulls all records uploaded the day before to generate a daily incremental upload. This part of the code is very quick (limited by DDB throttling rate), so we can do it single threaded.
- The Bridge Exporter filters records based on international filtering logic (not currently turned on) and filters out "no_sharing" records.
- The Bridge Exporter then creates an ExportTask with the given study ID, schema ID, and schema revision (collectively called a schemaKey) as well as the DDB record and passes it off to the ExportWorkerManager.
- (deprecated) There used to be a step where if the schema ID was "ios_survey", the ExportWorkerManager would convert the raw survey response into an exportable health data record. However, this now happens during upload validation (see previous section), so this step is no longer necessary.
- The ExportWorkerManager (https://github.com/DwayneJengSage/Bridge-Exporter-1/blob/develop/src/main/java/org/sagebionetworks/bridge/worker/ExportWorkerManager.java) contains
- a pool of worker threads
- a queue of outstanding tasks
- an export handler for each table, as well as an export handler for the "appVersion" tables.
- The ExportWorkerManager determines which schema, and therefore, which export handler and appVersion handler to use, and creates the outstanding tasks for the worker threads to process asynchronously.
- Each export handler (regardless of whether it's a health data export handler or an app version handler) will process that record, turn it into the corresponding table row, and write that row into the TSV file.
- Once the end of the DDB query is reached, each export handler is signaled with a call to endOfStream(). If the export handler processed at least 1 file (that is, if the TSV is not empty), it will upload the TSV to Synapse as a file handle, then import the file handle to the table.
For more details on what the exported Synapse tables look like, see Synapse Export Documentation.
appVersion table
Every record in each study also generates a record in the appVersion table. This is so that the appVersion table can be used to query number of users and records per day and per app version. For more info, see Synapse Export Documentation.
Why is the Upload Date different?
The Upload Date in the Health Data Records table in DDB tracks when the data was uploaded to Bridge. The Upload Date in the Synapse table tracks when the data was upload to Synapse. Since the Bridge Exporter runs every day at 2am and pulls yesterday's data, these dates are (in normal circumstances) always different.
This was done because the Upload Dates for each were done for very similar but different use cases. The Upload Date in DDB is for the Bridge Exporter to keep track of which data set it needs to export to Synapse. The Upload Date in Synapse is for researchers to keep track of which data they have processed, and to determine "today's data drop".
Bridge Exporter Synapse Accounts
This account is primarily used by the exporter to create tables and write health data to the tables. This account is also used for ad-hoc investigations and administration. (TODO: Should we have separate accounts for this?) There are 2 BridgeExporter accounts, called BridgeExporter and BridgeExporterStaging (used for prod and staging, respectively). The credentials for these accounts can be found in belltown in the usual location.
How to Set Up New Schemas
...
Long-term, when the data stream is cleaned up, we'll want to change this info message to an error. Additionally, we want to start sending automated notifications to app developers for errors in the upload stream. For new studies, where the upload stream will presumably start clean, we want to start sending them automated notifications right away, so they'll keep the upload stream clean.