Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

(util) BridgeExporterUtil - Static helper functions for Bridge-EX. Contains methods to build a schema key from a health data record DDB item and methods to extract values from DDB items and JSON objects and sanitize them (including stripping HTML, stripping newlines and tabs, and truncating strings to fit maximum length restrictions).

DynamoDB Tables

Operations

Deployment

...

Exporter-Scheduler-Config - Contains configuration for Bridge-EX-Scheduler to call Bridge-EX. Lambda is unable to pass any parameters into Bridge-EX-Scheduler other than the function name, so we key off of function name and use this table to get Scheduler configs.

  • schedulerName (hash key) - Matches the Lambda function name. Used to distinguish between devo, staging, and prod.
  • sqsQueueUrl - SQS queue to write requests to
  • timeZone - Currently configured to America/Los_Angeles (equivalent to Pacific Time) for all envs. In the future, if we need to launch Bridge-EX stacks in other regions, this may have other values.
  • requestOverrideJson - Optional. Request template that the Bridge-EX-Scheduler uses and fills in "date" with yesterday's date. Generally used for specialized stacks with special parameters or for testing. Example:

{

  "studyWhitelist":["api", "breastcancer", "parkinson"],

  "sharingMode":"PUBLIC_ONLY"

}

(ddbPrefix)SynapseMetaTables - Bridge-EX automatically writes to this table to keep track of meta tables (specifically appVersion tables and status tables). The key is the table name, generally of the form "parkinson-appVersion" or "parkinson-status", and it maps to the Synapse table ID. Bridge-EX uses this table to remember if it's already created a table, and if so, where to find that table.

(ddbPrefix)SynapseTables - Similar to SynapseMetaTables, Bridge-EX automatically writes to this table to keep track of tables, in this case, health data tables. The key is the schema name, flattened into the form "parkinson-TappingActivity-v6", which also maps to Synapse table IDs.

Operations

Deployment

Bridge-EX

  1. Bridge-EX changes are committed to our GitHub repository (generally via pull requests): https://github.com/Sage-Bionetworks/Bridge-Exporter
  2. Travis (https://travis-ci.org/Sage-Bionetworks/Bridge-Exporter) automatically builds the latest commit and deploys it to AWS Elastic Beanstalk according to the Travis configuration (https://github.com/Sage-Bionetworks/Bridge-Exporter/blob/develop/.travis.yml)
  3. AWS Elastic Beanstalk automatically deploys the Bridge-EX code to the AWS-managed EC2 cluster (currently configured to be a "cluster" of one machine), and then automatically starts the service.
  4. To test, go to the SQS console and generate a sample request into the appropriate SQS queue.
  5. To deploy to staging (or prod), merge the code in GitHub from the develop branch to the uat branch (or from the uat branch to the prod branch). Using a local repository cloned from the root fork, run the following commands:
    1. git checkout develop
    2. git pull
    3. git checkout uat
    4. git merge --ff-only develop
    5. git push

Bridge-EX-Scheduler

  1. Make the Bridge-EX-Scheduler changes in your local repository and commit to the root in GibHub (generally via pull request): https://github.com/Sage-Bionetworks/Bridge-EX-Scheduler
  2. In your local repository, run "mvn verify", then upload target/Bridge-EX-Scheduler-2.0.jar to AWS Lambda using the AWS Lambda console.
    1. Unfortunately, Travis doesn't support automated deployments of Java to AWS Lambda, so we have to do it manually.
  3. To test, click the "Test" button in AWS Lambda.

Troubleshooting

Logs

Logs can be found at https://logentries.com/. Credentials to the root Logentries account can be found at belltown:/work/platform/PasswordsAndCredentials/passwords.txt. Alternatively, get someone with account admin access to add your user account to Logentries.

If for some reason, the logs aren't showing up in Logentries, file a support ticket with Logentries. The alternative is to go to the AWS Elastic Beanstalk console, go to the environment you need logs for, go to Logs, and click on Request Logs. This will allow you to access the logs in your browser (if you choose Last 100 Lines) or download the logs to disk (if you choose Full Logs). The log file you're looking for is catalina.out.

If this doesn't work, you can try SSHing directly into the host.

  1. To find the hostname, go to a host tagged with the appropriate name (example, Bridge-EX-Prod), select it, and note the Public DNS in the description (example, ec2-52-91-223-70.compute-1.amazonaws.com).
  2. Download the security PEM from belltown:/work/platform/PasswordsAndCredentials/Bridge-EX-Prod.pem (or equivalent for another env).
  3. (This is optional, but makes things easier.) Set up your ~/.ssh/config with the following (replacing HostName and IdentityFile as needed). The host can be anything you want. User must be ec2-user.
    host BridgeEX2-Prod
         HostName ec2-52-91-223-70.compute-1.amazonaws.com
         User ec2-user
         IdentityFile ~/Bridge-EX-Prod.pem
  4. SSH into the host. You may need to be in the Fred Hutch intranet or log into the Fred Hutch VPN.
  5. Logs can be found at /var/log/tomcat8/catalina.out

Metrics to Look For

When scrubbing the Bridge-EX logs the key metrics to look for are:

  • number of ERRORs - Lots of errors generally means something is wrong at the systemic level. A single error generally means a record failed to upload to Synapse and is worth redriving or repairing. See Redrives for more info.
    • The only ERROR worth ignoring is "Unable to parse sharing options for hash[healthCode]=-691460808, sharing scope value=null". However, if there are a lot of these, this generally indicates a systemic error
    • On the flip side, exceptions and warnings generally aren't a problem. They generally are things like "#createFileHandleWithRetry(): attempt #1 of 5 failed", which indicates a Synapse call failed and was retried. That said, be sure to look at exceptions and warnings in case there are other problems.
  • A log line that looks like "Finished processing request in 835 seconds, date=2016-03-16, tag=[scheduler=Bridge-EX-Scheduler-prod;date=2016-03-16]". This indicates that Bridge-EX completed successfully and how long it took. If this line is missing, it indicates that Bridge-EX never completed. If this request time is significantly higher, this indicates a systemic problem.

Below are other issues that are worth looking at, but are too cumbersome to look at manually. Rather, these are things we need to build an automated monitoring and alarm system for:

  • accepted[ALL_QUALIFIED_RESEARCHERS], accepted[SPONSORS_AND_PARTNERS], excluded[NO_SHARING] - If there's a big shift in these numbers, it may indicate a bug in the Sharing Settings in Bridge, or possibly a major change in the app.
  • parkinson-appVersion.lineCount (and similar for other studies) - These indicate the total number of entries exported to Synapse for a particular study. If this number shifts (up or down) by a lot, it may indicate a problems in the app or in Bridge.
  • parkinson-TappingActivity-v6.lineCount (and similar for other studies and schemas) - Similarly, if any particular table sees large shifts, that could be a problem.
  • *.errorCount - If this appear at all, that means there's an error. This generally doesn't suggest a systemic issue (unless the error count is high, in which case our logscan alarms would go off), but rather indicate that we need to redrive some records.
  • numTotal - The total number of records Bridge-EX saw today across all studies and schemas, including records that were excluded or filtered out. Similarly, if this number shifts by a lot, it could be a problem.
  • uniqueHealthCodes[parkinson] (and similar for other studies) - This represents the number of active users. If this drops suddenly, it indicates dataloss somewhere in the Bridge pipeline. If the number rises suddenly, it may not be an issue, but it's worth understanding the cause behind it.

Currently, we manually scrub our logs about once a week. We want to move this to an automated monitoring and alarming system. This may involve pumping the logs to CloudWatch (or another system) or writing a custom solution. It may involve sending the metrics in a different format so our monitoring solution doesn't need to parse raw logs.

Monitoring and Alarms

We have logscan alarms in Logentries for 10+ ERRORs in an hour or for 100+ WARNs in an hour. These alarms send an email to bridgeit@sagebase.org.

Other than this

Redrives

Limitations

Legacy Hacks

...