This is the template for the agenda for the weekly Dev-Ops meeting. These are the things we should be looking at every week to make sure our services and data pipeline are healthy. (NOTE: The weekly agenda is still being developed, and information on this page is subject to change.)

Bridge Server

NOTE: Jenkins may be failing to email us. Manually monitor the integ tests, and if it’s a frequent failure to email us, we might need to fix it or replace it.

Alarms

We use https://www.hubspot.com/ as our Dev-Ops ticketing system. Alarms notify us by sending an email to bridgeops@sagebase.org, which automatically creates a ticket in HubSpot. Click on Service → Tickets, then All Open Tickets and look at all open unassigned tickets. If tickets can be resolved right away, add a Note and then close the ticket. If tickets require further action, either assign the ticket to yourself, or close the ticket and link a Jira.

Note: To search for notes for past tickets, click the magnifying glass icon in the top nav bar. (The Search field in the Ticket view only searches ticket IDs, names, and descriptions, which rarely contains what we want.)

Note: We chose HubSpot because we needed a way for our Dev-Ops emails to be automatically created as tickets, and we wanted a way to track the open/closed state beyond simply “This is marked as unread in Dwayne’s inbox”. We chose HubSpot specifically because it was easy to set up and has a free tier.

Note: If Hubspot isn’t working, navigate to the mailing list directly at https://groups.google.com/a/sagebase.org/g/bridgeops and look at new messages that have arrived since the last Dev-Ops meeting.

Common types of messages include (this list is not exhaustive):

Sumo Logic Dashboards

See https://www.sumologic.com/

It’s important to look at our dashboards. They inform us of unusual trends that might not individually be enough to set off an alarm, but in aggregate or over long time periods may require intervention. Some alarms indicate that some uploads and exports have failed and may need to be redriven.

A rundown of our dashboards graphs:

Example queries: Sumo Logic Examples

AWS Monitoring

Redrive Exports

Whenever an export in either 2.0 or 3.0 fails due to a bug or server error, we need to redrive those exports. Our process is as follows:

  1. At the start of each month, create a Jira for this month’s redrives. See for an example.

  2. Over the course of the month, whenever Ops finds an upload that needs to be redriven, add it to the Jira (ideally with the appId and studyId, so we can validate if needed).

  3. At the end of each month, create a text file with each upload ID, one ID per line, and upload it to S3 bucket org-sagebridge-backfill-prod in the Bridge-Prod AWS account.

  4. Get written approval from the Director of Bridge (currently Erin Mounts). For tracking purposes, this is best done as a comment in the Jira.

  5. Write the JSON message to the Bridge-Worker-Request-Prod SQS queue to kick off the redrive.

  6. You can verify the redrive by going to CloudWatch, opening up the worker logs (log group /aws/elasticbeanstalk/bridgeworker-prod/var/log/tomcat8/catalina.out), and searching for the upload ID(s). You can also verify by going to the Synapse projects for the corresponding studies, look in Files → Raw Data → today’s date, and searching for the file that starts with your upload ID(s).

See Bridge Data Change Request Process for information on the formal redrive process.

TODO

Mobile Apps

under construction