Weekly Dev-Ops Agenda
This is the template for the agenda for the weekly Dev-Ops meeting. These are the things we should be looking at every week to make sure our services and data pipeline are healthy. (NOTE: The weekly agenda is still being developed, and information on this page is subject to change.)
- 1 Bridge Server
- 1.1 Alarms
- 1.2 Sumo Logic Dashboards
- 1.3 AWS Monitoring
- 1.4 Redrive Exports
- 2 Mobile Apps
Bridge Server
NOTE: Jenkins may be failing to email us. Manually monitor the integ tests, and if it’s a frequent failure to email us, we might need to fix it or replace it.
Alarms
We use https://www.hubspot.com/ as our Dev-Ops ticketing system. Alarms notify us by sending an email to bridgeops@sagebase.org, which automatically creates a ticket in HubSpot. Click on Service → Tickets, then All Open Tickets and look at all open unassigned tickets. If tickets can be resolved right away, add a Note and then close the ticket. If tickets require further action, either assign the ticket to yourself, or close the ticket and link a Jira.
Note: To search for notes for past tickets, click the magnifying glass icon in the top nav bar. (The Search field in the Ticket view only searches ticket IDs, names, and descriptions, which rarely contains what we want.)
Note: We chose HubSpot because we needed a way for our Dev-Ops emails to be automatically created as tickets, and we wanted a way to track the open/closed state beyond simply “This is marked as unread in Dwayne’s inbox”. We chose HubSpot specifically because it was easy to set up and has a free tier.
Note: If Hubspot isn’t working, navigate to the mailing list directly at https://groups.google.com/a/sagebase.org/g/bridgeops and look at new messages that have arrived since the last Dev-Ops meeting.
Common types of messages include (this list is not exhaustive):
bridgeserver2-prod-AWSCWRequestErrorAlarm - Errors in the Bridge logs. Common errors include:
https://sagebionetworks.jira.com/browse/BRIDGE-3330 - Struts Error
https://sagebionetworks.jira.com/browse/BRIDGE-3194 - MissingServletRequestParameterException for startTime, type, and includeDeleted
https://sagebionetworks.jira.com/browse/BRIDGE-2492 - Illegal % escape characters in URL throws exception
- BRIDGE-3331Getting issue details... STATUS - Adherence report duplicate key exception
- BRIDGE-3369Getting issue details... STATUS - NO_SCHEDULE enum constant error
bridgeworker-prod-AWSCWRequestErrorAlarm - Similar to above, but for the Worker.
bridgeserver2-prod-AWSCW4XXAlarm-B675MRYR0N2M - HTTP 4XX’s from Bridge Server (excludes 401s, which happen all the time). This is
bridgeserver2-antivirus-prod-VirusScannerLambdaErrorAlarm - Errors in our antivirus scanner. For more documentation on (known) antivirus errors, see Virus Scanning of Files | VirusScanningofFiles ErrorMessages
Jenkins build became unstable / is still unstable - Our integration tests failed. See http://build-system.sagebase.org:8081/ This could either mean a bug was found in our code or our dev or staging environments are unhealthy. It is important to keep our integ test pipeline reliable, so that we can reliably catch bugs before they go to production.
Sumo Logic Dashboards
See https://www.sumologic.com/
It’s important to look at our dashboards. They inform us of unusual trends that might not individually be enough to set off an alarm, but in aggregate or over long time periods may require intervention. Some alarms indicate that some uploads and exports have failed and may need to be redriven.
A rundown of our dashboards graphs:
Exporter 3.0 Errors (Server) - Yes, this graph is confirmed to work. You shouldn’t ever see any errors here, but if you do, it means that something in the server failed and needs to be investigated.
Worker Errors - May indicate that an export failed, so we should look at this closely. There are many errors related to timeouts to table syn26890553, which Dwayne is currently investigating. FitBit errors can currently be ignored.
Worker Retryable Errors - This is usually Synapse isn’t writable (Synapse in read-only state during weekly release). Shouldn’t be a problem unless it lasts a long time or happens at unexpected times.
BridgeEX numRecords per study - Despite the name, this is actually Exporter 2.0 exports per app. This is only a concern if the graph changes dramatically and unexpected (eg, a line spikes up or disappears completely).
Bridge Upload Validation Errors per study - Exporter 2.0 errors in Server. We track these so we know if/when things need to be redriven, but we don’t currently have the tools to dive deep into these.
Strict Upload Validation Errors - Exporter 2.0 messages related to incomplete or missing fields in a schema. Currently, none of these apps are configured to block exports on Strict Upload Validation, so this graph can mostly be ignored.
BridgeEX Daily Export avg runtime - Exporter 2.0 runtime in seconds. Currently averages around 2400 seconds (~40 minutes). Can be an issue if this spikes dramatically. Becomes an issue if it goes above 4 hours (14400 seconds).
BridgeEX Hourly Export avg runtime - A few apps export on an hourly basis. Averages 10-15 seconds. Not really a problem unless it spikes up to several minutes.
BridgeEX Errors - Exporter 2.0 errors. Most of these are Synapse not writable. Other errors probably require further action.
Upload Auto-Complete numUploads - This is the feature that receives notifications from our Upload bucket and automatically calls Upload Complete for redundancy. This is only a problem if the graph spikes up or plummets unexpectedly.
BridgeUDD avg runtime - Bridge User Data Download. Average runtime is about 240 seconds. This worker doesn’t get a lot of traffic, so the graph might be volatile. Probably not an issue unless the graph spikes up to 20-30 minutes or so.
BridgeServer 4XX Errors - This graph is equivalent to the 4XX alarm above. This exists mostly so that when the alarm goes off, we can look at the results in Sumo Logic quickly. Might also be an issue if the graph disappears completely.
Server Errors - Same as the Server Errors alarm above and exists for the same reason.
EX3.0 Participant Versions per App - Count of participant versions being written to app-wide Synapse projects. Mainly for informational purposes, but we should look for unexpected trends. Note that if there is no app-wide Synapse project but there are study-specific Synapse projects, the data point will not show up on this graph.
EX3.0 Participant Versions per Study - Same as above, but for studies. TODO Dwayne to make a code change to log this metric using study name instead of study ID.
EX3.0 Exports per App - Count of uploads exported to the app-wide Synapse projects.
EX3.0 Exports per Study - Count of uploads exported to the study-specific Synapse projects.
Append Participant Version Latency - Latency of the append to table Synapse APIs in milliseconds, from the async start until the async get returns a result. We have a client-side timeout of 5 minutes (300,000 seconds).
Append Participant Demographics Latency - Latency of the append to table call for participant demographics. Note that this feature has not yet been launched in production, so the graph is currently blank.
Antivirus Scanner Count - Number of files scanned by our antivirus scanner. Not a concern unless the line unexpectedly spikes up or plummets or dispapears.
Antivirus Scanner Latency - Antivirus scanner latency in milliseconds. TODO Determine what is normal for the antivirus scanner.
Antivirus Scanner Errors - See Virus Scanning of Files | VirusScanningofFiles ErrorMessages
Antivirus Scanner Results - Result of antivirus scanner. Currently, all results are clean. TODO What does it look like when the antivirus scans a virus? Talk to IT to figure out how we can test this.
Promotional SMS (All), Promotional SMS (Failed), Transactional SMS (All), Transactional SMS (Failed) - Promotional SMS (Failed) is only for study burst reminders and the failures are generally Opt Outs. This isn’t worth looking into. However, Transactional SMS (Failed) is used for things like SMS sign-in, so we need to look at all of these.
Example queries: Sumo Logic Examples
AWS Monitoring
Simple Email Service → Reputation Dashboard / Automatic Suppression List - Check the bounce rate and the complaint rate. If bounces or complaints get too high, AWS can shut down our ability to send email. This shuts down any study that uses email for authentication (which is a majority of studies).
Note that the metrics are based on the last 10000 emails, and we send something like 10 per day, so this metric changes very slowly.
To see individual bounces and complaints, go to https://groups.google.com/a/sagebase.org/g/bridge-bounce-notifications. Bounces are mostly fine. Complaints need to be looked at closely.
An automatic suppression list has been enabled, preventing future emails from being sent to addresses that have returned bounces or complaints. Review the list of suppressed addresses for any that should be re-enabled.Simple Notification Service → Text Messaging (SMS) - Scroll down to Delivery Status Logs and look at SMS delivery since the last dev-ops meeting. Failures due to opted out are generally fine. Other failures should be looked at closely, especially if it’s widespread. This is no longer needed, because the logs now go to Sumo Logic.
Redrive Exports
Whenever an export in either 2.0 or 3.0 fails due to a bug or server error, we need to redrive those exports. Our process is as follows:
At the start of each month, create a Jira for this month’s redrives. See - DHP-990Getting issue details... STATUS for an example.
Over the course of the month, whenever Ops finds an upload that needs to be redriven, add it to the Jira (ideally with the appId and studyId, so we can validate if needed).
At the end of each month, create a text file with each upload ID, one ID per line, and upload it to S3 bucket
org-sagebridge-backfill-prod
in the Bridge-Prod AWS account.Get written approval from the Director of Bridge (currently Erin Mounts). For tracking purposes, this is best done as a comment in the Jira.
Write the JSON message to the Bridge-Worker-Request-Prod SQS queue to kick off the redrive.
You can verify the redrive by going to CloudWatch, opening up the worker logs (log group /aws/elasticbeanstalk/bridgeworker-prod/var/log/tomcat8/catalina.out), and searching for the upload ID(s). You can also verify by going to the Synapse projects for the corresponding studies, look in Files → Raw Data → today’s date, and searching for the file that starts with your upload ID(s).
See Bridge Data Change Request Process for information on the formal redrive process.
TODO - DHP-315Getting issue details... STATUS
Mobile Apps
under construction