Elevated rate of HTTP 500s on load balancer should trigger an alarm

Description

didn't immediately trigger alarms and our statuspage.io lambda didn't catch the elevated error rate. Creating an alarm for this case would notify us of an issue sooner, and possibly point us towards a cause or fix for future issues.

Environment

None

Activity

Show:
Kevin Boske
April 9, 2021, 10:09 PM

STACK REVIEW: Does not block deployment for 352

Marco Marasca
April 7, 2021, 5:59 PM

I think we will have to monitor the alarms and tweak them over time:

  • For the 5xx from the targets we should see them going off when we switch to read only mode when we release

  • For the 5xx from the LB it should probably never go off (we had an incident the past release that we didn’t see that would have triggered this alarm)

  • For the target response time we saw it going off on staging, but it will probably need to be tweaked to use more evaluation periods (e.g. instead of 1 use 2 out of 3) if it bother us (e.g. on staging there was very few requests)

  • The rejected connections and target connections should never go off

We can setup a dedicated stack to validate all this, but we would need help to put the stack in the various alarm conditions (e.g. take all the instances down, take the DB down etc).

Bruce Hoff
April 7, 2021, 5:50 PM

and , can you decide what, if anything, needs to be done to 'validate' this issue?

Marco Marasca
April 2, 2021, 11:15 PM

We added alarms from the ALB metrics for both the portal and the repo (configurable in the stack builder):

  • On 5xx originating from the LB: if sum >= 20 within a 60 sec period (Metric: HTTPCode_ELB_5XX_Count)

  • On 5xx originating from the targets: if sum >= 20 within a 60 sec period (Metric: HTTPCode_Target_5XX_Count)

  • On target response time of targets: if average >= 1 sec within a 5 min period (Metric: TargetResponseTime)

  • On rejected connections at the LB: if sum > 0 within a 60 sec period (Metric: RejectedConnectionCount)

  • On target connections errors: if sum > 0 within a 60 sec period (Metric: TargetConnectionErrorCount)

The values were based on the last 2 weeks of the stacks. The last 2 didn't actually show up, the list of metrics available:

https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-cloudwatch-metrics.html

Validation would be tricky, but the alarms about 5xx from the target should go off when we switch to read only when we release.

Marco Marasca
March 31, 2021, 5:51 PM

I managed to integrate alarms for the load balancer in our elastic beanstalk configuration adding a special config in the .ebextensions that is processed in the cloud formation stack (EB creates automatically a load balancer that we do no control and this is our injection point for cloud formation resources). Will set a baseline for now based on stack 350 in the past 2 weeks, but it will need to be different for each of the 3 envs (repo, workers, portals).

Fixed

Assignee

Marco Marasca

Reporter

Nick Grosenbacher

Validator

Xavier Schildwachter

Priority

Major

Labels

None

Development Area

Synapse Core Infrastructure

Sprint

Fix versions

Release Version History

None

Story Points

None

Epic Link

None

Slack Channel

None