Elevated rate of HTTP 500s on load balancer should trigger an alarm
Description
didn't immediately trigger alarms and our statuspage.io lambda didn't catch the elevated error rate. Creating an alarm for this case would notify us of an issue sooner, and possibly point us towards a cause or fix for future issues.
Environment
Activity
STACK REVIEW: Does not block deployment for 352
I think we will have to monitor the alarms and tweak them over time:
For the 5xx from the targets we should see them going off when we switch to read only mode when we release
For the 5xx from the LB it should probably never go off (we had an incident the past release that we didn’t see that would have triggered this alarm)
For the target response time we saw it going off on staging, but it will probably need to be tweaked to use more evaluation periods (e.g. instead of 1 use 2 out of 3) if it bother us (e.g. on staging there was very few requests)
The rejected connections and target connections should never go off
We can setup a dedicated stack to validate all this, but we would need help to put the stack in the various alarm conditions (e.g. take all the instances down, take the DB down etc).
and , can you decide what, if anything, needs to be done to 'validate' this issue?
We added alarms from the ALB metrics for both the portal and the repo (configurable in the stack builder):
On 5xx originating from the LB: if sum >= 20 within a 60 sec period (Metric: HTTPCode_ELB_5XX_Count)
On 5xx originating from the targets: if sum >= 20 within a 60 sec period (Metric: HTTPCode_Target_5XX_Count)
On target response time of targets: if average >= 1 sec within a 5 min period (Metric: TargetResponseTime)
On rejected connections at the LB: if sum > 0 within a 60 sec period (Metric: RejectedConnectionCount)
On target connections errors: if sum > 0 within a 60 sec period (Metric: TargetConnectionErrorCount)
The values were based on the last 2 weeks of the stacks. The last 2 didn't actually show up, the list of metrics available:
Validation would be tricky, but the alarms about 5xx from the target should go off when we switch to read only when we release.
I managed to integrate alarms for the load balancer in our elastic beanstalk configuration adding a special config in the .ebextensions that is processed in the cloud formation stack (EB creates automatically a load balancer that we do no control and this is our injection point for cloud formation resources). Will set a baseline for now based on stack 350 in the past 2 weeks, but it will need to be different for each of the 3 envs (repo, workers, portals).