Synapse outage (?)

Description

I cannot access backend services, all requests are timing out

Environment

None

Activity

Show:
Aaron Hayden
January 6, 2021, 9:41 PM
Nick Grosenbacher
January 6, 2021, 7:00 PM

Closing because the outage was resolved yesterday. If reopening this ticket for further postmortem analysis, please remove me as validator.

Marco Marasca
January 6, 2021, 6:46 PM

Checked today again for the POST login/sessions requests count to check if there is an attempt of brute forcing (Yesterday the results were masked by the timeouts):

Date

Status Code

Endpoint

Count

2021-01-06

201

POST /login

5077

2021-01-06

401

POST /login

55

2021-01-06

201

POST /session

4849

2021-01-06

401

POST /session

4

2021-01-05

201

POST /login

25932

2021-01-05

401

POST /login

81

2021-01-05

423

POST /login

3

2021-01-05

503

POST /session

16881

2021-01-05

201

POST /session

6892

2021-01-05

500

POST /session

13

2021-01-05

401

POST /session

12

2021-01-04

201

POST /login

40149

2021-01-04

401

POST /login

101

2021-01-04

423

POST /login

10

2021-01-04

201

POST /session

7288

2021-01-04

401

POST /session

24

Seems that we are back to normal. The user started receiving the authentication requests, meaning that the login for the account were successful.

Marco Marasca
January 6, 2021, 4:17 AM

An additional observation: after the new set of instances were up the old instances were still producing logs after several hours, meaning that request were still going through. This might also have added to the available capacity, diminishing the potential of saturation of slow requests.

Marco Marasca
January 6, 2021, 2:42 AM
Edited

The DB engine wasn’t doing much, but mostly waiting for a lock to get available and then timing out the requests (with a default of 50 seconds). In fact the DB instance itself wasn’t saturated. We reached a point of several hundreds of transactions all waiting for a lock acquisition to timeout. The backend uses a pool of connections, and each instance is configured with a max of 40 in the pool. I noticed that the profiler was showing very slow responses from simple DB calls, mostly on the delete operation of the authentication receipts (which would fail after 50 seconds leading to the high number of 503 above in the POST /session). The instances then couldn’t answer the health checker, probably timing out and the instances went all red.

This outage gave some room as the number of request dropped, then a new set of instances were started that could serve additional requests buying us some more time. When we renamed the username new “problematic” login requests for the delete of authentication receipts stopped coming and we could clear the authentication receipts after some time. We then didn’t notice other transactions waiting on locks coming in after we renamed the username to its original value.

Fixed

Assignee

Xavier Schildwachter

Reporter

Nick Grosenbacher

Validator

Nick Grosenbacher

Priority

Blocker

Labels

None

Development Area

Synapse Core Infrastructure

Sprint

None

Fix versions

Release Version History

None

Story Points

None

Epic Link

None

Slack Channel

None