Synapse outage (?)
Description
I cannot access backend services, all requests are timing out
Environment
Activity
Post-incident analysis will be here: Synapse Platform Outage - Incidents - Sage Bionetworks | Wiki (jira.com)
Closing because the outage was resolved yesterday. If reopening this ticket for further postmortem analysis, please remove me as validator.
Checked today again for the POST login/sessions requests count to check if there is an attempt of brute forcing (Yesterday the results were masked by the timeouts):
Date | Status Code | Endpoint | Count |
---|---|---|---|
2021-01-06 | 201 | POST /login | 5077 |
2021-01-06 | 401 | POST /login | 55 |
2021-01-06 | 201 | POST /session | 4849 |
2021-01-06 | 401 | POST /session | 4 |
2021-01-05 | 201 | POST /login | 25932 |
2021-01-05 | 401 | POST /login | 81 |
2021-01-05 | 423 | POST /login | 3 |
2021-01-05 | 503 | POST /session | 16881 |
2021-01-05 | 201 | POST /session | 6892 |
2021-01-05 | 500 | POST /session | 13 |
2021-01-05 | 401 | POST /session | 12 |
2021-01-04 | 201 | POST /login | 40149 |
2021-01-04 | 401 | POST /login | 101 |
2021-01-04 | 423 | POST /login | 10 |
2021-01-04 | 201 | POST /session | 7288 |
2021-01-04 | 401 | POST /session | 24 |
Seems that we are back to normal. The user started receiving the authentication requests, meaning that the login for the account were successful.
An additional observation: after the new set of instances were up the old instances were still producing logs after several hours, meaning that request were still going through. This might also have added to the available capacity, diminishing the potential of saturation of slow requests.
The DB engine wasn’t doing much, but mostly waiting for a lock to get available and then timing out the requests (with a default of 50 seconds). In fact the DB instance itself wasn’t saturated. We reached a point of several hundreds of transactions all waiting for a lock acquisition to timeout. The backend uses a pool of connections, and each instance is configured with a max of 40 in the pool. I noticed that the profiler was showing very slow responses from simple DB calls, mostly on the delete operation of the authentication receipts (which would fail after 50 seconds leading to the high number of 503 above in the POST /session). The instances then couldn’t answer the health checker, probably timing out and the instances went all red.
This outage gave some room as the number of request dropped, then a new set of instances were started that could serve additional requests buying us some more time. When we renamed the username new “problematic” login requests for the delete of authentication receipts stopped coming and we could clear the authentication receipts after some time. We then didn’t notice other transactions waiting on locks coming in after we renamed the username to its original value.