On Friday 22 October 2021 for a 16 minute period between 13:04-13:20 BST, the Pay platform experienced severely degraded availability of our Public API due to the failure of a Redis node used to enforce rate-limiting. During this time most payments using both our direct API integration and payment links would have failed. Paying users would likely have seen an error screen when they tried to make a payment.
Our team responded quickly to automated alerts and began investigating the issue. Service was restored by automated processes within 16 minutes.
We take platform availability very seriously and are sorry for the impact this outage had on your users and service teams.
We have conducted a post-incident review to understand what happened and to identify what we'll do to prevent a recurrence.
The publicapi microservice is our platform's API gateway and handles all API requests. It uses a redis database (a managed AWS elasticache instance) to keep track of request rates for each service and to enforce rate limiting controls.
The following is a timeline (in BST) of the outage:
13:04 - The redis node used for rate limiting failed and automatic reprovisioning began
13:11 - A replacement node was back online
13:19 - publicapi successfully reconnects to redis. The delay from 13:11-13:19 may have been due to high CPU utilisation on the redis node as it served a backlog of requests.
13:19 - 13:20 - Some requests to our API were rate limited due to a thundering herd effect (either user retries or intermediate queues). These requests would have received HTTP 429 status code responses.
Due to the nature of cloud services, sometimes an underlying machine instance will have a hardware failure, or need to be upgraded or shut down for maintenance. The redis node we were using was configured to use a single availability zone.
AWS also offer a managed "multi-AZ" configuration, which seamlessly synchronises writes to both nodes, and can fail over automatically in the case of a failure.
We had not used this because we believed we had a sufficient fallback mechanism in place.
We have a fallback mechanism in publicapi to handle the scenario when redis may be unavailable. Instead of using redis as a coordinator across the publicapi cluster, each publicapi node should fall back to using a local in-memory cache.
This fallback mechanism did not work as intended due to errors in the implementation of the fallback process.
The RateLimiter - which is a Singleton - is provided with Singleton instances of the RedisRateLimiter and LocalRateLimiter which it uses to enforce the rate limiting. The RedisRateLimiter.checkRateOf() method has a synchronized annotation. This means that only one thread can execute the checkRateOf() method at any time.
When Redis was unavailable, each execution thread in publicapi queued up waiting in turn to make a call to Redis to check the rate limit. Because Redis was operating in a degraded mode where connections were hanging and timing out, each redis operation waited for 2 seconds before timing out, and falling back to the LocalRateLimiter. This meant that each publicapi node served one request every 2 seconds, with all other requests queuing waiting to access the synchronized method.
This chart shows this effect, with a small trickle of requests being processed between 12:04 UTC and 12:18 UTC (13:04 BST-13:18 BST). There is also a spike of requests after service resumed at around 12:18 UTC (13:18 BST).
We will:
~~We'll update this post-mortem to confirm when these changes have been completed [DONE].~~
Update - The two planned fixes have now been implemented. We conducted additional load testing to verify that the fallback mechanisms work as intended.
If you have any questions, please contact us.