Issue with Pay API

Incident Report for GOV.UK Pay

Postmortem

On Friday 22 October 2021 for a 16 minute period between 13:04-13:20 BST, the Pay platform experienced severely degraded availability of our Public API due to the failure of a Redis node used to enforce rate-limiting. During this time most payments using both our direct API integration and payment links would have failed. Paying users would likely have seen an error screen when they tried to make a payment.

Our team responded quickly to automated alerts and began investigating the issue. Service was restored by automated processes within 16 minutes.

We take platform availability very seriously and are sorry for the impact this outage had on your users and service teams.

We have conducted a post-incident review to understand what happened and to identify what we'll do to prevent a recurrence.

What happened

The publicapi microservice is our platform's API gateway and handles all API requests. It uses a redis database (a managed AWS elasticache instance) to keep track of request rates for each service and to enforce rate limiting controls.

The following is a timeline (in BST) of the outage:

13:04 - The redis node used for rate limiting failed and automatic reprovisioning began

13:11 - A replacement node was back online

13:19 - publicapi successfully reconnects to redis. The delay from 13:11-13:19 may have been due to high CPU utilisation on the redis node as it served a backlog of requests.

13:19 - 13:20 - Some requests to our API were rate limited due to a thundering herd effect (either user retries or intermediate queues). These requests would have received HTTP 429 status code responses.

Why did the redis node fail?

Due to the nature of cloud services, sometimes an underlying machine instance will have a hardware failure, or need to be upgraded or shut down for maintenance. The redis node we were using was configured to use a single availability zone.

AWS also offer a managed "multi-AZ" configuration, which seamlessly synchronises writes to both nodes, and can fail over automatically in the case of a failure.

We had not used this because we believed we had a sufficient fallback mechanism in place.

What should have happened

We have a fallback mechanism in publicapi to handle the scenario when redis may be unavailable. Instead of using redis as a coordinator across the publicapi cluster, each publicapi node should fall back to using a local in-memory cache.

Why this didn't work

This fallback mechanism did not work as intended due to errors in the implementation of the fallback process.

The RateLimiter - which is a Singleton - is provided with Singleton instances of the RedisRateLimiter and LocalRateLimiter which it uses to enforce the rate limiting. The RedisRateLimiter.checkRateOf() method has a synchronized annotation. This means that only one thread can execute the checkRateOf() method at any time.

When Redis was unavailable, each execution thread in publicapi queued up waiting in turn to make a call to Redis to check the rate limit. Because Redis was operating in a degraded mode where connections were hanging and timing out, each redis operation waited for 2 seconds before timing out, and falling back to the LocalRateLimiter. This meant that each publicapi node served one request every 2 seconds, with all other requests queuing waiting to access the synchronized method.

This chart shows this effect, with a small trickle of requests being processed between 12:04 UTC and 12:18 UTC (13:04 BST-13:18 BST). There is also a spike of requests after service resumed at around 12:18 UTC (13:18 BST).

What we will do to prevent this happening again

We will:

Correct the implementation of the rate limiter fallback code so that it can successfully fall back to the local rate limiter. This work is in progress and we expect to be completed by 22 November 2021. Update: This was completed on 30 November 2021
Switch to a multi-AZ redis instance, and upgrade to version 5.05+ which handles failover better. We expect to complete this work by 26 November 2021. Update: This was completed 3 December 2021

~~We'll update this post-mortem to confirm when these changes have been completed [DONE].~~

Update - The two planned fixes have now been implemented. We conducted additional load testing to verify that the fallback mechanisms work as intended.

If you have any questions, please contact us.

Posted Nov 17, 2021 - 15:00 GMT

Resolved

We have seen a resolution to the earlier issues.

If you are still experiencing problems, please contact us through the support email address.

Posted Oct 22, 2021 - 14:27 BST

Monitoring

We have seen the platform return to normal conditions from the earlier issues. The GOV.UK Pay team is monitoring the situation and will provide further updates.

If you are still experiencing problems, please contact us through the support email address.

Posted Oct 22, 2021 - 13:59 BST

Investigating

We're investigating an increase in 504 errors from the Pay API from 13:05.

We will provide updates when we know more.

Posted Oct 22, 2021 - 13:29 BST

This incident affected: Public API.