At 00:19 on Monday 20 July, payment service provider Worldpay started returning a high proportion of errors when processing payments. The service was degraded until 16:15, and for some of that period as many as 90% of Worldpay payments through GOV.UK Pay were failing.
This was the most significant payment service provider outage in GOV.UK Pay’s history.
GOV.UK Pay has a 30 minute response time to incidents that degrade performance to this extent. However, this does not apply to problems caused by upstream services where the team has no ability to resolve the issue. Our alerts won’t wake up an engineer out of hours unless they’re going to be able to help resolve the problem.
We’ve identified some areas where our alerting could be improved to more quickly identify this rare type of outage.
We’re investigating how we can improve our status page to help service teams understand the cause of failures, including whether we can surface information about the availability of payment service providers on the GOV.UK status page automatically.
We’ve learned that different services have different understanding of the relationship between the service, GOV.UK Pay and the payment service provider. We’ll review our incident policy to see if there are useful interventions we can make to help services manage payment service provider outages, even when we can’t resolve the issue itself.