Loyalty API Incident - Loyalty-Beta Status

Write-up

Loyalty API Incident

On Thursday 4th September, between 14:05 and 14:43 UTC, some of our customers experienced errors and timeouts when using loyalty features. These issues were caused by a regression in a recent deployment that led to DB connection pool exhaustion on the Loyalty API. Engineers identified the root cause, applied mitigations, and restored normal service. What happened A recent release (commit abc1234) introduced a code path that leaked database connections under load. As connection handles were exhausted, the Loyalty API returned elevated 5xx errors and timeouts for a subset of requests. This caused transient failures and delays for loyalty operations while background job queues temporarily built up. This impacted a subset of our product features, including: Points accrual & redemptions: some transactions failed or were delayed. Balance lookups: customers sometimes received errors or stale data. Background loyalty processing: queued jobs increased and experienced retries. Throughout this time, the rest of our core platform remained fully available, including: Alert ingestion Paging via SMS/email (note: mobile push for loyalty-related notifications may have been affected for some retries) Status pages (publishing, viewing and notifications) Declaring and responding to incidents Our web dashboard and Slack application How we responded We detected the issue automatically via CloudWatch → SNS → incident.io and the incident was created and routed to the on-call SRE team. The team acknowledged the incident and executed the Loyalty runbook. Immediate mitigations included scaling the application ASG to add headroom, restarting unhealthy instances to reclaim leaked connections, and temporarily pausing non-critical background workers to reduce DB load. Error rates fell and the CloudWatch alarm returned to OK; we continued monitoring for an hour to ensure stability. All production services returned to normal operation once mitigations were in place. We’ve been stable since and there were no delayed-impact events from the outage. If you experienced failed loyalty actions during this window, please retry them now. For ongoing problems, contact support with timestamps or request IDs and we’ll investigate. A full post-incident report (timeline, RCA, and action items) is available via the incident link above.

Loyalty-Beta