A webhook race condition that cost us six hours
The shape of a webhook bug we've seen more than once, what to ship to fix it, and the change we now make to every sprint template so it never happens again.
The bug doesn't look like a bug the first time you see it. The app is working. Checkouts complete. The customer gets their receipt. Then, roughly every one in forty payments, the dashboard shows a subscription as "pending" and stays that way until someone clicks refresh. Everything else about the session is correct. The money moved. The row just didn't.
That's the shape. It's the shape of a webhook race condition, and in our experience it's the single most reliable way to lose half a day late in a sprint.
This is a composite post-mortem β drawn from the shape of a bug we've seen more than once across sprints. Names and specifics have been sanded off, but the pattern is real.
The setup
The flow is straightforward. A user completes a Stripe Checkout session. The browser is redirected to a success page. Meanwhile, Stripe fires a checkout.session.completed webhook to a route on your server, which creates or updates the corresponding subscription row in your database.
The success page does its own work. It reads the session ID from the URL, fetches the session from Stripe, reads the customer's email, and β critically β looks up the subscription record to confirm everything is provisioned.
Two concurrent writes. One read. Each looks like a normal piece of code. Put together, they're a race.
What actually happens
Most of the time, the webhook arrives first, the database row is created, and the success page reads it. Everyone is happy.
Sometimes β not often enough to catch in dev, often enough to matter in production β the user's success-page request reaches the server before the webhook does. The code looks for a subscription row that doesn't exist yet. It has a reasonable fallback path: "if no row exists, show a pending state." The user sees pending. The webhook arrives two seconds later. The row gets created. But the user has already seen the wrong thing.
Worse, some early implementations do the following: if no subscription row exists, create one from the session data. Then the webhook fires, tries to create the same row, fails with a unique-constraint violation, and silently drops. Now you have a row with the wrong state β or with a state that'll never be reconciled β and two logs that look benign because each side "handled" the error.
Why it's a sprint killer
Three reasons this specific bug chews through a sprint disproportionately.
It's non-deterministic. Dev never reproduces it because Stripe's webhooks in test mode fire almost instantly and the redirect is local. It only shows up under real network latency, which means production.
It masquerades as a UX bug. The first report is "the dashboard is slow to update after checkout." You waste an hour optimising the query. The second report is "it says pending when the card got charged." You waste another hour checking the webhook signature. By the third you've got the real story, and you've lost three.
Fixing it wrong introduces new bugs. Adding a manual "refresh" from the success page that re-fetches the session and creates the row is a very natural next step. It's also the thing that creates the duplicate-row problem above. Every naΓ―ve fix is worse than the bug.
What we ship instead
The pattern that works reliably, and that we now include in every sprint template that touches payments, has three pieces.
1. The webhook is authoritative
The webhook β and only the webhook β writes subscription state to the database. The success page does not write. Ever. It reads. This removes the "both sides are trying to create the row" problem entirely. If the row doesn't exist on the success page, the page waits.
2. The success page waits, briefly, explicitly
The success page polls for the row, with a short timeout β typically three seconds, with a clear "still processing" state if it doesn't arrive in time. The UI is designed for this: a loading pattern that looks intentional, a copy line that reads "we're finalising your subscription β this usually takes a second." If the row still hasn't appeared after the timeout, the page surfaces an honest "this is taking longer than expected β check your email for a receipt, we've noted it and will reconcile."
3. The webhook handler is idempotent
Idempotent in two directions. First, using the event ID: if the same checkout.session.completed event is delivered twice, the handler recognises the duplicate and returns 200 without re-writing. Second, using the underlying object: the write is an upsert keyed on the Stripe session or subscription ID, not an insert. Stripe will re-deliver webhooks. The handler has to be fine with that.
The four lines that stop this from happening
Not literally four lines, but four design constraints the sprint template now encodes:
- Webhook handlers upsert on the Stripe object ID.
- Webhook handlers acknowledge the event ID before doing anything expensive.
- Client pages never write authoritative state derived from webhook payloads.
- Any redirect that follows a webhook-driven write has an explicit "waiting for confirmation" state with a timeout.
None of these are clever. Together they eliminate the class of bug.
The lesson the sprint keeps teaching us
The meta-lesson, once we saw the pattern repeat: any time two independent writes can touch the same row, the design has a race in it β and a sprint doesn't leave enough time to discover that race in production, so we have to prevent it in the template.
A 7-day sprint doesn't have slack for retrofitting idempotency on day six. A 7-day sprint that has idempotency on day two has slack for everything else. The boring work in the first few days is what makes the last few days unboring. That's not a sprint-specific insight β it's a software insight β but a short timeline makes the arithmetic of it concrete in a way that a longer one doesn't.