Debugging Failed Webhooks in Production Without Losing Your Weekend

A customer emails: "my Stripe webhook is broken." You check the Stripe dashboard — events are firing. You check your server logs — nothing. You check Sentry — nothing. The events are going somewhere, but not where you expect.

This post is the step-by-step framework for debugging webhook delivery issues in production, based on the common failure modes I see.

TL;DR

Start with the provider dashboard — is the event firing, and what status is it getting back?
4xx = you're rejecting it (signature / middleware / wrong URL)
5xx = your handler crashed (check error monitoring at that exact timestamp)
Timeout = your handler is too slow (move heavy work async, return 200 fast)
The fix for "took 2 hours to debug" is one place that shows inbound + outbound + retries + replay — usually a webhook relay

The decision tree

Start here. It'll save you 30 minutes of guessing.

Q1: Is the provider actually sending the event?

Check the provider's dashboard (Stripe: Developers → Events; GitHub: Settings → Webhooks → Recent Deliveries; Shopify: Notifications → Webhooks → View Recent).

If the event isn't there, the trigger isn't firing. Debug the upstream state that's supposed to generate the event.
If it's there, proceed to Q2.

Q2: Is the provider getting a 2xx back?

In the same dashboard, look at the response status from your endpoint.

2xx: Provider thinks it succeeded. If your app didn't see it, the event was received but dropped after the response. Check your async queue, your DB write, your handler logic.
4xx: You're rejecting it. Most common: signature verification failure, missing header, or a middleware (WAF, rate limiter, auth layer) blocking the request.
5xx: Your endpoint crashed. Check your error monitoring — not just "the last hour" but "the exact timestamp of this delivery".
Timeout: Your endpoint is too slow. See Q4.
Connection refused / DNS: Your endpoint is unreachable. Different problem entirely — network/DNS/firewall.

Q3: If 4xx, which one?

400 — bad request. Usually a header mismatch or your signature check threw before returning a clear error. Check what your handler returns when the body is malformed or the signature is missing.
401 / 403 — you're rejecting the request as unauthorized. Signature verification failure? Middleware checking a user session on an endpoint that shouldn't need one?
404 — the URL is wrong. Did you deploy a route change that moved the webhook endpoint?
413 — payload too large. Shopify's bulk operations can send 1MB+ bodies. Raise your body-parser limit.
429 — rate limiter tripped. Whatever's rate-limiting your webhook route shouldn't be. Most webhook providers come from a finite set of IPs — your limiter may be treating them all as one attacker.

Q4: If timeout, what's slow?

Stripe gives you 30 seconds. GitHub gives you 10. Shopify gives you 5.

Most timeouts come from doing too much work in the handler. Common offenders:

Synchronous database writes in a pool that's saturated
A call out to another service (Stripe, Salesforce, Notion, your own API) inside the handler
Sending an email or a Slack message synchronously
Logging the whole body to stdout on a log ingester that's backpressuring

Fix pattern: the handler should verify the signature, enqueue the event, and return 200. All business logic happens downstream asynchronously.

What's usually actually broken

After a few years of watching this, here's the 80/20:

Signature verification is re-serializing the body — see the signature verification guide
A proxy or WAF is eating the webhook before it reaches your handler — check Cloudflare / Vercel / your platform's request logs
Rate limiting is tripping — Cloudflare's Bot Management can false-positive on legitimate webhooks
The endpoint URL changed in a deploy and nobody updated the provider
An env var is missing on production (signing secret, DB URL) and the handler silently 500s
A background job is crashing after the webhook returns 200 — the provider thinks it's fine, but nothing is getting processed

Tools that actually help

Provider dashboards are good for one event at a time, bad for patterns. If you want "all failed events in the last 24 hours", you're scrolling manually.

ngrok + webhook.site are for local dev. They don't help you with production.

Your own logs help if you logged the right thing. If you logged "webhook received for event evt_xxx" before the signature check, you at least know it arrived. Most people log after, which means signature-failure events are invisible.

A webhook relay like AnyHook sits in front of your handler and logs everything — headers, body, timing, response status, attempt count — so your debugging loop becomes "look at the event stream, click the failing one, inspect the raw payload, replay it."

The AnyHook debugging loop

When a user emails AnyHook support with "my webhook isn't working", the dashboard usually makes the cause obvious in under a minute:

Open the event stream, filter to status=failed
Click the failing event
See inbound headers (was the signature there?), inbound body (was it the shape you expected?), outbound response (what did your server say?), and timing breakdown (edge → queue → delivery)
If it's a one-off (a bad deploy, a flaky network), click Replay and watch it succeed
If it's systemic (every event is failing), now you know exactly what to fix — and your events are safely stored until you do

The thing that turns a 2-hour debugging session into a 2-minute one is having one place that shows everything: the request your provider actually sent, the response your server actually returned, and the timestamps between them. Without that, you're reconstructing it from three dashboards and your own logs.

Takeaway

If this is your Sunday afternoon more often than you'd like, AnyHook turns most of it into a dashboard click.