Exponential Backoff and Jitter for Webhook Retries

TL;DR

Naive exponential backoff (delay = base * 2^attempt) creates thundering herds — every client retries simultaneously
Add jitter (randomness) to spread retries out; equal jitter (base/2 + random(base/2)) is the sweet spot
Cap max delay, cap total retry duration, and separate 4xx (don't retry) from 5xx (do retry)
Circuit-break at the destination level after ~20 consecutive failures — retries won't fix a broken endpoint
Cheap replay beats aggressive retry: if replaying is one click, you only need retries to handle transient failures

Most retry code I see in the wild looks like this:

for (let i = 0; i < maxRetries; i++) {
  try {
    return await deliver(event);
  } catch {
    await sleep(1000 * Math.pow(2, i));  // 1s, 2s, 4s, 8s...
  }
}

This is exponential backoff. It's better than constant-interval retry. It is also wrong in a way that shows up only in production — when hundreds of webhooks fail simultaneously and all retry at the same moment, slamming your recovering server back into the ground.

The fix is jitter. Here's why, and how to do it properly.

The thundering herd

Imagine your endpoint goes down at T+0. In the next 60 seconds, 500 Stripe webhooks arrive and all fail. Every retry queue now has:

500 first retries scheduled at T+5min (or whatever your base delay is)
500 second retries scheduled at T+10min
etc.

Your endpoint comes back online at T+4min. Congrats — you now get 500 concurrent retries hitting at T+5min, pushing your still-recovering server back over its knee. That's the thundering herd.

Deterministic backoff has this problem by design. Every client/event that failed at the same time retries at the same time.

Jitter

Jitter adds randomness to retry timing so events retry on a spread instead of a spike. Two common strategies:

Full jitter

const base = 1000 * Math.pow(2, attempt);  // 1s, 2s, 4s, 8s
const sleep = Math.random() * base;         // 0 to base

Each retry waits anywhere from 0 to the full backoff interval. Pro: maximal spread. Con: some retries fire nearly immediately, which can feel inefficient.

Equal jitter

const base = 1000 * Math.pow(2, attempt);
const sleep = base / 2 + Math.random() * (base / 2);

Each retry waits between half and the full backoff. Pro: bounded minimum, still good spread. This is what AWS recommends.

Decorrelated jitter

let sleep = baseDelay;
// on each retry:
sleep = Math.min(maxDelay, Math.random() * sleep * 3);

Each retry's delay is a random multiple of the previous one. Good when you want the backoff to grow but without a fixed formula.

For webhooks, equal jitter is the sweet spot. It keeps retries roughly exponential but defuses the herd.

What production-grade retry schedules look like

Past the jitter, here's what experienced teams do:

1. Cap the maximum delay. Don't let attempt 10 wait 17 minutes — set a ceiling (e.g. 60s). Past the cap, just stay there until you give up.

2. Cap total retry duration, not just attempt count. "Retry for 24 hours" is a clearer SLA than "retry 8 times" because the second depends on delay math.

3. Separate transient from terminal errors. 4xx from the destination (especially 401, 403, 404) won't succeed on retry — the server is saying "this request is wrong." Don't waste retries on it. Only retry on 5xx, timeouts, and connection errors.

4. Surface the retry state. Your users need to know "this event is in attempt 3/5, next retry at X". If it's invisible, debugging is painful.

5. Circuit break at the destination level. If a destination has failed 20 times in a row, stop retrying that destination for a while. Something is wrong that retries won't fix — and you're just burning compute.

How AnyHook handles this

AnyHook's retry policy is tuned per plan and uses QStash's managed retry infrastructure underneath:

Retry counts: 3 (Free), 5 (Pro), 10 (Scale)
Exponential backoff with jitter applied by the queue
Timeout ceiling: 60s (Free), 120s (Pro), 5 min (Scale)
Circuit breaker: after 20 consecutive failures, we pause delivery to that destination and email you. No retry storms.
4xx detection: we alert you on 4xx because retries won't help — you need to fix the request shape

Crucially, exhausted retries don't lose the event. It's persisted in your log with status=failed, and you can replay it manually from the dashboard or the API at any time within your plan's retention window (3 days Free / 30 days Pro / 90 days Scale).

The cheat code

Here's the insight that changes how you think about retries: you don't actually need aggressive retries if you have cheap replay.

If replaying is one click, "retry 3 times over an hour" is enough. You catch most transient failures automatically, and for the rare long outage, a human replays the failed batch once the server is healthy.

Compare this to retry-only systems where you need to extend the retry window to 72 hours "just in case" — which means 72 hours of wasted deliveries to an endpoint you already know is down.

Cheap replay is strictly better than aggressive retry. It's what AnyHook is built around.

Takeaway

Treat retries as a fast-path for transient failures, not a substitute for persistence + replay. The systems that stay up during bad weeks are the ones where a failed delivery is always recoverable — because the event was logged before it was ever delivered.