Skip to content

DLQ and recovery

The dead-letter queue (DLQ) is {chasqui:<queue>}:dlq — another Redis Stream, sibling to the main queue. Its job is to hold entries that can’t be delivered through the normal retry path so an operator can decide what to do with them.

The decision to make the DLQ a Redis Stream (not a list, not a hash, not a separate Redis instance) follows from the same reasoning as the main queue:

  • Same operational primitives. XLEN, XRANGE, XADD, XACK work the same way on the DLQ. The same chasqui inspect snapshot covers both. The same XADD MAXLEN ~ N cap bounds growth.
  • Atomic moves. Replay is a single Lua script: XACKDEL from the DLQ, XADD to the main stream, with the attempt counter reset. No torn state if the script aborts.
  • Cluster-correct. The DLQ uses the same {chasqui:<queue>}:<suffix> hash tag as the main stream, so atomic main↔DLQ moves stay on one slot.
  • Inspectable with the same tools. chasqui dlq peek is just XRANGE on the DLQ stream with an extra histogram pass over the reason field.

The alternative — a parallel Redis list, a separate dead-job hash, a different store entirely — would mean two operational stories to learn, two sets of tools, two places to monitor. Symmetry is the design choice.

The engine writes a reason field on every DLQ entry. Five values, three of which fire from the consumer side and two from the reader side:

ReasonSideWhen it firesHandler ran?
retries_exhaustedConsumerHandler returned Err and attempt + 1 >= max_attemptsYes
unrecoverableConsumerHandler threw UnrecoverableErrorYes (once)
panicConsumerHandler panicked / threw uncaughtYes (once)
decode_failReaderThe entry’s msgpack payload couldn’t be decodedNo
malformedReaderThe entry was missing required fieldsNo
oversizeReaderThe payload exceeded max_payload_bytesNo

Reader-side reasons (decode_fail / malformed / oversize) carry attempt: 0 because the handler never ran. Useful when triaging a backlog — a high decode_fail rate means a producer is writing in a different schema, not a handler bug.

Producer::replay_dlq(limit) (Rust), Queue.replayDlq (Node), Queue.replay_dlq (Python), and chasqui dlq replay are all the same primitive: a single Lua script that, for each entry up to limit:

  1. Reads the entry from the DLQ stream.
  2. Resets attempt to 0 (so the replayed job gets a fresh retry budget).
  3. XADDs it to the main stream.
  4. XDELs it from the DLQ stream.

The script is atomic per entry. If the script aborts, no torn state — the entry is either fully replayed or fully unchanged.

  • You shipped a fix. The handler now handles the failure mode that was producing retries_exhausted. Replay the affected entries.
  • The downstream resource came back online. A cohort of jobs failed because Stripe / S3 / the database was down for 5 minutes. Replay them after recovery.
  • You raised the retry budget. The original attempts: 3 was too tight; you’ve changed to attempts: 10. Replay the entries that exhausted under the old budget.
  • The bug is in the producer. decode_fail / malformed / oversize entries will route back to DLQ on the first read. Fix the producer and drop the entries.
  • The handler is non-idempotent and the entry already partially completed. A retries_exhausted entry may have produced side effects on its way to DLQ (sent the email, but then crashed before acking). Replaying re-runs the handler; if the handler isn’t idempotent, you’ll send the email twice.
  • The replay would just go back to DLQ. If the underlying issue isn’t fixed, replay just thrashes. Verify your fix on a small batch (--limit 10) before mass replay.

ConsumerConfig::dlq_max_stream_len (default 100,000) caps the DLQ via XADD MAXLEN ~ N. A runaway error rate may overshoot temporarily but won’t grow unboundedly.

If your DLQ is growing fast, that’s a signal — usually the consumer is broken (every job DLQ’ing) or the producer is broken (every entry malformed). Either way, the cap saves you from a Redis OOM while you investigate.

Terminal window
chasqui dlq peek emails --limit 50

Renders:

  • A histogram by reason (so you see retries_exhausted: 12, unrecoverable: 2 at a glance).
  • The most recent entries with source_id, reason, attempt, dispatch name, and the raw payload bytes.

In code, Producer::peek_dlq(limit) returns Vec<DlqEntry> with the same fields. Use it for app-level diagnostics or scheduled health checks.

The replay path is per-entry atomic, but not idempotent across calls.

  • Within one replay_dlq(N) call, the entries that were in the DLQ at the snapshot moment are moved exactly once.
  • Calling replay_dlq twice — once at T=0, once at T=10 — moves whatever’s in the DLQ at T=0 and again whatever’s in the DLQ at T=10. If a replayed entry succeeded between the two calls, it’s no longer in the DLQ; the second call doesn’t see it.

The risk pattern: replay → fix is wrong → entry lands back in DLQ → replay again → repeat. Break the loop by peeking before each replay. See Replay the DLQ.

The fix-the-bug-and-requeue workflow looks like:

  1. chasqui dlq peek emails to see what’s failing and why.
  2. Decode a sample payload, reproduce the failure locally.
  3. Ship the fix.
  4. chasqui dlq replay emails --limit 10 to verify the fix on a small cohort.
  5. chasqui watch emails to check the DLQ doesn’t grow.
  6. chasqui dlq replay emails --limit 1000 to drain the rest.

For the operational guide: Replay the DLQ. For routing rules: Route to the DLQ.