DLQ and recovery
The dead-letter queue (DLQ) is {chasqui:<queue>}:dlq — another Redis Stream, sibling to the main queue. Its job is to hold entries that can’t be delivered through the normal retry path so an operator can decide what to do with them.
Why a stream, not a list
Section titled “Why a stream, not a list”The decision to make the DLQ a Redis Stream (not a list, not a hash, not a separate Redis instance) follows from the same reasoning as the main queue:
- Same operational primitives.
XLEN,XRANGE,XADD,XACKwork the same way on the DLQ. The samechasqui inspectsnapshot covers both. The sameXADD MAXLEN ~ Ncap bounds growth. - Atomic moves. Replay is a single Lua script:
XACKDELfrom the DLQ,XADDto the main stream, with theattemptcounter reset. No torn state if the script aborts. - Cluster-correct. The DLQ uses the same
{chasqui:<queue>}:<suffix>hash tag as the main stream, so atomic main↔DLQ moves stay on one slot. - Inspectable with the same tools.
chasqui dlq peekis justXRANGEon the DLQ stream with an extra histogram pass over thereasonfield.
The alternative — a parallel Redis list, a separate dead-job hash, a different store entirely — would mean two operational stories to learn, two sets of tools, two places to monitor. Symmetry is the design choice.
Five reasons an entry lands in DLQ
Section titled “Five reasons an entry lands in DLQ”The engine writes a reason field on every DLQ entry. Five values, three of which fire from the consumer side and two from the reader side:
| Reason | Side | When it fires | Handler ran? |
|---|---|---|---|
retries_exhausted | Consumer | Handler returned Err and attempt + 1 >= max_attempts | Yes |
unrecoverable | Consumer | Handler threw UnrecoverableError | Yes (once) |
panic | Consumer | Handler panicked / threw uncaught | Yes (once) |
decode_fail | Reader | The entry’s msgpack payload couldn’t be decoded | No |
malformed | Reader | The entry was missing required fields | No |
oversize | Reader | The payload exceeded max_payload_bytes | No |
Reader-side reasons (decode_fail / malformed / oversize) carry attempt: 0 because the handler never ran. Useful when triaging a backlog — a high decode_fail rate means a producer is writing in a different schema, not a handler bug.
Replay
Section titled “Replay”Producer::replay_dlq(limit) (Rust), Queue.replayDlq (Node), Queue.replay_dlq (Python), and chasqui dlq replay are all the same primitive: a single Lua script that, for each entry up to limit:
- Reads the entry from the DLQ stream.
- Resets
attemptto 0 (so the replayed job gets a fresh retry budget). XADDs it to the main stream.XDELs it from the DLQ stream.
The script is atomic per entry. If the script aborts, no torn state — the entry is either fully replayed or fully unchanged.
When to replay
Section titled “When to replay”- You shipped a fix. The handler now handles the failure mode that was producing
retries_exhausted. Replay the affected entries. - The downstream resource came back online. A cohort of jobs failed because Stripe / S3 / the database was down for 5 minutes. Replay them after recovery.
- You raised the retry budget. The original
attempts: 3was too tight; you’ve changed toattempts: 10. Replay the entries that exhausted under the old budget.
When NOT to replay
Section titled “When NOT to replay”- The bug is in the producer.
decode_fail/malformed/oversizeentries will route back to DLQ on the first read. Fix the producer and drop the entries. - The handler is non-idempotent and the entry already partially completed. A
retries_exhaustedentry may have produced side effects on its way to DLQ (sent the email, but then crashed before acking). Replaying re-runs the handler; if the handler isn’t idempotent, you’ll send the email twice. - The replay would just go back to DLQ. If the underlying issue isn’t fixed, replay just thrashes. Verify your fix on a small batch (
--limit 10) before mass replay.
Bounded growth
Section titled “Bounded growth”ConsumerConfig::dlq_max_stream_len (default 100,000) caps the DLQ via XADD MAXLEN ~ N. A runaway error rate may overshoot temporarily but won’t grow unboundedly.
If your DLQ is growing fast, that’s a signal — usually the consumer is broken (every job DLQ’ing) or the producer is broken (every entry malformed). Either way, the cap saves you from a Redis OOM while you investigate.
Inspecting
Section titled “Inspecting”chasqui dlq peek emails --limit 50Renders:
- A histogram by
reason(so you seeretries_exhausted: 12, unrecoverable: 2at a glance). - The most recent entries with
source_id,reason,attempt, dispatchname, and the raw payload bytes.
In code, Producer::peek_dlq(limit) returns Vec<DlqEntry> with the same fields. Use it for app-level diagnostics or scheduled health checks.
Idempotent replay
Section titled “Idempotent replay”The replay path is per-entry atomic, but not idempotent across calls.
- Within one
replay_dlq(N)call, the entries that were in the DLQ at the snapshot moment are moved exactly once. - Calling
replay_dlqtwice — once at T=0, once at T=10 — moves whatever’s in the DLQ at T=0 and again whatever’s in the DLQ at T=10. If a replayed entry succeeded between the two calls, it’s no longer in the DLQ; the second call doesn’t see it.
The risk pattern: replay → fix is wrong → entry lands back in DLQ → replay again → repeat. Break the loop by peeking before each replay. See Replay the DLQ.
Operational pattern
Section titled “Operational pattern”The fix-the-bug-and-requeue workflow looks like:
chasqui dlq peek emailsto see what’s failing and why.- Decode a sample payload, reproduce the failure locally.
- Ship the fix.
chasqui dlq replay emails --limit 10to verify the fix on a small cohort.chasqui watch emailsto check the DLQ doesn’t grow.chasqui dlq replay emails --limit 1000to drain the rest.
For the operational guide: Replay the DLQ. For routing rules: Route to the DLQ.