Replay the DLQ
You’ve shipped a fix for whatever was failing. Now move the dead-lettered jobs back into the main stream so the worker can take another swing.
Two paths: the CLI and the shim helper. Pick the one that matches your blast radius.
CLI: one-shot, scriptable
Section titled “CLI: one-shot, scriptable”chasqui dlq replay emails --limit 50Replays up to 50 entries (oldest first). The command prints how many were actually moved.
Behavior:
- Each entry is removed from
{chasqui:emails}:dlqand re-XADD’d to{chasqui:emails}in one Lua script — no torn state if the script aborts. - The
attemptcounter is reset to zero before re-encode, so the replayed job gets a fresh retry budget. Otherwise a job that hitRetriesExhaustedwould land in DLQ again on first dispatch. - The dispatch
nameis preserved.
Shim helper: replay from your application
Section titled “Shim helper: replay from your application”const moved = await queue.replayDlq(100);console.log(`replayed ${moved} entries`);moved = await queue.replay_dlq(limit=100)print(f"replayed {moved} entries")Same script, same atomicity. Use this when replay is part of your application’s runbook (a feature flag flip, a scheduled cleanup, a one-time migration).
CLI versus shim — which to use
Section titled “CLI versus shim — which to use”- CLI when an operator is on a terminal and needs to act now. Quick, no app deploy.
- Shim when replay is a code path: a feature-flagged self-heal, a scheduled cron that trims old failures, or a migration script.
Idempotency caveats
Section titled “Idempotency caveats”Replay is not an idempotent operation — calling replay_dlq(100) twice may produce a duplicate execution if the first call’s replayed jobs successfully completed before the second call drained DLQ.
Two contracts:
- Within one call, replay is atomic. A single
replay_dlq(N)invocation moves exactly the entries that were in DLQ at the snapshot moment. Concurrent replays serialize at Redis under the queue’s hash tag. - Across calls, replay is at-least-once. If you replay, an old entry runs successfully, and you replay again before noticing — the entry’s source ID has already been consumed but the replayed copy has been acked. There is no re-replay of an acked entry.
The risk is when you replay, your fix is wrong, and the job lands in DLQ again. In that case replay_dlq will move it back. To break that loop:
- Peek DLQ first (
peek_dlq). - Assert the new failure mode is not what just got replayed.
- Replay only a small batch at a time during fix verification.
When NOT to replay
Section titled “When NOT to replay”- The bug is in the producer. If the entry is malformed (
DlqReason::Malformed) or fails to decode (DlqReason::DecodeFailed), replay won’t help — the consumer will route it back to DLQ on the first read. Fix the producer and drop the entries instead. - The entry is structurally invalid. Same as above for
DlqReason::Oversize— the entry is bigger thanmax_payload_bytes. Replay just sends it back to DLQ. - The handler is non-idempotent and the entry already partially completed. A
RetriesExhaustedentry may have produced side effects on its way to DLQ (e.g., sent the email but then crashed). Replaying re-runs the handler — make sure your handler is idempotent before reaching for replay.
Drop instead
Section titled “Drop instead”The CLI does not currently have a “drop without replay” command — DLQ entries expire when you trim the stream. To clear DLQ entirely, use redis-cli DEL '{chasqui:emails}:dlq' (operator-only, last resort).
For the underlying mechanics: DLQ and recovery.