Diego Guerrero

Posted on Jun 24

Saga orchestrator in Python

#architecture #distributedsystems #microservices #python

Building a Saga orchestrator in Python: why existing tools weren't enough and what I learned designing one from scratch

Distributed workflows break in a specific, painful way. You charge a payment,
reserve inventory, then try to create a shipping label — and the shipping API
times out. The payment went through. The inventory is reserved. But the order
never completed.

Now what?

The naive fix is nested try/except blocks with manual cleanup calls. Every
developer has written this code. It works until the cleanup call fails, or the
process crashes between steps, or two workers process the same message
simultaneously. Then you have inconsistent state, angry customers, and no
clear path to recovery.

# The naive approach everyone writes first
async def process_order(order):
    try:
        payment = await charge_payment(order)
        try:
            reservation = await reserve_inventory(order)
            try:
                shipping = await ship_order(order)
            except Exception:
                await release_inventory(reservation)  # what if THIS fails?
                raise
        except Exception:
            await refund_payment(payment)  # what if THIS fails?
            raise
    except Exception:
        # now what?
        raise

sagakit solves this with the Saga pattern: each step in a workflow declares an
explicit compensation handler. If a later step fails, sagakit automatically
runs the compensations in reverse order — releasing inventory, refunding the
payment — with retries, idempotency guarantees, and structured logging built
in. One docker run redis is all the infrastructure you need.

Why not existing tools?

Before building sagakit, I looked at the existing options.

Temporal is the gold standard for durable workflow orchestration. It
handles long-running workflows, versioning, and failure recovery with
guarantees that sagakit doesn't offer. But it requires a dedicated cluster, a
separate worker process, and significant operational investment. For a Python
backend engineer who needs to coordinate 3-5 steps reliably, it's a freight
train for a commute.

Celery is designed for task queues — background jobs, periodic tasks,
fan-out processing. It doesn't have a native model for compensating
transactions. You can approximate it, but you end up building the compensation
logic yourself anyway, without the primitives to do it safely.

Manual try/except is what everyone writes first (see the code above). It
fails silently when the cleanup call itself fails, offers no protection against
duplicate processing when a message is redelivered, and becomes unmaintainable
past three steps.

I wanted something that required only Redis — which most Python backends
already run — made compensation logic explicit and co-located with the step it
undoes, and felt like idiomatic async Python. That's sagakit.

Three design decisions worth explaining

1. Orchestration over choreography

Sagas can be implemented two ways. In choreography, each service reacts
autonomously to events — payment-service publishes payment.charged,
inventory-service listens and reserves, and so on. No central coordinator
exists.

sagakit uses orchestration: a central executor drives each step explicitly, in
order. The entire workflow is defined in one place and readable top to bottom.

The tradeoff is real — orchestration introduces a coordinator that choreography
avoids. But for a library targeting clarity and testability, having the
workflow in one file is worth it. You can read a saga definition and
immediately understand what it does, what it compensates, and in what order.
With choreography, that understanding is reconstructed from events scattered
across services.

2. Idempotency keys don't include the attempt number

The original design used saga_id:step_name:attempt_number as the idempotency
key. It seemed logical — each attempt is a distinct event, so each gets a
distinct key.

The problem: if charge_payment fails on attempt 1 and retries as attempt 2,
the key changes. When the step passes ctx.idempotency_key to Stripe, Stripe
sees two different keys and treats them as two separate charges. The customer
gets billed twice.

The fix was simple but required changing the design mid-project: drop
attempt_number from the key. All retries of the same step share
saga_id:step_name. Stripe — and any other external system that accepts an
idempotency key — sees the same identifier across all attempts and returns the
same result without re-executing the side effect.

attempt_number still exists in SagaContext for logging and observability.
It just doesn't affect the key.

3. reject() uses XACK + XADD instead of XCLAIM

When a step fails and needs to be retried, the message must be returned to the
stream for reprocessing. The obvious Redis primitive is XCLAIM — reassign the
message to a consumer so it can retry.

The problem: XCLAIM keeps the message tied to the original consumer. If that
consumer is down, the message sits in the Pending Entries List indefinitely.
No other worker can claim it. The saga is silently stuck.

sagakit uses XACK + XADD instead. The original message is acknowledged
(removed from the PEL), and a new message with the same payload is published
to the stream. Any available consumer in the group can pick it up — the retry
is not tied to the worker that originally failed.

The re-published message carries a requeue_count attribute incremented on
each rejection, so the executor can detect pathological retry loops and route
to the DLQ after a threshold.

Seeing it in action

The happy path — all three steps complete successfully:

[info] payment.charged     amount=99.99 payment_id=pay_718836cb
[info] inventory.reserved  reservation_id=res_718836cb
[info] order.shipped       tracking_id=trk_718836cb

Status  : completed
Results :
  charge_payment    → pay_718836cb
  reserve_inventory → res_718836cb
  ship_order        → trk_718836cb

The failure path — ship_order fails, sagakit retries with exponential
backoff, exhausts attempts, then compensates in reverse order:

[info]    payment.charged     amount=99.99 payment_id=pay_6856bd2b
[info]    inventory.reserved  reservation_id=res_6856bd2b
[warning] step.retrying       attempt=1 delay=0.052s error='Shipping API down'
[warning] step.retrying       attempt=2 delay=0.170s error='Shipping API down'
[error]   step.exhausted_retries attempts=3 error='Shipping API down'
[info]    inventory.released  ← compensation
[info]    payment.refunded    ← compensation

Status    : compensated
Failed at : ship_order
Rolled back: reserve_inventory, charge_payment

Two things worth noticing in the failure output. First, the backoff is visible
in the logs — 52ms, then 170ms, with ±50% jitter applied. Second,
compensations run in strict reverse order: inventory before payment, because
that's the reverse of how they were acquired. sagakit guarantees this order
regardless of which step fails.

To reproduce: FAIL_AT_STEP=ship_order python run.py

Three things I learned building this

Compensation is not rollback

Coming from a relational database background, my first mental model of
compensation was "undo" — run the saga backwards and erase what happened.
That's not what compensation is.

A database rollback destroys state retroactively. It's as if the transaction
never happened — no trace, no intermediate state, no customer-visible effect.

Compensation is new forward-moving business logic that repairs the damage.
Refunding a payment is not the same as never charging it. The customer sees
two transactions on their bank statement. The inventory system received a
reservation and then a cancellation — it may have allocated physical space in
the interim. Other systems may have reacted to the original action before the
compensation ran.

This distinction has real consequences for how you write compensation handlers.
They cannot assume the system is in the same state it was when the forward step
ran. They must be written defensively, assuming that time has passed and other
things have happened.

Design decisions reveal their flaws during implementation

The idempotency key started as saga_id:step_name:attempt_number. It seemed
correct on paper — each attempt is a distinct event, so each gets a distinct
identifier.

The flaw only became visible when I thought through the implementation
concretely: if the key changes between retries, external systems like Stripe
see different keys and treat each attempt as a new transaction. A payment gets
charged twice.

The fix was one line of code. But it required updating the ADR, the
implementation, and the tests — and more importantly, it required catching the
assumption before it shipped.

This is why I wrote Architecture Decision Records before writing code. The ADR
for idempotency forced me to think through the key construction explicitly,
which is when the flaw surfaced. Without that document, the bug would have
lived in the code until a real payment was doubled.

Writing the decision before writing the code

I had never written an Architecture Decision Record before this project. My
previous approach was the common one: make a decision, write the code, maybe
add a comment explaining why.

The difference with ADRs is that you document not just what you decided, but
what you considered and rejected — and why. That forces a level of rigor that
commenting doesn't. You can't write "rejected because operationally heavy"
without first asking yourself: heavy compared to what? Heavy for whom?

Four ADRs later, the project has a paper trail of every major architectural
choice: why Sagas over 2PC, why Redis Streams over Kafka, how idempotency
works, what compensation guarantees are provided and which aren't. A new
contributor — or a future version of me — can read those documents and
understand not just what the system does, but why it exists in this shape.

I won't build a non-trivial system without them again.

Try it yourself

sagakit is pre-alpha — the API may change, and it hasn't been benchmarked in
production. But it works, it's tested, and the order-processing example runs
in under five minutes with a single docker compose up -d.

If you're building event-driven workflows in Python and the naive try/except
approach is starting to hurt, give it a try and let me know what breaks.

→ github.com/diegogue88/sagakit

What patterns do you use for distributed transactions in Python? I'd love to
hear what's working — and what isn't — in the comments.