Skip to main content

How to Handle Partial Failures in Payment Systems (Saga Pattern + Temporal)

Partial Failures in Payment Systems Are More Common Than You Think

Partial failures in payment pipelines are one of the most dangerous reliability issues in distributed systems. A payment charge succeeds, but the ledger update fails. A refund is triggered, but notifications never send. These inconsistencies lead to phantom transactions, reconciliation headaches, and costly production incidents.

In this guide, we’ll show how to safely handle partial failures in payment systems using the saga pattern and Temporal workflows — including a full production-ready Go implementation.

TL;DR

When your charge succeeds but your ledger update fails, you don’t have a bug — you have a phantom transaction and an angry merchant. This post walks through the exact saga and compensation patterns we use to make partial failures safe in production payment pipelines, including a full Go implementation and a pre-launch checklist.

Partial Failures in Payment Pipelines: The Hidden Reliability Risk

Let me paint a scene you’ve probably lived. It’s 2:47 AM. Your on-call engineer gets paged: a merchant is showing a successful payment in your UI, but the funds never hit their account and the call to the payment gateway succeeded. However, your ledger update which is the second step timed out.

You now have a phantom transaction. The customer was charged. The merchant never got paid. Your support queue will be full by morning.

This is the canonical partial failure scenario in payment pipelines, and it is shockingly common. Here’s the architecture pattern that causes it as seen in virtually every legacy payment system built before 2019:

// The naive (and dangerous) pattern

async function processPayment(order) {

  const charge = await stripe.charges.create({ … });    // Step 1

  await db.ledger.insert({ chargeId: charge.id, … });   // Step 2 ← can fail

  await notificationService.send({ … });                  // Step 3 ← can fail

  await analyticsService.record({ … });                  // Step 4 ← can fail

}

// If step 2 crashes — you have charged the customer but recorded nothing.

// If step 3 crashes — customer never receives a receipt.

// Retrying the whole function double-charges the customer.


The dangerous part isn’t any single failure. It’s the combination of two properties:

  • Non-idempotent operations: charging a card twice charges a customer twice.
  • No compensating logic: there is no code path to undo a successful charge if a downstream step fails.

Why Idempotency, Queues, and 2PC Don’t Fix Payment Failures

Before we get to Temporal, let’s be honest about why the standard workarounds do not fully solve this.

1. Idempotency Keys Alone Are Not Enough

Yes, Stripe and most modern gateways support idempotency keys. This prevents double charges on the gateway side. But your ledger, your notification service, and your analytics system probably do not have idempotency semantics built in. And even if they do, you still need the orchestration layer to know which steps completed, which failed, and what to do about it.

2. Message Queues Without State Are Just Delayed Failures

Dropping step 2 into a Kafka topic does not make it reliable. If your consumer crashes mid-processing, you get at-least-once delivery semantics — which means you need idempotent consumers everywhere, plus dead letter queues, plus monitoring on those queues, plus runbooks for manual reconciliation. You’ve traded one complexity for three.

3. Distributed Transactions (2PC) Are a Trap

Two-phase commit across your charge service, ledger, and notification system introduces tight coupling, performance bottlenecks, and the very real possibility of all three systems being locked while you wait for coordinator recovery. In a payment context, this means latency spikes directly visible to customers.

The Right Approach: Durable Orchestration for Payment Workflows

What you actually need is not ACID transactions across distributed services. You need durable orchestration — the guarantee that each step is durably tracked, and if a worker crashes mid-execution, Temporal will retry it from where it left off. Activities are at-least-once by design, which is why idempotency is not optional — it is the contract that makes retries safe. When combined with explicit compensation logic, this is the saga pattern, and Temporal is the most mature framework for implementing it in production today.

The Saga Pattern: Theory in 60 Seconds

The saga pattern, first formalized by Hector Garcia-Molina and Kenneth Salem in 1987, decomposes a long-running transaction into a sequence of local transactions, each with a corresponding compensation transaction.

For our payment pipeline:

Step Forward Action Compensation Action
1 chargeCard() refundCharge()
2 insertLedgerEntry() deleteLedgerEntry()
3 sendReceipt() sendFailureNotification()
4 recordAnalytics() (best-effort, no compensation needed)

The critical insight: compensation actions are not rollbacks. They are forward-moving operations that undo the effect of a previous step. Refunding a charge is a new API call — not a database rollback.

Implementing the Saga Pattern in Temporal (Production Example)

Temporal is purpose-built for this problem. Its durable execution model means your workflow function’s state — every local variable, every completed step — is persisted to Temporal’s event history. A worker crash is transparent to the workflow. It simply resumes from where it left off.

Here is a production-grade payment saga workflow in Go. I’ll walk through it piece by piece.

Step 1: Define Your Activities

Activities are the individual steps. Each one is independently retryable and independently compensatable.

// activities/payment.go

type PaymentActivities struct {

    gateway     GatewayClient

    ledger      LedgerClient

    notifier    NotificationClient

    analytics   AnalyticsClient

}

// ChargeCard is idempotent when called with the same IdempotencyKey.

// The gateway deduplicates; we record the ChargeID for compensation.

func (a *PaymentActivities) ChargeCard(ctx context.Context, req ChargeRequest) (ChargeResult, error) {

    return a.gateway.Charge(ctx, gateway.ChargeParams{

        Amount:         req.Amount,

        Currency:       req.Currency,

        IdempotencyKey: req.PaymentID,  // workflow ID → idempotency key

    })

}

// RefundCharge is the compensation for ChargeCard.

func (a *PaymentActivities) RefundCharge(ctx context.Context, chargeID string) error {

    return a.gateway.Refund(ctx, chargeID)

}

// InsertLedgerEntry uses the paymentID as a unique constraint.

// Duplicate calls return success (idempotent by DB constraint).

func (a *PaymentActivities) InsertLedgerEntry(ctx context.Context, entry LedgerEntry) error {

    return a.ledger.Insert(ctx, entry)

}

func (a *PaymentActivities) DeleteLedgerEntry(ctx context.Context, paymentID string) error {

    return a.ledger.DeleteByPaymentID(ctx, paymentID)

}

Step 2: Write the Saga Workflow

This is where the magic happens. The workflow orchestrates activities and manages the compensation stack.

Compensation is driven entirely by error handling — not by defers or panic recovery. Each ExecuteActivity call returns an error. If a step fails after a previous step has already succeeded, we explicitly call the compensation activity for that previous step before returning. This is intentional: Temporal’s execution model does not support using recover() inside workflow defers the way normal Go does, and using workflow.IsReplaying() to gate business logic is explicitly discouraged in the Temporal docs. Keep compensation logic simple, explicit, and error-driven.

Step 3: Handle the Determinism Constraint

Temporal replays your workflow function to reconstruct state after a worker restart. This means your workflow code must be deterministic: the same inputs must always produce the same sequence of activity calls. Common violations to avoid:

  • Never use time.Now() in workflow code. Use workflow.Now(ctx) instead.
  • Never use rand.Intn() in workflow code. Pass random values in as activity parameters.
  • Never make direct HTTP calls or DB queries in workflow code. These belong in activities.
  • Never use goroutines directly. Use workflow.Go() for concurrency.
Pro Tip: Two Separate Traps, Two Separate Fixes

The non-determinism trap: Calling time.Now(), rand.Intn(), or any non-deterministic function directly inside a workflow function will cause replay to fail — not on deployment, but on any replay, including routine worker restarts. Temporal records the sequence of commands your workflow issued in its event history. On replay, it re-executes your workflow function and checks that the same commands are issued in the same order. A different value from time.Now() breaks that check immediately. Fix: use workflow.Now(ctx) for time, pass random values in as parameters, and keep all I/O inside activities.

The versioning trap: If you deploy new workflow code that adds, removes, or reorders activity calls, existing in-flight workflows will fail to replay because their recorded history no longer matches the new code path. This is a deployment problem, not a determinism problem. Fix: use workflow.GetVersion() to branch on a named version marker so old and new workflows each follow the correct code path.

These two problems look similar in production — both surface as replay errors — but they have completely different root causes and completely different fixes. Knowing which one you’re looking at cuts your debugging time significantly.

Migrating Legacy Payment Systems to Temporal (Strangler Pattern)

You almost certainly cannot rewrite your entire payment pipeline on Temporal in a single sprint. Here is the migration strategy we use with clients.

Phase 1: Wrap, Don’t Replace (Weeks 1–2)

Introduce Temporal at the orchestration layer while leaving existing services untouched. Your existing charge service, ledger service, and notification service become Temporal activities. No internal changes to those services yet.

// Thin activity wrappers around your existing services

func (a *PaymentActivities) ChargeCard(ctx context.Context, req ChargeRequest) (ChargeResult, error) {

    // Call your EXISTING charge service via HTTP/gRPC/whatever

    return a.existingChargeServiceClient.Charge(ctx, req)

}

Phase 2: Add Idempotency to Activities (Weeks 3–6)

Now that Temporal owns orchestration, retrofit idempotency into each activity’s underlying service. For payment gateways this usually means using their native idempotency key support. For internal services it means adding a unique payment ID constraint at the database level.

Phase 3: Add Compensation Logic (Weeks 7–10)

For each activity that can partially succeed, implement its corresponding compensation action. Start with the highest-value operations: refunds for successful charges, ledger reversals for inserted entries.

Phase 4: Decompose Long-Running Activities (Ongoing)

Any activity that takes more than a few seconds is a candidate for decomposition into a child workflow or async pattern. Payment disputes, for example, might live as long as 120 days — these should never be a single blocking activity call.

Observability: What to Monitor in Production

Temporal surfaces have rich observability primitives. Here are the metrics that matter for payment workflows specifically.

Workflow-Level Metrics

  • temporal_workflow_failed — segment by workflow_type. A spike here is your first production signal for payment failures.
  • temporal_workflow_endtoend_latency — for payment workflows, P99 above 5 seconds usually indicates a stuck or heavily retrying activity.
  • temporal_workflow_active — a steadily growing count of active workflows without a corresponding increase in completions indicates worker capacity or stuck workflow issues.

Activity-Level Metrics

  • temporal_activity_execution_failed — filter by activity_type=ChargeCard and correlate spikes directly with gateway incident timelines.
  • temporal_activity_schedule_to_start_latency — high values mean your task queue workers are not picking up work fast enough; this is a worker scaling signal, not an application error.

Note on metric names: These names reflect the Temporal SDK’s default Prometheus metric exports. If your team applies custom Prometheus relabeling rules or uses a third-party observability platform, your metric names may differ. Always verify against your actual emitted metrics using temporal_ as the prefix filter before building dashboards.

Business-Level Signals (Custom Search Attributes)

Temporal allows you to add custom search attributes to workflows, queryable via the Temporal UI and your metrics system:

// When starting the workflow

opts := client.StartWorkflowOptions{

    ID:        req.PaymentID,

    TaskQueue: “payments”,

    SearchAttributes: map[string]interface{}{

        “MerchantID”:   req.MerchantID,

        “PaymentAmount”: req.Amount,

        “PaymentCurrency”: req.Currency,

    },

}

// Now you can query in Temporal UI: MerchantID = “acme” AND PaymentAmount > 10000

The Failure Modes You Need to Know Before Go-Live

After working on Temporal payment pipelines, here are the failure modes that catch teams off guard.

1. The Compensation Failure

What happens if your compensation action — the refund — also fails? You need a compensation failure policy. Options: retry the compensation indefinitely (usually right for refunds), alert and require manual intervention (for high-value edge cases), or accept the inconsistency and trigger a reconciliation job. All three are valid; the mistake is having no policy.

2. The Non-Determinism Bomb

You deploy new workflow code. Existing in-flight workflows are still running the old code path. If your new code changes the sequence or signature of activity calls, replay will fail. Solution: use workflow versioning via workflow.GetVersion() to handle code evolution safely.

// Safely evolving workflow code

v := workflow.GetVersion(ctx, “add-fraud-check”, workflow.DefaultVersion, 1)

if v >= 1 {

    // New code path: fraud check added

    if err := workflow.ExecuteActivity(ctx, a.FraudCheck, req).Get(ctx, nil); err != nil {

        // …

    }

}

3. The Retry Storm

A downstream service goes down. Every in-flight payment workflow starts retrying its stuck activity. Without a schedule-to-close timeout on activities, these workflows retry forever, consuming worker resources and hammering the recovering service when it comes back up. Always set StartToCloseTimeout and ScheduleToCloseTimeout on every activity, and use exponential backoff.

4. The Phantom Workflow

A workflow appears stuck in the Temporal UI but your business system shows it as complete. This almost always means the workflow’s final signal was processed by your application but the workflow itself never received a completion signal. Always close workflows explicitly and monitor for workflows in the “Running” state older than 2x your expected max duration.

Payment Workflow Observability: Metrics to Monitor

# Check Why It Matters
1 Every activity has StartToCloseTimeout set Prevents zombie activities consuming workers
2 Every activity that touches money is idempotent Prevents double charges on retry
3 Compensation actions exist for all money-moving steps Ensures safe rollback on partial failure
4 Workflow ID derived from business key (e.g. paymentID) Prevents duplicate workflow starts
5 Non-retryable errors enumerated per activity Avoids pointless retries on validation errors
6 Custom search attributes set at workflow start Enables business-level debugging in Temporal UI
7 Alerts on temporal_workflow_failed_total First production signal for payment failures
8 Workflow versioning strategy documented Required for safe code deployment
9 Load tested at 2x expected peak TPS Worker capacity confirmed before launch
10 Runbook for compensation failures written On-call engineers know what to do at 3 AM

When Should You Use the Saga Pattern in Payment Systems?

You should consider using the saga pattern if your payment workflow:

  • Calls multiple services during payment processing
  • Includes refunds or reversals
  • Requires ledger or accounting updates
  • Uses asynchronous processing
  • Has experienced phantom transactions

Common use cases include:

  • Marketplace payments
  • Subscription billing systems
  • Refund workflows
  • Wallet and credit systems
  • Multi-step checkout pipelines

Closing Thoughts

Partial failures in payment pipelines are not a Temporal problem. They’re a distributed systems problem that has existed since the first two-service architecture was connected by a network call. What Temporal does is give you a principled, battle-tested framework for handling them without reinventing the wheel.

The saga pattern, durable execution, and built-in retry semantics eliminate an entire class of 3 AM incidents. But they require you to think carefully about idempotency, compensation, and workflow versioning before your first deployment. The teams that skip that thinking are the ones who call us 90 days after launch.

Is your payments team still the bottleneck for incident visibility?

If support and operations are still routing every “did this charge go through?” or “is this payment stuck?” question through engineering — that’s a design gap, not a Temporal limitation. In payment pipelines it’s also one of the most expensive gaps to leave open.

Xgrid offers two entry points depending on where you are:

  • Temporal 90-Day Production Health Check — we audit your current workflow visibility setup, idempotency coverage, compensation logic, and runbook completeness, then give you a concrete fix list ranked by risk.
  • Temporal Reliability Partner — for payments teams that want a named Temporal expert embedded long-term to own this layer, review new workflow designs before they ship, and mentor internal engineers.

Both are fixed-scope. No open-ended retainer required to get started.

Related Articles

Related Articles