Common Temporal Production Failure Patterns We See in Fintech Ops
3:17 AM. PagerDuty fires.
A payment platform is throwing errors on chargeback workflows that have been running fine for 11 days. Customer support is getting calls. The on-call engineer is staring at the Temporal Web UI looking at 847 workflows stuck in a state that shouldn’t be possible.
The deploy that broke it went out 14 hours ago. It passed the review. It worked fine, for every workflow that started after the deployment.
The 847 that started before it? Every single one needed someone to fix it by hand.
Deploying Temporal successfully is just the first step; the real challenges often emerge much later. I’ve frequently been called upon to assist teams in diagnosing payment failures misbehavior that surfaces 60, 90, or even 180 days after what initially seemed like a successful launch.
These are the eight problems I keep running into. Not hypothetical scenarios. Things I have actually seen break in real systems moving real money. Read this before the next incident does it for you.
THE 8 PROBLEMS, QUICK GUIDE
| # | Pattern | Risk | When You’ll Feel It |
|---|---|---|---|
| 1 | Workflow Versioning Failures | CRITICAL | First deploy with in-flight workflows |
| 2 | Saga Compensation Logic Gaps | HIGH | First partial failure in a multi-service flow |
| 3 | Activity Timeout Misconfiguration | HIGH | First load spike or slow external API call |
| 4 | Signal/Query Race Conditions | MEDIUM | High-concurrency payment state reads |
| 5 | Task Queue Contention | HIGH | First batch reconciliation run at scale |
| 6 | Workflow History Explosion | HIGH | Reconciliation loop crosses ~10k items |
| 7 | Non-Deterministic Code in Workflows | CRITICAL | First replay after any code change |
| 8 | PCI Boundary Violations | CRITICAL | Compliance audit / security review |
01. Workflow Versioning Failures During Payment Flow Evolution
The incident: A team adds a fraud check to their payment workflow on a Friday. Tests pass. Staging looks clean. Saturday morning, 200+ chargeback workflows that started on Wednesday are all failing. Nobody touched those workflows. But the deploy changed what Temporal sees when it replays history, and every workflow that started before the deploy is now broken.
Temporal recovers a workflow’s state by replaying its full event history from the beginning. To do that correctly, the code has to produce the exact same steps in the exact same order every single time it replays. The moment you add, remove, or reorder an activity call without using GetVersion(), that stops being true, for every workflow that’s currently running.
HOW A DEPLOY BREAKS IN-FLIGHT WORKFLOWS
What makes this nasty in payments: it doesn’t fail straight away. Workflows nearly done get through fine. It’s the long-running chargeback flows, the ones that have been going for 72 hours, that blow up. By the time you catch it, you’ve got stuck workflows and half-finished payment sequences that someone has to sort out by hand.
ROOT CAUSE Changing the steps in a workflow while old copies of it are still running breaks how Temporal replays history.
FIX Any change to the shape of a workflow, adding a step, removing one, reordering, needs a version check using GetVersion() (Go) or Workflow.getVersion() (Java/TypeScript). Never deploy workflow code changes without versioning them first.
02. Saga Compensation Logic That Silently Does Nothing
The incident: Gateway charge goes through. The ledger write fails. The rollback fires to reverse the gateway charge, but the rollback activity isn’t safe to run twice. A quick network hiccup causes a retry. The reversal runs twice. Three days later, customer support gets a double-refund complaint, and nobody on the team can explain how that happened.
Temporal is great for coordinating steps across multiple payment services. The problem I keep seeing: the rollback logic exists, but it doesn’t actually work reliably.
SAGA FLOW: HAPPY PATH vs COMPENSATION PATH
| Happy Path ✓ | Compensation Path (if broken) ✗ |
|---|---|
| Authorize Gateway → | ← Reverse Authorization |
| Debit Ledger → | ← Credit Ledger |
| Notify User → | ← Send Failure Notice |
Three things that make rollback logic fail:
| The Problem | What Happens in Production |
|---|---|
| Rollback isn’t safe to run twice | The reversal runs twice on retry. Customer support gets a double-refund ticket they have to fix by hand. |
| Rollback depends on data that’s gone | By the time the rollback runs, the data it needs has already been cleaned up or changed. The reversal does nothing, silently. |
| No retries on the rollback itself | One network hiccup and the reversal is gone. No retry policy means no second attempt, ever. |
ROOT CAUSE Rollback logic gets written quickly and without the same care as the main flow, no retry policies, no protection against running twice.
FIX Treat the rollback path exactly like the main flow: make every reversal safe to run twice, add explicit retry policies, and set up alerts if the rollback itself fails, not just the forward steps. The rollback isn’t a safety net. It’s half the feature.
03. Activity Timeout Misconfiguration Against Payment
The incident: Friday evening, Stripe slows down under load. Payment activities start timing out and retrying. The team only set StartToCloseTimeout, so each of the ten retries runs for 30 seconds before giving up. One payment hangs for over five minutes. Users leave the checkout. And because the activity isn’t protected against duplicates, Stripe gets nine separate charge attempts for the same transaction.
Temporal activities have two separate timeout settings, and most teams mix them up. Using only one of them removes the time ceiling across all retry attempts combined, which means a slow API and a generous retry count can tie up a payment for far longer than intended.
TWO TIMEOUTS, WHAT EACH ONE DOES
|
StartToCloseTimeout
How long one try can take
|
ScheduleToCloseTimeout
Hard limit across all tries combined
|
|---|---|
| Without ScheduleToCloseTimeout: // Attempt 1: 30s → timeout Attempt 2: 30s → timeout … |
With ScheduleToCloseTimeout: 45s: Attempt 1: 30s → timeout Attempt 2: 15s → hard stop |
| // Attempt 10: 30s → timeout = 5+ minute hang, 9 charge attempts,possible duplicate charges |
= Payment fails quickly. The user gets a clear error.No duplicate charges. |
ROOT CAUSE If you only set StartToCloseTimeout, there’s no limit on how long all the retries together can take. A slow external API plus a generous retry count means charges can keep firing for minutes.
FIX Always set ScheduleToCloseTimeout as a hard total limit on every activity that calls a payment API. That’s what stops a slow API from turning into a flood of duplicate charge requests.
04. Signal and Query Race Conditions in Payment State Machines
The incident: A payments dashboard checks workflow status to decide if a refund is allowed. A capture signal comes in and Temporal holds it in a buffer. Before the workflow gets to process it, the dashboard asks for the status, and gets back ‘Pending’. A refund kicks off on a transaction the workflow has already moved past. Now two systems think they’re in charge of the same payment.
Temporal holds incoming signals in a buffer and only processes them when the workflow is ready. If someone queries the workflow status in that gap, after the signal arrived but before it was processed, they get the old status back, even though a change is already in motion.
THE GAP WHERE THINGS GO WRONG
The fix: keep an internal variable for the payment status. When a signal comes in, update that variable right away, before anything else happens. Then have your query handler read from that variable, not from something it works out on the fly.
| ✖ Wrong, work out status on demand |
✔ Right, update status immediately |
|---|---|
|
// query works out status on the fly
QueryHandler: return computeStatus(history)
// signal updates state too late
SignalHandler(capture):
yield; // too late |
// signal updates state right away
SignalHandler(capture):
status = CAPTURED // sync
// query reads the variable directly
QueryHandler: return status
|
ROOT CAUSE
If a query determines the status on demand, it cannot account for signals that are still in the buffer waiting to be processed.
FIX Signal handlers should update an internal status variable as soon as they are triggered. Query handlers should read that variable directly. Avoid writing business logic that assumes a query result is always up to date, especially when a signal may have just arrived.
05. Task Queue Contention Between Payment and Reconciliation Workers
The incident: End-of-day settlement kicks off across 40,000 transactions. Every worker slot fills up. Live payment authorizations get stuck waiting behind the batch job. A checkout that normally takes 200ms suddenly takes 4 seconds. Conversion tanks on a Friday night. Nobody called it an incident, no errors fired. It just got slow. Until revenue ops are called.
When live payments, refunds, and nightly reconciliation all share the same workers, they fight over the same slots. You won’t see errors. You’ll just see everything slow down, right when you least want it to.
| WRONG: One shared queue | RIGHT: Separate queues by urgency |
|---|---|
| Everything goes into one task queue | payment-realtime → its own dedicated workers |
| Batch jobs compete with live payments for slots | reconciliation-batch → a separate worker pool |
| 200ms checkout → 4 seconds under batch load | Batch jobs can never slow down live payments |
| Worker settings never adjusted from defaults | Worker settings matched to what they actually do |
Separating the queues is step one. You also need to tune how workers behave. Three settings matter:
| Parameter | Why it mattersS |
|---|---|
| MaxConcurrentWorkflowTaskPollers | Controls how many workflow decisions get picked up at once. Too low and tasks queue up. Too high and you waste CPU. |
| MaxConcurrentActivityExecutionSize | Limits how many activities run at once per worker. Matters a lot when your activities call slow external APIs like Stripe or Adyen. |
| MaxConcurrentActivityTaskPollers | Controls how aggressively workers look for new activity tasks. Most teams miss this one. When slow API calls are the bottleneck, not task scheduling, this is the setting that makes the difference. |
ROOT CAUSE When batch jobs and live payments share workers, the batch will eventually eat all the capacity. No errors get thrown. Payments just get slow.
FIX Give live payments their own workers. Put batch jobs in a separate pool that can’t touch the live payment workers. Then set all three tuning parameters, MaxConcurrentWorkflowTaskPollers, MaxConcurrentActivityExecutionSize, and MaxConcurrentActivityTaskPollers, based on what your workflows actually do, not what the defaults happen to be.
06. Workflow History Explosion in Long-Running Reconciliation Flows
The incident: A nightly reconciliation job runs cleanly for two months. Then transaction volume grows past a certain point and the workflow just stops mid-run. No clear error in the app logs, just a cryptic message buried in Temporal’s storage about history size. Half the night’s settlements don’t process. The team had no idea there was a limit. Nobody told them.
Temporal keeps a full record of everything a workflow does so it can replay it later. That record has a limit, roughly 50,000 events or 50MB. When you hit it, Temporal terminates the workflow with an error. The run stops mid-execution. Most teams hit this without realising the limit existed.
Worth knowing: Temporal Cloud enforces this limit more tightly than a self-hosted setup, and the ceiling can be lower depending on how your namespace is set up. If you’re on Cloud, check what your actual limit is before assuming the defaults apply.
THE PROBLEM vs THE FIX
| The Problem: Single Loop | Right: Break it into batches |
|---|---|
| for tx in 10,000 txns: ProcessTx(tx) # 5 events# = 50,000 events # The workflow gets killed |
process batch of 500 save checkpoint(lastID) ContinueAsNew(checkpoint)# Clean slate for each batch # No risk of hitting the limit |
ROOT CAUSE One workflow looping over thousands of items quietly fills up its history until Temporal kills it, right in the middle of a run.
FIX Use Continue As New to save your progress and start fresh after each batch. For operations that fan out to lots of items, give each item its own child workflow with its own history. Check your batch sizes against the history limit before you go live, not after your first killed production job.
07. Non-Deterministic Code Hidden Inside Workflow Logic
The incident: A senior engineer adds a quick exchange-rate lookup inside a workflow function, just a cache read, nothing big. It works fine for six weeks. Then a library upgrade changes the cache key format. Next time Temporal replays the workflow, it gets a different result. Temporal notices something changed and fails the workflow. Forty-seven live payment flows go down at once. The engineer is confused, it was just a read. It didn’t feel like a decision.
Temporal needs workflow code to always produce the same steps in the same order when replaying the same history. Any code inside a workflow that might give a different result each time it runs, checking the time, generating random numbers, reading from a database, calling an API, will break replay and fail the workflow.
WHAT GOES IN A WORKFLOW vs AN ACTIVITY
| ✗ Never inside a Workflow | ✓ Put it in an Activity |
|---|---|
| time.Now() → idempotency key | Derive it from data passed into the workflow |
| math/rand → payment amount | Generate it outside and pass it in as a parameter |
| Direct DB read → exchange rate | FetchExchangeRate(), make it an activity |
| HTTP call → feature flag | FetchFeatureFlag(), make it an activity |
| UUID generator inside workflow | Generate it outside and pass it in at the start |
| Cache read → any config | FetchConfig(), even reads count as I/O |
ROOT CAUSE Any code that might give a different answer next time, including things that feel like read-only, like cache lookups or config fetches, can break replay if anything underneath it changes.
FIX The rule is simple: the workflow coordinates, activities do the work. Use workflow.Now() instead of time.Now(). Move every I/O call, even reads, into an activity. Getting this right before you launch is a lot cheaper than fixing it after a compliance audit flags your Temporal cluster.
08. PCI/Compliance Boundary Violations in Temporal Cluster Design
The incident: A fintech team is six months into a PCI audit. The assessor asks what data flows through the Temporal cluster. The engineering lead opens a recent workflow history in the Temporal UI and sees, right there in plain text, a billing address, the last four card digits, and a partial card number someone added for debugging three sprints back. The cluster is now in PCI scope. So is every worker. So is every engineer who can access the namespace. Fixing it takes four months.
Temporal stores a complete record of everything that passes through a workflow, every input, every output, every step. If card data flows through that record unencrypted, your whole Temporal cluster comes under PCI rules. That means the cluster, every worker, and everyone with access to it.
HOW CARD DATA CREEPS INTO THE CLUSTER
| Sprint | What was added to the workflow | PCI Risk |
|---|---|---|
| Sprint 1 | Transaction ID and amount only | None ✔ |
| Sprint 3 | Last four digits of card for fraud scoring | Low ⚠ |
| Sprint 7 | Billing address for logging context | Medium ⚠⚠ |
| Sprint 11 | Partial PAN just for debugging | CLUSTER IN SCOPE |
Four things you can do to keep card data out of the cluster:
| What to do | How it helps |
|---|---|
| Use tokens, not real card data | Store the real card data in a vault. Let Temporal carry only a token, a reference with no value on its own. Activities retrieve the real data when they need it, inside a controlled environment. The Temporal cluster never touches actual card data. |
| Encrypt what goes into the event history | Use Temporal’s DataConverter to encrypt sensitive fields before they get written to the event history. Card data never appears in plain text, even in the database where Temporal stores its history. |
| Keep payment workflows in their own namespace | Run payment workflows in a separate Temporal namespace, away from everything else. This limits the blast radius if something goes wrong, makes audits simpler, and makes it clear where your PCI boundary sits. |
| Test what’s actually being logged | Write rules to strip sensitive data from logs and actually test them. Temporal’s event history can expose data you didn’t expect. Find out what’s in there before the auditor does. |
ROOT CAUSE PCI scope creep happens one sprint at a time. Each addition looks harmless on its own. Together, they quietly bring the whole cluster, and everyone who touches it, into compliance scope. Nobody makes a formal decision to do this. It just happens.
FIX Card data should never appear as plain text in workflow inputs, activity results, or memo fields. Only pass tokens or encrypted references. Make this a code review rule, not just something people try to remember. By the time you spot it in the Temporal UI, you’re already in scope.
What the engineer at 3:17 AM now knows
None of these eight problems are mysteries once you’ve seen them. The chargeback workflows that blew up on Saturday were missing a GetVersion() call. The double-refund happened because nobody made the rollback safe to run twice. The PCI finding was ten sprints of small decisions that each looked fine on their own.
Running Temporal in production for payments is not the same as getting it to work in staging. These problems live in the gap between what Temporal promises and what actually happens when your team is changing the code, scaling up traffic, and moving real money at the same time.
Every one of these is fixable. Teams that catch these in a planned review and teams that catch them at 3 AM are often separated by one thing: somebody went looking before the system forced the issue.

