How to Decouple Reliability from Infrastructure Ownership
Why your workflow reliability should not be hostage to whether your platform team has capacity this sprint — and how Temporal’s execution model makes that separation possible
| TL;DR
Most teams treat reliability as an infrastructure problem: better hardware, more redundancy, stricter SLAs on the cluster. That framing puts reliability in the wrong hands. When workflow durability is baked into the execution model — not bolted onto the infrastructure — your application teams own their reliability posture directly, without waiting on the platform. This post explains how Temporal’s architecture enables that separation, what it takes to implement it correctly, and the organizational failure modes that undo it. |
The Reliability Trap Most Engineering Orgs Walk Into
Here is a pattern that plays out in almost every engineering organization that scales past 50 engineers. Early on, reliability is everyone’s problem in an informal way — the team that owns the service owns its uptime. Then incidents accumulate, on-call rotations burn people out, and leadership decides to formalize. A Platform Engineering team is formed. SREs are hired. Monitoring is centralized. Runbooks are written.
This is the right instinct. But it creates a side effect that nobody plans for: reliability becomes a service that application teams request from the platform team, rather than a property that application teams own in their code.
The result looks like this in practice:
- An application team’s workflow starts failing intermittently. They open a ticket with the platform team.
- The platform team investigates the infrastructure: cluster health, database load, network latency. Everything looks fine.
- The root cause turns out to be a missing retry policy and no timeout on a third-party API call — an application-level design decision, not an infrastructure problem.
- Two weeks have passed. The incident is closed. The underlying design gap remains.
This is not a process failure. It is a structural one. When reliability lives in the infrastructure layer, the people who can actually fix most reliability problems — the application engineers who write workflow and activity code — are one escalation removed from the problem.
| The Core Distinction
Infrastructure reliability means your cluster stays up. Application reliability means your workflows complete correctly under failure conditions. These are different properties, owned by different teams, fixed by different interventions. Conflating them is the root cause of most Temporal production incidents we see. |
How Temporal’s Architecture Enables the Separation
Temporal’s durable execution model is the technical foundation that makes this separation possible. Understanding it precisely is what separates teams that successfully decouple reliability from those that simply move the same problems to a new layer.
What Temporal Actually Guarantees
Temporal guarantees that the orchestration of your workflow — the sequence of activity calls, the retry decisions, the timer firings — will survive any worker crash. It does this by persisting every command your workflow issues to an append-only event history in the Temporal cluster’s persistence layer. When a worker crashes and restarts, it replays the workflow function against the recorded event history to reconstruct the state, then continues from where execution left off.
This guarantee is infrastructure-level: it is provided by Temporal regardless of what your workflow code does. The cluster going down does not lose your workflow state. A worker pod being evicted does not drop an in-flight execution. These are properties of the platform.
What Temporal Does Not Guarantee
Temporal does not guarantee that your workflow will complete correctly. That guarantee is the responsibility of the application team. Specifically:
- Temporal will retry a failed activity — but only if you configured a retry policy. An activity with no retry policy and a transient network failure will fail permanently.
- Temporal will replay your workflow after a worker crash — but only if your workflow code is deterministic. A non-deterministic workflow will throw a non-determinism error on replay and get stuck.
- Temporal will durably track a long-running workflow — but if that workflow has no heartbeat on its activities, a crashed activity will not be detected until its StartToCloseTimeout expires, which could be hours if misconfigured.
- Temporal will execute your compensation logic — but only if you wrote it. A workflow with no compensation for partial failures leaves money on the table, literally.
Every item in that list is an application-layer decision. The platform team cannot fix them. The cluster SLA cannot prevent them. They require application engineers to understand Temporal’s primitives and apply them correctly in code.
| The Separation in One Sentence
Temporal makes workflow state durable. Your application code makes workflow behavior correct. These are two different layers, and owning the second one does not require owning the first. |
The Four Primitives That Application Teams Must Own
Decoupling reliability from infrastructure ownership means application teams taking direct ownership of four specific Temporal primitives. None of these require platform team involvement. All of them are code-level decisions.
1. Retry Policies
Every activity that touches an external system — a payment gateway, a third-party API, a database — needs an explicit retry policy. Not a default, not an inherited policy, an explicit one that reflects the failure characteristics of that specific dependency.
| // The wrong way: no retry policy, accepting Temporal defaults
ao := workflow.ActivityOptions{ StartToCloseTimeout: 30 * time.Second, // No RetryPolicy specified — Temporal will use defaults: // unlimited retries, 1s initial interval, 2x backoff, 100s max interval // This is almost never what you want for a payment activity. } // The right way: explicit policy matched to the dependency ao := workflow.ActivityOptions{ StartToCloseTimeout: 30 * time.Second, RetryPolicy: &temporal.RetryPolicy{ InitialInterval: 500 * time.Millisecond, BackoffCoefficient: 2.0, MaximumInterval: 30 * time.Second, MaxAttempts: 4, // Non-retryable: business errors that retrying cannot fix NonRetryableErrorTypes: []string{ “ErrInsufficientFunds”, “ErrCardExpired”, “ErrInvalidAccountNumber”, }, }, } |
The NonRetryableErrorTypes field is particularly important. Without it, Temporal will retry on every error type, including business validation errors that retrying will never fix. This burns retry budget, delays failure detection, and makes debugging harder. Application engineers who own the activity own this decision — the platform team cannot know which errors are retryable for a given business operation.
2. Timeouts
Temporal has four distinct timeout types and they address different failure scenarios. Using them correctly is application-layer work.
| Timeout | What It Protects Against | Who Sets It |
| ScheduleToStartTimeout | Worker pool starvation — task sits in queue too long | Application team (reflects SLO) |
| StartToCloseTimeout | Activity execution hung or crashed | Application team (reflects max expected duration) |
| ScheduleToCloseTimeout | Total time budget for activity including all retries | Application team (reflects business deadline) |
| HeartbeatTimeout | Long-running activity stopped reporting progress | Application team (reflects heartbeat interval) |
Every timeout in that table is set by the application team. The platform team cannot know your business SLO for a payment activity, or how long a file processing activity is expected to run. These are domain decisions embedded in code.
3. Heartbeating for Long-Running Activities
An activity that runs for more than a few seconds without heartbeating is invisible to Temporal. If the worker executing it crashes, Temporal has no way to know until the StartToCloseTimeout expires. For long-running operations — file uploads, batch processing, AI inference — this can mean minutes or hours of silent failure.
| // A long-running activity that heartbeats correctly
func (a *ProcessingActivities) ProcessLargeFile( ctx context.Context, req FileProcessingRequest, ) (FileProcessingResult, error) { records, err := loadRecords(req.FileURL) if err != nil { return FileProcessingResult{}, err } // Resume from checkpoint if this is a retry startFrom := 0 if details := activity.GetInfo(ctx).HeartbeatDetails; details != nil { _ = converter.GetDefaultDataConverter().FromPayloads(details, &startFrom) } for i := startFrom; i < len(records); i++ { if err := processRecord(records[i]); err != nil { return FileProcessingResult{}, err } // Heartbeat every 100 records with progress checkpoint if i%100 == 0 { activity.RecordHeartbeat(ctx, i) // Heartbeat also checks for cancellation if ctx.Err() != nil { return FileProcessingResult{}, ctx.Err() } } } return FileProcessingResult{ProcessedCount: len(records)}, nil } |
The heartbeat here does two things: it tells Temporal the activity is still alive (so a crashed worker is detected within HeartbeatTimeout, not StartToCloseTimeout), and it checkpoints progress so a retry can resume from where it left off rather than reprocessing from the beginning. Both are application-layer concerns. Neither requires platform intervention.
4. Workflow Determinism
This is the primitive that most teams underinvest in until they hit their first non-determinism error in production. Workflow code must be deterministic: given the same event history, the workflow function must always issue the same sequence of commands. The rules are straightforward but require active discipline:
| // Non-deterministic: breaks replay
func BadWorkflow(ctx workflow.Context, req Request) error { if time.Now().Weekday() == time.Saturday { // ✗ non-deterministic return workflow.ExecuteActivity(ctx, WeekendActivity, req).Get(ctx, nil) } return workflow.ExecuteActivity(ctx, WeekdayActivity, req).Get(ctx, nil) } // Deterministic: replay-safe func GoodWorkflow(ctx workflow.Context, req Request) error { if workflow.Now(ctx).Weekday() == time.Saturday { // ✓ replay-safe return workflow.ExecuteActivity(ctx, WeekendActivity, req).Get(ctx, nil) } return workflow.ExecuteActivity(ctx, WeekdayActivity, req).Get(ctx, nil) } |
Determinism is enforced at code review time, not at runtime. The platform team cannot catch a non-determinism bug in your workflow code during infrastructure monitoring. By the time it surfaces — as a stuck workflow in production — it is already an incident. This is why application teams must internalize the determinism rules, not treat them as a platform concern.
Building a Reliability Contract Between Teams
Decoupling does not mean the platform team disappears from the reliability conversation. It means the responsibilities are clearly partitioned so that each team owns what they can actually control. The following contract formalizes that partition.
| Platform Team Owns | Application Team Owns |
| Temporal cluster uptime and SLA | Retry policies for every activity |
| Persistence layer health and capacity | Timeout values matched to business SLOs |
| Worker infrastructure scaling | Heartbeating on long-running activities |
| Namespace provisioning and access control | Workflow determinism and versioning |
| TLS certificate management | Compensation logic for partial failures |
| Cluster-level metrics and alerting | Workflow-level metrics and alerting |
| Temporal SDK version upgrades | Non-retryable error classification |
| Disaster recovery and backup | Runbooks for workflow-specific incidents |
The right column is entirely code. None of it requires a ticket to the platform team, a cluster configuration change, or an infrastructure deployment. Application teams can audit, test, and improve every item in that column on their own sprint cycle.
Making Application-Layer Reliability Auditable
Ownership without visibility is not ownership — it is just blame assignment. For application teams to genuinely own their reliability posture, they need tooling that surfaces the state of their reliability primitives at any point in time. Here is what that looks like in practice.
Automated Workflow Design Audits
Write a linter that checks every workflow and activity registration against a minimum reliability standard. This runs in CI and blocks merges that introduce reliability regressions.
| // tools/temporal-lint/main.go
// Simplified example of a workflow design audit tool type ActivityConfig struct { Name string HasRetryPolicy bool HasStartToCloseTimeout bool HasHeartbeatTimeout bool NonRetryableErrors []string } type AuditResult struct { Activity ActivityConfig Findings []string } func AuditActivityOptions(opts workflow.ActivityOptions, name string) AuditResult { result := AuditResult{Activity: ActivityConfig{Name: name}} if opts.RetryPolicy == nil { result.Findings = append(result.Findings, “[CRITICAL] No retry policy set. Temporal defaults apply (unlimited retries).”) } else if len(opts.RetryPolicy.NonRetryableErrorTypes) == 0 { result.Findings = append(result.Findings, “[WARNING] No NonRetryableErrorTypes defined. All errors will be retried.”) } if opts.StartToCloseTimeout == 0 { result.Findings = append(result.Findings, “[CRITICAL] StartToCloseTimeout not set. Activity may hang indefinitely.”) } if opts.HeartbeatTimeout == 0 { result.Findings = append(result.Findings, “[WARNING] HeartbeatTimeout not set. Crashed workers not detected until StartToCloseTimeout.”) } return result } |
Workflow-Level SLO Tracking with Custom Search Attributes
Temporal’s custom search attributes let you embed business context into every workflow execution. Combined with visibility queries, this gives application teams a live view of their workflow SLO health — without routing through the platform team’s monitoring stack.
| // Embed SLO context at workflow start
opts := client.StartWorkflowOptions{ ID: req.OrderID, TaskQueue: “order-processing”, SearchAttributes: map[string]interface{}{ “TeamName”: “payments-eng”, “ServiceTier”: “tier-1”, “SLOTargetMs”: int64(5000), // 5s SLO for this workflow type “CustomerRegion”: req.Region, }, } // Query for SLO violations directly from Temporal visibility // temporal workflow list \ // –query ‘TeamName = “payments-eng” AND ExecutionDuration > 5000’ // // This query is owned and run by the payments-eng team. // No platform team involvement required. |
Incident Ownership via Runbook Coverage
A workflow that has no runbook is a workflow whose incidents will be routed to whoever gets paged first — which is usually the platform team, because they own the alerting infrastructure. Breaking this pattern requires application teams to write and own runbooks for their workflow types before go-live.
The minimum runbook for any production workflow type covers four questions:
- How do I determine whether a stuck workflow is stuck due to an infrastructure problem or an application bug?
- What is the safe remediation path for the three most common failure modes for this workflow type?
- Under what conditions is it safe to terminate a stuck workflow and restart it?
- Who is the on-call owner for this workflow type and how do I reach them?
These questions cannot be answered by the platform team. They require domain knowledge of the workflow’s business logic. Requiring runbook coverage before a new workflow type goes to production is one of the highest-leverage reliability interventions an engineering organization can implement.
The Organizational Failure Modes That Undo This
The technical primitives are the easy part. The harder part is the organizational patterns that pull reliability ownership back into the platform team even after you have established the separation. Here are the three most common ones.
1. The Escalation Default
An application team hits a Temporal error they have not seen before. Instead of consulting the documentation or their team’s own runbook, they open a ticket with the platform team. The platform team investigates, finds it’s an application-layer issue (missing retry policy, non-deterministic code), and hands it back. This cycle repeats until the application team builds enough Temporal literacy to self-serve — or until they give up and accept that platform will always be in the loop.
The fix is not better documentation. It is requiring application teams to do a structured Temporal failure mode analysis before escalating — and tracking escalations that turned out to be application-layer issues as a team metric.
2. The Shared Namespace Trap
All workflow types across all application teams run in a single Temporal namespace. One team’s workflow starts generating a retry storm. Worker capacity is consumed. Other teams’ workflows begin experiencing ScheduleToStartTimeout failures. Everyone’s reliability is now coupled through shared infrastructure despite the logical separation.
The fix is namespace isolation. Give each team or service tier its own namespace and worker pool. The platform team provides namespaces. Application teams own everything inside their namespace. A retry storm in one team’s namespace does not affect another’s.
| # Namespace-per-team isolation pattern
# Platform team provisions: temporal operator namespace create –namespace payments-prod.acme temporal operator namespace create –namespace operations-prod.acme temporal operator namespace create –namespace agents-prod.acme # Each team runs their own worker pool against their namespace: # payments-eng team: TEMPORAL_NAMESPACE=payments-prod.acme ./payments-worker # operations team: TEMPORAL_NAMESPACE=operations-prod.acme ./operations-worker # A retry storm in payments-prod has zero impact on operations-prod. |
3. The Metrics Silo
Temporal cluster metrics (history shard latency, frontend service errors, persistence read/write latency) live in the platform team’s observability stack. Workflow-level metrics (workflow failure rate by type, activity retry rate by activity name) are not tracked by anyone because the application teams assume the platform team’s dashboards cover it.
The result: nobody has visibility into application-layer reliability. The first signal is an incident.
The fix is a clear metric ownership boundary. Cluster metrics belong to the platform team. Workflow metrics belong to the application team. Application teams must instrument their workflows before go-live and own their own dashboards and alerts. The platform team’s dashboards tell you the cluster is healthy. The application team’s dashboards tell you your workflows are behaving correctly. Both are necessary. Neither substitutes for the other.
Application Reliability Checklist
For any workflow type going to production, this checklist should be signed off by the application team — not the platform team.
| # | Check | Why It Matters |
| 1 | Explicit retry policy set on every activity, including NonRetryableErrorTypes | Prevents unlimited retries on business validation errors |
| 2 | StartToCloseTimeout set on every activity based on measured p99 duration | Prevents hung activities blocking workers indefinitely |
| 3 | HeartbeatTimeout set on any activity expected to run longer than 10 seconds | Enables detection of crashed workers without waiting for full timeout |
| 4 | Workflow code reviewed for determinism: no time.Now(), rand, or direct I/O | Prevents non-determinism errors on replay after worker restart |
| 5 | workflow.GetVersion() used for any code changes to in-flight workflow types | Prevents replay failures for existing in-flight executions post-deploy |
| 6 | Compensation logic implemented for all money-moving or state-mutating activities | Ensures partial failures result in a clean known state, not orphaned data |
| 7 | Custom search attributes set at workflow start for team, tier, and business context | Enables self-service debugging without routing through platform team |
| 8 | Workflow-level alerts configured on temporal_workflow_failed for this workflow type | Application team is paged for their own incidents, not platform team |
| 9 | Runbook written covering top 3 failure modes and safe remediation steps | On-call engineers know what to do without escalating at 3 AM |
| 10 | Namespace isolation confirmed: workflow runs in team-owned namespace | Retry storms from other teams do not affect this workflow’s SLA |
Closing Thoughts
The promise of decoupling reliability from infrastructure ownership is not that your platform team does less. It is that your application teams can move faster without creating reliability debt they cannot see or fix. When the four primitives — retry policies, timeouts, heartbeating, and determinism — are treated as first-class code artifacts owned by the teams that write the workflows, reliability stops being a queue and starts being a property.
The organizational failure modes are harder to fix than the technical ones. Namespace isolation and metric ownership boundaries require conversations across team lines. Runbook requirements require process changes to your go-live criteria. These changes have friction. But the alternative — an on-call rotation where every third page goes to the platform team for a problem that only an application engineer can actually fix — has more friction and more cost.
If your team is operating Temporal in production and the reliability ownership lines are still blurry, the checklist above is a starting point. If you want a structured review of how your current workflow designs map against these primitives, that is exactly the kind of work Xgrid’s Temporal Practice does as part of the 90-Day Production Health Check.
Is your engineering team still the bottleneck for workflow reliability?
If every “why did this workflow fail?” The question gets routed through your platform team before anyone looks at application code — that’s an ownership gap, not a Temporal limitation. It’s also one of the fastest things to fix with the right primitives and the right team contract in place.
Xgrid offers two entry points depending on where you are:
- Temporal 90-Day Production Health Check — we audit your current workflow designs against the reliability primitives above, identify where ownership is unclear, and give you a concrete fix list ranked by risk.
- Temporal Reliability Partner — for teams that want a named Temporal expert embedded long-term to establish the application-layer reliability practice, run design reviews before new workflows go live, and mentor engineers on the primitives that matter.
Both are fixed-scope. No open-ended retainer required to get started.

