Skip to main content

How to Decouple Reliability from Infrastructure Ownership

Why your workflow reliability should not be hostage to whether your platform team has capacity this sprint — and how Temporal’s execution model makes that separation possible

TL;DR

Most teams treat reliability as an infrastructure problem: better hardware, more redundancy, stricter SLAs on the cluster. That framing puts reliability in the wrong hands. When workflow durability is baked into the execution model — not bolted onto the infrastructure — your application teams own their reliability posture directly, without waiting on the platform. This post explains how Temporal’s architecture enables that separation, what it takes to implement it correctly, and the organizational failure modes that undo it.

The Reliability Trap Most Engineering Orgs Walk Into

Here is a pattern that plays out in almost every engineering organization that scales past 50 engineers. Early on, reliability is everyone’s problem in an informal way — the team that owns the service owns its uptime. Then incidents accumulate, on-call rotations burn people out, and leadership decides to formalize. A Platform Engineering team is formed. SREs are hired. Monitoring is centralized. Runbooks are written.

This is the right instinct. But it creates a side effect that nobody plans for: reliability becomes a service that application teams request from the platform team, rather than a property that application teams own in their code.

The result looks like this in practice:

  • An application team’s workflow starts failing intermittently. They open a ticket with the platform team.
  • The platform team investigates the infrastructure: cluster health, database load, network latency. Everything looks fine.
  • The root cause turns out to be a missing retry policy and no timeout on a third-party API call — an application-level design decision, not an infrastructure problem.
  • Two weeks have passed. The incident is closed. The underlying design gap remains.

This is not a process failure. It is a structural one. When reliability lives in the infrastructure layer, the people who can actually fix most reliability problems — the application engineers who write workflow and activity code — are one escalation removed from the problem.

The Core Distinction

Infrastructure reliability means your cluster stays up. Application reliability means your workflows complete correctly under failure conditions. These are different properties, owned by different teams, fixed by different interventions. Conflating them is the root cause of most Temporal production incidents we see.

How Temporal’s Architecture Enables the Separation

Temporal’s durable execution model is the technical foundation that makes this separation possible. Understanding it precisely is what separates teams that successfully decouple reliability from those that simply move the same problems to a new layer.

What Temporal Actually Guarantees

Temporal guarantees that the orchestration of your workflow — the sequence of activity calls, the retry decisions, the timer firings — will survive any worker crash. It does this by persisting every command your workflow issues to an append-only event history in the Temporal cluster’s persistence layer. When a worker crashes and restarts, it replays the workflow function against the recorded event history to reconstruct the state, then continues from where execution left off.

This guarantee is infrastructure-level: it is provided by Temporal regardless of what your workflow code does. The cluster going down does not lose your workflow state. A worker pod being evicted does not drop an in-flight execution. These are properties of the platform.

What Temporal Does Not Guarantee

Temporal does not guarantee that your workflow will complete correctly. That guarantee is the responsibility of the application team. Specifically:

  • Temporal will retry a failed activity — but only if you configured a retry policy. An activity with no retry policy and a transient network failure will fail permanently.
  • Temporal will replay your workflow after a worker crash — but only if your workflow code is deterministic. A non-deterministic workflow will throw a non-determinism error on replay and get stuck.
  • Temporal will durably track a long-running workflow — but if that workflow has no heartbeat on its activities, a crashed activity will not be detected until its StartToCloseTimeout expires, which could be hours if misconfigured.
  • Temporal will execute your compensation logic — but only if you wrote it. A workflow with no compensation for partial failures leaves money on the table, literally.

Every item in that list is an application-layer decision. The platform team cannot fix them. The cluster SLA cannot prevent them. They require application engineers to understand Temporal’s primitives and apply them correctly in code.

The Separation in One Sentence

Temporal makes workflow state durable. Your application code makes workflow behavior correct. These are two different layers, and owning the second one does not require owning the first.

The Four Primitives That Application Teams Must Own

Decoupling reliability from infrastructure ownership means application teams taking direct ownership of four specific Temporal primitives. None of these require platform team involvement. All of them are code-level decisions.

1. Retry Policies

Every activity that touches an external system — a payment gateway, a third-party API, a database — needs an explicit retry policy. Not a default, not an inherited policy, an explicit one that reflects the failure characteristics of that specific dependency.

// The wrong way: no retry policy, accepting Temporal defaults

ao := workflow.ActivityOptions{

    StartToCloseTimeout: 30 * time.Second,

    // No RetryPolicy specified — Temporal will use defaults:

    // unlimited retries, 1s initial interval, 2x backoff, 100s max interval

    // This is almost never what you want for a payment activity.

}

// The right way: explicit policy matched to the dependency

ao := workflow.ActivityOptions{

    StartToCloseTimeout: 30 * time.Second,

    RetryPolicy: &temporal.RetryPolicy{

        InitialInterval:        500 * time.Millisecond,

        BackoffCoefficient:     2.0,

        MaximumInterval:        30 * time.Second,

        MaxAttempts:            4,

        // Non-retryable: business errors that retrying cannot fix

        NonRetryableErrorTypes: []string{

            “ErrInsufficientFunds”,

            “ErrCardExpired”,

            “ErrInvalidAccountNumber”,

        },

    },

}

The NonRetryableErrorTypes field is particularly important. Without it, Temporal will retry on every error type, including business validation errors that retrying will never fix. This burns retry budget, delays failure detection, and makes debugging harder. Application engineers who own the activity own this decision — the platform team cannot know which errors are retryable for a given business operation.

2. Timeouts

Temporal has four distinct timeout types and they address different failure scenarios. Using them correctly is application-layer work.

Timeout What It Protects Against Who Sets It
ScheduleToStartTimeout Worker pool starvation — task sits in queue too long Application team (reflects SLO)
StartToCloseTimeout Activity execution hung or crashed Application team (reflects max expected duration)
ScheduleToCloseTimeout Total time budget for activity including all retries Application team (reflects business deadline)
HeartbeatTimeout Long-running activity stopped reporting progress Application team (reflects heartbeat interval)

Every timeout in that table is set by the application team. The platform team cannot know your business SLO for a payment activity, or how long a file processing activity is expected to run. These are domain decisions embedded in code.

3. Heartbeating for Long-Running Activities

An activity that runs for more than a few seconds without heartbeating is invisible to Temporal. If the worker executing it crashes, Temporal has no way to know until the StartToCloseTimeout expires. For long-running operations — file uploads, batch processing, AI inference — this can mean minutes or hours of silent failure.

// A long-running activity that heartbeats correctly

func (a *ProcessingActivities) ProcessLargeFile(

    ctx context.Context,

    req FileProcessingRequest,

) (FileProcessingResult, error) {

    records, err := loadRecords(req.FileURL)

    if err != nil {

        return FileProcessingResult{}, err

    }

    // Resume from checkpoint if this is a retry

    startFrom := 0

    if details := activity.GetInfo(ctx).HeartbeatDetails; details != nil {

        _ = converter.GetDefaultDataConverter().FromPayloads(details, &startFrom)

    }

    for i := startFrom; i < len(records); i++ {

        if err := processRecord(records[i]); err != nil {

            return FileProcessingResult{}, err

        }

        // Heartbeat every 100 records with progress checkpoint

        if i%100 == 0 {

            activity.RecordHeartbeat(ctx, i)

            // Heartbeat also checks for cancellation

            if ctx.Err() != nil {

                return FileProcessingResult{}, ctx.Err()

            }

        }

    }

    return FileProcessingResult{ProcessedCount: len(records)}, nil

}

The heartbeat here does two things: it tells Temporal the activity is still alive (so a crashed worker is detected within HeartbeatTimeout, not StartToCloseTimeout), and it checkpoints progress so a retry can resume from where it left off rather than reprocessing from the beginning. Both are application-layer concerns. Neither requires platform intervention.

4. Workflow Determinism

This is the primitive that most teams underinvest in until they hit their first non-determinism error in production. Workflow code must be deterministic: given the same event history, the workflow function must always issue the same sequence of commands. The rules are straightforward but require active discipline:

// Non-deterministic: breaks replay

func BadWorkflow(ctx workflow.Context, req Request) error {

    if time.Now().Weekday() == time.Saturday {  // ✗ non-deterministic

        return workflow.ExecuteActivity(ctx, WeekendActivity, req).Get(ctx, nil)

    }

    return workflow.ExecuteActivity(ctx, WeekdayActivity, req).Get(ctx, nil)

}

// Deterministic: replay-safe

func GoodWorkflow(ctx workflow.Context, req Request) error {

    if workflow.Now(ctx).Weekday() == time.Saturday {  // ✓ replay-safe

        return workflow.ExecuteActivity(ctx, WeekendActivity, req).Get(ctx, nil)

    }

    return workflow.ExecuteActivity(ctx, WeekdayActivity, req).Get(ctx, nil)

}

Determinism is enforced at code review time, not at runtime. The platform team cannot catch a non-determinism bug in your workflow code during infrastructure monitoring. By the time it surfaces — as a stuck workflow in production — it is already an incident. This is why application teams must internalize the determinism rules, not treat them as a platform concern.

Building a Reliability Contract Between Teams

Decoupling does not mean the platform team disappears from the reliability conversation. It means the responsibilities are clearly partitioned so that each team owns what they can actually control. The following contract formalizes that partition.

Platform Team Owns Application Team Owns
Temporal cluster uptime and SLA Retry policies for every activity
Persistence layer health and capacity Timeout values matched to business SLOs
Worker infrastructure scaling Heartbeating on long-running activities
Namespace provisioning and access control Workflow determinism and versioning
TLS certificate management Compensation logic for partial failures
Cluster-level metrics and alerting Workflow-level metrics and alerting
Temporal SDK version upgrades Non-retryable error classification
Disaster recovery and backup Runbooks for workflow-specific incidents

The right column is entirely code. None of it requires a ticket to the platform team, a cluster configuration change, or an infrastructure deployment. Application teams can audit, test, and improve every item in that column on their own sprint cycle.

Making Application-Layer Reliability Auditable

Ownership without visibility is not ownership — it is just blame assignment. For application teams to genuinely own their reliability posture, they need tooling that surfaces the state of their reliability primitives at any point in time. Here is what that looks like in practice.

Automated Workflow Design Audits

Write a linter that checks every workflow and activity registration against a minimum reliability standard. This runs in CI and blocks merges that introduce reliability regressions.

// tools/temporal-lint/main.go

// Simplified example of a workflow design audit tool

type ActivityConfig struct {

    Name                string

    HasRetryPolicy      bool

    HasStartToCloseTimeout bool

    HasHeartbeatTimeout bool

    NonRetryableErrors  []string

}

type AuditResult struct {

    Activity ActivityConfig

    Findings []string

}

func AuditActivityOptions(opts workflow.ActivityOptions, name string) AuditResult {

    result := AuditResult{Activity: ActivityConfig{Name: name}}

    if opts.RetryPolicy == nil {

        result.Findings = append(result.Findings,

            “[CRITICAL] No retry policy set. Temporal defaults apply (unlimited retries).”)

    } else if len(opts.RetryPolicy.NonRetryableErrorTypes) == 0 {

        result.Findings = append(result.Findings,

            “[WARNING] No NonRetryableErrorTypes defined. All errors will be retried.”)

    }

    if opts.StartToCloseTimeout == 0 {

        result.Findings = append(result.Findings,

            “[CRITICAL] StartToCloseTimeout not set. Activity may hang indefinitely.”)

    }

    if opts.HeartbeatTimeout == 0 {

        result.Findings = append(result.Findings,

            “[WARNING] HeartbeatTimeout not set. Crashed workers not detected until StartToCloseTimeout.”)

    }

    return result

}

Workflow-Level SLO Tracking with Custom Search Attributes

Temporal’s custom search attributes let you embed business context into every workflow execution. Combined with visibility queries, this gives application teams a live view of their workflow SLO health — without routing through the platform team’s monitoring stack.

// Embed SLO context at workflow start

opts := client.StartWorkflowOptions{

    ID:        req.OrderID,

    TaskQueue: “order-processing”,

    SearchAttributes: map[string]interface{}{

        “TeamName”:       “payments-eng”,

        “ServiceTier”:    “tier-1”,

        “SLOTargetMs”:    int64(5000),   // 5s SLO for this workflow type

        “CustomerRegion”: req.Region,

    },

}

// Query for SLO violations directly from Temporal visibility

// temporal workflow list \

//   –query ‘TeamName = “payments-eng” AND ExecutionDuration > 5000’

//

// This query is owned and run by the payments-eng team.

// No platform team involvement required.

Incident Ownership via Runbook Coverage

A workflow that has no runbook is a workflow whose incidents will be routed to whoever gets paged first — which is usually the platform team, because they own the alerting infrastructure. Breaking this pattern requires application teams to write and own runbooks for their workflow types before go-live.

The minimum runbook for any production workflow type covers four questions:

  • How do I determine whether a stuck workflow is stuck due to an infrastructure problem or an application bug?
  • What is the safe remediation path for the three most common failure modes for this workflow type?
  • Under what conditions is it safe to terminate a stuck workflow and restart it?
  • Who is the on-call owner for this workflow type and how do I reach them?

These questions cannot be answered by the platform team. They require domain knowledge of the workflow’s business logic. Requiring runbook coverage before a new workflow type goes to production is one of the highest-leverage reliability interventions an engineering organization can implement.

The Organizational Failure Modes That Undo This

The technical primitives are the easy part. The harder part is the organizational patterns that pull reliability ownership back into the platform team even after you have established the separation. Here are the three most common ones.

1. The Escalation Default

An application team hits a Temporal error they have not seen before. Instead of consulting the documentation or their team’s own runbook, they open a ticket with the platform team. The platform team investigates, finds it’s an application-layer issue (missing retry policy, non-deterministic code), and hands it back. This cycle repeats until the application team builds enough Temporal literacy to self-serve — or until they give up and accept that platform will always be in the loop.

The fix is not better documentation. It is requiring application teams to do a structured Temporal failure mode analysis before escalating — and tracking escalations that turned out to be application-layer issues as a team metric.

2. The Shared Namespace Trap

All workflow types across all application teams run in a single Temporal namespace. One team’s workflow starts generating a retry storm. Worker capacity is consumed. Other teams’ workflows begin experiencing ScheduleToStartTimeout failures. Everyone’s reliability is now coupled through shared infrastructure despite the logical separation.

The fix is namespace isolation. Give each team or service tier its own namespace and worker pool. The platform team provides namespaces. Application teams own everything inside their namespace. A retry storm in one team’s namespace does not affect another’s.

# Namespace-per-team isolation pattern

# Platform team provisions:

temporal operator namespace create –namespace payments-prod.acme

temporal operator namespace create –namespace operations-prod.acme

temporal operator namespace create –namespace agents-prod.acme

# Each team runs their own worker pool against their namespace:

# payments-eng team:

TEMPORAL_NAMESPACE=payments-prod.acme ./payments-worker

# operations team:

TEMPORAL_NAMESPACE=operations-prod.acme ./operations-worker

# A retry storm in payments-prod has zero impact on operations-prod.

3. The Metrics Silo

Temporal cluster metrics (history shard latency, frontend service errors, persistence read/write latency) live in the platform team’s observability stack. Workflow-level metrics (workflow failure rate by type, activity retry rate by activity name) are not tracked by anyone because the application teams assume the platform team’s dashboards cover it.

The result: nobody has visibility into application-layer reliability. The first signal is an incident.

The fix is a clear metric ownership boundary. Cluster metrics belong to the platform team. Workflow metrics belong to the application team. Application teams must instrument their workflows before go-live and own their own dashboards and alerts. The platform team’s dashboards tell you the cluster is healthy. The application team’s dashboards tell you your workflows are behaving correctly. Both are necessary. Neither substitutes for the other.

Application Reliability Checklist

For any workflow type going to production, this checklist should be signed off by the application team — not the platform team.

# Check Why It Matters
1 Explicit retry policy set on every activity, including NonRetryableErrorTypes Prevents unlimited retries on business validation errors
2 StartToCloseTimeout set on every activity based on measured p99 duration Prevents hung activities blocking workers indefinitely
3 HeartbeatTimeout set on any activity expected to run longer than 10 seconds Enables detection of crashed workers without waiting for full timeout
4 Workflow code reviewed for determinism: no time.Now(), rand, or direct I/O Prevents non-determinism errors on replay after worker restart
5 workflow.GetVersion() used for any code changes to in-flight workflow types Prevents replay failures for existing in-flight executions post-deploy
6 Compensation logic implemented for all money-moving or state-mutating activities Ensures partial failures result in a clean known state, not orphaned data
7 Custom search attributes set at workflow start for team, tier, and business context Enables self-service debugging without routing through platform team
8 Workflow-level alerts configured on temporal_workflow_failed for this workflow type Application team is paged for their own incidents, not platform team
9 Runbook written covering top 3 failure modes and safe remediation steps On-call engineers know what to do without escalating at 3 AM
10 Namespace isolation confirmed: workflow runs in team-owned namespace Retry storms from other teams do not affect this workflow’s SLA

Closing Thoughts

The promise of decoupling reliability from infrastructure ownership is not that your platform team does less. It is that your application teams can move faster without creating reliability debt they cannot see or fix. When the four primitives — retry policies, timeouts, heartbeating, and determinism — are treated as first-class code artifacts owned by the teams that write the workflows, reliability stops being a queue and starts being a property.

The organizational failure modes are harder to fix than the technical ones. Namespace isolation and metric ownership boundaries require conversations across team lines. Runbook requirements require process changes to your go-live criteria. These changes have friction. But the alternative — an on-call rotation where every third page goes to the platform team for a problem that only an application engineer can actually fix — has more friction and more cost.

If your team is operating Temporal in production and the reliability ownership lines are still blurry, the checklist above is a starting point. If you want a structured review of how your current workflow designs map against these primitives, that is exactly the kind of work Xgrid’s Temporal Practice does as part of the 90-Day Production Health Check.

Is your engineering team still the bottleneck for workflow reliability?

If every “why did this workflow fail?” The question gets routed through your platform team before anyone looks at application code — that’s an ownership gap, not a Temporal limitation. It’s also one of the fastest things to fix with the right primitives and the right team contract in place.

Xgrid offers two entry points depending on where you are:

  • Temporal 90-Day Production Health Check — we audit your current workflow designs against the reliability primitives above, identify where ownership is unclear, and give you a concrete fix list ranked by risk.
  • Temporal Reliability Partner — for teams that want a named Temporal expert embedded long-term to establish the application-layer reliability practice, run design reviews before new workflows go live, and mentor engineers on the primitives that matter.

Both are fixed-scope. No open-ended retainer required to get started.

Related Articles

Related Articles