Skip to main content

Migrating from On-Prem Workflows to Cloud Without Workflow Downtime

How to move long-running Temporal workflows from self-hosted infrastructure to Temporal Cloud — with zero dropped executions, no frozen state, and no maintenance window

TL;DR

Moving from a self-hosted Temporal cluster to Temporal Cloud is not a lift-and-shift. In-flight workflows cannot be exported, imported, or paused mid-execution. The only safe migration path is a traffic-shifting strategy: new workflows start on the cloud namespace while existing workflows run to completion on the on-prem cluster. This post walks through exactly how to execute that migration — including namespace setup, worker dual-registration, task queue routing, and the observability signals that tell you when it is safe to decommission on-prem.

Why This Migration Is Harder Than It Looks

On the surface, migrating Temporal from on-prem to cloud looks like a standard infrastructure move: provision a new cluster, point your workers at it, done. Teams that approach it this way usually discover the hard way that Temporal’s durability guarantees — the very thing that makes it valuable — are what make migration non-trivial.

Here is the core constraint you are working around:

The Fundamental Constraint

Temporal workflow state lives entirely inside the Temporal cluster’s persistence layer. There is no supported mechanism to export an in-flight workflow’s event history from one cluster and import it into another. In-flight workflows must run to completion on the cluster where they were started.

This means that on the day you want to cut over to Temporal Cloud, you cannot simply redirect your workers. Any workflow that started on your on-prem cluster — whether it has been running for 30 seconds or 30 days — must complete there. The migration window is therefore not a single cutover event. It is a dual-running period whose length is determined by the lifespan of your longest-running workflows.

For teams with short-lived workflows (seconds to minutes), this is manageable. For teams running payment dispute workflows, multi-day approval chains, or AI agent tasks that run for hours, this dual-running period can stretch to days or weeks. That has infrastructure cost, operational complexity, and risk implications that need to be planned for explicitly.

What Are We Actually Migrating?

Before writing a single line of migration code, map exactly what needs to move. Temporal has more moving parts than most teams account for at migration time.

1. The Temporal Cluster Itself

Self-hosted Temporal consists of multiple services: Frontend, History, Matching, and Worker services, backed by a persistence layer (typically Cassandra or PostgreSQL) and optionally Elasticsearch for visibility. Temporal Cloud replaces all of this with a managed namespace. You lose direct access to the persistence layer, gain SLA-backed uptime, and hand off all cluster operations to Temporal.

2. Workers

Your workers are your application code. They register workflow and activity implementations with a specific Temporal Frontend address and poll specific task queues. During migration, workers need to be able to connect to both the on-prem cluster and Temporal Cloud simultaneously, or you need to run separate worker pools for each.

3. Namespaces

Namespaces are the isolation boundary in Temporal. Your on-prem cluster probably has one or more namespaces (e.g. payments-prod, operations-prod). Each of these needs a corresponding namespace provisioned in Temporal Cloud. Namespace names do not need to match, but mapping them explicitly in your migration plan avoids confusion.

4. Workflow ID Conventions

Temporal Cloud and your on-prem cluster are separate systems. The same workflow ID can exist in both simultaneously without conflict. This is useful — you can use identical workflow ID schemes on both clusters — but it also means you need a clear way to determine which cluster a given workflow ID lives on when debugging incidents during the migration window.

5. Search Attributes

Custom search attributes defined on your on-prem cluster do not exist on Temporal Cloud by default. They must be explicitly provisioned on the cloud namespace before workers start routing traffic there. Missing search attributes are a common source of silent failures post-cutover.

6. Schedules and Cron Workflows

Any Temporal Schedules or cron-based workflows running on the on-prem cluster need to be explicitly recreated on the cloud namespace. They will not automatically migrate, and they will not automatically stop running on-prem. If you forget this step, you end up with duplicate scheduled executions firing from both clusters simultaneously.

Migration Inventory Checklist

Before starting: list every namespace, every registered workflow type, every custom search attribute, every active Schedule, and your longest-running workflow’s expected duration. This inventory is your migration exit criteria. You are not done until every item on this list has been verified on Temporal Cloud.

The Traffic-Shifting Migration Strategy

The only downtime-free migration path is traffic shifting: you keep your on-prem cluster running for existing workflows while gradually routing new workflow starts to Temporal Cloud. Here is the full sequence.

Phase 1: Provision and Validate the Cloud Namespace (Week 1)

Before a single production workflow touches Temporal Cloud, validate the namespace thoroughly in a staging environment.

Migration Inventory Checklist

Before starting: list every namespace, every registered workflow type, every custom search attribute, every active Schedule, and your longest-running workflow’s expected duration. This inventory is your migration exit criteria. You are not done until every item on this list has been verified on Temporal Cloud.

Phase 2: Dual-Register Workers (Week 2)

This is the core of the migration. You modify your workers to connect to both the on-prem cluster and Temporal Cloud, polling task queues on both. New workflows start on the cloud namespace. Existing workflows complete on on-prem.

The cleanest implementation uses an environment-driven connection factory:

// worker/connection.go

type ClusterTarget string

const (

    TargetOnPrem ClusterTarget = “on-prem”

    TargetCloud  ClusterTarget = “cloud”

)

func NewTemporalClient(target ClusterTarget) (client.Client, error) {

    switch target {

    case TargetOnPrem:

        return client.Dial(client.Options{

            HostPort:  os.Getenv(“TEMPORAL_ONPREM_HOST”),  // e.g. 10.0.1.50:7233

            Namespace: os.Getenv(“TEMPORAL_ONPREM_NAMESPACE”),

        })

    case TargetCloud:

        cert, err := tls.LoadX509KeyPair(

            os.Getenv(“TEMPORAL_CLOUD_CERT_PATH”),

            os.Getenv(“TEMPORAL_CLOUD_KEY_PATH”),

        )

        if err != nil {

            return nil, fmt.Errorf(“loading cloud TLS cert: %w”, err)

        }

        return client.Dial(client.Options{

            HostPort:  os.Getenv(“TEMPORAL_CLOUD_HOST”),   // e.g. payments-prod.your-account-id.tmprl.cloud:7233

            Namespace: os.Getenv(“TEMPORAL_CLOUD_NAMESPACE”),

            ConnectionOptions: client.ConnectionOptions{

                TLS: &tls.Config{Certificates: []tls.Certificate{cert}},

            },

        })

    default:

        return nil, fmt.Errorf(“unknown cluster target: %s”, target)

    }

}

Then run both worker pools in parallel, each polling the same task queue name on their respective cluster:

// cmd/worker/main.go

func main() {

onPremClient, err := NewTemporalClient(TargetOnPrem)

if err != nil { log.Fatal(err) }

defer onPremClient.Close()

cloudClient, err := NewTemporalClient(TargetCloud)

if err != nil { log.Fatal(err) }

defer cloudClient.Close()

// On-prem worker: drains existing in-flight workflows

onPremWorker := worker.New(onPremClient, “payments”, worker.Options{

MaxConcurrentWorkflowTaskExecutionSize: 100,

})

onPremWorker.RegisterWorkflow(PaymentSagaWorkflow)

onPremWorker.RegisterActivity(&PaymentActivities{})

// Cloud worker: handles all new workflow starts

cloudWorker := worker.New(cloudClient, “payments”, worker.Options{

MaxConcurrentWorkflowTaskExecutionSize: 100,

})

cloudWorker.RegisterWorkflow(PaymentSagaWorkflow)

cloudWorker.RegisterActivity(&PaymentActivities{})

// Start both workers concurrently

if err := onPremWorker.Start(); err != nil { log.Fatal(err) }

if err := cloudWorker.Start(); err != nil { log.Fatal(err) }

// Block until shutdown signal

<-worker.InterruptCh()

}

Phase 3: Route New Workflow Starts to Cloud (Week 2–3)

With both worker pools running, update your workflow starter code to direct new workflow starts to the cloud client. Existing workflows are unaffected — they continue executing on the on-prem worker pool.

// starter/payment.go

type WorkflowStarter struct {

    onPremClient client.Client  // kept alive for signals/queries to in-flight workflows

    cloudClient  client.Client  // all new starts go here

}

func (s *WorkflowStarter) StartPayment(ctx context.Context, req PaymentRequest) error {

    opts := client.StartWorkflowOptions{

        ID:        req.PaymentID,

        TaskQueue: “payments”,

    }

    // All new payment workflows start on cloud

    _, err := s.cloudClient.ExecuteWorkflow(ctx, opts, PaymentSagaWorkflow, req)

    return err

}

// Sending a signal to an in-flight workflow still on on-prem

func (s *WorkflowStarter) SignalOnPremWorkflow(ctx context.Context, workflowID, signal string, payload interface{}) error {

    return s.onPremClient.SignalWorkflow(ctx, workflowID, “”, signal, payload)

}

Important: Keep the On-Prem Client Alive

Even after all new starts route to cloud, you still need the on-prem client for two things: sending signals or queries to workflows that are still running on-prem, and querying workflow status for in-flight workflows during support or incident response. Do not shut down the on-prem client until you have confirmed zero open workflows on the on-prem namespace.

Phase 4: Drain On-Prem Workflows

This phase is passive — you wait. The drain period is determined by the longest-running workflow type in your system. The goal is to reach zero open workflows on the on-prem namespace. Monitor this actively:

# Check open workflow count on on-prem namespace

temporal workflow list \

–namespace your-onprem-namespace \

–query ‘ExecutionStatus = “Running”‘ \

–limit 1000

# When this returns zero results, the namespace is drained.

# Do not decommission until this has returned zero for 48 hours

# to account for any workflows that may be in a scheduled/delayed state.

For long-running workflows (days or weeks), you can accelerate the drain by identifying workflow types that are safe to signal to completion, or by running a one-time reconciliation job that closes workflows that have been stuck or orphaned.

Phase 5: Decommission On-Prem

Only after the on-prem namespace has been at zero open workflows for a minimum of 48 hours is it safe to begin decommissioning. Do this in stages:

  • Stop the on-prem worker pool first. If any workflow tasks appear in the on-prem task queue after workers stop, that is a signal that a workflow was missed.
  • Keep the on-prem Temporal Frontend running in read-only mode for 2 weeks after worker shutdown. This preserves the ability to query historical workflow data for support and compliance.
  • Archive the on-prem persistence layer (Cassandra/PostgreSQL snapshot) before final teardown. Retain for your compliance retention period.
  • Shut down and deprovision on-prem infrastructure.

Handling the Hard Cases

Long-Running Workflows That Cannot Wait

Some workflow types simply cannot be left to drain naturally — a 90-day payment dispute workflow, for example, represents real business state that you cannot leave on aging infrastructure. For these, the options are:

  • Terminate and replay: if the workflow is idempotent and its state can be reconstructed from your system of record, terminate it on on-prem and start a new instance on cloud with the same business inputs. This only works if your workflows are designed around external state, not exclusively internal Temporal state.
  • Signal-driven checkpoint: design a signal handler into the workflow that, when received, causes it to checkpoint its state to an external store (your database), complete on on-prem, and trigger a new workflow instance on cloud to resume from the checkpoint. This requires forethought in workflow design but is the cleanest pattern for complex migrations.
  • Accept the drain period: for most teams, accepting a 2–4 week dual-running period is the lowest-risk option. Infrastructure cost during this window is real but bounded.

Schedules and Cron Workflows

The risk here is duplicate execution: the on-prem Schedule fires, the cloud Schedule fires, and the same business operation runs twice. The safe sequence is:

  • Pause all on-prem Schedules before creating their cloud equivalents.
  • Create and validate the cloud Schedule equivalents in a paused state.
  • On the cutover day, unpause cloud Schedules and delete on-prem Schedules within the same deployment window.
  • Never run both simultaneously, even briefly.

# Step 1: Pause on-prem schedule

temporal schedule pause \

–namespace your-onprem-namespace \

–schedule-id daily-reconciliation

# Step 2: Create cloud schedule in paused state

temporal schedule create \

–namespace payments-prod.your-account-id \

–schedule-id daily-reconciliation \

–workflow-type ReconciliationWorkflow \

–task-queue payments \

–cron-schedule ‘0 2 * * *’ \

–pause  # start paused

# Step 3: On cutover day – unpause cloud, delete on-prem

temporal schedule unpause \

–namespace payments-prod.your-account-id \

–schedule-id daily-reconciliation

temporal schedule delete \

–namespace your-onprem-namespace \

–schedule-id daily-reconciliation

Workflow Versioning Across the Migration

If you need to deploy new workflow code during the migration window — and you probably will — you must use `workflow.GetVersion()` consistently. New code deploying to both the on-prem and cloud worker pools simultaneously must be version-safe for both clusters, because in-flight workflows on both clusters will replay against the new code.

The rule of thumb: treat any code deployment during a migration window as if you are deploying to a cluster with the maximum possible number of in-flight workflows. Because you effectively are — you have two clusters worth of in-flight state to preserve.

Observability During the Migration Window

Running two clusters simultaneously means you need visibility into both simultaneously. Here is the minimum monitoring setup for a safe migration.

The Four Signals That Matter

Signal On-Prem Target Cloud Target
Open workflow count Trending to zero (drain confirmed) Matches expected new starts rate
temporal_workflow_failed Zero new failures (no new starts) Baseline established within first 24h
temporal_activity_schedule_to_start_latency Stable or decreasing (draining load) P99 < on-prem baseline (cloud should be faster)
Worker poll errors Zero (workers still healthy) Zero (TLS auth confirmed working)

The Drain Dashboard

Build a single dashboard that shows both clusters side by side during the migration window. The most important chart: open workflow count on on-prem over time. It should be a monotonically decreasing curve from migration start to decommission. Any uptick means workflows are still starting on on-prem — which means your traffic routing change did not fully propagate.

The Failure Modes That Catch Teams Off Guard

1. The Forgotten Starter

You updated the primary workflow starter service to route to the cloud. You forgot the batch job that also starts workflows, the admin tool, and the event-driven consumer. They are still starting workflows on on-prem. The drain curve flatlines instead of decreasing. Mitigation: before Phase 3, audit every code path that calls client.ExecuteWorkflow and map them all to a single connection factory.

2. TLS Certificate Expiry During Migration

Temporal Cloud requires mTLS for worker connections. If you are using short-lived certificates and your migration window overlaps a certificate rotation, your cloud workers will start dropping connections mid-migration. Mitigation: before starting the migration, verify certificate expiry dates and ensure they do not fall within the migration window plus a 2-week buffer.

3. Search Attribute Type Mismatch

You provisioned search attributes on the cloud namespace but used a different type than on-prem (e.g., Keyword instead of Text for a merchant ID field). Workflows start on cloud but visibility queries that rely on those attributes return no results. Support cannot find workflows by merchant ID. This is a silent failure that only surfaces when someone tries to search. Mitigation: export your on-prem search attribute schema and validate it matches cloud before routing any traffic.

4. Orphaned Signals

A webhook arrives targeting a workflow ID that exists on on-prem. Your signal routing code checks cloud first, finds nothing, and drops the signal. The on-prem workflow never receives the event it was waiting for and eventually times out. Mitigation: during the dual-running period, signal routing must check both clusters. Build a fallback: try cloud first, if the workflow is not found there, try on-prem.

// Signal routing with cluster fallback

func (s *WorkflowStarter) SignalWorkflow(

    ctx context.Context,

    workflowID string,

    signalName string,

    payload interface{},

) error {

    // Try cloud first (new workflows)

    err := s.cloudClient.SignalWorkflow(ctx, workflowID, “”, signalName, payload)

    if err == nil {

        return nil

    }

    // If not found on cloud, fall back to on-prem (in-flight migrations)

    var notFound *serviceerror.NotFound

    if errors.As(err, &notFound) {

        return s.onPremClient.SignalWorkflow(ctx, workflowID, “”, signalName, payload)

    }

    return fmt.Errorf(“signal failed on both clusters: %w”, err)

}

Pre-Migration Checklist

# Check Why It Matters
1 Inventory of all namespaces, workflow types, and search attributes documented Migration scope is fully known before starting
2 Longest-running workflow duration identified Sets the minimum dual-running window length
3 All workflow starters mapped to a single connection factory Ensures no forgotten starters route to on-prem post-cutover
4 Cloud namespace provisioned with matching search attributes Prevents silent visibility failures on first cloud workflows
5 TLS certificates for cloud workers valid for 60+ days beyond migration end Prevents auth failures mid-migration
6 All Temporal Schedules identified and paused/recreated plan documented Prevents duplicate schedule executions
7 Signal routing updated with cluster fallback logic Prevents orphaned signals dropping in-flight workflows
8 Drain dashboard built showing open workflow count on on-prem Provides objective decommission exit criteria
9 On-prem persistence snapshot plan confirmed Ensures compliance and audit trail is preserved
10 Full migration rehearsed end-to-end in staging No surprises in production during dual-running window

Closing Thoughts

The teams that struggle with this migration are not the ones who underestimate the technical complexity — Temporal’s documentation covers the mechanics well. They are the ones who underestimate the operational surface area: the forgotten workflow starters, the orphaned signals, the schedule duplication, the long-running workflows nobody had mapped. The checklist above exists because every item on it has caused a real incident for a real team.

The dual-running pattern is not glamorous. Running two Temporal clusters simultaneously for two to four weeks feels like a longer migration than teams expect. But it is the only path that guarantees zero dropped executions, and for systems where workflows represent real business state — payments, approvals, agent tasks — that guarantee is non-negotiable.

If you are planning this migration and want a structured review of your workflow inventory, your drain strategy, and your rollback plan before you start, that is exactly the kind of work Xgrid’s Temporal Practice does.

Is your platform team still the bottleneck for migration confidence?

If the question “is it safe to decommission on-prem yet?” still requires an engineering deep-dive to answer — that’s a visibility and planning gap, not a Temporal limitation. It’s also one of the fastest things to fix with the right patterns in place.

Xgrid offers two entry points depending on where you are:

  • Temporal Launch Readiness Review — we audit your migration plan, workflow inventory, drain strategy, and rollback coverage before you start, and give you a concrete go / no-go scorecard.
  • Temporal Reliability Partner — for teams that want a named Temporal expert embedded through the migration window and beyond, owning the dual-running period and mentoring internal engineers on cloud-native Temporal patterns.Both are fixed-scope. No open-ended retainer required to get started.

Related Articles

Related Articles