Temporal Visibility and Search Attributes: How to Give Ops a Shared View of Workflow State
When Temporal moves from pilot to production, a shared view becomes necessary because ops is now expected to answer customer-impacting state questions quickly: Is this booking paid?, Why is payout stuck?, or Which agent run owns this escalation?.
In many teams today, this shared view does not exist yet. Visibility is fragmented: engineering has traces and service-level dashboards, support has ticket context, and ops has partial workflow signals, so each group sees a different slice of truth.
That fragmentation is where on-call fatigue, slow incident triage, and ownership arguments start.
This split — engineering has dashboards, ops has nothing — is one of the most consistent patterns we see when teams are 60–90 days past a Temporal POC. At Xgrid, we’ve helped ops, support, and platform engineering teams across payments, booking platforms, and AI agent workflows converge on a single shared view of what their orchestrator is doing. The fix isn’t a new tool. It’s a set of naming, indexing, and access decisions most teams skip in the rush to production.
This article describes how we give ops, platform, and product engineering a shared view of workflow state using Temporal’s first-class primitives: Visibility, Search Attributes, Workflow Id / Run Id, and optional read-only Queries plus the governance choices (namespaces, identifiers, and data in indexes) that make the view safe as well as consistent.What “shared view” means (and what it is not)
A shared view is not “everyone has Grafana.” It means:
- 1. Same identifiers in tickets, chat, and Temporal ( namespace, workflow id, run id when needed).
- 2. Same discovery path to find executions: Visibility list filters / UI or CLI—not “ask the service team to grep.”
- 3. Same semantics for “state”: ops distinguishes platform lifecycle (Running / Failed / Timed Out…) from business phase (e.g. awaiting_fraud_review), using patterns documented in runbooks.
- 4. Same guardrails on what is indexed or shown in the UI (especially no PII in Workflow Ids and careful use of Search Attributes).
The authoritative progression of an execution remains the Event History; the shared ops view is how teams find that execution and interpret status without being workflow authors.
Layer 1: A small, enforced execution vocabulary
Ops cannot share a screen if everyone names things differently.Namespace as the boundary
A Namespace scopes workflows, retention settings, and access. Common patterns:
- Environment separation: prod vs staging as separate namespaces or accounts (depending on how you host Temporal).
- Domain separation: payments vs internal tooling—only if it matches how your org splits ownership and access. Too many namespaces fragments the “shared view” unless you standardize cross-namespace runbooks.
Document which namespace applies to each major product surface and link it from your internal hub.Workflow Id and Run Id
Per Temporal’s model, a Workflow Id is an application-level identifier (often business-meaningful), unique among open executions in that namespace. A Run Id identifies a specific run; it can change across retries and chain operations, so ops runbooks should emphasize Workflow Id for triage and teach when Run Id is required (e.g. deep support escalations, history export).
Critical ops-facing rule from Temporal’s documentation: do not put PII or secrets in Workflow Ids—they appear in plain text in the UI, CLI, Event History, and system logs. The same caution applies to other user-defined names visible in the system.Workflow Type as the product label
Workflow Type is the stable “sku” of logic (e.g. PayoutOrchestration). Ops dashboards and filters should use Workflow Type consistently in filters and alert routing so on-call matches what engineers deploy.
Use case example
A fintech support ticket says, “Payout PO-98431 is delayed.” The runbook maps payouts to namespace payments-prod and workflow type PayoutOrchestration. Ops searches by workflow id pattern payout-PO-98431 and immediately lands on the correct execution instead of paging three teams to identify ownership.
Layer 2: Visibility and Search Attributes for “find my execution”
Visibility is Temporal’s subsystem for listing and filtering workflow executions. For ops, this is the fastest path to a shared view: finance, support, and platform can all run the same List Filter (SQL-like) in the Web UI or automation.Custom Search Attributes
Custom Search Attributes let you index a small, controlled set of fields for filtering—examples: tenant_id, order_id, payment_intent_ref, agent_session—depending on your vertical (payments, business process, AI agent orchestration).
Design principles we use with ops teams:
- Index only what you will actually filter on in incidents; every new attribute has a retention and cardinality story.
- Never index raw PII if your policy treats the Visibility store as operational, broadly accessible, or long-retained. Prefer opaque internal IDs your CRM or ledger already uses.
- Name attributes in the product’s language so support tickets map 1:1 to filters (e.g. OrderId vs opaque UUID if both are safe).
Example: Search Attribute schema for a payment orchestration workflow
Here’s the schema we typically recommend for fintech teams building payment flows on Temporal:
| Attribute | Type | Purpose |
| TenantId | Keyword | Multi-tenant isolation; safe to index |
| OrderId | Keyword | Maps 1:1 to support ticket language |
| PaymentIntentRef | Keyword | External gateway reference for escalations |
| PaymentPhase | Keyword | Business phase: auth_pending, captured, refund_initiated |
| IsHighValue | Bool | Flag high-value transactions for priority triage |
Notice what’s absent: no customer name, no card details, no email. The rule is index the opaque internal ID your CRM or ledger already uses — never raw PII. For AI agent workflows, the equivalent schema typically uses AgentSessionId, TaskType, and ApprovalStatus as the triage-relevant attributes.
Visibility tells you which executions matter; Event History still tells you what happened. Train ops to open describe + history after the list filter finds the row.Memo vs Search Attributes
Memo carries extra metadata on the execution but is not a substitute for indexed search. It can supplement human context on the execution detail page; it should not be the only place ops can correlate a ticket. If support needs to search by a field, it generally belongs in Search Attributes (with governance), not Memo alone.
Use case example
An operations engineer gets a “booking stuck” escalation with booking id BK-44721. Because booking_id is a Search Attribute, they run a single Visibility filter (WorkflowType=’BookingLifecycle’ AND booking_id=’BK-44721′) and find the execution in seconds. If that id lived only in logs or Memo, triage would require manual service-by-service correlation.
Layer 3: Queries for a readable “business phase” (optional but powerful)
Queries are synchronous, read-only requests that report workflow state. They do not append events to Event History, which makes them attractive for cheap operational readouts (“what stage are we in?”, “what was the last external reference returned?”).
Practical guidelines:
- Define small, stable query shapes ops is allowed to call from the UI or CLI (currentPhase, summaryForSupport).
- Queries run against worker code; if workers are unhealthy, queries can fail while the platform still shows the execution as Running. Ops runbooks should say what to do when Queries fail but history loads (often: worker/deployment issue, not “payment vanished”).
- Keep responses minimal—large query payloads hurt usability and blur the line with “log dumping.”
For workflows where blocking until a state is ready matters, Temporal’s message-passing docs note tradeoffs between polling Queries vs using Updates (which write history). Ops-facing tooling usually stays on Queries; product-facing APIs may choose Updates.
Use case example
In an AI-agent workflow, support needs a quick answer to “where is this run blocked?” A query like currentPhase returns awaiting_human_approval with a timestamp. Ops can give the customer a precise status without reading raw event history, while engineering still uses history for deeper debugging.
Layer 4: Tools everyone can actually open
Shared view requires shared access:
- Web UI: default “front door” for humans; ensure SSO/roles match who may see production namespaces (Temporal Cloud and self-hosted each have their own access models).
- CLI (temporal workflow …): same filters as the UI; scriptable for bridges into your ticketing system.
- Tickets: require namespace, workflow_id, and one business key that is also a Search Attribute when possible—so the first on-call step is deterministic.
If only three people can access prod Temporal, you do not have an ops-wide shared view—you have a bottleneck.
Use case example
A Sev-2 incident opens at 2 a.m. Because on-call ops has Temporal Web UI access and a ticket template that requires namespace + workflow_id, they can classify the issue and attach execution details before waking application engineers. Mean-time-to-triage drops because access is not gated through one platform owner.
Layer 5: Metrics and health—shared signals, not shared state
Workflow state is history + (optional) query handlers; cluster health and worker health are Temporal Cloud metrics and SDK metrics.
Temporal’s guidance: for business process health (backlog, latency to complete, failure rate by workflow type), bias toward SDK metrics as application-ground-truth; use Cloud/Server metrics for the service itself. Ops benefits when alert names reference the same Workflow Type and Task Queue names Engineering uses—otherwise Grafana becomes a second dialect.Runbook pattern we reuse across clients
- 1. Classify: Is the issue missing execution, stuck execution, or failed execution? Use Visibility filters (ExecutionStatus=, WorkflowType=, custom attributes).
- 2. Identify: Capture namespace + workflow id (+ run id if provided by tooling).
- 3. Inspect: Open Event History; identify last decisive events (activity retries, timer, signal waited).
- 4. Interpret business phase: If queries exist and workers healthy—run approved query; else infer phase from history and documented mapping.
- 5. Escalate: Engineering gets ids + screenshots or exported history, not“customer says it’s broken.”
Anti-patterns that break the shared view
- Snowflake identifiers: every service logs a different key; Temporal never gets a join key in Search Attributes.
- PII in Workflow Id or searchable fields: compliance incident plus broken trust in the “shared” UI.
- Thirty ad-hoc Grafana panels but no documented List Filter for the same question.
- Undocumented Queries: ops randomly discovers query names in Slack; some environments lack worker versions that implement them.
Use case example
Payments sees increased “payout delay” complaints. Ops correlates an SDK metric spike in PayoutOrchestration activity latency with normal Temporal Cloud service metrics. This quickly points to a worker-side dependency regression (not Temporal platform outage), so the right team is paged first.
Closing
Giving ops a shared view of workflow state is mostly product design in operations clothing: namespaces and ids people agree on, Visibility fields that match ticket language, optional Queries that encode allowed summaries, and metrics that use the same workflow vocabulary as the code.
If your organization is past the Temporal POC and feeling “only the workflow authors understand what’s live,” this is one of the highest-return stabilizations you can ship—without changing Temporal itself, only how you expose it.
Is your team still the bottleneck for ops visibility?
If support and operations are still routing every “is this stuck?” question through engineering, that’s a design gap — not a Temporal limitation. It’s also one of the fastest things to fix with the right primitives in place.
Xgrid offers two entry points depending on where you are:
- Temporal 90-Day Production Health Check — we audit your current Visibility setup, identifier conventions, and runbook coverage, and give you a concrete fix list.
- Temporal Reliability Partner — for teams that want a named Temporal expert embedded long-term to own this layer and mentor internal engineers.
Both are fixed-scope. No open-ended retainer required to get started.