fix(metrics): ensure consistent Prometheus label sets for persistence metrics by officialasishkumar · Pull Request #7893 · cadence-workflow/cadence

officialasishkumar · 2026-04-05T08:23:52Z

What changed?

Fixed inconsistent label sets across scopes emitting the same Prometheus metric name, which caused the Prometheus reporter to silently drop metrics.

Fixes #7844.

Why?

The Prometheus client_golang library requires that all descriptors registered under the same metric name share the same label-name set. When Cadence emitted the same metric from scopes with different tag sets, the second registration failed and Tally returned a noopMetric{}, silently discarding all data from that scope.

Two specific patterns were broken:

persistence_requests_per_shard / persistence_latency_per_shard — In callWithDomainAndShardScope, shardOperationsMetricsScope included additionalTags (e.g., is_retry) but shardOverallMetricsScope did not. Both scopes emitted the same metric names with different label sets. Fixed by passing additionalTags to shardOverallMetricsScope.
persistence_requests / persistence_latency — Three methods (CompleteHistoryTask, GetHistoryTasks, RangeCompleteHistoryTask) added a task_category tag via getCustomMetricTags in the callWithoutDomainTag fallback path, while all other ExecutionManager methods did not. This created an inconsistent label set for the shared persistence_requests and persistence_latency metrics. Fixed by removing getCustomMetricTags from the callWithoutDomainTag path in the codegen template — the operation label already identifies the method uniquely.

Note: the cache metrics inconsistencies (cache_count, cache_evict, cache_latency) mentioned in #7844 originate in the cache layer, not the persistence metered wrapper, and are left for a follow-up.

How did you test it?

go test -race ./common/persistence/wrappers/metered/...
go build ./...

All existing tests pass. The generated file execution_generated.go was regenerated from the updated template via make go-generate GEN_DIR=common/persistence.

Potential risks

The task_category tag is no longer emitted on persistence_requests / persistence_latency for the three history-task methods when the request has no domain. Dashboards querying these metrics by task_category on the no-domain path would need to switch to filtering by operation instead. The operation label already uniquely identifies these methods, so no information is lost.
The shardOverallMetricsScope now carries additionalTags (currently is_retry), slightly increasing label cardinality for the PersistenceShardRequestCountScope. This is the intended fix — all emission paths for persistence_requests_per_shard must share the same label set.

Release notes

Fixed silent metric data loss when using the Prometheus reporter. Persistence metrics (persistence_requests_per_shard, persistence_latency_per_shard, persistence_requests, persistence_latency) were emitted with inconsistent label sets across code paths, causing Prometheus to reject the second registration and drop all data from the affected scope.

Documentation Changes

N/A

ribaraka · 2026-04-06T12:56:19Z

It would be good to add a specific test here as well. This kind of label mismatch is only caught at prometheus runtime, so a small test covering the affected persistence_requests/persistence_latency would help prevent regressions.

officialasishkumar · 2026-04-06T17:20:03Z

Added a focused metered-wrapper test that exercises the shared persistence_requests and persistence_latency metrics directly, plus the shard-scoped tags that replace the removed shared-metric breakdown.

… metrics - Added additionalTags to shardOverallMetricsScope in callWithDomainAndShardScope so that persistence_requests_per_shard and persistence_latency_per_shard are emitted with the same label set (including is_retry) from both shard scopes. Previously shardOverallMetricsScope was missing additionalTags, causing Prometheus to reject the second registration and silently drop metrics. - Removed getCustomMetricTags from callWithoutDomainTag in the execution metered template. Three methods (CompleteHistoryTask, GetHistoryTasks, RangeCompleteHistoryTask) added a task_category tag via getCustomMetricTags, while all other methods did not, creating inconsistent label sets for persistence_requests and persistence_latency. The operation label already identifies the method, so task_category is not needed in the shared metrics. Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>

Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>

officialasishkumar · 2026-04-07T06:11:48Z

@ribaraka I added a focused metered-wrapper test for the per-shard paths and removed request-specific shard tags so the shared shard metrics keep one stable Prometheus label set.

gitar-bot · 2026-04-07T06:15:53Z

CI failed: SQLite WASM out-of-bounds memory access crashes during SQLite integration tests (infrastructure flake); secondary ndc integration test failure likely related to PR metrics label changes in persistence wrapper causing assertion mismatches.

Overview

Analyzed 4 CI logs across 2 unique templates. Found 2 distinct failure patterns: (1) a flaky SQLite WASM runtime crash unrelated to code changes, and (2) an ndc integration test failure that may be related to the PR's persistence metrics label consistency changes.

Failures

SQLite WASM Out-of-Bounds Memory Access (confidence: high)

Type: infrastructure
Affected jobs: 70195663273
Related to change: no
Root cause: The go-sqlite3 WASM binding (v0.22.0) encounters out-of-bounds memory access panics during database operations in the timer queue processor. The panic occurs in WASM memory management functions (sqlite3.wasm.dlfree, sqlite3MemFree, sqlite3_step, sqlite3_prepare) and is not caused by the PR's metrics label changes.
Suggested fix: (1) Update go-sqlite3 WASM to a newer, more stable version; (2) Switch to native SQLite driver for integration tests instead of WASM; (3) Use PostgreSQL backend in CI instead of SQLite; (4) Add WASM-specific retry logic. This is a known flakiness issue unrelated to metrics consistency.

NDC Integration Test Failure (confidence: medium)

Type: test
Affected jobs: 70195663308
Related to change: yes
Root cause: The ndc (multi-datacenter) integration test failed with exit code 2 after 37.5s. The PR changes Prometheus label consistency in persistence metrics wrappers (metered wrapper handling of typed nil requests, base metrics tag stability). A mismatch between expected and actual metric label sets during persistence operations (e.g., GetCurrentExecution) could cause assertion failures when the test validates metric output or scope labels.
Suggested fix: (1) Verify that the metered wrapper changes in the PR properly emit metrics with consistent label sets for all persistence methods; (2) Check if ndc integration tests validate Prometheus metrics or metric scope labels; (3) Ensure metered_test.go (added in PR) covers label set consistency for execution-related metrics; (4) Run ndc test locally with verbose logging to identify the specific assertion failure.

Summary

Change-related failures: 1 (ndc integration test failure possibly caused by inconsistent persistence metrics label sets)
Infrastructure/flaky failures: 1 (SQLite WASM out-of-bounds memory access, unrelated to PR)
Recommended action: (1) Investigate whether the ndc test validates metrics; if so, review the metered wrapper and base metrics changes to ensure label consistency; (2) Add comprehensive test coverage for persistence metric label sets in metered_test.go; (3) The SQLite WASM crash should be addressed separately as a test infrastructure issue, not a blocker for this PR.

Code Review ✅ Approved 2 resolved / 2 findings

Fixes Prometheus label set inconsistencies in persistence metrics by ensuring stable label sets across per-shard and base metrics, addressing label-set variations that could cause cardinality issues. No issues found.

✅ 2 resolved

✅ Bug: New label-set inconsistency introduced for per-shard metrics

📄 common/persistence/wrappers/templates/metered_execution.tmpl:58 📄 common/persistence/wrappers/templates/metered_execution.tmpl:64 📄 common/persistence/wrappers/templates/metered_execution.tmpl:71
The PR fixes label inconsistency for persistence_requests/persistence_latency but introduces a new one for persistence_requests_per_shard/persistence_latency_per_shard.

In the template, shardMetricTags is built as append(getCustomMetricTags(req), metrics.IsRetryTag(...)). getCustomMetricTags returns different tags depending on the request type:

3 history task methods (GetHistoryTasks, CompleteHistoryTask, RangeCompleteHistoryTask) → [task_category=X]

PutReplicationTaskToDLQ → [domain=X]

All other methods → []

These shardMetricTags are passed to both callWithShardScope and callWithDomainAndShardScope, which emit persistence_requests_per_shard and persistence_latency_per_shard. This produces at least 3 distinct label-name sets for the same Prometheus metric:

{operation, shard_id, is_retry} — most methods

{operation, shard_id, task_category, is_retry} — history task methods

{operation, shard_id, domain, is_retry} — PutReplicationTaskToDLQ

Prometheus client_golang requires all descriptors for the same metric name to share the same label-name set, so the second and third sets will be silently dropped — the exact class of bug this PR aims to fix.

Note: the old code only passed [is_retry] as additionalTags to shard scopes (custom metric tags were only used in the callWithoutDomainTag fallback). By moving getCustomMetricTags into shardMetricTags, this inconsistency is newly introduced.

✅ Bug: Per-shard metrics still have inconsistent label sets across methods

📄 common/persistence/wrappers/templates/metered_execution.tmpl:58 📄 common/persistence/wrappers/templates/metered_execution.tmpl:64 📄 common/persistence/wrappers/templates/metered_execution.tmpl:71 📄 common/persistence/wrappers/metered/base.go:184-185 📄 common/persistence/wrappers/metered/base.go:208-209
The PR fixes the label-set mismatch between shardOperationsMetricsScope and shardOverallMetricsScope, and removes task_category from the callWithoutDomainTag path. However, it introduces a new inconsistency on the per-shard path.

shardMetricTags is built as append(getCustomMetricTags(req), metrics.IsRetryTag(retryCount > 0)). Only 3 request types (GetHistoryTasksRequest, CompleteHistoryTaskRequest, RangeCompleteHistoryTaskRequest) implement MetricTags() and return a task_category tag. All other execution manager requests return an empty slice. Both paths feed into shardOverallMetricsScope which uses the shared PersistenceShardRequestCountScope (operation: ShardIdPersistenceRequest).

Result:

GetHistoryTasks emits persistence_requests_per_shard with labels {operation, shard_id, task_category, is_retry}

CreateWorkflowExecution emits the same metric with labels {operation, shard_id, is_retry}

Prometheus client_golang requires all descriptors registered under the same metric name to share the same label-name set. The second registration will fail and Tally will return a noopMetric{}, silently dropping data — the exact same class of bug this PR aims to fix.

The test (lines 378-399) passes because tally.NewTestScope doesn't enforce Prometheus's label-set constraint, masking the issue.

Rules ✅ All requirements met

Repository Rules

✅ GitHub Issue Linking Requirement: The PR description contains 'Fixes #7844', a valid issue link referencing the cadence-workflow organization.

✅ PR Description Quality Standards: The PR description contains all required sections including WHAT, WHY, testable commands, risk analysis, release notes, and documentation assessment.

_{1 rule not applicable. Show all rules by commenting gitar display:verbose.}

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

`Auto-apply`	`Compact`
`gitar auto-apply:on`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

gitar-bot bot assigned natemort and c-warren Apr 5, 2026

ribaraka requested changes Apr 6, 2026

View reviewed changes

Comment thread common/persistence/wrappers/templates/metered_execution.tmpl Outdated

gitar-bot bot reviewed Apr 6, 2026

View reviewed changes

Comment thread common/persistence/wrappers/templates/metered_execution.tmpl Outdated

officialasishkumar force-pushed the fix/metrics-shard-scope-label-consistency branch from 40cdcda to f4000f3 Compare April 6, 2026 17:35

gitar-bot bot reviewed Apr 6, 2026

View reviewed changes

Comment thread common/persistence/wrappers/templates/metered_execution.tmpl Outdated

officialasishkumar added 2 commits April 7, 2026 06:09

fix(metrics): keep persistence base metrics tag-stable

8a02e25

Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>

officialasishkumar added 2 commits April 7, 2026 06:09

fix(metrics): handle typed nil requests in metered wrapper

184a321

Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>

fix(metrics): keep shard persistence labels stable

c689f47

Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>

officialasishkumar force-pushed the fix/metrics-shard-scope-label-consistency branch from 4fdbb6d to c689f47 Compare April 7, 2026 06:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(metrics): ensure consistent Prometheus label sets for persistence metrics#7893

fix(metrics): ensure consistent Prometheus label sets for persistence metrics#7893
officialasishkumar wants to merge 4 commits intocadence-workflow:masterfrom
officialasishkumar:fix/metrics-shard-scope-label-consistency

officialasishkumar commented Apr 5, 2026

Uh oh!

Uh oh!

ribaraka commented Apr 6, 2026

Uh oh!

officialasishkumar commented Apr 6, 2026

Uh oh!

Uh oh!

Uh oh!

officialasishkumar commented Apr 7, 2026

Uh oh!

gitar-bot bot commented Apr 7, 2026 •

edited

Loading

Overview

Failures

SQLite WASM Out-of-Bounds Memory Access (confidence: high)

NDC Integration Test Failure (confidence: medium)

Summary

Repository Rules

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

officialasishkumar commented Apr 5, 2026

Uh oh!

Uh oh!

ribaraka commented Apr 6, 2026

Uh oh!

officialasishkumar commented Apr 6, 2026

Uh oh!

Uh oh!

Uh oh!

officialasishkumar commented Apr 7, 2026

Uh oh!

gitar-bot bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Failures

SQLite WASM Out-of-Bounds Memory Access (confidence: high)

NDC Integration Test Failure (confidence: medium)

Summary

Repository Rules

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gitar-bot bot commented Apr 7, 2026 •

edited

Loading