Skip to content

fix(metrics): ensure consistent Prometheus label sets for persistence metrics#7893

Open
officialasishkumar wants to merge 4 commits intocadence-workflow:masterfrom
officialasishkumar:fix/metrics-shard-scope-label-consistency
Open

fix(metrics): ensure consistent Prometheus label sets for persistence metrics#7893
officialasishkumar wants to merge 4 commits intocadence-workflow:masterfrom
officialasishkumar:fix/metrics-shard-scope-label-consistency

Conversation

@officialasishkumar
Copy link
Copy Markdown
Contributor

What changed?

Fixed inconsistent label sets across scopes emitting the same Prometheus metric name, which caused the Prometheus reporter to silently drop metrics.

Fixes #7844.

Why?

The Prometheus client_golang library requires that all descriptors registered under the same metric name share the same label-name set. When Cadence emitted the same metric from scopes with different tag sets, the second registration failed and Tally returned a noopMetric{}, silently discarding all data from that scope.

Two specific patterns were broken:

  1. persistence_requests_per_shard / persistence_latency_per_shard — In callWithDomainAndShardScope, shardOperationsMetricsScope included additionalTags (e.g., is_retry) but shardOverallMetricsScope did not. Both scopes emitted the same metric names with different label sets. Fixed by passing additionalTags to shardOverallMetricsScope.

  2. persistence_requests / persistence_latency — Three methods (CompleteHistoryTask, GetHistoryTasks, RangeCompleteHistoryTask) added a task_category tag via getCustomMetricTags in the callWithoutDomainTag fallback path, while all other ExecutionManager methods did not. This created an inconsistent label set for the shared persistence_requests and persistence_latency metrics. Fixed by removing getCustomMetricTags from the callWithoutDomainTag path in the codegen template — the operation label already identifies the method uniquely.

Note: the cache metrics inconsistencies (cache_count, cache_evict, cache_latency) mentioned in #7844 originate in the cache layer, not the persistence metered wrapper, and are left for a follow-up.

How did you test it?

go test -race ./common/persistence/wrappers/metered/...
go build ./...

All existing tests pass. The generated file execution_generated.go was regenerated from the updated template via make go-generate GEN_DIR=common/persistence.

Potential risks

  • The task_category tag is no longer emitted on persistence_requests / persistence_latency for the three history-task methods when the request has no domain. Dashboards querying these metrics by task_category on the no-domain path would need to switch to filtering by operation instead. The operation label already uniquely identifies these methods, so no information is lost.
  • The shardOverallMetricsScope now carries additionalTags (currently is_retry), slightly increasing label cardinality for the PersistenceShardRequestCountScope. This is the intended fix — all emission paths for persistence_requests_per_shard must share the same label set.

Release notes

Fixed silent metric data loss when using the Prometheus reporter. Persistence metrics (persistence_requests_per_shard, persistence_latency_per_shard, persistence_requests, persistence_latency) were emitted with inconsistent label sets across code paths, causing Prometheus to reject the second registration and drop all data from the affected scope.

Documentation Changes

N/A

Comment thread common/persistence/wrappers/templates/metered_execution.tmpl Outdated
@ribaraka
Copy link
Copy Markdown
Contributor

ribaraka commented Apr 6, 2026

It would be good to add a specific test here as well. This kind of label mismatch is only caught at prometheus runtime, so a small test covering the affected persistence_requests/persistence_latency would help prevent regressions.

@officialasishkumar
Copy link
Copy Markdown
Contributor Author

Added a focused metered-wrapper test that exercises the shared persistence_requests and persistence_latency metrics directly, plus the shard-scoped tags that replace the removed shared-metric breakdown.

Comment thread common/persistence/wrappers/templates/metered_execution.tmpl Outdated
@officialasishkumar officialasishkumar force-pushed the fix/metrics-shard-scope-label-consistency branch from 40cdcda to f4000f3 Compare April 6, 2026 17:35
Comment thread common/persistence/wrappers/templates/metered_execution.tmpl Outdated
… metrics

- Added additionalTags to shardOverallMetricsScope in callWithDomainAndShardScope
  so that persistence_requests_per_shard and persistence_latency_per_shard are
  emitted with the same label set (including is_retry) from both shard scopes.
  Previously shardOverallMetricsScope was missing additionalTags, causing
  Prometheus to reject the second registration and silently drop metrics.

- Removed getCustomMetricTags from callWithoutDomainTag in the execution metered
  template. Three methods (CompleteHistoryTask, GetHistoryTasks,
  RangeCompleteHistoryTask) added a task_category tag via getCustomMetricTags,
  while all other methods did not, creating inconsistent label sets for
  persistence_requests and persistence_latency. The operation label already
  identifies the method, so task_category is not needed in the shared metrics.

Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
@officialasishkumar officialasishkumar force-pushed the fix/metrics-shard-scope-label-consistency branch from 4fdbb6d to c689f47 Compare April 7, 2026 06:11
@officialasishkumar
Copy link
Copy Markdown
Contributor Author

@ribaraka I added a focused metered-wrapper test for the per-shard paths and removed request-specific shard tags so the shared shard metrics keep one stable Prometheus label set.

@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Apr 7, 2026

CI failed: SQLite WASM out-of-bounds memory access crashes during SQLite integration tests (infrastructure flake); secondary ndc integration test failure likely related to PR metrics label changes in persistence wrapper causing assertion mismatches.

Overview

Analyzed 4 CI logs across 2 unique templates. Found 2 distinct failure patterns: (1) a flaky SQLite WASM runtime crash unrelated to code changes, and (2) an ndc integration test failure that may be related to the PR's persistence metrics label consistency changes.

Failures

SQLite WASM Out-of-Bounds Memory Access (confidence: high)

  • Type: infrastructure
  • Affected jobs: 70195663273
  • Related to change: no
  • Root cause: The go-sqlite3 WASM binding (v0.22.0) encounters out-of-bounds memory access panics during database operations in the timer queue processor. The panic occurs in WASM memory management functions (sqlite3.wasm.dlfree, sqlite3MemFree, sqlite3_step, sqlite3_prepare) and is not caused by the PR's metrics label changes.
  • Suggested fix: (1) Update go-sqlite3 WASM to a newer, more stable version; (2) Switch to native SQLite driver for integration tests instead of WASM; (3) Use PostgreSQL backend in CI instead of SQLite; (4) Add WASM-specific retry logic. This is a known flakiness issue unrelated to metrics consistency.

NDC Integration Test Failure (confidence: medium)

  • Type: test
  • Affected jobs: 70195663308
  • Related to change: yes
  • Root cause: The ndc (multi-datacenter) integration test failed with exit code 2 after 37.5s. The PR changes Prometheus label consistency in persistence metrics wrappers (metered wrapper handling of typed nil requests, base metrics tag stability). A mismatch between expected and actual metric label sets during persistence operations (e.g., GetCurrentExecution) could cause assertion failures when the test validates metric output or scope labels.
  • Suggested fix: (1) Verify that the metered wrapper changes in the PR properly emit metrics with consistent label sets for all persistence methods; (2) Check if ndc integration tests validate Prometheus metrics or metric scope labels; (3) Ensure metered_test.go (added in PR) covers label set consistency for execution-related metrics; (4) Run ndc test locally with verbose logging to identify the specific assertion failure.

Summary

  • Change-related failures: 1 (ndc integration test failure possibly caused by inconsistent persistence metrics label sets)
  • Infrastructure/flaky failures: 1 (SQLite WASM out-of-bounds memory access, unrelated to PR)
  • Recommended action: (1) Investigate whether the ndc test validates metrics; if so, review the metered wrapper and base metrics changes to ensure label consistency; (2) Add comprehensive test coverage for persistence metric label sets in metered_test.go; (3) The SQLite WASM crash should be addressed separately as a test infrastructure issue, not a blocker for this PR.
Code Review ✅ Approved 2 resolved / 2 findings

Fixes Prometheus label set inconsistencies in persistence metrics by ensuring stable label sets across per-shard and base metrics, addressing label-set variations that could cause cardinality issues. No issues found.

✅ 2 resolved
Bug: New label-set inconsistency introduced for per-shard metrics

📄 common/persistence/wrappers/templates/metered_execution.tmpl:58 📄 common/persistence/wrappers/templates/metered_execution.tmpl:64 📄 common/persistence/wrappers/templates/metered_execution.tmpl:71
The PR fixes label inconsistency for persistence_requests/persistence_latency but introduces a new one for persistence_requests_per_shard/persistence_latency_per_shard.

In the template, shardMetricTags is built as append(getCustomMetricTags(req), metrics.IsRetryTag(...)). getCustomMetricTags returns different tags depending on the request type:

  • 3 history task methods (GetHistoryTasks, CompleteHistoryTask, RangeCompleteHistoryTask) → [task_category=X]
  • PutReplicationTaskToDLQ[domain=X]
  • All other methods → []

These shardMetricTags are passed to both callWithShardScope and callWithDomainAndShardScope, which emit persistence_requests_per_shard and persistence_latency_per_shard. This produces at least 3 distinct label-name sets for the same Prometheus metric:

  1. {operation, shard_id, is_retry} — most methods
  2. {operation, shard_id, task_category, is_retry} — history task methods
  3. {operation, shard_id, domain, is_retry} — PutReplicationTaskToDLQ

Prometheus client_golang requires all descriptors for the same metric name to share the same label-name set, so the second and third sets will be silently dropped — the exact class of bug this PR aims to fix.

Note: the old code only passed [is_retry] as additionalTags to shard scopes (custom metric tags were only used in the callWithoutDomainTag fallback). By moving getCustomMetricTags into shardMetricTags, this inconsistency is newly introduced.

Bug: Per-shard metrics still have inconsistent label sets across methods

📄 common/persistence/wrappers/templates/metered_execution.tmpl:58 📄 common/persistence/wrappers/templates/metered_execution.tmpl:64 📄 common/persistence/wrappers/templates/metered_execution.tmpl:71 📄 common/persistence/wrappers/metered/base.go:184-185 📄 common/persistence/wrappers/metered/base.go:208-209
The PR fixes the label-set mismatch between shardOperationsMetricsScope and shardOverallMetricsScope, and removes task_category from the callWithoutDomainTag path. However, it introduces a new inconsistency on the per-shard path.

shardMetricTags is built as append(getCustomMetricTags(req), metrics.IsRetryTag(retryCount > 0)). Only 3 request types (GetHistoryTasksRequest, CompleteHistoryTaskRequest, RangeCompleteHistoryTaskRequest) implement MetricTags() and return a task_category tag. All other execution manager requests return an empty slice. Both paths feed into shardOverallMetricsScope which uses the shared PersistenceShardRequestCountScope (operation: ShardIdPersistenceRequest).

Result:

  • GetHistoryTasks emits persistence_requests_per_shard with labels {operation, shard_id, task_category, is_retry}
  • CreateWorkflowExecution emits the same metric with labels {operation, shard_id, is_retry}

Prometheus client_golang requires all descriptors registered under the same metric name to share the same label-name set. The second registration will fail and Tally will return a noopMetric{}, silently dropping data — the exact same class of bug this PR aims to fix.

The test (lines 378-399) passes because tally.NewTestScope doesn't enforce Prometheus's label-set constraint, masking the issue.

Rules ✅ All requirements met

Repository Rules

GitHub Issue Linking Requirement: The PR description contains 'Fixes #7844', a valid issue link referencing the cadence-workflow organization.
PR Description Quality Standards: The PR description contains all required sections including WHAT, WHY, testable commands, risk analysis, release notes, and documentation assessment.

1 rule not applicable. Show all rules by commenting gitar display:verbose.

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Prometheus reporter silently drops metrics with same name due to inconsistent label sets across scopes

4 participants