fix(metrics): ensure consistent Prometheus label sets for persistence metrics#7893
Conversation
|
It would be good to add a specific test here as well. This kind of label mismatch is only caught at prometheus runtime, so a small test covering the affected |
|
Added a focused metered-wrapper test that exercises the shared |
40cdcda to
f4000f3
Compare
… metrics - Added additionalTags to shardOverallMetricsScope in callWithDomainAndShardScope so that persistence_requests_per_shard and persistence_latency_per_shard are emitted with the same label set (including is_retry) from both shard scopes. Previously shardOverallMetricsScope was missing additionalTags, causing Prometheus to reject the second registration and silently drop metrics. - Removed getCustomMetricTags from callWithoutDomainTag in the execution metered template. Three methods (CompleteHistoryTask, GetHistoryTasks, RangeCompleteHistoryTask) added a task_category tag via getCustomMetricTags, while all other methods did not, creating inconsistent label sets for persistence_requests and persistence_latency. The operation label already identifies the method, so task_category is not needed in the shared metrics. Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
4fdbb6d to
c689f47
Compare
|
@ribaraka I added a focused metered-wrapper test for the per-shard paths and removed request-specific shard tags so the shared shard metrics keep one stable Prometheus label set. |
CI failed: SQLite WASM out-of-bounds memory access crashes during SQLite integration tests (infrastructure flake); secondary ndc integration test failure likely related to PR metrics label changes in persistence wrapper causing assertion mismatches.OverviewAnalyzed 4 CI logs across 2 unique templates. Found 2 distinct failure patterns: (1) a flaky SQLite WASM runtime crash unrelated to code changes, and (2) an ndc integration test failure that may be related to the PR's persistence metrics label consistency changes. FailuresSQLite WASM Out-of-Bounds Memory Access (confidence: high)
NDC Integration Test Failure (confidence: medium)
Summary
Code Review ✅ Approved 2 resolved / 2 findingsFixes Prometheus label set inconsistencies in persistence metrics by ensuring stable label sets across per-shard and base metrics, addressing label-set variations that could cause cardinality issues. No issues found. ✅ 2 resolved✅ Bug: New label-set inconsistency introduced for per-shard metrics
✅ Bug: Per-shard metrics still have inconsistent label sets across methods
Rules ✅ All requirements metRepository Rules
1 rule not applicable. Show all rules by commenting Tip Comment OptionsAuto-apply is off → Gitar will not commit updates to this branch. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
What changed?
Fixed inconsistent label sets across scopes emitting the same Prometheus metric name, which caused the Prometheus reporter to silently drop metrics.
Fixes #7844.
Why?
The Prometheus
client_golanglibrary requires that all descriptors registered under the same metric name share the same label-name set. When Cadence emitted the same metric from scopes with different tag sets, the second registration failed and Tally returned anoopMetric{}, silently discarding all data from that scope.Two specific patterns were broken:
persistence_requests_per_shard/persistence_latency_per_shard— IncallWithDomainAndShardScope,shardOperationsMetricsScopeincludedadditionalTags(e.g.,is_retry) butshardOverallMetricsScopedid not. Both scopes emitted the same metric names with different label sets. Fixed by passingadditionalTagstoshardOverallMetricsScope.persistence_requests/persistence_latency— Three methods (CompleteHistoryTask,GetHistoryTasks,RangeCompleteHistoryTask) added atask_categorytag viagetCustomMetricTagsin thecallWithoutDomainTagfallback path, while all other ExecutionManager methods did not. This created an inconsistent label set for the sharedpersistence_requestsandpersistence_latencymetrics. Fixed by removinggetCustomMetricTagsfrom thecallWithoutDomainTagpath in the codegen template — theoperationlabel already identifies the method uniquely.Note: the cache metrics inconsistencies (
cache_count,cache_evict,cache_latency) mentioned in #7844 originate in the cache layer, not the persistence metered wrapper, and are left for a follow-up.How did you test it?
go test -race ./common/persistence/wrappers/metered/... go build ./...All existing tests pass. The generated file
execution_generated.gowas regenerated from the updated template viamake go-generate GEN_DIR=common/persistence.Potential risks
task_categorytag is no longer emitted onpersistence_requests/persistence_latencyfor the three history-task methods when the request has no domain. Dashboards querying these metrics bytask_categoryon the no-domain path would need to switch to filtering byoperationinstead. Theoperationlabel already uniquely identifies these methods, so no information is lost.shardOverallMetricsScopenow carriesadditionalTags(currentlyis_retry), slightly increasing label cardinality for thePersistenceShardRequestCountScope. This is the intended fix — all emission paths forpersistence_requests_per_shardmust share the same label set.Release notes
Fixed silent metric data loss when using the Prometheus reporter. Persistence metrics (
persistence_requests_per_shard,persistence_latency_per_shard,persistence_requests,persistence_latency) were emitted with inconsistent label sets across code paths, causing Prometheus to reject the second registration and drop all data from the affected scope.Documentation Changes
N/A