Skip to content

feat: Add gauge replacements for non-duration replication timers (CDNC-17893)#7858

Open
zawadzkidiana wants to merge 2 commits intocadence-workflow:masterfrom
zawadzkidiana:diana/gauge-replacements-nonduration
Open

feat: Add gauge replacements for non-duration replication timers (CDNC-17893)#7858
zawadzkidiana wants to merge 2 commits intocadence-workflow:masterfrom
zawadzkidiana:diana/gauge-replacements-nonduration

Conversation

@zawadzkidiana
Copy link
Copy Markdown
Contributor

@zawadzkidiana zawadzkidiana commented Mar 30, 2026

Fixes #7843

What changed?

Add gauge metrics alongside existing timer emissions for 6 non-duration
replication metrics that currently abuse Timer type to record count/lag
values. These metrics represent point-in-time snapshot values (cache sizes,
task counts, lag counts) rather than durations, and should use Gauge type.

Metrics migrated:

Timer (old) Gauge (new) What it measures
cache_size cache_size_gauge Current replication cache item count
replication_tasks_lag replication_tasks_lag_gauge Current replication lag (task count)
replication_tasks_lag_raw replication_tasks_lag_raw_gauge Raw replication lag before hydration
replication_tasks_fetched replication_tasks_fetched_gauge Tasks fetched per batch
replication_tasks_returned replication_tasks_returned_gauge Tasks returned per batch
replication_tasks_returned_diff replication_tasks_returned_diff_gauge Diff between fetched and returned

Why?

These timers record integer counts cast to time.Duration, which is semantically wrong
and produces confusing units in dashboards. Gauges are the correct metric type for
point-in-time values. This uses the GaugeMigration framework added in #7834 to gate
emission, enabling a controlled migration.

How did you test it?

  • go build ./... - compiles cleanly
  • go test ./common/metrics/... - all pass
  • go test ./service/history/replication/... - all pass
  • Gauge emission is gated by GaugeMigrationMetrics + GaugeMigration config (default: emit)

Potential risks

None — this only adds new gauge emissions alongside existing timer/histogram emissions.
No existing metrics are modified or removed.

Is it a breaking change?

No

[Release notes]
Added gauge metric replacements for 6 non-duration replication timers (cache_size_gauge,
replication_tasks_lag_gauge, replication_tasks_lag_raw_gauge, replication_tasks_fetched_gauge,
replication_tasks_returned_gauge, replication_tasks_returned_diff_gauge). These gauges
are emitted alongside existing timers and controlled by the GaugeMigration framework.

[Documentation Changes]
N/A

@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Mar 30, 2026

Thanks for the PR! A couple of items to address:

  1. GitHub Issue Link: The PR references CDNC-17893 (Jira) but is missing a GitHub issue link. Please add one in the description using one of these formats: #123, cadence-workflow/repo#123, or https://github.com/cadence-workflow/repo/issues/123.

  2. PR Description Sections: The required [Release notes] and [Documentation Changes] sections are missing from the description. Please add them (you can mark them as N/A if not applicable), for example:

    • [Release notes]: N/A — internal metric type migration, no user-facing functionality changed.
    • [Documentation Changes]: N/A — no configuration or user-facing changes requiring doc updates.

@zawadzkidiana zawadzkidiana force-pushed the diana/gauge-replacements-nonduration branch from 3894b0b to f249e4d Compare March 31, 2026 22:53
Comment thread service/history/task/task.go Outdated
…C-17893)

Add gauge metrics alongside existing timer emissions for 6 non-duration
replication metrics that currently abuse Timer type for count/lag values:
cache_size, replication_tasks_lag, replication_tasks_lag_raw,
replication_tasks_fetched, replication_tasks_returned,
replication_tasks_returned_diff.

Each gauge is registered in GaugeMigrationMetrics for controlled emission
via the GaugeMigration framework. This enables operators to validate
gauge metrics in dashboards before disabling the legacy timers.

Signed-off-by: Diana Zawadzki <dzawa@live.de>
@zawadzkidiana zawadzkidiana force-pushed the diana/gauge-replacements-nonduration branch from f249e4d to 8cd6724 Compare March 31, 2026 23:00
Resolve conflict in config.go: keep gauge migration metric entries
with new init comment from master.
@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Apr 8, 2026

Code Review ✅ Approved 1 resolved / 1 findings

Adds gauge replacements for non-duration replication timers, addressing ExponentialTaskQueueLatency metric emission on per-domain scope. No issues found.

✅ 1 resolved
Bug: ExponentialTaskQueueLatency emitted on per-domain scope

📄 service/history/task/task.go:395
At line 395, ExponentialTaskQueueLatency (an aggregate metric without "PerDomain" suffix) is emitted on t.scope, which is a per-domain tagged scope created via getOrCreateDomainTaggedScope(). This means the aggregate histogram will carry domain tags, producing one time-series per domain rather than a single aggregate series.

The other aggregate metrics in this method (e.g., TaskAttemptTimerPerDomain, TaskLatencyPerDomain) rely on metricRollupName for automatic rollup from per-domain to aggregate. Since histograms don't support rollup (as noted in the PR description), the fix needs to emit on an untagged scope instead.

Emitting on a domain-tagged scope won't crash anything, but it defeats the purpose of having an aggregate metric — dashboards querying task_latency_queue_ns without domain filters will see per-domain cardinality, and aggregation behavior will differ from what's expected.

Rules ✅ All requirements met

Repository Rules

PR Description Quality Standards: The PR description meets all quality standards with required sections, specific test commands, clear explanation of WHY, and appropriate risk assessment.

2 rules not applicable. Show all rules by commenting gitar display:verbose.

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add GaugeMigration and CounterMigration frameworks for controlled metric emission

4 participants