feat: Add gauge replacements for non-duration replication timers (CDNC-17893)#7858
Conversation
|
Thanks for the PR! A couple of items to address:
|
3894b0b to
f249e4d
Compare
…C-17893) Add gauge metrics alongside existing timer emissions for 6 non-duration replication metrics that currently abuse Timer type for count/lag values: cache_size, replication_tasks_lag, replication_tasks_lag_raw, replication_tasks_fetched, replication_tasks_returned, replication_tasks_returned_diff. Each gauge is registered in GaugeMigrationMetrics for controlled emission via the GaugeMigration framework. This enables operators to validate gauge metrics in dashboards before disabling the legacy timers. Signed-off-by: Diana Zawadzki <dzawa@live.de>
f249e4d to
8cd6724
Compare
Resolve conflict in config.go: keep gauge migration metric entries with new init comment from master.
Code Review ✅ Approved 1 resolved / 1 findingsAdds gauge replacements for non-duration replication timers, addressing ExponentialTaskQueueLatency metric emission on per-domain scope. No issues found. ✅ 1 resolved✅ Bug: ExponentialTaskQueueLatency emitted on per-domain scope
Rules ✅ All requirements metRepository Rules
2 rules not applicable. Show all rules by commenting OptionsAuto-apply is off → Gitar will not commit updates to this branch. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
Fixes #7843
What changed?
Add gauge metrics alongside existing timer emissions for 6 non-duration
replication metrics that currently abuse Timer type to record count/lag
values. These metrics represent point-in-time snapshot values (cache sizes,
task counts, lag counts) rather than durations, and should use Gauge type.
Metrics migrated:
cache_sizecache_size_gaugereplication_tasks_lagreplication_tasks_lag_gaugereplication_tasks_lag_rawreplication_tasks_lag_raw_gaugereplication_tasks_fetchedreplication_tasks_fetched_gaugereplication_tasks_returnedreplication_tasks_returned_gaugereplication_tasks_returned_diffreplication_tasks_returned_diff_gaugeWhy?
These timers record integer counts cast to
time.Duration, which is semantically wrongand produces confusing units in dashboards. Gauges are the correct metric type for
point-in-time values. This uses the GaugeMigration framework added in #7834 to gate
emission, enabling a controlled migration.
How did you test it?
go build ./...- compiles cleanlygo test ./common/metrics/...- all passgo test ./service/history/replication/...- all passGaugeMigrationMetrics+GaugeMigrationconfig (default: emit)Potential risks
None — this only adds new gauge emissions alongside existing timer/histogram emissions.
No existing metrics are modified or removed.
Is it a breaking change?
No
[Release notes]
Added gauge metric replacements for 6 non-duration replication timers (
cache_size_gauge,replication_tasks_lag_gauge,replication_tasks_lag_raw_gauge,replication_tasks_fetched_gauge,replication_tasks_returned_gauge,replication_tasks_returned_diff_gauge). These gaugesare emitted alongside existing timers and controlled by the
GaugeMigrationframework.[Documentation Changes]
N/A