enhance: shard cgo active future manager with 32-core groups by CLiqing · Pull Request #49047 · milvus-io/milvus

CLiqing · 2026-04-15T07:20:58Z

increase defaultRegisterBuf from 1 to 256 for active future registration
shard activeFutureManager by CPU groups (numShards = cpuNum/32, minimum 1 shard) and route registrations round-robin
update TestConcurrent to aggregate active future counts across all shards

sre-ci-robot · 2026-04-15T07:21:13Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: CLiqing
To complete the pull request process, please assign czs007 after the PR has been reviewed.
You can assign the PR to them by writing /assign @czs007 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mergify · 2026-04-15T07:21:36Z

@CLiqing Thanks for your contribution. Please submit with DCO, see the contributing guide https://github.com/milvus-io/milvus/blob/master/CONTRIBUTING.md#developer-certificate-of-origin-dco.

mergify · 2026-04-15T07:21:38Z

@CLiqing

Invalid PR Title Format Detected

Your PR submission does not adhere to our required standards. To ensure clarity and consistency, please meet the following criteria:

Title Format: The PR title must begin with one of these prefixes:

feat: for introducing a new feature.
fix: for bug fixes.
enhance: for improvements to existing functionality.
test: for add tests to existing functionality.
doc: for modifying documentation.
auto: for the pull request from bot.
build(deps): for dependency updates from Dependabot.

Description Requirement: The PR must include a non-empty description, detailing the changes and their impact.

Required Title Structure:

[Type]: [Description of the PR]

Where Type is one of feat, fix, enhance, test or doc.

Example:

enhance: improve search performance significantly

Please review and update your PR to comply with these guidelines.

sre-ci-robot · 2026-04-15T07:22:04Z

[ci-v2-notice]
Notice: New ci-v2 system is enabled for this PR.

To rerun ci-v2 checks, comment with:

/ci-rerun-code-check // for ci-v2/code-check
/ci-rerun-build // for ci-v2/build
/ci-rerun-build-all // for ci-v2/build-all (multi-arch builds)
/ci-rerun-buildenv // for ci-v2/build-env (build milvus-env builder images)
/ci-rerun-ut-integration // for ci-v2/ut-integration, will rerun ci-v2/build
/ci-rerun-ut-go // for ci-v2/ut-go, will rerun ci-v2/build
/ci-rerun-ut-cpp // for ci-v2/ut-cpp
/ci-rerun-ut // for all ci-v2/ut-integration, ci-v2/ut-go, ci-v2/ut-cpp, will rerun ci-v2/build
/ci-rerun-e2e-default // for ci-v2/e2e-default
/ci-rerun-e2e-amd // for ci-v2/e2e-amd (e2e pool dispatcher)
/ci-rerun-build-ut-cov // for ci-v2/build-ut-cov (build + unit tests in one pipeline)
/ci-rerun-gosdk // for ci-v2/go-sdk (Go SDK E2E tests, ARM)

If you have any questions or requests, please contact @zhikunyao.

Use a 256-slot register buffer and route future registration across `cpuNum/32` managers (minimum 1 shard) to reduce single-manager contention in the cancel/select loop. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: CLiqing <2208529306@qq.com>

codecov · 2026-04-15T08:25:59Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.90%. Comparing base (0795e4d) to head (e30d7ec).
⚠️ Report is 26 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #49047      +/-   ##
==========================================
- Coverage   78.02%   77.90%   -0.12%     
==========================================
  Files        2167     2167              
  Lines      356124   356265     +141     
==========================================
- Hits       277868   277560     -308     
- Misses      69753    70148     +395     
- Partials     8503     8557      +54

Components	Coverage Δ
Client	`79.25% <ø> (ø)`
Core	`84.43% <43.85%> (-0.02%)`	⬇️
Go	`76.17% <57.23%> (-0.17%)`	⬇️

Files with missing lines	Coverage Δ
internal/util/cgo/futures.go	`98.97% <100.00%> (+0.01%)`	⬆️
internal/util/cgo/manager_active.go	`98.48% <100.00%> (+0.20%)`	⬆️

... and 63 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

foxspy

Review: Shard CGO Active Future Manager

The sharding approach is sound — distributing futures across N managers (CPU/32, min 1) with atomic round-robin eliminates the single-goroutine bottleneck on the select loop and register channel. The mechanical changes are correct.

However, there is a critical bug in Prometheus metric reporting that must be fixed before merge.

🔴 Critical: `ActiveFutureTotal` gauge clobbered by multiple shards

File: internal/util/cgo/manager_active.go, line 104-106 (unchanged code, but broken by this PR)

metrics.ActiveFutureTotal.WithLabelValues(m.nodeID).Set(float64(activeTotal))

ActiveFutureTotal is a GaugeVec keyed only by nodeID. All shards share the same nodeID, so each shard calls .Set() on the exact same gauge series with its own local activeTotal. The last shard to call doSelect() wins — the gauge shows only that shard's count, not the sum.

Impact: On a 128-core machine (4 shards), the metric could report ~25% of the true active future count. This breaks production monitoring and any alerting built on this metric.

Suggested fix (pick one):

Option A (simplest): Switch from .Set() to .Add(1) on register and .Sub(1) on completion — this naturally aggregates across shards.
Option B: Add a shard-index label so each shard reports independently and Grafana queries use sum().

🟡 Medium: `defaultRegisterBuf` 256x increase lacks rationale

1 → 256 is a large jump. A buffer of 256 per shard means up to 1024 futures can queue (on a 4-shard system) before any caller blocks. This is probably fine but could mask overload. A brief comment explaining the sizing rationale would help future readers.

🟢 Verified — no issues found

No stray references to old futureManager singleton — grep confirms all references are in the three modified files. ✓
hardware.GetCPUNum() uses GOMAXPROCS(0) which respects cgroup limits after automaxprocs init. Correct for k8s. ✓
Round-robin distribution via registerSeq.Inc() % managerCount is correct. Inc() returns post-increment (starts at 1), so shard 0 gets its first future when the counter wraps to a multiple of N — uniform distribution over time. ✓
Test aggregation correctly iterates all shards and sums ActiveCount. ✓
No shutdown/cleanup path existed before this PR and none is needed now (pre-existing design — managers run for process lifetime). ✓
maxSelectCase is now per-shard — each shard sees fewer futures, which is strictly better for the select loop. ✓

sre-ci-robot requested review from sunby and yanliang567 April 15, 2026 07:21

sre-ci-robot added the size/M Denotes a PR that changes 30-99 lines. label Apr 15, 2026

mergify bot added the needs-dco DCO is missing in this pull request. label Apr 15, 2026

mergify bot added the do-not-merge/invalid-pr-format label Apr 15, 2026

CLiqing force-pushed the perf/buf256-shard32-pr branch from 03608b3 to e30d7ec Compare April 15, 2026 07:26

mergify bot added dco-passed DCO check passed. and removed needs-dco DCO is missing in this pull request. labels Apr 15, 2026

CLiqing changed the title ~~enhance cgo future manager sharding with 32-core groups~~ enhance: shard cgo active future manager with 32-core groups Apr 15, 2026

mergify bot added kind/enhancement Issues or changes related to enhancement and removed do-not-merge/invalid-pr-format labels Apr 15, 2026

sre-ci-robot added the low-code-coverage add test-label from zhikun, diff coverage > 80% label Apr 15, 2026

mergify bot added the ci-passed label Apr 15, 2026

foxspy suggested changes Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhance: shard cgo active future manager with 32-core groups#49047

enhance: shard cgo active future manager with 32-core groups#49047
CLiqing wants to merge 1 commit intomilvus-io:masterfrom
CLiqing:perf/buf256-shard32-pr

CLiqing commented Apr 15, 2026 •

edited

Loading

Uh oh!

sre-ci-robot commented Apr 15, 2026

Uh oh!

mergify bot commented Apr 15, 2026

Uh oh!

mergify bot commented Apr 15, 2026

Uh oh!

sre-ci-robot commented Apr 15, 2026

Uh oh!

codecov bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

foxspy left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

CLiqing commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sre-ci-robot commented Apr 15, 2026

Uh oh!

mergify bot commented Apr 15, 2026

Uh oh!

mergify bot commented Apr 15, 2026

Uh oh!

sre-ci-robot commented Apr 15, 2026

Uh oh!

codecov bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

foxspy left a comment

Choose a reason for hiding this comment

Review: Shard CGO Active Future Manager

🔴 Critical: ActiveFutureTotal gauge clobbered by multiple shards

🟡 Medium: defaultRegisterBuf 256x increase lacks rationale

🟢 Verified — no issues found

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLiqing commented Apr 15, 2026 •

edited

Loading

codecov bot commented Apr 15, 2026 •

edited

Loading

🔴 Critical: `ActiveFutureTotal` gauge clobbered by multiple shards

🟡 Medium: `defaultRegisterBuf` 256x increase lacks rationale