Skip to content

enhance: shard cgo active future manager with 32-core groups#49047

Open
CLiqing wants to merge 1 commit intomilvus-io:masterfrom
CLiqing:perf/buf256-shard32-pr
Open

enhance: shard cgo active future manager with 32-core groups#49047
CLiqing wants to merge 1 commit intomilvus-io:masterfrom
CLiqing:perf/buf256-shard32-pr

Conversation

@CLiqing
Copy link
Copy Markdown
Contributor

@CLiqing CLiqing commented Apr 15, 2026

  • increase defaultRegisterBuf from 1 to 256 for active future registration
  • shard activeFutureManager by CPU groups (numShards = cpuNum/32, minimum 1 shard) and route registrations round-robin
  • update TestConcurrent to aggregate active future counts across all shards

@sre-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: CLiqing
To complete the pull request process, please assign czs007 after the PR has been reviewed.
You can assign the PR to them by writing /assign @czs007 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sre-ci-robot sre-ci-robot added the size/M Denotes a PR that changes 30-99 lines. label Apr 15, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 15, 2026

@CLiqing Thanks for your contribution. Please submit with DCO, see the contributing guide https://github.com/milvus-io/milvus/blob/master/CONTRIBUTING.md#developer-certificate-of-origin-dco.

@mergify mergify bot added the needs-dco DCO is missing in this pull request. label Apr 15, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 15, 2026

@CLiqing

Invalid PR Title Format Detected

Your PR submission does not adhere to our required standards. To ensure clarity and consistency, please meet the following criteria:

  1. Title Format: The PR title must begin with one of these prefixes:
  • feat: for introducing a new feature.
  • fix: for bug fixes.
  • enhance: for improvements to existing functionality.
  • test: for add tests to existing functionality.
  • doc: for modifying documentation.
  • auto: for the pull request from bot.
  • build(deps): for dependency updates from Dependabot.
  1. Description Requirement: The PR must include a non-empty description, detailing the changes and their impact.

Required Title Structure:

[Type]: [Description of the PR]

Where Type is one of feat, fix, enhance, test or doc.

Example:

enhance: improve search performance significantly 

Please review and update your PR to comply with these guidelines.

@sre-ci-robot
Copy link
Copy Markdown
Contributor

[ci-v2-notice]
Notice: New ci-v2 system is enabled for this PR.

To rerun ci-v2 checks, comment with:

  • /ci-rerun-code-check // for ci-v2/code-check
  • /ci-rerun-build // for ci-v2/build
  • /ci-rerun-build-all // for ci-v2/build-all (multi-arch builds)
  • /ci-rerun-buildenv // for ci-v2/build-env (build milvus-env builder images)
  • /ci-rerun-ut-integration // for ci-v2/ut-integration, will rerun ci-v2/build
  • /ci-rerun-ut-go // for ci-v2/ut-go, will rerun ci-v2/build
  • /ci-rerun-ut-cpp // for ci-v2/ut-cpp
  • /ci-rerun-ut // for all ci-v2/ut-integration, ci-v2/ut-go, ci-v2/ut-cpp, will rerun ci-v2/build
  • /ci-rerun-e2e-default // for ci-v2/e2e-default
  • /ci-rerun-e2e-amd // for ci-v2/e2e-amd (e2e pool dispatcher)
  • /ci-rerun-build-ut-cov // for ci-v2/build-ut-cov (build + unit tests in one pipeline)
  • /ci-rerun-gosdk // for ci-v2/go-sdk (Go SDK E2E tests, ARM)

If you have any questions or requests, please contact @zhikunyao.

Use a 256-slot register buffer and route future registration across
`cpuNum/32` managers (minimum 1 shard) to reduce single-manager
contention in the cancel/select loop.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: CLiqing <2208529306@qq.com>
@CLiqing CLiqing force-pushed the perf/buf256-shard32-pr branch from 03608b3 to e30d7ec Compare April 15, 2026 07:26
@mergify mergify bot added dco-passed DCO check passed. and removed needs-dco DCO is missing in this pull request. labels Apr 15, 2026
@CLiqing CLiqing changed the title enhance cgo future manager sharding with 32-core groups enhance: shard cgo active future manager with 32-core groups Apr 15, 2026
@mergify mergify bot added kind/enhancement Issues or changes related to enhancement and removed do-not-merge/invalid-pr-format labels Apr 15, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.90%. Comparing base (0795e4d) to head (e30d7ec).
⚠️ Report is 26 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #49047      +/-   ##
==========================================
- Coverage   78.02%   77.90%   -0.12%     
==========================================
  Files        2167     2167              
  Lines      356124   356265     +141     
==========================================
- Hits       277868   277560     -308     
- Misses      69753    70148     +395     
- Partials     8503     8557      +54     
Components Coverage Δ
Client 79.25% <ø> (ø)
Core 84.43% <43.85%> (-0.02%) ⬇️
Go 76.17% <57.23%> (-0.17%) ⬇️
Files with missing lines Coverage Δ
internal/util/cgo/futures.go 98.97% <100.00%> (+0.01%) ⬆️
internal/util/cgo/manager_active.go 98.48% <100.00%> (+0.20%) ⬆️

... and 63 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sre-ci-robot sre-ci-robot added the low-code-coverage add test-label from zhikun, diff coverage > 80% label Apr 15, 2026
@mergify mergify bot added the ci-passed label Apr 15, 2026
Copy link
Copy Markdown
Contributor

@foxspy foxspy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Shard CGO Active Future Manager

The sharding approach is sound — distributing futures across N managers (CPU/32, min 1) with atomic round-robin eliminates the single-goroutine bottleneck on the select loop and register channel. The mechanical changes are correct.

However, there is a critical bug in Prometheus metric reporting that must be fixed before merge.


🔴 Critical: ActiveFutureTotal gauge clobbered by multiple shards

File: internal/util/cgo/manager_active.go, line 104-106 (unchanged code, but broken by this PR)

metrics.ActiveFutureTotal.WithLabelValues(m.nodeID).Set(float64(activeTotal))

ActiveFutureTotal is a GaugeVec keyed only by nodeID. All shards share the same nodeID, so each shard calls .Set() on the exact same gauge series with its own local activeTotal. The last shard to call doSelect() wins — the gauge shows only that shard's count, not the sum.

Impact: On a 128-core machine (4 shards), the metric could report ~25% of the true active future count. This breaks production monitoring and any alerting built on this metric.

Suggested fix (pick one):

  • Option A (simplest): Switch from .Set() to .Add(1) on register and .Sub(1) on completion — this naturally aggregates across shards.
  • Option B: Add a shard-index label so each shard reports independently and Grafana queries use sum().

🟡 Medium: defaultRegisterBuf 256x increase lacks rationale

1 → 256 is a large jump. A buffer of 256 per shard means up to 1024 futures can queue (on a 4-shard system) before any caller blocks. This is probably fine but could mask overload. A brief comment explaining the sizing rationale would help future readers.


🟢 Verified — no issues found

  • No stray references to old futureManager singleton — grep confirms all references are in the three modified files. ✓
  • hardware.GetCPUNum() uses GOMAXPROCS(0) which respects cgroup limits after automaxprocs init. Correct for k8s. ✓
  • Round-robin distribution via registerSeq.Inc() % managerCount is correct. Inc() returns post-increment (starts at 1), so shard 0 gets its first future when the counter wraps to a multiple of N — uniform distribution over time. ✓
  • Test aggregation correctly iterates all shards and sums ActiveCount. ✓
  • No shutdown/cleanup path existed before this PR and none is needed now (pre-existing design — managers run for process lifetime). ✓
  • maxSelectCase is now per-shard — each shard sees fewer futures, which is strictly better for the select loop. ✓

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-passed dco-passed DCO check passed. kind/enhancement Issues or changes related to enhancement low-code-coverage add test-label from zhikun, diff coverage > 80% size/M Denotes a PR that changes 30-99 lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants