Skip to content

enhance: decouple parquet metadata loading from GroupChunkTranslator#49026

Merged
sre-ci-robot merged 2 commits intomilvus-io:masterfrom
sparknack:translator-mem-fix
Apr 17, 2026
Merged

enhance: decouple parquet metadata loading from GroupChunkTranslator#49026
sre-ci-robot merged 2 commits intomilvus-io:masterfrom
sparknack:translator-mem-fix

Conversation

@sparknack
Copy link
Copy Markdown
Contributor

issue: #49025

Summary

  • Extract parquet metadata loading out of GroupChunkTranslator's constructor into a free LoadGroupChunkMetadata function declared in segcore/ChunkedSegmentSealedImpl.h. Returns { row_group_meta_list, parquet_stats_by_field }.
  • When ENABLE_PARQUET_STATS_SKIP_INDEX is off (default), the loader only reads the lightweight RowGroupMetadataVector (24 B per row group) and skips GetParquetMetadata() entirely — no parquet::FileMetaData is loaded or retained.
  • When on, stats are extracted during the same parallel file pass and the FileMetaData is released before the translator is constructed.
  • GroupChunkTranslator now accepts std::vector<milvus_storage::RowGroupMetadataVector>&& directly; the parquet_file_metas() / field_id_mapping() getters are removed.
  • Updated both call sites (ChunkedSegmentSealedImpl::load_column_group_data_internal, JsonKeyStats) and the translator unit test.

Test plan

  • Local build passes (milvus_segcore, milvus_index)
  • Existing GroupChunkTranslatorTest updated and compiles
  • Full CI — C++ unit tests + integration tests

🤖 Generated with Claude Code

Previously GroupChunkTranslator loaded and retained full
parquet::FileMetaData in its constructor even when the parquet
stats skip index was disabled (the default). This wasted memory
across long-lived translators and coupled the translator to
parquet-level types.

Extract metadata loading into a free LoadGroupChunkMetadata
declared in ChunkedSegmentSealedImpl.h: callers pass
field_ids_for_stats only when ENABLE_PARQUET_STATS_SKIP_INDEX is
on, otherwise the loader skips GetParquetMetadata() entirely and
returns only the lightweight RowGroupMetadataVector. The
translator now takes row_group_meta_list directly and no longer
holds parquet FileMetaData.

Signed-off-by: Shawn Wang <shawn.wang@zilliz.com>
@sre-ci-robot sre-ci-robot added the size/L Denotes a PR that changes 100-499 lines. label Apr 14, 2026
@mergify mergify bot added dco-passed DCO check passed. kind/enhancement Issues or changes related to enhancement labels Apr 14, 2026
@sre-ci-robot
Copy link
Copy Markdown
Contributor

[ci-v2-notice]
Notice: New ci-v2 system is enabled for this PR.

To rerun ci-v2 checks, comment with:

  • /ci-rerun-code-check // for ci-v2/code-check
  • /ci-rerun-build // for ci-v2/build
  • /ci-rerun-build-all // for ci-v2/build-all (multi-arch builds)
  • /ci-rerun-buildenv // for ci-v2/build-env (build milvus-env builder images)
  • /ci-rerun-ut-integration // for ci-v2/ut-integration, will rerun ci-v2/build
  • /ci-rerun-ut-go // for ci-v2/ut-go, will rerun ci-v2/build
  • /ci-rerun-ut-cpp // for ci-v2/ut-cpp
  • /ci-rerun-ut // for all ci-v2/ut-integration, ci-v2/ut-go, ci-v2/ut-cpp, will rerun ci-v2/build
  • /ci-rerun-e2e-default // for ci-v2/e2e-default
  • /ci-rerun-e2e-amd // for ci-v2/e2e-amd (e2e pool dispatcher)
  • /ci-rerun-build-ut-cov // for ci-v2/build-ut-cov (build + unit tests in one pipeline)
  • /ci-rerun-gosdk // for ci-v2/go-sdk (Go SDK E2E tests, ARM)

If you have any questions or requests, please contact @zhikunyao.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 14, 2026

Codecov Report

❌ Patch coverage is 60.00000% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.95%. Comparing base (f3aacd1) to head (5c0df5c).
⚠️ Report is 35 commits behind head on master.

Files with missing lines Patch % Lines
...rnal/core/src/segcore/ChunkedSegmentSealedImpl.cpp 55.22% 30 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #49026      +/-   ##
==========================================
- Coverage   77.97%   77.95%   -0.03%     
==========================================
  Files        2167     2167              
  Lines      356240   356274      +34     
==========================================
- Hits       277783   277734      -49     
- Misses      69915    69984      +69     
- Partials     8542     8556      +14     
Components Coverage Δ
Client 79.25% <ø> (ø)
Core 84.44% <60.00%> (-0.01%) ⬇️
Go 76.23% <40.62%> (-0.03%) ⬇️
Files with missing lines Coverage Δ
...nternal/core/src/index/json_stats/JsonKeyStats.cpp 88.04% <100.00%> (+0.03%) ⬆️
...ternal/core/src/segcore/ChunkedSegmentSealedImpl.h 69.45% <100.00%> (+0.15%) ⬆️
...gcore/storagev2translator/GroupChunkTranslator.cpp 86.66% <100.00%> (-0.63%) ⬇️
...segcore/storagev2translator/GroupChunkTranslator.h 100.00% <ø> (ø)
...e/storagev2translator/GroupChunkTranslatorTest.cpp 99.01% <100.00%> (+<0.01%) ⬆️
...rnal/core/src/segcore/ChunkedSegmentSealedImpl.cpp 66.66% <55.22%> (+0.20%) ⬆️

... and 42 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sre-ci-robot sre-ci-robot added the low-code-coverage add test-label from zhikun, diff coverage > 80% label Apr 14, 2026
@sparknack
Copy link
Copy Markdown
Contributor Author

/ci-rerun-e2e-default

@mergify mergify bot added the ci-passed label Apr 16, 2026
Signed-off-by: Shawn Wang <shawn.wang@zilliz.com>
@sre-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sparknack, zhengbuqian

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sparknack
Copy link
Copy Markdown
Contributor Author

/ci-rerun-ut-go

@sparknack
Copy link
Copy Markdown
Contributor Author

/ci-rerun-e2e-default

1 similar comment
@sparknack
Copy link
Copy Markdown
Contributor Author

/ci-rerun-e2e-default

@mergify mergify bot added the ci-passed label Apr 17, 2026
@aoiasd
Copy link
Copy Markdown
Contributor

aoiasd commented Apr 17, 2026

/lgtm

@sre-ci-robot sre-ci-robot merged commit f332aeb into milvus-io:master Apr 17, 2026
17 of 22 checks passed
sre-ci-robot pushed a commit that referenced this pull request Apr 17, 2026
…lator (#49027)

issue: #49025
pr: #49026

## Summary

Cherry-pick of #49026 to the 3.0 branch.

- Extract parquet metadata loading out of `GroupChunkTranslator`'s
constructor into a free `LoadGroupChunkMetadata` function declared in
`segcore/ChunkedSegmentSealedImpl.h`. Returns `{ row_group_meta_list,
parquet_stats_by_field }`.
- When `ENABLE_PARQUET_STATS_SKIP_INDEX` is off (default), the loader
only reads the lightweight `RowGroupMetadataVector` (24 B per row group)
and skips `GetParquetMetadata()` entirely — no `parquet::FileMetaData`
is loaded or retained.
- When on, stats are extracted during the same parallel file pass and
the `FileMetaData` is released before the translator is constructed.
- `GroupChunkTranslator` now accepts
`std::vector<milvus_storage::RowGroupMetadataVector>&&` directly; the
`parquet_file_metas()` / `field_id_mapping()` getters are removed.
- Updated both call sites
(`ChunkedSegmentSealedImpl::load_column_group_data_internal`,
`JsonKeyStats`) and the translator unit test.

## Test plan

- [x] Local build passes (`milvus_segcore`, `milvus_index`) on 3.0
- [x] Cherry-pick applied with auto-merge, no manual conflict resolution
needed
- [ ] Full CI — C++ unit tests + integration tests

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Signed-off-by: Shawn Wang <shawn.wang@zilliz.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved ci-passed dco-passed DCO check passed. kind/enhancement Issues or changes related to enhancement lgtm low-code-coverage add test-label from zhikun, diff coverage > 80% size/L Denotes a PR that changes 100-499 lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants