Skip to content

enhance: [ExternalTable Part8 - 1/4] cross-bucket, schemaless reader, force nullable#49061

Open
weiliu1031 wants to merge 1 commit intomilvus-io:masterfrom
weiliu1031:part8-1of4-cross-bucket
Open

enhance: [ExternalTable Part8 - 1/4] cross-bucket, schemaless reader, force nullable#49061
weiliu1031 wants to merge 1 commit intomilvus-io:masterfrom
weiliu1031:part8-1of4-cross-bucket

Conversation

@weiliu1031
Copy link
Copy Markdown
Contributor

@weiliu1031 weiliu1031 commented Apr 15, 2026

issue: #45881

Summary

Part 1/4 of the External Table Part 8 series. Foundation layer for external collections — delivers data-plane primitives that Parts 2–4 build on.

  • Cross-bucket external data source: detect same-bucket prefix; route non-matching paths through extfs.
  • Schemaless reader: open reader with nullptr arrow schema + explicit needed_columns projection instead of full pre-built arrow schema.
  • Force nullable: external user fields set to nullable=true so Arrow buffers from external sources are consumed uniformly (valid_data populated in ArrowToDataArray).
  • API cleanup: replace external_access_mode string with use_take_for_output bool.
  • Load optimization: skip redundant S3 data pull during segment load.
  • Shared helper: new pkg/util/externalspec parsing package + paramtable entries (100% test coverage).
  • Arrow codecs: enable snappy/lz4 in Arrow build (fixes #48869).

Related to #45881 (External Collection with Lakehouse Integration tracking).

Why split this way

This PR intentionally excludes the refresh pipeline (manager/checker/inspector/task, schema DDL updater, refresh task update, e2e refresh tests) — those land in Part 2/4. Each PR stays on a single theme for reviewability.

Test plan

  • `make milvus` — clean build
  • `go build -tags dynamic,test ./... && cd pkg && go build -tags dynamic,test ./...` — 0 errors
  • `go test -tags dynamic,test ./internal/datanode/external/... ./internal/storagev2/packed/...` — 66 PASS / 0 FAIL
  • `go test -tags dynamic,test ./pkg/v2/util/externalspec/...` — 15 PASS, 100% coverage
  • `golangci-lint run --build-tags=dynamic,test` on all changed packages — 0 issues
  • CI validation (ut-go / ut-cpp / integration-test / e2e)

@sre-ci-robot sre-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines. label Apr 15, 2026
@mergify mergify bot added dco-passed DCO check passed. kind/enhancement Issues or changes related to enhancement labels Apr 15, 2026
@sre-ci-robot
Copy link
Copy Markdown
Contributor

[ci-v2-notice]
Notice: New ci-v2 system is enabled for this PR.

To rerun ci-v2 checks, comment with:

  • /ci-rerun-code-check // for ci-v2/code-check
  • /ci-rerun-build // for ci-v2/build
  • /ci-rerun-build-all // for ci-v2/build-all (multi-arch builds)
  • /ci-rerun-buildenv // for ci-v2/build-env (build milvus-env builder images)
  • /ci-rerun-ut-integration // for ci-v2/ut-integration, will rerun ci-v2/build
  • /ci-rerun-ut-go // for ci-v2/ut-go, will rerun ci-v2/build
  • /ci-rerun-ut-cpp // for ci-v2/ut-cpp
  • /ci-rerun-ut // for all ci-v2/ut-integration, ci-v2/ut-go, ci-v2/ut-cpp, will rerun ci-v2/build
  • /ci-rerun-e2e-default // for ci-v2/e2e-default
  • /ci-rerun-e2e-amd // for ci-v2/e2e-amd (e2e pool dispatcher)
  • /ci-rerun-build-ut-cov // for ci-v2/build-ut-cov (build + unit tests in one pipeline)
  • /ci-rerun-gosdk // for ci-v2/go-sdk (Go SDK E2E tests, ARM)

If you have any questions or requests, please contact @zhikunyao.

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 15, 2026

@weiliu1031 Please associate the related issue to the body of your Pull Request. (eg. "issue: #")

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 15, 2026

Codecov Report

❌ Patch coverage is 60.38217% with 311 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.02%. Comparing base (87b04e8) to head (4749d90).
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
internal/core/src/storage/Util.cpp 32.72% 148 Missing ⚠️
internal/storagev2/packed/ffi_common.go 13.33% 77 Missing and 1 partial ⚠️
...rnal/core/src/segcore/ChunkedSegmentSealedImpl.cpp 30.23% 60 Missing ⚠️
internal/core/src/storage/loon_ffi/util.cpp 81.35% 11 Missing ⚠️
internal/querynodev2/segments/segment_loader.go 11.11% 7 Missing and 1 partial ⚠️
internal/core/src/indexbuilder/index_c.cpp 0.00% 4 Missing ⚠️
internal/datanode/index/task_index.go 0.00% 2 Missing ⚠️

❌ Your patch check has failed because the patch coverage (60.38%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #49061      +/-   ##
==========================================
- Coverage   78.11%   78.02%   -0.09%     
==========================================
  Files        2169     2170       +1     
  Lines      358449   358360      -89     
==========================================
- Hits       279990   279598     -392     
- Misses      69879    70168     +289     
- Partials     8580     8594      +14     
Components Coverage Δ
Client 78.96% <ø> (ø)
Core 84.37% <63.32%> (-0.15%) ⬇️
Go 76.38% <50.28%> (-0.07%) ⬇️
Files with missing lines Coverage Δ
internal/core/src/common/Schema.cpp 81.39% <100.00%> (+15.43%) ⬆️
internal/core/src/common/Schema.h 92.12% <ø> (-1.48%) ⬇️
...ternal/core/src/segcore/ChunkedSegmentSealedImpl.h 69.45% <100.00%> (+0.15%) ⬆️
internal/core/src/segcore/SegmentLoadInfo.h 87.96% <ø> (ø)
...re/storagev2translator/ManifestGroupTranslator.cpp 79.36% <100.00%> (+2.04%) ⬆️
internal/core/src/storage/Util.h 68.62% <ø> (-5.89%) ⬇️
internal/core/unittest/test_external_take.cpp 97.33% <100.00%> (-1.00%) ⬇️
internal/datacoord/task_index.go 78.41% <100.00%> (+0.15%) ⬆️
internal/util/segcore/segment.go 78.39% <100.00%> (+0.05%) ⬆️
pkg/util/externalspec/external_spec.go 100.00% <100.00%> (ø)
... and 9 more

... and 29 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sre-ci-robot sre-ci-robot added the low-code-coverage add test-label from zhikun, diff coverage > 80% label Apr 15, 2026
@weiliu1031 weiliu1031 force-pushed the part8-1of4-cross-bucket branch 9 times, most recently from 4392585 to 5d3478a Compare April 16, 2026 07:24
@tedxu
Copy link
Copy Markdown
Contributor

tedxu commented Apr 16, 2026

/lgtm
/approve

@sre-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tedxu, weiliu1031

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mergify mergify bot removed the dco-passed DCO check passed. label Apr 16, 2026
@weiliu1031 weiliu1031 force-pushed the part8-1of4-cross-bucket branch from 3f54299 to d05aa88 Compare April 16, 2026 15:56
@mergify mergify bot added dco-passed DCO check passed. and removed needs-dco DCO is missing in this pull request. labels Apr 16, 2026
@weiliu1031
Copy link
Copy Markdown
Contributor Author

/ci-rerun-build

@weiliu1031
Copy link
Copy Markdown
Contributor Author

/ci-rerun-ut-go

@weiliu1031
Copy link
Copy Markdown
Contributor Author

/ci-rerun-ut-cpp

@weiliu1031
Copy link
Copy Markdown
Contributor Author

/ci-rerun-ut-integration

@weiliu1031
Copy link
Copy Markdown
Contributor Author

/ci-rerun-e2e-default

@weiliu1031
Copy link
Copy Markdown
Contributor Author

/ci-rerun-ut-go

@weiliu1031
Copy link
Copy Markdown
Contributor Author

/ci-rerun-ut-cpp

@weiliu1031
Copy link
Copy Markdown
Contributor Author

/ci-rerun-integration

@weiliu1031
Copy link
Copy Markdown
Contributor Author

/ci-rerun-e2e-default

1 similar comment
@weiliu1031
Copy link
Copy Markdown
Contributor Author

/ci-rerun-e2e-default

@weiliu1031
Copy link
Copy Markdown
Contributor Author

/ci-rerun-ciloop

@weiliu1031
Copy link
Copy Markdown
Contributor Author

/ci-rerun-e2e-default

@weiliu1031
Copy link
Copy Markdown
Contributor Author

/ci-rerun-e2e

@sre-ci-robot
Copy link
Copy Markdown
Contributor

✅ CI Loop Results cdb5aee

Stage Result Duration Tests
✅ Build SUCCESS 48.9min -
✅ Code-Check SUCCESS 9.9min -
✅ UT-GO SUCCESS 21.6min 931 passed
✅ UT-Integration SUCCESS 24.7min 46 passed
✅ UT-CPP-Cov SUCCESS 58.3min 7318 passed

Total: 110min | Pipeline | Artifacts

@weiliu1031
Copy link
Copy Markdown
Contributor Author

/ci-rerun-e2e-default

@weiliu1031
Copy link
Copy Markdown
Contributor Author

/ci-rerun-go-sdk

@weiliu1031
Copy link
Copy Markdown
Contributor Author

/ci-rerun-e2e-default

@weiliu1031
Copy link
Copy Markdown
Contributor Author

/ci-rerun-go-sdk

1 similar comment
@weiliu1031
Copy link
Copy Markdown
Contributor Author

/ci-rerun-go-sdk

@weiliu1031
Copy link
Copy Markdown
Contributor Author

/ci-rerun-e2e-default

1 similar comment
@weiliu1031
Copy link
Copy Markdown
Contributor Author

/ci-rerun-e2e-default

… force nullable

This is part of a 4-PR series delivering Part 8 of the external
table feature:

  1/4  cross-bucket, schemaless reader, force nullable
  2/4  control plane — DDL validation, refresh infrastructure
  3/4  Iceberg external table support
  4/4  segment load — reduce overhead, memory estimation

- Cross-bucket external data source support: detect same-bucket prefix
  and treat non-matching paths as cross-bucket, routed via extfs.
- Schemaless reader: external collections open the Reader with a
  nullptr arrow schema plus an explicit needed_columns projection,
  instead of the full pre-built arrow schema.
- Force all external table user fields to nullable=true so the Arrow
  buffers emitted by external sources can be consumed uniformly, and
  populate valid_data accordingly in ArrowToDataArray.
- Replace external_access_mode string with use_take_for_output bool.
- Skip redundant S3 data pull during segment load.
- Introduce pkg/util/externalspec shared parsing helper and paramtable
  entries for external collection behavior.
- Fix CI compilation and test issues.

This commit intentionally does NOT include the refresh pipeline work
(manager/checker/inspector/task, schema DDL updater, refresh task
update, and e2e refresh tests); those land in the follow-up commit
so each commit stays on a single theme.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
@weiliu1031 weiliu1031 force-pushed the part8-1of4-cross-bucket branch from 0d9fdcd to 4749d90 Compare April 17, 2026 09:31
@mergify mergify bot added the ci-passed label Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved area/compilation area/dependency Pull requests that update a dependency file area/internal-api area/test ci-passed dco-passed DCO check passed. kind/enhancement Issues or changes related to enhancement low-code-coverage add test-label from zhikun, diff coverage > 80% sig/testing size/XXL Denotes a PR that changes 1000+ lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: External table query fails with snappy/lz4 compressed Parquet — codec not built in StorageV2

3 participants