Skip to content

Fix Warp band-mismatch crash on 4-band rasters with implicit alpha#41

Draft
josemaria-vilaplana wants to merge 2 commits intomasterfrom
fix/sc-547263/4-band-warp-alpha-fallback
Draft

Fix Warp band-mismatch crash on 4-band rasters with implicit alpha#41
josemaria-vilaplana wants to merge 2 commits intomasterfrom
fix/sc-547263/4-band-warp-alpha-fallback

Conversation

@josemaria-vilaplana
Copy link
Copy Markdown

@josemaria-vilaplana josemaria-vilaplana commented Apr 27, 2026

Summary

When a source raster has photometric metadata that doesn't add up cleanly (e.g. SamplesPerPixel != color channels + ExtraSamples, or RGBA inputs with ambiguous alpha tagging), GDAL Warp implicitly adds an alpha mask band to the destination. create_tile_ds was allocating exactly src.RasterCount bands, so Warp raised:

RuntimeError: Destination dataset has N bands, but at least N+1 are needed

The worker process died on that exception. Depending on raquet's version, the parent process either propagated AttributeError: 'NoneType' object has no attribute '__dict__' (when create_metadata tried to read missing stats) or silently produced a parquet with the metadata row but zero data tiles — both end with no usable output.

This was hit in production by a customer (CARTO/cloud-native) importing a 4-band Byte GeoTIFF whose Photometric was Gray (1 color channel) but SamplesPerPixel was 4, with the remaining 3 bands tagged ExtraSamples=Unspecified.

Approach

Strategy: retry once with an extra band, then make the choice sticky for the rest of the raster. Detection by failure rather than upfront introspection — GDAL's alpha heuristics drift across versions, so destination band-count is the only signal that's stable.

  • create_tile_ds accepts a new extra_bands: int = 0 so callers can reserve one extra band for the alpha mask Warp may add.
  • read_raster_data_stats accepts a new band_count: int | None = None so the alpha band is excluded from the parquet output (only the original data bands are persisted).
  • The native-zoom Warp now goes through warp_source_into_tile_with_alpha_fallback, which catches RuntimeError and retries once with extra_bands=1. The result returns a needs_extra_band flag the caller keeps sticky across tiles in the same raster — so the retry happens at most once per raster, never per-tile.
  • Same logic applied to the parallel _read_raster_worker.

Cost model

  • Healthy rasters: zero overhead. The fast path is unchanged: create_tile_ds allocates RasterCount bands, Warp succeeds, done.
  • Affected rasters: exactly one extra Warp call across the whole raster (the failed first attempt on tile Create LICENSE #1). Subsequent tiles know to allocate RasterCount + 1 from the start.

Validation

  • tests/test_geotiff2raquet.py::TestGeotiff2Raquet::test_4band_byte_alpha_mismatch — new regression test that builds a synthetic 4-band Byte GeoTIFF matching the customer shape (Photometric=Gray + 3x Undefined, noDataValue=255) and asserts the parquet output contains at least one data tile. Verified that this test fails on master without the fix (parquet has only the metadata row, len(table) == 1) and passes with the fix (len(table) == 19, full pyramid).
  • Synthetic GeoTIFF is built in a spawn'd subprocess so osgeo never loads in the test parent — pyarrow and osgeo collide on the 'file' filesystem registration when loaded in the same interpreter ([Python][C++] ArrowKeyError: Attempted to register factory for scheme 'file' when using pip-installed GDAL apache/arrow#44696).
  • Healthy 3-band RGB GeoTIFF: unchanged behaviour, single Warp call per tile (verified out-of-tree).
  • Full tests/test_geotiff2raquet.py: 17/17 pass with the fix.
  • Full suite (excluding test_earthengine.py and test_imageserver.py which need external creds): 46/47 pass. The one failure (test_cli.py::TestVersion::test_version) is pre-existing and reproduces on clean master — it's an environment quirk where importlib.metadata can't find the raquet-io distribution when running via PYTHONPATH without pip install.

Notes

  • The "broad RuntimeError catch" in warp_source_into_tile_with_alpha_fallback is intentional: any first-tile failure triggers a single retry with the wider destination, and if that also fails the second exception is propagated as-is. Worst case for an unrelated RuntimeError is one extra Warp call before the real error surfaces.
  • An alpha mask written into the extra band by Warp is silently discarded by read_raster_data_stats (it only iterates band_count bands). If a future user wants alpha-aware handling, that would be an opt-in extension on top of this patch — not affected by this change.
  • Separate concern spotted while validating: master's convert_to_raquet_files swallows worker-death and produces an empty-but-valid parquet (the symptom that the new regression test catches via len(table) > 1). Even with this PR's fix in place, the parent could benefit from propagating worker exceptions instead of producing zero-data outputs. Out of scope here, worth a follow-up.

josemaria-vilaplana and others added 2 commits April 27, 2026 17:18
When the source raster has photometric metadata that doesn't add up cleanly
(e.g. SamplesPerPixel != color channels + ExtraSamples), GDAL Warp implicitly
adds an alpha mask band to the destination. create_tile_ds was allocating
exactly src.RasterCount bands, so Warp raised:

    RuntimeError: Destination dataset has N bands, but at least N+1 are needed

That worker crash also tripped a secondary AttributeError in create_metadata
when the parent process tried to dereference the missing stats result.

This patch:

- Adds extra_bands to create_tile_ds so callers can reserve room for the
  alpha mask GDAL Warp may add.
- Adds band_count to read_raster_data_stats so the alpha band is excluded
  from the parquet output (only the original data bands are persisted).
- Wraps the source-zoom Warp in warp_source_into_tile_with_alpha_fallback,
  which retries once with extra_bands=1 on RuntimeError. The result is a
  sticky needs_extra_band flag — set on first failure, propagated to every
  subsequent tile of the same raster — so the retry runs at most once per
  raster, never per-tile.
- Applies the same logic to the parallel _read_raster_worker.

Healthy rasters take the unchanged fast path. Affected rasters pay one extra
Warp call total, regardless of pyramid size.

Reproducer: any 4-band Byte GeoTIFF where Photometric is set to a single
color channel plus ExtraSamples set to Undefined for the remaining bands
(e.g. Gray + 3x Undefined), or any RGBA where Warp's alpha auto-detection
fires.

Tracking: sc-547263 (BBVA import support ticket).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds a synthetic 4-band Byte GeoTIFF with photometric / SamplesPerPixel
mismatch (1 Gray color channel + 3 Undefined ExtraSamples, noDataValue=255)
and asserts that convert_to_raquet_files writes at least one data tile to
the output parquet.

The synthetic GeoTIFF builder runs in a spawn'd subprocess so osgeo never
loads in the test parent process — pyarrow and osgeo collide on the 'file'
filesystem registration when loaded in the same interpreter
(apache/arrow#44696).

Without the create_tile_ds extra_bands fix, the worker process crashes with
`RuntimeError: Destination dataset has 4 bands, but at least 5 are needed`
and the parent silently emits a parquet with the metadata row but zero data
rows — assertion `len(table) > 1` catches that regression mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@josemaria-vilaplana
Copy link
Copy Markdown
Author

The CI failure here is in the Check formatting (ruff) step, not unit tests — the unit tests pass ✅. The 9 ruff errors are all in files this PR doesn't touch (raquet/cli.py and raquet/earthengine.py) and reproduce on clean origin/master without any of these changes applied. Pre-existing lint debt, not introduced by this PR.

Categories:

  • cli.py:1281 — f-string with no placeholders (F541)
  • earthengine.py:15 — unused from pathlib import Path (F401)
  • earthengine.py:255,301,303,369,748F821 forward references to "ee.Image" / "ee.batch.Task" where ee is imported lazily inside the helpers
  • earthengine.py:389,399ee = _get_ee() assigned but unused (F841)

Happy to fix these in a separate cleanup PR if you'd like, or land this one with a CI override and clean up afterwards. Wanted to flag rather than mix the lint cleanup into a bug fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant