Skip to content

ORC-2619: Fix estimateRgEndOffset slop calculation for incompressible data#2620

Open
thexiay wants to merge 1 commit intoapache:mainfrom
thexiay:fix/estimate-rg-end-offset-slop
Open

ORC-2619: Fix estimateRgEndOffset slop calculation for incompressible data#2620
thexiay wants to merge 1 commit intoapache:mainfrom
thexiay:fix/estimate-rg-end-offset-slop

Conversation

@thexiay
Copy link
Copy Markdown

@thexiay thexiay commented May 7, 2026

What changes were proposed in this pull request?

Fix the estimateRgEndOffset slop calculation in RecordReaderUtils.java to account for the 2-byte RLEv2 DIRECT run header.

Problem

The old formula:

int stretchFactor = 2 + (MAX_VALUES_LENGTH * MAX_BYTE_WIDTH - 1) / bufferSize;

only considers the value payload (512 * 8 = 4096 bytes) but ignores the 2-byte RLE header. For bufferSize = 1024, this gives stretchFactor = 5, which is one block short when data is incompressible.

Fix

int maxRleDirectRunSize = MAX_VALUES_LENGTH * MAX_BYTE_WIDTH + 2;
int stretchFactor = 2 + (maxRleDirectRunSize - 1) / bufferSize;

This correctly yields stretchFactor = 6, ensuring enough compressed blocks are allocated.

How was this patch tested?

Added testTruncatedRleV2DirectRunAtEstimatedEndFails in TestInStream.java that:

  1. Creates a compressed stream with incompressible (random) data
  2. Truncates it at the old estimated end offset
  3. Verifies that reading a full RLE v2 DIRECT run fails with IllegalArgumentException: Buffer size too small

This proves the old slop estimation was insufficient.

Closes #2619

@thexiay thexiay force-pushed the fix/estimate-rg-end-offset-slop branch from a9ffd5d to c52545c Compare May 8, 2026 03:06
… data

The stretchFactor calculation in estimateRgEndOffset did not account for
the 2-byte RLEv2 DIRECT run header. This caused insufficient buffer
allocation when data is incompressible, leading to 'Buffer size too small'
errors.

Fix: Include RLE_V2_HEADER_SIZE in the worst-case payload calculation.
Add test demonstrating the issue with the old formula.
@thexiay thexiay force-pushed the fix/estimate-rg-end-offset-slop branch from c52545c to 076e787 Compare May 8, 2026 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

estimateRgEndOffset slop calculation is insufficient for incompressible data

1 participant