You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Surface pyarrow batch reading parameters and adjust defaults
Why these changes are being introduced:
During use of this library for batch reading transformed
records to index into Opensearch, the application TIM threw
an out-of-memory error. It was observed that, as-is, batch
reading of records could hover around 1-2gb. This was surprising,
as we are careful to only yield records as they are batch read from
the parquet dataset.
It turns out that pyarrow Dataset.to_batches() has some defaults
for optimistic reading ahead to improve IO, but at the cost of
memory consumption. Tuning these down resulted in much lower
memory consumption, that aligns with how our current TIMDEX
applications are resourced.
How this addresses that need:
By surfacing the Dataset.to_batches() arguments 'batch_readahead'
and 'fragment_readahead' to this library's read methods, and
setting conservative defaults, memory consumption is significantly
lower.
Side effects of this change:
* Per the defaults set, slower IO for dataset reads.
Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-468
0 commit comments