You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Why these changes are being introduced:
Bulk reading and writing from the TIMDEX dataset is a primary responsibility,
but occassional random access (e.g. locating a single record row) will be
helpful (e.g. looking at the original source record for a problematic record).
Each TIMDEX JSON record in Opensearch will contain a "provenance" object that will
include things like run_date, run_id, and now run_record_offset. This offset
allows for quicker (time) and more efficient (data read) retrieval of a single
record given information in the TIMDEX provenance object.
How this addresses that need:
Parquet files have metadata embedded that describe what values can be found
in subsets of the file, but this is only helpful when the min/max values
in that metadata can inform query engines if a desired record may be
present. Unfortunately, the timdex_record_id is a) not lexicographically
sortable (at least not easily), and b) are not ordered during write.
By adding this offset, effectively an incrementing counter as records are
yielded for writing, we have a value that is pre-sorted and provides nice
ranges in the parquet file metadata. Query engines can utilize this to
dramatically improve random access reads. By including this offset integer
in the TIMDEX record "provenance" section we close the loop and provide
enough information in the Opensearch record to efficiently retrieve it
from the parquet dataset.
Side effects of this change:
* Dataset will now include a new column 'run_record_offset'
Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-465
0 commit comments