You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Rework dataset partitions to only year, month, day
Why these changes are being introduced:
* These changes simplify the partitioning schema for the TIMDEXDataset,
allowing the app to take advantage of PyArrow's memory-efficient
processes for reading and writing Parquet datasets. Furthermore, the
new partitioning schema will result in a more efficient, coherent
folder structure when writing datasets. For more details, see:
https://mitlibraries.atlassian.net/wiki/spaces/IN/pages/4094296066/Engineering+Plan+Parquet+Datasets+for+TIMDEX+ETL#Rework-Dataset-Partitions-to-use-only-Year-%2F-Month-%2F-Day.
How this addresses that need:
* Update TIMDEX_DATASET_SCHEMA to include [year, month, day]
* Update DatasetRecord attrs to include [year, month, day] and
set [source, run_date, run_type, run_id, action] as primary columns
* Add post_init method to DatasetRecord to derive partition values
from 'run-date
* Remove 'partition' values from DatasetRecord.to_dict
* Remove 'partition_values' mixin from TIMDEXDataset.write to reduce
complexity and have write method utilize DatasetRecord partition
columns instead.
* Update unit tests to use new partitions and remove deprecated tests
Side effects of this change:
* The new partitioning schema introduces a 3-level folder structure
within TIMDEXDataset.location (i.e. the base path of the dataset)
for [year, month, day], where the leaf node will contain parquet files
for every source run.
Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-432
0 commit comments