Skip to content

Commit d30c9e2

Browse files
committed
Reorder run_id and action partitions
Why these changes are being introduced: Discussions around retrieving subsets of records from the dataset suggested that it would be beneficial to have the run_id partition before the action partition. This will allow using a prefix approach of partition names and values when loading a dataset, that will pinpoint a particular run even before the dataset is fully loaded. This ordering was originally proposed in the engineering plan for this library, but it switched somewhere along the way; so moving back to agreed upon ordering. How this addresses that need: * Moves run_id before action in ordered partition columns Side effects of this change: * Omitting the action partition, and using everything until the run_id partition, is sufficient for getting all records from a run. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-424
1 parent 7bec349 commit d30c9e2

3 files changed

Lines changed: 6 additions & 6 deletions

File tree

tests/test_dataset_write.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -169,14 +169,14 @@ def test_dataset_write_schema_partitions_correctly_ordered(
169169
"source": "alma",
170170
"run_date": "2024-12-01",
171171
"run_type": "daily",
172-
"action": "index",
173172
"run_id": "000-111-aaa-bbb",
173+
"action": "index",
174174
},
175175
)
176176
file = written_files[0]
177177
assert (
178178
"/source=alma/run_date=2024-12-01/run_type=daily"
179-
"/action=index/run_id=000-111-aaa-bbb" in file.path
179+
"/run_id=000-111-aaa-bbb/action=index/" in file.path
180180
)
181181

182182

timdex_dataset_api/dataset.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,17 +28,17 @@
2828
pa.field("source", pa.string()),
2929
pa.field("run_date", pa.date32()),
3030
pa.field("run_type", pa.string()),
31-
pa.field("action", pa.string()),
3231
pa.field("run_id", pa.string()),
32+
pa.field("action", pa.string()),
3333
)
3434
)
3535

3636
TIMDEX_DATASET_PARTITION_COLUMNS = [
3737
"source",
3838
"run_date",
3939
"run_type",
40-
"action",
4140
"run_id",
41+
"action",
4242
]
4343

4444
DEFAULT_BATCH_SIZE = 1_000

timdex_dataset_api/record.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,8 @@ class DatasetRecord:
2424
source: str | None = None
2525
run_date: str | datetime.datetime | None = None
2626
run_type: str | None = None
27-
action: str | None = None
2827
run_id: str | None = None
28+
action: str | None = None
2929

3030
def to_dict(
3131
self,
@@ -46,7 +46,7 @@ def validate(self) -> None:
4646
# ensure all partition columns are set
4747
missing_partition_values = [
4848
field
49-
for field in ["source", "run_date", "run_type", "action", "run_id"]
49+
for field in ["source", "run_date", "run_type", "run_id", "action"]
5050
if getattr(self, field) is None
5151
]
5252
if missing_partition_values:

0 commit comments

Comments
 (0)