Skip to content

Commit 3daca7b

Browse files
Rework dataset load method and apply filtering
Why these changes are needed: * Updating the load method to support partition prefixes enables PyArrow to only load data from partitions matching the prefix and avoid scanning the entire TIMDEXDataset.dataset. The added filtering method will be used in future read operations to retrieve relevant rows from TIMDEXDataset.dataset. How this addresses that need: * Create TIMDEX_DATASET_FILTER_COLUMNS global variable * Add '_get_filtered_dataset' method * Add private helper method '_get_partition_prefixes' * Add private helper method '_parse_date_filters' * Update load method to support partition prefixes and filter method * Add and update unit tests for load and filter methods Side effects of this change: * TIMDEXDataset.load no longer returns self and instead assigns pyarrow.dataset.Dataset to self.dataset. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-425
1 parent 6b93d88 commit 3daca7b

7 files changed

Lines changed: 617 additions & 190 deletions

File tree

Pipfile.lock

Lines changed: 57 additions & 66 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

tests/conftest.py

Lines changed: 17 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -23,22 +23,34 @@ def _test_env(monkeypatch):
2323

2424
@pytest.fixture
2525
def local_dataset_location(tmp_path):
26-
return str(tmp_path / "tests/fixtures/local_datasets/dataset")
26+
return str(tmp_path / "local_dataset/")
2727

2828

2929
@pytest.fixture
3030
def local_dataset(local_dataset_location):
3131
timdex_dataset = TIMDEXDataset(local_dataset_location)
32-
records = generate_sample_records_with_simulated_partitions(num_records=5_000)
33-
timdex_dataset.write(records)
32+
timdex_dataset.write(
33+
generate_sample_records_with_simulated_partitions(num_records=5_000)
34+
)
3435
timdex_dataset.load()
3536
return timdex_dataset
3637

3738

3839
@pytest.fixture
3940
def new_local_dataset(tmp_path) -> TIMDEXDataset:
40-
location = str(tmp_path / "new_local_dataset")
41-
return TIMDEXDataset(location=location)
41+
return TIMDEXDataset(location=str(tmp_path / "new_local_dataset/"))
42+
43+
44+
@pytest.fixture
45+
def fixed_local_dataset(tmp_path) -> TIMDEXDataset:
46+
"""Local dataset with a fixed set of configurations.
47+
48+
This fixture is required to perform unit tests for TIMDEXDataset.filter
49+
method.
50+
"""
51+
timdex_dataset = TIMDEXDataset(str(tmp_path / "fixed_local_dataset/"))
52+
timdex_dataset.write(generate_sample_records(num_records=5_000, run_id="abc123"))
53+
return timdex_dataset
4254

4355

4456
@pytest.fixture

0 commit comments

Comments
 (0)