Commit 3daca7b
committed
Rework dataset load method and apply filtering
Why these changes are needed:
* Updating the load method to support partition prefixes enables
PyArrow to only load data from partitions matching the prefix and
avoid scanning the entire TIMDEXDataset.dataset. The added filtering
method will be used in future read operations to retrieve relevant
rows from TIMDEXDataset.dataset.
How this addresses that need:
* Create TIMDEX_DATASET_FILTER_COLUMNS global variable
* Add '_get_filtered_dataset' method
* Add private helper method '_get_partition_prefixes'
* Add private helper method '_parse_date_filters'
* Update load method to support partition prefixes and filter method
* Add and update unit tests for load and filter methods
Side effects of this change:
* TIMDEXDataset.load no longer returns self and instead assigns
pyarrow.dataset.Dataset to self.dataset.
Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-4251 parent 6b93d88 commit 3daca7b
7 files changed
Lines changed: 617 additions & 190 deletions
File tree
- tests
- timdex_dataset_api
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
26 | | - | |
| 26 | + | |
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
32 | | - | |
33 | | - | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
34 | 35 | | |
35 | 36 | | |
36 | 37 | | |
37 | 38 | | |
38 | 39 | | |
39 | 40 | | |
40 | | - | |
41 | | - | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
42 | 54 | | |
43 | 55 | | |
44 | 56 | | |
| |||
0 commit comments