Commit 1625332
committed
Rework read methods to utilize metadata
Why these changes are being introduced:
This commit is a culmination of work to elevate metadata about
ETL records to the point it can be used to improve the speed
and efficiency of data queries.
While the signature of the read methods will remain mostly
the same, it exposes a 'where' clause that accepts raw SQL to
filter the results, allowing for more advanced querying beyond
the simple key/value DatasetFilters.
Additionally, and equally important, data retrieval is now
coming directly from DuckDB instead of more low level pyarrow
dataset reads. Overall complexity remains about the same, but
we have shifted focus into DuckDB table and view preperation
and SQL construction, which also pays dividends in other contexts.
It it anticipated this will set us up well for other data we may
add to the TIMDEX dataset, e.g. vector embeddings or fulltext,
which we may want to query and retrieve.
How this addresses that need:
As before, all read methods eventually call
TIMDEXDataset.read_batches_iter() which now performs a two-part
process of first quickly querying metadata records, then using that
information to prune heavier data retrieved.
SQLAlchemy is used to provide model DuckDB tables and views such
that we can preserve the simpler key/value DatasetFilters, e.g.
source='libguides' or run_type='daily', which will likely
represent the majority of the public API needs by converting those
key/value pairs into a SQL WHERE clause programatically. This is
done without the need for complex string interpolation and
escaping.
The overall input and output signatures are largely the same, but
the underlying approach to querying the ETL parquet records now
utilizes DuckDB much more heavily, while also providing a SQL
'escape hatch' if the keyword filters don't suffice.
Side effects of this change:
* None! Transmog and TIM can call TDA in the same way as before.
The underlying approach is different, but the signatures are
mostly the same.
Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-5291 parent 472726c commit 1625332
7 files changed
Lines changed: 379 additions & 139 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
| 12 | + | |
| 13 | + | |
12 | 14 | | |
13 | 15 | | |
14 | 16 | | |
| |||
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
59 | 59 | | |
60 | 60 | | |
61 | 61 | | |
| 62 | + | |
62 | 63 | | |
63 | 64 | | |
64 | 65 | | |
| |||
98 | 99 | | |
99 | 100 | | |
100 | 101 | | |
| 102 | + | |
| 103 | + | |
101 | 104 | | |
102 | 105 | | |
103 | 106 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
3 | 6 | | |
4 | 7 | | |
5 | 8 | | |
| |||
28 | 31 | | |
29 | 32 | | |
30 | 33 | | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
31 | 41 | | |
32 | 42 | | |
33 | 43 | | |
| |||
0 commit comments