Skip to content

Commit 1625332

Browse files
committed
Rework read methods to utilize metadata
Why these changes are being introduced: This commit is a culmination of work to elevate metadata about ETL records to the point it can be used to improve the speed and efficiency of data queries. While the signature of the read methods will remain mostly the same, it exposes a 'where' clause that accepts raw SQL to filter the results, allowing for more advanced querying beyond the simple key/value DatasetFilters. Additionally, and equally important, data retrieval is now coming directly from DuckDB instead of more low level pyarrow dataset reads. Overall complexity remains about the same, but we have shifted focus into DuckDB table and view preperation and SQL construction, which also pays dividends in other contexts. It it anticipated this will set us up well for other data we may add to the TIMDEX dataset, e.g. vector embeddings or fulltext, which we may want to query and retrieve. How this addresses that need: As before, all read methods eventually call TIMDEXDataset.read_batches_iter() which now performs a two-part process of first quickly querying metadata records, then using that information to prune heavier data retrieved. SQLAlchemy is used to provide model DuckDB tables and views such that we can preserve the simpler key/value DatasetFilters, e.g. source='libguides' or run_type='daily', which will likely represent the majority of the public API needs by converting those key/value pairs into a SQL WHERE clause programatically. This is done without the need for complex string interpolation and escaping. The overall input and output signatures are largely the same, but the underlying approach to querying the ETL parquet records now utilizes DuckDB much more heavily, while also providing a SQL 'escape hatch' if the keyword filters don't suffice. Side effects of this change: * None! Transmog and TIM can call TDA in the same way as before. The underlying approach is different, but the signatures are mostly the same. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-529
1 parent 472726c commit 1625332

7 files changed

Lines changed: 379 additions & 139 deletions

File tree

Pipfile

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@ boto3 = "*"
99
duckdb = "*"
1010
pandas = "*"
1111
pyarrow = "*"
12+
sqlalchemy = "*"
13+
duckdb-engine = "*"
1214

1315
[dev-packages]
1416
black = "*"

Pipfile.lock

Lines changed: 91 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ TDA_BATCH_READ_AHEAD=# Number of batches to optimistically read ahead when batch
5959
TDA_FRAGMENT_READ_AHEAD=# Number of fragments to optimistically read ahead when batch reaching from a dataset; pyarrow default is 4
6060
TDA_DUCKDB_MEMORY_LIMIT=# Memory limit for DuckDB connection
6161
TDA_DUCKDB_THREADS=# Thread limit for DuckDB connection
62+
TDA_DUCKDB_JOIN_BATCH_SIZE=# Batch size for metadata + data joins, 100k default and recommended
6263
```
6364

6465
## Local S3 via MinIO
@@ -98,6 +99,8 @@ WARNING_ONLY_LOGGERS=asyncio,botocore,urllib3,s3transfer,boto3
9899

99100
### Reading Data
100101

102+
See [docs/reading.md](docs/reading.md) for an in-depth guide and Mermaid diagram.
103+
101104
First, import the library:
102105
```python
103106
from timdex_dataset_api import TIMDEXDataset

timdex_dataset_api/config.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
import logging
22
import os
3+
import warnings
4+
5+
from duckdb_engine import DuckDBEngineWarning
36

47

58
def configure_logger(
@@ -28,6 +31,13 @@ def configure_logger(
2831
for warning_logger_name in warning_only_loggers.split(","):
2932
logging.getLogger(warning_logger_name).setLevel(logging.WARNING)
3033

34+
# suppress a SQLAlchemy duckdb_engine warning
35+
warnings.filterwarnings(
36+
"ignore",
37+
category=DuckDBEngineWarning,
38+
message=r".*doesn't yet support reflection on indices.*",
39+
)
40+
3141
return logger
3242

3343

0 commit comments

Comments
 (0)