Skip to content

Commit 3448d12

Browse files
committed
Read methods documentation
1 parent 7ac193f commit 3448d12

2 files changed

Lines changed: 182 additions & 2 deletions

File tree

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -99,8 +99,6 @@ WARNING_ONLY_LOGGERS=asyncio,botocore,urllib3,s3transfer,boto3
9999

100100
### Reading Data
101101

102-
See [docs/reading.md](docs/reading.md) for an in-depth guide and Mermaid diagram.
103-
104102
First, import the library:
105103
```python
106104
from timdex_dataset_api import TIMDEXDataset
@@ -150,6 +148,8 @@ run_df = timdex_dataset.read_dataframe(
150148
)
151149
```
152150

151+
See [docs/reading.md](docs/reading.md) for more information.
152+
153153
### Writing Data
154154

155155
At this time, the only application that writes to the ETL parquet dataset is Transmogrifier.

docs/reading.md

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
# Reading data from TIMDEXDataset
2+
3+
This guide explains how `TIMDEXDataset` read methods work and how to use them effectively.
4+
5+
- `TIMDEXDataset` and `TIMDEXDatasetMetadata` both maintain an in-memory DuckDB context. You can issue DuckDB SQL against the views/tables they create.
6+
- Read methods use a two-step query flow for performance:
7+
1) a metadata query determines which Parquet files and row offsets are relevant
8+
2) a data query reads just those rows and returns the requested columns
9+
- Prefer simple key/value `DatasetFilters` for most use cases; add a `where=` SQL predicate when you need more advanced logic (e.g., ranges, `BETWEEN`, `>`, `<`, `IN`).
10+
11+
## Available read methods
12+
13+
- `read_batches_iter(...)`: yields `pyarrow.RecordBatch`
14+
- `read_dicts_iter(...)`: yields Python `dict` per row
15+
- `read_dataframe(...)`: returns a pandas `DataFrame`
16+
- `read_dataframes_iter(...)`: yields pandas `DataFrame` batches
17+
- `read_transformed_records_iter(...)`: yields `transformed_record` dictionaries only
18+
19+
All accept the same `DatasetFilters` and the optional `where=` SQL predicate.
20+
21+
## Filters vs. where=
22+
23+
- `DatasetFilters` are key/value arguments on read methods. They are validated and translated into SQL and will cover most queries.
24+
- Examples: `source="alma"`, `run_date="2024-12-01"`, `run_type="daily"`, `action="index"`
25+
- `where=` is an optional raw SQL WHERE predicate string, combined with `DatasetFilters` using `AND`. Use it for:
26+
- date/time ranges (BETWEEN, >, <)
27+
- set membership (IN (...))
28+
- complex boolean logic (AND/OR grouping)
29+
30+
Important: `where=` must be only a WHERE predicate (no `SELECT`/`FROM`/`;`). The library plugs it into generated SQL.
31+
32+
## How reading works (two-step process)
33+
34+
1) Metadata query
35+
- Runs against `TIMDEXDatasetMetadata` views (e.g., `metadata.records`, `metadata.current_records`)
36+
- Produces a small result set with identifiers: `filename`, row group/offsets, and primary keys
37+
- Greatly reduces how much data must be scanned
38+
39+
2) Data query
40+
- Uses DuckDB to read only relevant Parquet fragments based on metadata results
41+
- Joins the metadata identifiers to return the exact rows requested
42+
- Returns batches, dicts, or a `DataFrame` depending on the method
43+
44+
This pattern keeps reads fast and memory-efficient even for large datasets.
45+
46+
The following diagram shows the flow for a query like:
47+
48+
```python
49+
for record_dict in td.read_dicts_iter(table="records", source="dspace", run_date="2025-09-01", run_id="abc123"):
50+
# process record...
51+
```
52+
53+
```mermaid
54+
sequenceDiagram
55+
autonumber
56+
participant U as User
57+
participant TD as TIMDEXDataset
58+
participant TDM as TIMDEXDatasetMetadata
59+
participant D as DuckDB Context
60+
participant P as Parquet files
61+
62+
U->>TD: Perform query
63+
Note left of TD: read_dicts_iter(<br>table="records",<br>source="dspace",<br>run_date="2025-09-01",<br>run_id="abc123")
64+
TD->>TDM: build_meta_query(table, filters, where=None)
65+
Note right of TDM: (Metadata Query)<br><br>SELECT r.timdex_record_id, r.run_id, r.filename, r.run_record_offset<br>FROM metadata.records r<br>WHERE r.source = 'dspace'<br>AND r.run_date = '2025-09-01'<br>AND r.run_id = 'abc123'<br>ORDER BY r.filename, r.run_record_offset
66+
67+
TDM->>D: Execute metadata query
68+
D-->>TD: lightweight result set (file + offsets)
69+
70+
TD->>D: Build and run data query using metadata
71+
Note right of D: (Data query)<br><br>SELECT <COLUMNS><br>FROM read_parquet(P.files) d<br>JOIN meta m<br>USING (timdex_record_id, run_id, run_record_offset)<br>WHERE d.source = 'dspace' AND d.run_id = 'abc123'
72+
73+
D-->>TD: batches of rows
74+
TD-->>U: iterator of dicts (one dict per row)
75+
```
76+
77+
78+
## Quick start examples
79+
80+
```python
81+
from timdex_dataset_api import TIMDEXDataset
82+
83+
td = TIMDEXDataset("s3://my-bucket/timdex-dataset") # example instance
84+
85+
# 1) Get a single record as a dict
86+
first = next(td.read_dicts_iter())
87+
88+
# 2) Read batches with simple filters
89+
for batch in td.read_batches_iter(source="alma", run_date="2025-06-01", run_id="abc123"):
90+
... # process pyarrow.RecordBatch
91+
92+
# 3) DataFrame of one run
93+
df = td.read_dataframe(source="dspace", run_date="2025-06-01", run_id="def456")
94+
95+
# 4) Only transformed records (used by indexer)
96+
for rec in td.read_transformed_records_iter(source="aspace", run_type="daily"):
97+
... # rec is a dict of the transformed_record
98+
```
99+
100+
## `where=` examples
101+
102+
Advanced filtering that complements `DatasetFilters`.
103+
104+
```python
105+
# date range with BETWEEN
106+
where = "run_date BETWEEN '2024-12-01' AND '2024-12-31'"
107+
df = td.read_dataframe(source="alma", where=where)
108+
109+
# greater-than on a timestamp (if present in columns)
110+
where = "run_timestamp > '2024-12-01T10:00:00Z'"
111+
df = td.read_dataframe(source="aspace", run_type="daily", where=where)
112+
113+
# combine set membership and action
114+
where = "run_id IN ('run-1', 'run-3', 'run-5') AND action = 'index'"
115+
df = td.read_dataframe(source="alma", where=where)
116+
117+
# combine filters (AND) with where=
118+
where = "run_type = 'daily' AND action = 'index'"
119+
df = td.read_dataframe(source="libguides", where=where)
120+
```
121+
122+
Validation tips:
123+
- Use only a predicate (no SELECT/FROM, no trailing semicolon).
124+
- Column names must exist in the target table/view (e.g., records or current_records).
125+
- `DatasetFilters` + `where=` are ANDed; if the combination yields zero rows, you’ll get an empty result.
126+
127+
## Choosing a table
128+
129+
By default, read methods query the `records` view (all versions). To get only the latest version per `timdex_record_id`, target the `current_records` view:
130+
131+
```python
132+
# ALL records in the 'libguides' source
133+
all_libguides_df = td.read_dataframe(table="records", source="libguides")
134+
135+
# latest unique records across the dataset
136+
current_df = td.read_dataframe(table="current_records")
137+
138+
# current records for a source and specific run
139+
current_df = td.read_dataframe(table="current_records", source="alma", run_id="run-5")
140+
```
141+
142+
## DuckDB context
143+
144+
- `TIMDEXDataset` exposes a DuckDB connection used for data queries against Parquet.
145+
- `TIMDEXDatasetMetadata` exposes a DuckDB connection used for metadata queries and provides views:
146+
- `metadata.records`: all record versions with run metadata
147+
- `metadata.current_records`: latest record per `timdex_record_id`
148+
- `metadata.append_deltas`: incremental write tracking
149+
150+
You can execute raw DuckDB SQL for inspection and debugging:
151+
152+
```python
153+
# access metadata connection
154+
conn = td.metadata.conn # DuckDB connection
155+
156+
# peek at view schemas
157+
print(conn.sql("DESCRIBE metadata.records").to_df())
158+
print(conn.sql("DESCRIBE metadata.current_records").to_df())
159+
160+
# ad-hoc query (read-only)
161+
debug_df = conn.sql("""
162+
SELECT source, action, COUNT(*) as n
163+
FROM metadata.records
164+
WHERE run_date = '2024-12-01'
165+
GROUP BY 1, 2
166+
ORDER BY n DESC
167+
""").to_df()
168+
```
169+
170+
## Performance notes
171+
172+
- Batch iterators (`read_batches_iter()` / `read_dataframes_iter()`) stream results to control memory.
173+
- `read_dataframe()` loads ALL matching rows into memory; fine for small/filtered sets but can easily overwhelm memory for large result sets
174+
- Tuning via env vars (advanced): `TDA_READ_BATCH_SIZE`, `TDA_DUCKDB_THREADS`, `TDA_DUCKDB_MEMORY_LIMIT`.
175+
176+
## Troubleshooting
177+
178+
- Empty results? Check that filters and `where=` don’t over-constrain your query.
179+
- Syntax errors? Ensure `where=` is a valid predicate and references existing columns.
180+
- Large scans? Make sure to use `_iter()` read methods.

0 commit comments

Comments
 (0)