Skip to content

Commit 4429a08

Browse files
committed
Update agents file after working through the first tpc-h query using only the text description
1 parent 7238c73 commit 4429a08

1 file changed

Lines changed: 51 additions & 1 deletion

File tree

python/datafusion/AGENTS.md

Lines changed: 51 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -218,6 +218,7 @@ df.distinct_on( # keep first row per group (like DISTINCT ON in Postgres
218218
DataFrames are lazy until you collect.
219219

220220
```python
221+
df.show() # print formatted table to stdout
221222
batches = df.collect() # list[pa.RecordBatch]
222223
table = df.to_arrow_table() # pa.Table
223224
pandas_df = df.to_pandas() # pd.DataFrame
@@ -249,8 +250,12 @@ col("column_name") # reference a column
249250
lit(42) # integer literal
250251
lit("hello") # string literal
251252
lit(3.14) # float literal
253+
lit(pa.scalar(value)) # PyArrow scalar (preserves Arrow type)
252254
```
253255

256+
`lit()` accepts PyArrow scalars directly -- prefer this over converting Arrow
257+
data to Python and back when working with values extracted from query results.
258+
254259
### Arithmetic
255260

256261
```python
@@ -261,6 +266,26 @@ col("a") / lit(2) # division
261266
col("a") % lit(3) # modulo
262267
```
263268

269+
### Date Arithmetic
270+
271+
`Date32` columns require `Interval` types for arithmetic, not `Duration`. Use
272+
PyArrow's `month_day_nano_interval` type, which takes a `(months, days, nanos)`
273+
tuple:
274+
275+
```python
276+
import pyarrow as pa
277+
278+
# Subtract 90 days from a date column
279+
col("ship_date") - lit(pa.scalar((0, 90, 0), type=pa.month_day_nano_interval()))
280+
281+
# Subtract 3 months
282+
col("ship_date") - lit(pa.scalar((3, 0, 0), type=pa.month_day_nano_interval()))
283+
```
284+
285+
**Important**: `lit(datetime.timedelta(days=90))` creates a `Duration(µs)`
286+
literal, which is **not** compatible with `Date32` arithmetic. Always use
287+
`pa.month_day_nano_interval()` for date operations.
288+
264289
### Comparisons
265290

266291
```python
@@ -414,8 +439,8 @@ result = (
414439
.aggregate([col("region")], [F.sum(col("sales")).alias("total")])
415440
.sort(col("total").sort(ascending=False))
416441
.limit(10)
417-
.collect()
418442
)
443+
result.show()
419444
```
420445

421446
### Using Variables as CTEs
@@ -429,6 +454,31 @@ by_region = base.aggregate([col("region")], [F.sum(col("amount")).alias("total")
429454
top_regions = by_region.filter(col("total") > lit(10000))
430455
```
431456

457+
### Reusing Expressions as Variables
458+
459+
Just like DataFrames, expressions (`Expr`) can be stored in variables and used
460+
anywhere an `Expr` is expected. This is useful for building up complex
461+
expressions or reusing a computed value across multiple operations:
462+
463+
```python
464+
# Build an expression and reuse it
465+
disc_price = col("price") * (lit(1) - col("discount"))
466+
df = df.select(
467+
col("id"),
468+
disc_price.alias("disc_price"),
469+
(disc_price * (lit(1) + col("tax"))).alias("total"),
470+
)
471+
472+
# Use a collected scalar as an expression
473+
max_val = result_batch[0].column("max_price")[0] # PyArrow scalar
474+
cutoff = lit(max_val) - lit(pa.scalar((0, 90, 0), type=pa.month_day_nano_interval()))
475+
df = df.filter(col("ship_date") <= cutoff) # cutoff is already an Expr
476+
```
477+
478+
**Important**: Do not wrap an `Expr` in `lit()`. `lit()` is for converting
479+
Python/PyArrow values into expressions. If a value is already an `Expr`, use it
480+
directly.
481+
432482
### Window Functions for Scalar Subqueries
433483

434484
Where SQL uses a correlated scalar subquery, the idiomatic DataFrame approach

0 commit comments

Comments
 (0)