Skip to content

Commit c620a80

Browse files
committed
Add feedback from working through each of the TPC-H queries
1 parent 4429a08 commit c620a80

1 file changed

Lines changed: 132 additions & 36 deletions

File tree

python/datafusion/AGENTS.md

Lines changed: 132 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -83,9 +83,8 @@ Every method returns a **new** DataFrame (immutable/lazy). Chain them fluently.
8383
### Projection
8484

8585
```python
86-
df.select("a", "b") # by column name
87-
df.select(col("a"), col("b")) # by Expr
88-
df.select(col("a"), (col("b") + 1).alias("b_plus_1")) # with expression
86+
df.select("a", "b") # preferred: plain names as strings
87+
df.select(col("a"), (col("b") + 1).alias("b_plus_1")) # use col()/Expr only when you need an expression
8988

9089
df.with_column("new_col", col("a") + lit(10)) # add one column
9190
df.with_columns(
@@ -97,38 +96,61 @@ df.drop("unwanted_col")
9796
df.with_column_renamed("old_name", "new_name")
9897
```
9998

99+
When a column is referenced by name alone, pass the name as a string rather
100+
than wrapping it in `col()`. Reach for `col()` only when the projection needs
101+
arithmetic, aliasing, casting, or another expression operation.
102+
103+
**Case sensitivity**: both `select("Name")` and `col("Name")` lowercase the
104+
identifier. For a column whose real name has uppercase letters, embed double
105+
quotes inside the string: `select('"MyCol"')` or `col('"MyCol"')`. Without the
106+
inner quotes the lookup will fail with `No field named mycol`.
107+
100108
### Filtering
101109

102110
```python
103-
df.filter(col("a") > lit(10))
104-
df.filter(col("a") > lit(10), col("b") == lit("x")) # multiple = AND
105-
df.filter("a > 10") # SQL expression string
111+
df.filter(col("a") > 10)
112+
df.filter(col("a") > 10, col("b") == "x") # multiple = AND
113+
df.filter("a > 10") # SQL expression string
106114
```
107115

116+
Raw Python values on the right-hand side of a comparison are auto-wrapped
117+
into literals by the `Expr` operators, so prefer `col("a") > 10` over
118+
`col("a") > lit(10)`. See the Comparisons section and pitfall #2 for the
119+
full rule.
120+
108121
### Aggregation
109122

110123
```python
111124
# GROUP BY a, compute sum(b) and count(*)
112-
df.aggregate([col("a")], [F.sum(col("b")), F.count(col("a"))])
125+
df.aggregate(["a"], [F.sum(col("b")), F.count(col("a"))])
113126

114127
# HAVING equivalent: use the filter keyword on the aggregate function
115128
df.aggregate(
116-
[col("region")],
129+
["region"],
117130
[F.sum(col("sales"), filter=col("sales") > lit(1000)).alias("large_sales")],
118131
)
119132
```
120133

134+
As with `select()`, group keys can be passed as plain name strings. Reach for
135+
`col(...)` only when the grouping expression needs arithmetic, aliasing,
136+
casting, or another expression operation.
137+
121138
Most aggregate functions accept an optional `filter` keyword argument. When
122139
provided, only rows where the filter expression is true contribute to the
123140
aggregate.
124141

125142
### Sorting
126143

127144
```python
128-
df.sort(col("a").sort(ascending=True, nulls_first=False))
145+
df.sort(col("a")) # ascending (default)
129146
df.sort(col("a").sort(ascending=False)) # descending
147+
df.sort(col("a").sort(nulls_first=False)) # override null placement
130148
```
131149

150+
A plain expression passed to `sort()` is already treated as ascending. Only
151+
reach for `col(...).sort(...)` when you need to override a default (descending
152+
order or null placement). Writing `col("a").sort(ascending=True)` is redundant.
153+
132154
### Joining
133155

134156
```python
@@ -151,6 +173,14 @@ df1.join(df2, on="key", how="anti")
151173

152174
Join types: `"inner"`, `"left"`, `"right"`, `"full"`, `"semi"`, `"anti"`.
153175

176+
Inner is the default `how`. Prefer `df1.join(df2, on="key")` over
177+
`df1.join(df2, on="key", how="inner")` — drop `how=` unless you need a
178+
non-inner join type.
179+
180+
When the two sides' join columns have different native names, use
181+
`left_on=`/`right_on=` with the original names rather than aliasing one side
182+
to match the other — see pitfall #7.
183+
154184
### Window Functions
155185

156186
```python
@@ -220,6 +250,7 @@ DataFrames are lazy until you collect.
220250
```python
221251
df.show() # print formatted table to stdout
222252
batches = df.collect() # list[pa.RecordBatch]
253+
arr = df.collect_column("col_name") # pa.Array | pa.ChunkedArray (single column)
223254
table = df.to_arrow_table() # pa.Table
224255
pandas_df = df.to_pandas() # pd.DataFrame
225256
polars_df = df.to_polars() # pl.DataFrame
@@ -289,25 +320,30 @@ literal, which is **not** compatible with `Date32` arithmetic. Always use
289320
### Comparisons
290321

291322
```python
292-
col("a") > lit(10)
293-
col("a") >= lit(10)
294-
col("a") < lit(10)
295-
col("a") <= lit(10)
296-
col("a") == lit("x")
297-
col("a") != lit("x")
323+
col("a") > 10
324+
col("a") >= 10
325+
col("a") < 10
326+
col("a") <= 10
327+
col("a") == "x"
328+
col("a") != "x"
298329
col("a") == None # same as col("a").is_null()
299330
col("a") != None # same as col("a").is_not_null()
300331
```
301332

333+
Comparison operators auto-wrap the right-hand Python value into a literal,
334+
so writing `col("a") > lit(10)` is redundant. Drop the `lit()` in
335+
comparisons. Reach for `lit()` only when auto-wrapping does not apply — see
336+
pitfall #2.
337+
302338
### Boolean Logic
303339

304340
**Important**: Python's `and`, `or`, `not` keywords do NOT work with Expr
305341
objects. You must use the bitwise operators:
306342

307343
```python
308-
(col("a") > lit(1)) & (col("b") < lit(10)) # AND
309-
(col("a") > lit(1)) | (col("b") < lit(10)) # OR
310-
~(col("a") > lit(1)) # NOT
344+
(col("a") > 1) & (col("b") < 10) # AND
345+
(col("a") > 1) | (col("b") < 10) # OR
346+
~(col("a") > 1) # NOT
311347
```
312348

313349
Always wrap each comparison in parentheses when combining with `&`, `|`, `~`
@@ -379,7 +415,7 @@ col("array_col")[1:3] # array slice (0-indexed)
379415
| `SELECT a, b + 1 AS c` | `df.select(col("a"), (col("b") + lit(1)).alias("c"))` |
380416
| `SELECT *, a + 1 AS c` | `df.with_column("c", col("a") + lit(1))` |
381417
| `WHERE a > 10` | `df.filter(col("a") > lit(10))` |
382-
| `GROUP BY a` with `SUM(b)` | `df.aggregate([col("a")], [F.sum(col("b"))])` |
418+
| `GROUP BY a` with `SUM(b)` | `df.aggregate(["a"], [F.sum(col("b"))])` |
383419
| `SUM(b) FILTER (WHERE b > 100)` | `F.sum(col("b"), filter=col("b") > lit(100))` |
384420
| `ORDER BY a DESC` | `df.sort(col("a").sort(ascending=False))` |
385421
| `LIMIT 10 OFFSET 5` | `df.limit(10, offset=5)` |
@@ -407,26 +443,70 @@ col("array_col")[1:3] # array slice (0-indexed)
407443
1. **Boolean operators**: Use `&`, `|`, `~` -- not Python's `and`, `or`, `not`.
408444
Always parenthesize: `(col("a") > lit(1)) & (col("b") < lit(2))`.
409445

410-
2. **Wrapping scalars with `lit()`**: Comparisons with raw Python values work
411-
(e.g., `col("a") > 10`) because the Expr operators auto-wrap them, but
412-
standalone scalars in function calls need `lit()`:
413-
`F.coalesce(col("a"), lit(0))`, not `F.coalesce(col("a"), 0)`.
414-
415-
3. **Column name quoting**: Column names are normalized to lowercase by default.
416-
To reference a column with uppercase letters, use double quotes inside the
417-
string: `col('"MyColumn"')`.
446+
2. **Wrapping scalars with `lit()`**: Prefer raw Python values on the
447+
right-hand side of comparisons — `col("a") > 10`, `col("name") == "Alice"`
448+
— because the Expr comparison operators auto-wrap them. Writing
449+
`col("a") > lit(10)` is redundant. Reserve `lit()` for places where
450+
auto-wrapping does *not* apply:
451+
- standalone scalars passed into function calls:
452+
`F.coalesce(col("a"), lit(0))`, not `F.coalesce(col("a"), 0)`
453+
- arithmetic between two literals with no column involved:
454+
`lit(1) - col("discount")` is fine, but `lit(1) - lit(2)` needs both
455+
- values that must carry a specific Arrow type, via `lit(pa.scalar(...))`
456+
- `.when(...)`, `.otherwise(...)`, `F.nullif(...)`, `.between(...)`,
457+
`F.in_list(...)` and similar method/function arguments
458+
459+
3. **Column name quoting**: Column names are normalized to lowercase by default
460+
in both `select("...")` and `col("...")`. To reference a column with
461+
uppercase letters, use double quotes inside the string:
462+
`select('"MyColumn"')` or `col('"MyColumn"')`.
418463

419464
4. **DataFrames are immutable**: Every method returns a **new** DataFrame. You
420465
must capture the return value:
421466
```python
422-
df = df.filter(col("a") > lit(1)) # correct
423-
df.filter(col("a") > lit(1)) # WRONG -- result is discarded
467+
df = df.filter(col("a") > 1) # correct
468+
df.filter(col("a") > 1) # WRONG -- result is discarded
424469
```
425470

426471
5. **Window frame defaults**: When using `order_by` in a window, the default
427472
frame is `RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW`. For a full
428473
partition frame, set `window_frame=WindowFrame("rows", None, None)`.
429474

475+
6. **Arithmetic on aggregates belongs in a later `select`, not inside
476+
`aggregate`**: Each item in the aggregate list must be a single aggregate
477+
call (optionally aliased). Combining aggregates with arithmetic inside
478+
`aggregate(...)` fails with `Internal error: Invalid aggregate expression`.
479+
Alias the aggregates, then compute the combination downstream:
480+
```python
481+
# WRONG -- arithmetic wraps two aggregates
482+
df.aggregate([], [(lit(100) * F.sum(col("a")) / F.sum(col("b"))).alias("ratio")])
483+
484+
# CORRECT -- aggregate first, then combine
485+
(df.aggregate([], [F.sum(col("a")).alias("num"), F.sum(col("b")).alias("den")])
486+
.select((lit(100) * col("num") / col("den")).alias("ratio")))
487+
```
488+
489+
7. **Don't alias a join column to match the other side**: When equi-joining
490+
with `on="key"`, renaming the join column on one side via `.alias("key")`
491+
in a fresh projection creates a schema where one side's `key` is
492+
qualified (`?table?.key`) and the other is unqualified. The join then
493+
fails with `Schema contains qualified field name ... and unqualified
494+
field name ... which would be ambiguous`. Use `left_on=`/`right_on=` with
495+
the native names, or use `join_on(...)` with an explicit equality.
496+
```python
497+
# WRONG -- alias on one side produces ambiguous schema after join
498+
failed = orders.select(col("o_orderkey").alias("l_orderkey"))
499+
li.join(failed, on="l_orderkey") # ambiguous l_orderkey error
500+
501+
# CORRECT -- keep native names, use left_on/right_on
502+
failed = orders.select("o_orderkey")
503+
li.join(failed, left_on="l_orderkey", right_on="o_orderkey")
504+
505+
# ALSO CORRECT -- explicit predicate via join_on
506+
# (note: join_on keeps both key columns in the output, unlike on="key")
507+
li.join_on(failed, col("l_orderkey") == col("o_orderkey"))
508+
```
509+
430510
## Idiomatic Patterns
431511

432512
### Fluent Chaining
@@ -436,7 +516,7 @@ result = (
436516
ctx.read_parquet("data.parquet")
437517
.filter(col("year") >= lit(2020))
438518
.select(col("region"), col("sales"))
439-
.aggregate([col("region")], [F.sum(col("sales")).alias("total")])
519+
.aggregate(["region"], [F.sum(col("sales")).alias("total")])
440520
.sort(col("total").sort(ascending=False))
441521
.limit(10)
442522
)
@@ -450,7 +530,7 @@ variables:
450530

451531
```python
452532
base = ctx.read_parquet("orders.parquet").filter(col("status") == lit("shipped"))
453-
by_region = base.aggregate([col("region")], [F.sum(col("amount")).alias("total")])
533+
by_region = base.aggregate(["region"], [F.sum(col("amount")).alias("total")])
454534
top_regions = by_region.filter(col("total") > lit(10000))
455535
```
456536

@@ -470,9 +550,9 @@ df = df.select(
470550
)
471551

472552
# Use a collected scalar as an expression
473-
max_val = result_batch[0].column("max_price")[0] # PyArrow scalar
553+
max_val = result_df.collect_column("max_price")[0] # PyArrow scalar
474554
cutoff = lit(max_val) - lit(pa.scalar((0, 90, 0), type=pa.month_day_nano_interval()))
475-
df = df.filter(col("ship_date") <= cutoff) # cutoff is already an Expr
555+
df = df.filter(col("ship_date") <= cutoff) # cutoff is already an Expr
476556
```
477557

478558
**Important**: Do not wrap an `Expr` in `lit()`. `lit()` is for converting
@@ -529,9 +609,25 @@ The `functions` module (imported as `F`) provides 290+ functions. Key categories
529609
`ntile`, `lag`, `lead`, `first_value`, `last_value`, `nth_value`
530610

531611
**String**: `length`, `lower`, `upper`, `trim`, `ltrim`, `rtrim`, `lpad`,
532-
`rpad`, `starts_with`, `ends_with`, `contains`, `substr`, `replace`, `reverse`,
533-
`repeat`, `split_part`, `concat`, `concat_ws`, `initcap`, `ascii`, `chr`,
534-
`left`, `right`, `strpos`, `translate`, `overlay`, `levenshtein`
612+
`rpad`, `starts_with`, `ends_with`, `contains`, `substr`, `substring`,
613+
`replace`, `reverse`, `repeat`, `split_part`, `concat`, `concat_ws`,
614+
`initcap`, `ascii`, `chr`, `left`, `right`, `strpos`, `translate`, `overlay`,
615+
`levenshtein`
616+
617+
`F.substr(str, start)` takes **only two arguments** and returns the tail of
618+
the string from `start` onward — passing a third length argument raises
619+
`TypeError: substr() takes 2 positional arguments but 3 were given`. For the
620+
SQL-style 3-arg form (`SUBSTRING(str FROM start FOR length)`), use
621+
`F.substring(col("s"), lit(start), lit(length))`. For a fixed-length prefix,
622+
`F.left(col("s"), lit(n))` is cleanest.
623+
624+
```python
625+
# WRONG — substr does not accept a length argument
626+
F.substr(col("c_phone"), lit(1), lit(2))
627+
# CORRECT
628+
F.substring(col("c_phone"), lit(1), lit(2)) # explicit length
629+
F.left(col("c_phone"), lit(2)) # prefix shortcut
630+
```
535631

536632
**Math**: `abs`, `ceil`, `floor`, `round`, `trunc`, `sqrt`, `cbrt`, `exp`,
537633
`ln`, `log`, `log2`, `log10`, `pow`, `signum`, `pi`, `random`, `factorial`,

0 commit comments

Comments
 (0)