@@ -83,9 +83,8 @@ Every method returns a **new** DataFrame (immutable/lazy). Chain them fluently.
8383### Projection
8484
8585``` python
86- df.select(" a" , " b" ) # by column name
87- df.select(col(" a" ), col(" b" )) # by Expr
88- df.select(col(" a" ), (col(" b" ) + 1 ).alias(" b_plus_1" )) # with expression
86+ df.select(" a" , " b" ) # preferred: plain names as strings
87+ df.select(col(" a" ), (col(" b" ) + 1 ).alias(" b_plus_1" )) # use col()/Expr only when you need an expression
8988
9089df.with_column(" new_col" , col(" a" ) + lit(10 )) # add one column
9190df.with_columns(
@@ -97,38 +96,61 @@ df.drop("unwanted_col")
9796df.with_column_renamed(" old_name" , " new_name" )
9897```
9998
99+ When a column is referenced by name alone, pass the name as a string rather
100+ than wrapping it in ` col() ` . Reach for ` col() ` only when the projection needs
101+ arithmetic, aliasing, casting, or another expression operation.
102+
103+ ** Case sensitivity** : both ` select("Name") ` and ` col("Name") ` lowercase the
104+ identifier. For a column whose real name has uppercase letters, embed double
105+ quotes inside the string: ` select('"MyCol"') ` or ` col('"MyCol"') ` . Without the
106+ inner quotes the lookup will fail with ` No field named mycol ` .
107+
100108### Filtering
101109
102110``` python
103- df.filter(col(" a" ) > lit( 10 ) )
104- df.filter(col(" a" ) > lit( 10 ) , col(" b" ) == lit( " x" ) ) # multiple = AND
105- df.filter(" a > 10" ) # SQL expression string
111+ df.filter(col(" a" ) > 10 )
112+ df.filter(col(" a" ) > 10 , col(" b" ) == " x" ) # multiple = AND
113+ df.filter(" a > 10" ) # SQL expression string
106114```
107115
116+ Raw Python values on the right-hand side of a comparison are auto-wrapped
117+ into literals by the ` Expr ` operators, so prefer ` col("a") > 10 ` over
118+ ` col("a") > lit(10) ` . See the Comparisons section and pitfall #2 for the
119+ full rule.
120+
108121### Aggregation
109122
110123``` python
111124# GROUP BY a, compute sum(b) and count(*)
112- df.aggregate([col( " a" ) ], [F.sum(col(" b" )), F.count(col(" a" ))])
125+ df.aggregate([" a" ], [F.sum(col(" b" )), F.count(col(" a" ))])
113126
114127# HAVING equivalent: use the filter keyword on the aggregate function
115128df.aggregate(
116- [col( " region" ) ],
129+ [" region" ],
117130 [F.sum(col(" sales" ), filter = col(" sales" ) > lit(1000 )).alias(" large_sales" )],
118131)
119132```
120133
134+ As with ` select() ` , group keys can be passed as plain name strings. Reach for
135+ ` col(...) ` only when the grouping expression needs arithmetic, aliasing,
136+ casting, or another expression operation.
137+
121138Most aggregate functions accept an optional ` filter ` keyword argument. When
122139provided, only rows where the filter expression is true contribute to the
123140aggregate.
124141
125142### Sorting
126143
127144``` python
128- df.sort(col(" a" ).sort( ascending = True , nulls_first = False ) )
145+ df.sort(col(" a" )) # ascending (default )
129146df.sort(col(" a" ).sort(ascending = False )) # descending
147+ df.sort(col(" a" ).sort(nulls_first = False )) # override null placement
130148```
131149
150+ A plain expression passed to ` sort() ` is already treated as ascending. Only
151+ reach for ` col(...).sort(...) ` when you need to override a default (descending
152+ order or null placement). Writing ` col("a").sort(ascending=True) ` is redundant.
153+
132154### Joining
133155
134156``` python
@@ -151,6 +173,14 @@ df1.join(df2, on="key", how="anti")
151173
152174Join types: ` "inner" ` , ` "left" ` , ` "right" ` , ` "full" ` , ` "semi" ` , ` "anti" ` .
153175
176+ Inner is the default ` how ` . Prefer ` df1.join(df2, on="key") ` over
177+ ` df1.join(df2, on="key", how="inner") ` — drop ` how= ` unless you need a
178+ non-inner join type.
179+
180+ When the two sides' join columns have different native names, use
181+ ` left_on= ` /` right_on= ` with the original names rather than aliasing one side
182+ to match the other — see pitfall #7 .
183+
154184### Window Functions
155185
156186``` python
@@ -220,6 +250,7 @@ DataFrames are lazy until you collect.
220250``` python
221251df.show() # print formatted table to stdout
222252batches = df.collect() # list[pa.RecordBatch]
253+ arr = df.collect_column(" col_name" ) # pa.Array | pa.ChunkedArray (single column)
223254table = df.to_arrow_table() # pa.Table
224255pandas_df = df.to_pandas() # pd.DataFrame
225256polars_df = df.to_polars() # pl.DataFrame
@@ -289,25 +320,30 @@ literal, which is **not** compatible with `Date32` arithmetic. Always use
289320### Comparisons
290321
291322``` python
292- col(" a" ) > lit( 10 )
293- col(" a" ) >= lit( 10 )
294- col(" a" ) < lit( 10 )
295- col(" a" ) <= lit( 10 )
296- col(" a" ) == lit( " x" )
297- col(" a" ) != lit( " x" )
323+ col(" a" ) > 10
324+ col(" a" ) >= 10
325+ col(" a" ) < 10
326+ col(" a" ) <= 10
327+ col(" a" ) == " x"
328+ col(" a" ) != " x"
298329col(" a" ) == None # same as col("a").is_null()
299330col(" a" ) != None # same as col("a").is_not_null()
300331```
301332
333+ Comparison operators auto-wrap the right-hand Python value into a literal,
334+ so writing ` col("a") > lit(10) ` is redundant. Drop the ` lit() ` in
335+ comparisons. Reach for ` lit() ` only when auto-wrapping does not apply — see
336+ pitfall #2 .
337+
302338### Boolean Logic
303339
304340** Important** : Python's ` and ` , ` or ` , ` not ` keywords do NOT work with Expr
305341objects. You must use the bitwise operators:
306342
307343``` python
308- (col(" a" ) > lit( 1 )) & (col(" b" ) < lit( 10 ) ) # AND
309- (col(" a" ) > lit( 1 )) | (col(" b" ) < lit( 10 ) ) # OR
310- ~ (col(" a" ) > lit( 1 )) # NOT
344+ (col(" a" ) > 1 ) & (col(" b" ) < 10 ) # AND
345+ (col(" a" ) > 1 ) | (col(" b" ) < 10 ) # OR
346+ ~ (col(" a" ) > 1 ) # NOT
311347```
312348
313349Always wrap each comparison in parentheses when combining with ` & ` , ` | ` , ` ~ `
@@ -379,7 +415,7 @@ col("array_col")[1:3] # array slice (0-indexed)
379415| ` SELECT a, b + 1 AS c ` | ` df.select(col("a"), (col("b") + lit(1)).alias("c")) ` |
380416| ` SELECT *, a + 1 AS c ` | ` df.with_column("c", col("a") + lit(1)) ` |
381417| ` WHERE a > 10 ` | ` df.filter(col("a") > lit(10)) ` |
382- | ` GROUP BY a ` with ` SUM(b) ` | ` df.aggregate([col( "a") ], [F.sum(col("b"))]) ` |
418+ | ` GROUP BY a ` with ` SUM(b) ` | ` df.aggregate(["a"], [F.sum(col("b"))]) ` |
383419| ` SUM(b) FILTER (WHERE b > 100) ` | ` F.sum(col("b"), filter=col("b") > lit(100)) ` |
384420| ` ORDER BY a DESC ` | ` df.sort(col("a").sort(ascending=False)) ` |
385421| ` LIMIT 10 OFFSET 5 ` | ` df.limit(10, offset=5) ` |
@@ -407,26 +443,70 @@ col("array_col")[1:3] # array slice (0-indexed)
4074431 . ** Boolean operators** : Use ` & ` , ` | ` , ` ~ ` -- not Python's ` and ` , ` or ` , ` not ` .
408444 Always parenthesize: ` (col("a") > lit(1)) & (col("b") < lit(2)) ` .
409445
410- 2 . ** Wrapping scalars with ` lit() ` ** : Comparisons with raw Python values work
411- (e.g., ` col("a") > 10 ` ) because the Expr operators auto-wrap them, but
412- standalone scalars in function calls need ` lit() ` :
413- ` F.coalesce(col("a"), lit(0)) ` , not ` F.coalesce(col("a"), 0) ` .
414-
415- 3 . ** Column name quoting** : Column names are normalized to lowercase by default.
416- To reference a column with uppercase letters, use double quotes inside the
417- string: ` col('"MyColumn"') ` .
446+ 2 . ** Wrapping scalars with ` lit() ` ** : Prefer raw Python values on the
447+ right-hand side of comparisons — ` col("a") > 10 ` , ` col("name") == "Alice" `
448+ — because the Expr comparison operators auto-wrap them. Writing
449+ ` col("a") > lit(10) ` is redundant. Reserve ` lit() ` for places where
450+ auto-wrapping does * not* apply:
451+ - standalone scalars passed into function calls:
452+ ` F.coalesce(col("a"), lit(0)) ` , not ` F.coalesce(col("a"), 0) `
453+ - arithmetic between two literals with no column involved:
454+ ` lit(1) - col("discount") ` is fine, but ` lit(1) - lit(2) ` needs both
455+ - values that must carry a specific Arrow type, via ` lit(pa.scalar(...)) `
456+ - ` .when(...) ` , ` .otherwise(...) ` , ` F.nullif(...) ` , ` .between(...) ` ,
457+ ` F.in_list(...) ` and similar method/function arguments
458+
459+ 3 . ** Column name quoting** : Column names are normalized to lowercase by default
460+ in both ` select("...") ` and ` col("...") ` . To reference a column with
461+ uppercase letters, use double quotes inside the string:
462+ ` select('"MyColumn"') ` or ` col('"MyColumn"') ` .
418463
4194644 . ** DataFrames are immutable** : Every method returns a ** new** DataFrame. You
420465 must capture the return value:
421466 ``` python
422- df = df.filter(col(" a" ) > lit( 1 ) ) # correct
423- df.filter(col(" a" ) > lit( 1 ) ) # WRONG -- result is discarded
467+ df = df.filter(col(" a" ) > 1 ) # correct
468+ df.filter(col(" a" ) > 1 ) # WRONG -- result is discarded
424469 ```
425470
4264715 . ** Window frame defaults** : When using ` order_by ` in a window, the default
427472 frame is ` RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ` . For a full
428473 partition frame, set ` window_frame=WindowFrame("rows", None, None) ` .
429474
475+ 6 . ** Arithmetic on aggregates belongs in a later ` select ` , not inside
476+ ` aggregate ` ** : Each item in the aggregate list must be a single aggregate
477+ call (optionally aliased). Combining aggregates with arithmetic inside
478+ ` aggregate(...) ` fails with ` Internal error: Invalid aggregate expression ` .
479+ Alias the aggregates, then compute the combination downstream:
480+ ``` python
481+ # WRONG -- arithmetic wraps two aggregates
482+ df.aggregate([], [(lit(100 ) * F.sum(col(" a" )) / F.sum(col(" b" ))).alias(" ratio" )])
483+
484+ # CORRECT -- aggregate first, then combine
485+ (df.aggregate([], [F.sum(col(" a" )).alias(" num" ), F.sum(col(" b" )).alias(" den" )])
486+ .select((lit(100 ) * col(" num" ) / col(" den" )).alias(" ratio" )))
487+ ```
488+
489+ 7 . ** Don't alias a join column to match the other side** : When equi-joining
490+ with ` on="key" ` , renaming the join column on one side via ` .alias("key") `
491+ in a fresh projection creates a schema where one side's ` key ` is
492+ qualified (` ?table?.key ` ) and the other is unqualified. The join then
493+ fails with `Schema contains qualified field name ... and unqualified
494+ field name ... which would be ambiguous` . Use ` left_on=` / ` right_on=` with
495+ the native names, or use ` join_on(...) ` with an explicit equality.
496+ ``` python
497+ # WRONG -- alias on one side produces ambiguous schema after join
498+ failed = orders.select(col(" o_orderkey" ).alias(" l_orderkey" ))
499+ li.join(failed, on = " l_orderkey" ) # ambiguous l_orderkey error
500+
501+ # CORRECT -- keep native names, use left_on/right_on
502+ failed = orders.select(" o_orderkey" )
503+ li.join(failed, left_on = " l_orderkey" , right_on = " o_orderkey" )
504+
505+ # ALSO CORRECT -- explicit predicate via join_on
506+ # (note: join_on keeps both key columns in the output, unlike on="key")
507+ li.join_on(failed, col(" l_orderkey" ) == col(" o_orderkey" ))
508+ ```
509+
430510## Idiomatic Patterns
431511
432512### Fluent Chaining
@@ -436,7 +516,7 @@ result = (
436516 ctx.read_parquet(" data.parquet" )
437517 .filter(col(" year" ) >= lit(2020 ))
438518 .select(col(" region" ), col(" sales" ))
439- .aggregate([col( " region" ) ], [F.sum(col(" sales" )).alias(" total" )])
519+ .aggregate([" region" ], [F.sum(col(" sales" )).alias(" total" )])
440520 .sort(col(" total" ).sort(ascending = False ))
441521 .limit(10 )
442522)
@@ -450,7 +530,7 @@ variables:
450530
451531``` python
452532base = ctx.read_parquet(" orders.parquet" ).filter(col(" status" ) == lit(" shipped" ))
453- by_region = base.aggregate([col( " region" ) ], [F.sum(col(" amount" )).alias(" total" )])
533+ by_region = base.aggregate([" region" ], [F.sum(col(" amount" )).alias(" total" )])
454534top_regions = by_region.filter(col(" total" ) > lit(10000 ))
455535```
456536
@@ -470,9 +550,9 @@ df = df.select(
470550)
471551
472552# Use a collected scalar as an expression
473- max_val = result_batch[ 0 ].column (" max_price" )[0 ] # PyArrow scalar
553+ max_val = result_df.collect_column (" max_price" )[0 ] # PyArrow scalar
474554cutoff = lit(max_val) - lit(pa.scalar((0 , 90 , 0 ), type = pa.month_day_nano_interval()))
475- df = df.filter(col(" ship_date" ) <= cutoff) # cutoff is already an Expr
555+ df = df.filter(col(" ship_date" ) <= cutoff) # cutoff is already an Expr
476556```
477557
478558** Important** : Do not wrap an ` Expr ` in ` lit() ` . ` lit() ` is for converting
@@ -529,9 +609,25 @@ The `functions` module (imported as `F`) provides 290+ functions. Key categories
529609` ntile ` , ` lag ` , ` lead ` , ` first_value ` , ` last_value ` , ` nth_value `
530610
531611** String** : ` length ` , ` lower ` , ` upper ` , ` trim ` , ` ltrim ` , ` rtrim ` , ` lpad ` ,
532- ` rpad ` , ` starts_with ` , ` ends_with ` , ` contains ` , ` substr ` , ` replace ` , ` reverse ` ,
533- ` repeat ` , ` split_part ` , ` concat ` , ` concat_ws ` , ` initcap ` , ` ascii ` , ` chr ` ,
534- ` left ` , ` right ` , ` strpos ` , ` translate ` , ` overlay ` , ` levenshtein `
612+ ` rpad ` , ` starts_with ` , ` ends_with ` , ` contains ` , ` substr ` , ` substring ` ,
613+ ` replace ` , ` reverse ` , ` repeat ` , ` split_part ` , ` concat ` , ` concat_ws ` ,
614+ ` initcap ` , ` ascii ` , ` chr ` , ` left ` , ` right ` , ` strpos ` , ` translate ` , ` overlay ` ,
615+ ` levenshtein `
616+
617+ ` F.substr(str, start) ` takes ** only two arguments** and returns the tail of
618+ the string from ` start ` onward — passing a third length argument raises
619+ ` TypeError: substr() takes 2 positional arguments but 3 were given ` . For the
620+ SQL-style 3-arg form (` SUBSTRING(str FROM start FOR length) ` ), use
621+ ` F.substring(col("s"), lit(start), lit(length)) ` . For a fixed-length prefix,
622+ ` F.left(col("s"), lit(n)) ` is cleanest.
623+
624+ ``` python
625+ # WRONG — substr does not accept a length argument
626+ F.substr(col(" c_phone" ), lit(1 ), lit(2 ))
627+ # CORRECT
628+ F.substring(col(" c_phone" ), lit(1 ), lit(2 )) # explicit length
629+ F.left(col(" c_phone" ), lit(2 )) # prefix shortcut
630+ ```
535631
536632** Math** : ` abs ` , ` ceil ` , ` floor ` , ` round ` , ` trunc ` , ` sqrt ` , ` cbrt ` , ` exp ` ,
537633` ln ` , ` log ` , ` log2 ` , ` log10 ` , ` pow ` , ` signum ` , ` pi ` , ` random ` , ` factorial ` ,
0 commit comments