@@ -218,6 +218,7 @@ df.distinct_on( # keep first row per group (like DISTINCT ON in Postgres
218218DataFrames are lazy until you collect.
219219
220220``` python
221+ df.show() # print formatted table to stdout
221222batches = df.collect() # list[pa.RecordBatch]
222223table = df.to_arrow_table() # pa.Table
223224pandas_df = df.to_pandas() # pd.DataFrame
@@ -249,8 +250,12 @@ col("column_name") # reference a column
249250lit(42 ) # integer literal
250251lit(" hello" ) # string literal
251252lit(3.14 ) # float literal
253+ lit(pa.scalar(value)) # PyArrow scalar (preserves Arrow type)
252254```
253255
256+ ` lit() ` accepts PyArrow scalars directly -- prefer this over converting Arrow
257+ data to Python and back when working with values extracted from query results.
258+
254259### Arithmetic
255260
256261``` python
@@ -261,6 +266,26 @@ col("a") / lit(2) # division
261266col(" a" ) % lit(3 ) # modulo
262267```
263268
269+ ### Date Arithmetic
270+
271+ ` Date32 ` columns require ` Interval ` types for arithmetic, not ` Duration ` . Use
272+ PyArrow's ` month_day_nano_interval ` type, which takes a ` (months, days, nanos) `
273+ tuple:
274+
275+ ``` python
276+ import pyarrow as pa
277+
278+ # Subtract 90 days from a date column
279+ col(" ship_date" ) - lit(pa.scalar((0 , 90 , 0 ), type = pa.month_day_nano_interval()))
280+
281+ # Subtract 3 months
282+ col(" ship_date" ) - lit(pa.scalar((3 , 0 , 0 ), type = pa.month_day_nano_interval()))
283+ ```
284+
285+ ** Important** : ` lit(datetime.timedelta(days=90)) ` creates a ` Duration(µs) `
286+ literal, which is ** not** compatible with ` Date32 ` arithmetic. Always use
287+ ` pa.month_day_nano_interval() ` for date operations.
288+
264289### Comparisons
265290
266291``` python
@@ -414,8 +439,8 @@ result = (
414439 .aggregate([col(" region" )], [F.sum(col(" sales" )).alias(" total" )])
415440 .sort(col(" total" ).sort(ascending = False ))
416441 .limit(10 )
417- .collect()
418442)
443+ result.show()
419444```
420445
421446### Using Variables as CTEs
@@ -429,6 +454,31 @@ by_region = base.aggregate([col("region")], [F.sum(col("amount")).alias("total")
429454top_regions = by_region.filter(col(" total" ) > lit(10000 ))
430455```
431456
457+ ### Reusing Expressions as Variables
458+
459+ Just like DataFrames, expressions (` Expr ` ) can be stored in variables and used
460+ anywhere an ` Expr ` is expected. This is useful for building up complex
461+ expressions or reusing a computed value across multiple operations:
462+
463+ ``` python
464+ # Build an expression and reuse it
465+ disc_price = col(" price" ) * (lit(1 ) - col(" discount" ))
466+ df = df.select(
467+ col(" id" ),
468+ disc_price.alias(" disc_price" ),
469+ (disc_price * (lit(1 ) + col(" tax" ))).alias(" total" ),
470+ )
471+
472+ # Use a collected scalar as an expression
473+ max_val = result_batch[0 ].column(" max_price" )[0 ] # PyArrow scalar
474+ cutoff = lit(max_val) - lit(pa.scalar((0 , 90 , 0 ), type = pa.month_day_nano_interval()))
475+ df = df.filter(col(" ship_date" ) <= cutoff) # cutoff is already an Expr
476+ ```
477+
478+ ** Important** : Do not wrap an ` Expr ` in ` lit() ` . ` lit() ` is for converting
479+ Python/PyArrow values into expressions. If a value is already an ` Expr ` , use it
480+ directly.
481+
432482### Window Functions for Scalar Subqueries
433483
434484Where SQL uses a correlated scalar subquery, the idiomatic DataFrame approach
0 commit comments