Skip to content

Commit a3f19a9

Browse files
timsaucerclaude
andcommitted
docs: enrich RST pages with demos relocated from TPC-H rewrite
Moves the illustrative patterns that #1504 removed from the TPC-H examples into the common-operations docs, where they serve as pattern-focused teaching material without cluttering the TPC-H translations: - expressions.rst gains a "Testing membership in a list" section comparing `|`-compound filters, `in_list`, and `array_position` + `make_array`, plus a "Conditional expressions" section contrasting switched and searched `case`. - udf-and-udfa.rst gains a "When not to use a UDF" subsection showing the compound-OR predicate that replaces a Python-side UDF for disjunctive bucket filters (the Q19 case). - aggregations.rst gains a "Building per-group arrays" subsection covering `array_agg(filter=..., distinct=True)` with `array_length`/`array_element` for the single-value-per-group pattern (the Q21 case). - Adds `examples/array-operations.py`, a runnable end-to-end walkthrough of the membership and array_agg patterns. Implements PR 4e of the plan in #1394. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 35b7893 commit a3f19a9

5 files changed

Lines changed: 314 additions & 0 deletions

File tree

docs/source/user-guide/common-operations/aggregations.rst

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -163,6 +163,62 @@ Suppose we want to find the speed values for only Pokemon that have low Attack v
163163
f.avg(col_speed, filter=col_attack < lit(50)).alias("Avg Speed Low Attack")])
164164
165165
166+
Building per-group arrays
167+
^^^^^^^^^^^^^^^^^^^^^^^^^
168+
169+
:py:func:`~datafusion.functions.array_agg` collects the values within each
170+
group into a list. Combined with ``distinct=True`` and the ``filter``
171+
argument, it lets you ask two questions of the same group in one pass —
172+
"what are all the values?" and "what are the values that satisfy some
173+
condition?".
174+
175+
Suppose each row records a line item with the supplier that fulfilled it and
176+
a flag for whether that supplier met the commit date. We want to identify
177+
orders where exactly one supplier failed, among two or more suppliers in
178+
total:
179+
180+
.. ipython:: python
181+
182+
from datafusion import SessionContext, col, lit, functions as f
183+
184+
ctx = SessionContext()
185+
df = ctx.from_pydict(
186+
{
187+
"order_id": [1, 1, 1, 2, 2, 3],
188+
"supplier_id": [100, 101, 102, 200, 201, 300],
189+
"failed": [False, True, False, False, False, True],
190+
},
191+
)
192+
193+
grouped = df.aggregate(
194+
[col("order_id")],
195+
[
196+
f.array_agg(col("supplier_id"), distinct=True).alias("all_suppliers"),
197+
f.array_agg(
198+
col("supplier_id"),
199+
filter=col("failed"),
200+
distinct=True,
201+
).alias("failed_suppliers"),
202+
],
203+
)
204+
205+
grouped.filter(
206+
(f.array_length(col("failed_suppliers")) == lit(1))
207+
& (f.array_length(col("all_suppliers")) > lit(1))
208+
).select(
209+
col("order_id"),
210+
f.array_element(col("failed_suppliers"), lit(1)).alias("the_one_bad_supplier"),
211+
)
212+
213+
Two aspects of the pattern are worth calling out:
214+
215+
- ``filter=`` on an aggregate narrows the rows contributing to *that*
216+
aggregate only. Filtering the DataFrame before the aggregate would have
217+
dropped whole groups that no longer had any rows.
218+
- :py:func:`~datafusion.functions.array_length` tests group size without
219+
another aggregate pass, and :py:func:`~datafusion.functions.array_element`
220+
extracts a single value when you have proven the array has length one.
221+
166222
Grouping Sets
167223
-------------
168224

docs/source/user-guide/common-operations/expressions.rst

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,98 @@ This function returns a new array with the elements repeated.
146146
In this example, the `repeated_array` column will contain `[[1, 2, 3], [1, 2, 3]]`.
147147

148148

149+
Testing membership in a list
150+
----------------------------
151+
152+
A common need is filtering rows where a column equals *any* of a small set of
153+
values. DataFusion offers three forms; they differ in readability and in how
154+
they scale:
155+
156+
1. A compound boolean using ``|`` across explicit equalities.
157+
2. :py:func:`~datafusion.functions.in_list`, which accepts a list of
158+
expressions and tests equality against all of them in one call.
159+
3. A trick with :py:func:`~datafusion.functions.array_position` and
160+
:py:func:`~datafusion.functions.make_array`, which returns the 1-based
161+
index of the value in a constructed array, or null if it is not present.
162+
163+
.. ipython:: python
164+
165+
from datafusion import SessionContext, col, lit
166+
from datafusion import functions as f
167+
168+
ctx = SessionContext()
169+
df = ctx.from_pydict({"shipmode": ["MAIL", "SHIP", "AIR", "TRUCK", "RAIL"]})
170+
171+
# Option 1: compound boolean. Fine for two values; awkward past three.
172+
df.filter((col("shipmode") == lit("MAIL")) | (col("shipmode") == lit("SHIP")))
173+
174+
# Option 2: in_list. Preferred for readability as the set grows.
175+
df.filter(f.in_list(col("shipmode"), [lit("MAIL"), lit("SHIP")]))
176+
177+
# Option 3: array_position / make_array. Useful when you already have the
178+
# set as an array column and want "is in that array" semantics.
179+
df.filter(
180+
~f.array_position(
181+
f.make_array(lit("MAIL"), lit("SHIP")), col("shipmode")
182+
).is_null()
183+
)
184+
185+
Use ``in_list`` as the default. It is explicit, readable, and matches the
186+
semantics users expect from SQL's ``IN (...)``. Reach for the
187+
``array_position`` form only when the membership set is itself an array
188+
column rather than a literal list.
189+
190+
Conditional expressions
191+
-----------------------
192+
193+
DataFusion provides :py:func:`~datafusion.functions.case` for the SQL
194+
``CASE`` expression in both its switched and searched forms, along with
195+
:py:func:`~datafusion.functions.when` as a standalone builder for the
196+
searched form.
197+
198+
**Switched CASE** (one expression compared against several literal values):
199+
200+
.. ipython:: python
201+
202+
df = ctx.from_pydict(
203+
{"priority": ["1-URGENT", "2-HIGH", "3-MEDIUM", "5-LOW"]},
204+
)
205+
206+
df.select(
207+
col("priority"),
208+
f.case(col("priority"))
209+
.when(lit("1-URGENT"), lit(1))
210+
.when(lit("2-HIGH"), lit(1))
211+
.otherwise(lit(0))
212+
.alias("is_high_priority"),
213+
)
214+
215+
**Searched CASE** (an independent boolean predicate per branch). Use this
216+
form whenever a branch tests more than simple equality — for example,
217+
checking whether a joined column is ``NULL`` to gate a computed value:
218+
219+
.. ipython:: python
220+
221+
df = ctx.from_pydict(
222+
{"volume": [10.0, 20.0, 30.0], "supplier_id": [1, None, 2]},
223+
)
224+
225+
df.select(
226+
col("volume"),
227+
col("supplier_id"),
228+
f.when(col("supplier_id").is_not_null(), col("volume"))
229+
.otherwise(lit(0.0))
230+
.alias("attributed_volume"),
231+
)
232+
233+
This searched-CASE pattern is idiomatic for "attribute the measure to the
234+
matching side of a left join, otherwise contribute zero" — a shape that
235+
appears in TPC-H Q08 and similar market-share calculations.
236+
237+
If a switched CASE has only two or three branches that test equality, an
238+
``in_list`` filter combined with :py:meth:`~datafusion.expr.Expr.otherwise`
239+
is often simpler than the full ``case`` builder.
240+
149241
Structs
150242
-------
151243

docs/source/user-guide/common-operations/udf-and-udfa.rst

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,67 @@ write Rust based UDFs and to expose them to Python. There is an example in the
101101
`DataFusion blog <https://datafusion.apache.org/blog/2024/11/19/datafusion-python-udf-comparisons/>`_
102102
describing how to do this.
103103

104+
When not to use a UDF
105+
^^^^^^^^^^^^^^^^^^^^^
106+
107+
A UDF is the right tool when the computation genuinely cannot be expressed
108+
with built-in functions. It is often the *wrong* tool for a compound
109+
predicate that happens to be easier to write in Python. The optimizer
110+
cannot push a UDF through joins or filters, so a Python-side predicate
111+
prevents otherwise obvious rewrites and forces a per-row Python callback.
112+
113+
Consider a filter that selects rows falling into one of three brand-specific
114+
buckets, each with its own containers, quantity range, and size range:
115+
116+
.. code-block:: python
117+
118+
# Anti-pattern: the predicate is a plain disjunction, but hidden inside a UDF.
119+
def is_of_interest(brand, container, quantity, size):
120+
result = []
121+
for b, c, q, s in zip(brand, container, quantity, size):
122+
b = b.as_py()
123+
if b == "Brand#12":
124+
result.append(c.as_py() in ("SM CASE", "SM BOX") and 1 <= q.as_py() <= 11 and 1 <= s.as_py() <= 5)
125+
elif b == "Brand#23":
126+
result.append(c.as_py() in ("MED BAG", "MED BOX") and 10 <= q.as_py() <= 20 and 1 <= s.as_py() <= 10)
127+
else:
128+
result.append(False)
129+
return pa.array(result)
130+
131+
df = df.filter(udf_is_of_interest(col("brand"), col("container"), col("quantity"), col("size")))
132+
133+
The native equivalent keeps the bucket definitions as plain Python data
134+
(a dict) and builds an ``Expr`` from them. The optimizer sees a disjunction
135+
of simple predicates it can analyze and push down:
136+
137+
.. code-block:: python
138+
139+
from functools import reduce
140+
from operator import or_
141+
from datafusion import col, lit, functions as f
142+
143+
items_of_interest = {
144+
"Brand#12": {"containers": ["SM CASE", "SM BOX"], "min_qty": 1, "max_size": 5},
145+
"Brand#23": {"containers": ["MED BAG", "MED BOX"], "min_qty": 10, "max_size": 10},
146+
}
147+
148+
def brand_clause(brand, spec):
149+
return (
150+
(col("brand") == lit(brand))
151+
& f.in_list(col("container"), [lit(c) for c in spec["containers"]])
152+
& (col("quantity") >= lit(spec["min_qty"]))
153+
& (col("quantity") <= lit(spec["min_qty"] + 10))
154+
& (col("size") >= lit(1))
155+
& (col("size") <= lit(spec["max_size"]))
156+
)
157+
158+
predicate = reduce(or_, (brand_clause(b, s) for b, s in items_of_interest.items()))
159+
df = df.filter(predicate)
160+
161+
Reach for a UDF when the per-row computation is not expressible as a tree
162+
of built-in functions. When it *is* expressible, build the ``Expr`` tree
163+
directly.
164+
104165
Aggregate Functions
105166
-------------------
106167

examples/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ Here is a direct link to the file used in the examples:
3737
- [Query a Parquet file using the DataFrame API](./dataframe-parquet.py)
3838
- [Run a SQL query and store the results in a Pandas DataFrame](./sql-to-pandas.py)
3939
- [Query PyArrow Data](./query-pyarrow-data.py)
40+
- [Array operations: membership tests, array_agg patterns, array inspection](./array-operations.py)
4041

4142
### Running User-Defined Python Code
4243

examples/array-operations.py

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# Licensed to the Apache Software Foundation (ASF) under one
2+
# or more contributor license agreements. See the NOTICE file
3+
# distributed with this work for additional information
4+
# regarding copyright ownership. The ASF licenses this file
5+
# to you under the Apache License, Version 2.0 (the
6+
# "License"); you may not use this file except in compliance
7+
# with the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing,
12+
# software distributed under the License is distributed on an
13+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
# KIND, either express or implied. See the License for the
15+
# specific language governing permissions and limitations
16+
# under the License.
17+
18+
"""Array operations in DataFusion Python.
19+
20+
Runnable reference for the idiomatic array-building and array-inspection
21+
patterns. No external data is required -- the example constructs all inputs
22+
through ``from_pydict``.
23+
24+
Topics covered:
25+
26+
- ``F.make_array`` to build a literal array expression.
27+
- ``F.array_position`` and ``F.in_list`` for membership tests.
28+
- ``F.array_length`` and ``F.array_element`` for inspecting an aggregated
29+
array.
30+
- ``F.array_agg(distinct=True, filter=...)`` for building two related arrays
31+
per group in one pass, and filtering groups by array size afterwards.
32+
33+
Run with::
34+
35+
python examples/array-operations.py
36+
"""
37+
38+
from datafusion import SessionContext, col, lit
39+
from datafusion import functions as F
40+
41+
ctx = SessionContext()
42+
43+
44+
# ---------------------------------------------------------------------------
45+
# 1. Membership tests: in_list vs. array_position / make_array
46+
# ---------------------------------------------------------------------------
47+
48+
shipments = ctx.from_pydict(
49+
{
50+
"order_id": [1, 2, 3, 4, 5],
51+
"shipmode": ["MAIL", "SHIP", "AIR", "TRUCK", "RAIL"],
52+
}
53+
)
54+
55+
print("\n== in_list: is shipmode one of {MAIL, SHIP}? ==")
56+
shipments.filter(F.in_list(col("shipmode"), [lit("MAIL"), lit("SHIP")])).show()
57+
58+
print("\n== array_position / make_array: same question via a literal array ==")
59+
shipments.filter(
60+
~F.array_position(F.make_array(lit("MAIL"), lit("SHIP")), col("shipmode")).is_null()
61+
).show()
62+
63+
64+
# ---------------------------------------------------------------------------
65+
# 2. array_agg with filter to inspect groups of two related arrays
66+
# ---------------------------------------------------------------------------
67+
#
68+
# Input represents line items per order, each fulfilled by one supplier. The
69+
# `failed` column marks whether the supplier met the commit date. We want to
70+
# find orders with multiple suppliers where exactly one of them failed, and
71+
# report that single failing supplier.
72+
73+
line_items = ctx.from_pydict(
74+
{
75+
"order_id": [1, 1, 1, 2, 2, 3, 3, 3, 3],
76+
"supplier_id": [100, 101, 102, 200, 201, 300, 301, 302, 303],
77+
"failed": [False, True, False, False, False, True, False, False, False],
78+
}
79+
)
80+
81+
grouped = line_items.aggregate(
82+
[col("order_id")],
83+
[
84+
F.array_agg(col("supplier_id"), distinct=True).alias("all_suppliers"),
85+
F.array_agg(
86+
col("supplier_id"),
87+
filter=col("failed"),
88+
distinct=True,
89+
).alias("failed_suppliers"),
90+
],
91+
)
92+
93+
print("\n== per-order supplier arrays ==")
94+
grouped.sort(col("order_id").sort()).show()
95+
96+
print("\n== orders with >1 supplier and exactly one failure ==")
97+
singled_out = grouped.filter(
98+
(F.array_length(col("failed_suppliers")) == lit(1))
99+
& (F.array_length(col("all_suppliers")) > lit(1))
100+
).select(
101+
col("order_id"),
102+
F.array_element(col("failed_suppliers"), lit(1)).alias("bad_supplier"),
103+
)
104+
singled_out.sort(col("order_id").sort()).show()

0 commit comments

Comments
 (0)