docs: enrich RST pages with demos relocated from TPC-H rewrite

timsaucer · claude · timsaucer · commit a3f19a9960b2 · 2026-04-24T11:59:41.000-04:00
Moves the illustrative patterns that #1504 removed from the TPC-H examples into the common-operations docs, where they serve as pattern-focused teaching material without cluttering the TPC-H translations: - expressions.rst gains a "Testing membership in a list" section comparing `|`-compound filters, `in_list`, and `array_position` + `make_array`, plus a "Conditional expressions" section contrasting switched and searched `case`. - udf-and-udfa.rst gains a "When not to use a UDF" subsection showing the compound-OR predicate that replaces a Python-side UDF for disjunctive bucket filters (the Q19 case). - aggregations.rst gains a "Building per-group arrays" subsection covering `array_agg(filter=..., distinct=True)` with `array_length`/`array_element` for the single-value-per-group pattern (the Q21 case). - Adds `examples/array-operations.py`, a runnable end-to-end walkthrough of the membership and array_agg patterns. Implements PR 4e of the plan in #1394. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/docs/source/user-guide/common-operations/aggregations.rst b/docs/source/user-guide/common-operations/aggregations.rst
@@ -163,6 +163,62 @@ Suppose we want to find the speed values for only Pokemon that have low Attack v
         f.avg(col_speed, filter=col_attack < lit(50)).alias("Avg Speed Low Attack")])
 
 
+Building per-group arrays
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+:py:func:`~datafusion.functions.array_agg` collects the values within each
+group into a list. Combined with ``distinct=True`` and the ``filter``
+argument, it lets you ask two questions of the same group in one pass —
+"what are all the values?" and "what are the values that satisfy some
+condition?".
+
+Suppose each row records a line item with the supplier that fulfilled it and
+a flag for whether that supplier met the commit date. We want to identify
+orders where exactly one supplier failed, among two or more suppliers in
+total:
+
+.. ipython:: python
+
+    from datafusion import SessionContext, col, lit, functions as f
+
+    ctx = SessionContext()
+    df = ctx.from_pydict(
+        {
+            "order_id": [1, 1, 1, 2, 2, 3],
+            "supplier_id": [100, 101, 102, 200, 201, 300],
+            "failed":      [False, True, False, False, False, True],
+        },
+    )
+
+    grouped = df.aggregate(
+        [col("order_id")],
+        [
+            f.array_agg(col("supplier_id"), distinct=True).alias("all_suppliers"),
+            f.array_agg(
+                col("supplier_id"),
+                filter=col("failed"),
+                distinct=True,
+            ).alias("failed_suppliers"),
+        ],
+    )
+
+    grouped.filter(
+        (f.array_length(col("failed_suppliers")) == lit(1))
+        & (f.array_length(col("all_suppliers")) > lit(1))
+    ).select(
+        col("order_id"),
+        f.array_element(col("failed_suppliers"), lit(1)).alias("the_one_bad_supplier"),
+    )
+
+Two aspects of the pattern are worth calling out:
+
+- ``filter=`` on an aggregate narrows the rows contributing to *that*
+  aggregate only. Filtering the DataFrame before the aggregate would have
+  dropped whole groups that no longer had any rows.
+- :py:func:`~datafusion.functions.array_length` tests group size without
+  another aggregate pass, and :py:func:`~datafusion.functions.array_element`
+  extracts a single value when you have proven the array has length one.
+
 Grouping Sets
 -------------
 
diff --git a/docs/source/user-guide/common-operations/expressions.rst b/docs/source/user-guide/common-operations/expressions.rst
@@ -146,6 +146,98 @@ This function returns a new array with the elements repeated.
 In this example, the `repeated_array` column will contain `[[1, 2, 3], [1, 2, 3]]`.
 
 
+Testing membership in a list
+----------------------------
+
+A common need is filtering rows where a column equals *any* of a small set of
+values. DataFusion offers three forms; they differ in readability and in how
+they scale:
+
+1. A compound boolean using ``|`` across explicit equalities.
+2. :py:func:`~datafusion.functions.in_list`, which accepts a list of
+   expressions and tests equality against all of them in one call.
+3. A trick with :py:func:`~datafusion.functions.array_position` and
+   :py:func:`~datafusion.functions.make_array`, which returns the 1-based
+   index of the value in a constructed array, or null if it is not present.
+
+.. ipython:: python
+
+    from datafusion import SessionContext, col, lit
+    from datafusion import functions as f
+
+    ctx = SessionContext()
+    df = ctx.from_pydict({"shipmode": ["MAIL", "SHIP", "AIR", "TRUCK", "RAIL"]})
+
+    # Option 1: compound boolean. Fine for two values; awkward past three.
+    df.filter((col("shipmode") == lit("MAIL")) | (col("shipmode") == lit("SHIP")))
+
+    # Option 2: in_list. Preferred for readability as the set grows.
+    df.filter(f.in_list(col("shipmode"), [lit("MAIL"), lit("SHIP")]))
+
+    # Option 3: array_position / make_array. Useful when you already have the
+    # set as an array column and want "is in that array" semantics.
+    df.filter(
+        ~f.array_position(
+            f.make_array(lit("MAIL"), lit("SHIP")), col("shipmode")
+        ).is_null()
+    )
+
+Use ``in_list`` as the default. It is explicit, readable, and matches the
+semantics users expect from SQL's ``IN (...)``. Reach for the
+``array_position`` form only when the membership set is itself an array
+column rather than a literal list.
+
+Conditional expressions
+-----------------------
+
+DataFusion provides :py:func:`~datafusion.functions.case` for the SQL
+``CASE`` expression in both its switched and searched forms, along with
+:py:func:`~datafusion.functions.when` as a standalone builder for the
+searched form.
+
+**Switched CASE** (one expression compared against several literal values):
+
+.. ipython:: python
+
+    df = ctx.from_pydict(
+        {"priority": ["1-URGENT", "2-HIGH", "3-MEDIUM", "5-LOW"]},
+    )
+
+    df.select(
+        col("priority"),
+        f.case(col("priority"))
+         .when(lit("1-URGENT"), lit(1))
+         .when(lit("2-HIGH"), lit(1))
+         .otherwise(lit(0))
+         .alias("is_high_priority"),
+    )
+
+**Searched CASE** (an independent boolean predicate per branch). Use this
+form whenever a branch tests more than simple equality — for example,
+checking whether a joined column is ``NULL`` to gate a computed value:
+
+.. ipython:: python
+
+    df = ctx.from_pydict(
+        {"volume": [10.0, 20.0, 30.0], "supplier_id": [1, None, 2]},
+    )
+
+    df.select(
+        col("volume"),
+        col("supplier_id"),
+        f.when(col("supplier_id").is_not_null(), col("volume"))
+         .otherwise(lit(0.0))
+         .alias("attributed_volume"),
+    )
+
+This searched-CASE pattern is idiomatic for "attribute the measure to the
+matching side of a left join, otherwise contribute zero" — a shape that
+appears in TPC-H Q08 and similar market-share calculations.
+
+If a switched CASE has only two or three branches that test equality, an
+``in_list`` filter combined with :py:meth:`~datafusion.expr.Expr.otherwise`
+is often simpler than the full ``case`` builder.
+
 Structs
 -------
 
diff --git a/docs/source/user-guide/common-operations/udf-and-udfa.rst b/docs/source/user-guide/common-operations/udf-and-udfa.rst
@@ -101,6 +101,67 @@ write Rust based UDFs and to expose them to Python. There is an example in the
 `DataFusion blog <https://datafusion.apache.org/blog/2024/11/19/datafusion-python-udf-comparisons/>`_
 describing how to do this.
 
+When not to use a UDF
+^^^^^^^^^^^^^^^^^^^^^
+
+A UDF is the right tool when the computation genuinely cannot be expressed
+with built-in functions. It is often the *wrong* tool for a compound
+predicate that happens to be easier to write in Python. The optimizer
+cannot push a UDF through joins or filters, so a Python-side predicate
+prevents otherwise obvious rewrites and forces a per-row Python callback.
+
+Consider a filter that selects rows falling into one of three brand-specific
+buckets, each with its own containers, quantity range, and size range:
+
+.. code-block:: python
+
+    # Anti-pattern: the predicate is a plain disjunction, but hidden inside a UDF.
+    def is_of_interest(brand, container, quantity, size):
+        result = []
+        for b, c, q, s in zip(brand, container, quantity, size):
+            b = b.as_py()
+            if b == "Brand#12":
+                result.append(c.as_py() in ("SM CASE", "SM BOX") and 1 <= q.as_py() <= 11 and 1 <= s.as_py() <= 5)
+            elif b == "Brand#23":
+                result.append(c.as_py() in ("MED BAG", "MED BOX") and 10 <= q.as_py() <= 20 and 1 <= s.as_py() <= 10)
+            else:
+                result.append(False)
+        return pa.array(result)
+
+    df = df.filter(udf_is_of_interest(col("brand"), col("container"), col("quantity"), col("size")))
+
+The native equivalent keeps the bucket definitions as plain Python data
+(a dict) and builds an ``Expr`` from them. The optimizer sees a disjunction
+of simple predicates it can analyze and push down:
+
+.. code-block:: python
+
+    from functools import reduce
+    from operator import or_
+    from datafusion import col, lit, functions as f
+
+    items_of_interest = {
+        "Brand#12": {"containers": ["SM CASE", "SM BOX"], "min_qty": 1, "max_size": 5},
+        "Brand#23": {"containers": ["MED BAG", "MED BOX"], "min_qty": 10, "max_size": 10},
+    }
+
+    def brand_clause(brand, spec):
+        return (
+            (col("brand") == lit(brand))
+            & f.in_list(col("container"), [lit(c) for c in spec["containers"]])
+            & (col("quantity") >= lit(spec["min_qty"]))
+            & (col("quantity") <= lit(spec["min_qty"] + 10))
+            & (col("size") >= lit(1))
+            & (col("size") <= lit(spec["max_size"]))
+        )
+
+    predicate = reduce(or_, (brand_clause(b, s) for b, s in items_of_interest.items()))
+    df = df.filter(predicate)
+
+Reach for a UDF when the per-row computation is not expressible as a tree
+of built-in functions. When it *is* expressible, build the ``Expr`` tree
+directly.
+
 Aggregate Functions
 -------------------
 
diff --git a/examples/README.md b/examples/README.md
@@ -37,6 +37,7 @@ Here is a direct link to the file used in the examples:
 - [Query a Parquet file using the DataFrame API](./dataframe-parquet.py)
 - [Run a SQL query and store the results in a Pandas DataFrame](./sql-to-pandas.py)
 - [Query PyArrow Data](./query-pyarrow-data.py)
+- [Array operations: membership tests, array_agg patterns, array inspection](./array-operations.py)
 
 ### Running User-Defined Python Code
 
diff --git a/examples/array-operations.py b/examples/array-operations.py
@@ -0,0 +1,104 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""Array operations in DataFusion Python.
+
+Runnable reference for the idiomatic array-building and array-inspection
+patterns. No external data is required -- the example constructs all inputs
+through ``from_pydict``.
+
+Topics covered:
+
+- ``F.make_array`` to build a literal array expression.
+- ``F.array_position`` and ``F.in_list`` for membership tests.
+- ``F.array_length`` and ``F.array_element`` for inspecting an aggregated
+  array.
+- ``F.array_agg(distinct=True, filter=...)`` for building two related arrays
+  per group in one pass, and filtering groups by array size afterwards.
+
+Run with::
+
+    python examples/array-operations.py
+"""
+
+from datafusion import SessionContext, col, lit
+from datafusion import functions as F
+
+ctx = SessionContext()
+
+
+# ---------------------------------------------------------------------------
+# 1. Membership tests: in_list vs. array_position / make_array
+# ---------------------------------------------------------------------------
+
+shipments = ctx.from_pydict(
+    {
+        "order_id": [1, 2, 3, 4, 5],
+        "shipmode": ["MAIL", "SHIP", "AIR", "TRUCK", "RAIL"],
+    }
+)
+
+print("\n== in_list: is shipmode one of {MAIL, SHIP}? ==")
+shipments.filter(F.in_list(col("shipmode"), [lit("MAIL"), lit("SHIP")])).show()
+
+print("\n== array_position / make_array: same question via a literal array ==")
+shipments.filter(
+    ~F.array_position(F.make_array(lit("MAIL"), lit("SHIP")), col("shipmode")).is_null()
+).show()
+
+
+# ---------------------------------------------------------------------------
+# 2. array_agg with filter to inspect groups of two related arrays
+# ---------------------------------------------------------------------------
+#
+# Input represents line items per order, each fulfilled by one supplier. The
+# `failed` column marks whether the supplier met the commit date. We want to
+# find orders with multiple suppliers where exactly one of them failed, and
+# report that single failing supplier.
+
+line_items = ctx.from_pydict(
+    {
+        "order_id": [1, 1, 1, 2, 2, 3, 3, 3, 3],
+        "supplier_id": [100, 101, 102, 200, 201, 300, 301, 302, 303],
+        "failed": [False, True, False, False, False, True, False, False, False],
+    }
+)
+
+grouped = line_items.aggregate(
+    [col("order_id")],
+    [
+        F.array_agg(col("supplier_id"), distinct=True).alias("all_suppliers"),
+        F.array_agg(
+            col("supplier_id"),
+            filter=col("failed"),
+            distinct=True,
+        ).alias("failed_suppliers"),
+    ],
+)
+
+print("\n== per-order supplier arrays ==")
+grouped.sort(col("order_id").sort()).show()
+
+print("\n== orders with >1 supplier and exactly one failure ==")
+singled_out = grouped.filter(
+    (F.array_length(col("failed_suppliers")) == lit(1))
+    & (F.array_length(col("all_suppliers")) > lit(1))
+).select(
+    col("order_id"),
+    F.array_element(col("failed_suppliers"), lit(1)).alias("bad_supplier"),
+)
+singled_out.sort(col("order_id").sort()).show()