Skip to content

Commit 7238c73

Browse files
timsaucerclaude
andcommitted
Fix AGENTS.md: Arrow C Data Interface, aggregate filter, fluent example
- Clarify that DataFusion works with any Arrow C Data Interface implementation, not just PyArrow. - Show the filter keyword argument on aggregate functions (the idiomatic HAVING equivalent) instead of the post-aggregate .filter() pattern. - Update the SQL reference table to show FILTER (WHERE ...) syntax. - Remove the now-incorrect "Aggregate then filter for HAVING" pitfall. - Add .collect() to the fluent chaining example so the result is clearly materialized. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent c5f75f5 commit 7238c73

1 file changed

Lines changed: 13 additions & 12 deletions

File tree

python/datafusion/AGENTS.md

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,10 @@ dependencies. You create a `SessionContext`, point it at data (Parquet, CSV,
2727
JSON, Arrow IPC, Pandas, Polars, or raw Python dicts/lists), and run queries
2828
using either SQL or the DataFrame API described below.
2929

30-
All data flows through **PyArrow** (`pyarrow.RecordBatch` / `pyarrow.Table`),
31-
so any library that speaks Arrow can interoperate with DataFusion.
30+
All data flows through **Apache Arrow**. The canonical Python implementation is
31+
PyArrow (`pyarrow.RecordBatch` / `pyarrow.Table`), but any library that
32+
conforms to the [Arrow C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html)
33+
can interoperate with DataFusion.
3234

3335
## Core Abstractions
3436

@@ -109,13 +111,17 @@ df.filter("a > 10") # SQL expression string
109111
# GROUP BY a, compute sum(b) and count(*)
110112
df.aggregate([col("a")], [F.sum(col("b")), F.count(col("a"))])
111113

112-
# HAVING: filter after aggregate
114+
# HAVING equivalent: use the filter keyword on the aggregate function
113115
df.aggregate(
114116
[col("region")],
115-
[F.sum(col("sales")).alias("total_sales")],
116-
).filter(col("total_sales") > lit(1000))
117+
[F.sum(col("sales"), filter=col("sales") > lit(1000)).alias("large_sales")],
118+
)
117119
```
118120

121+
Most aggregate functions accept an optional `filter` keyword argument. When
122+
provided, only rows where the filter expression is true contribute to the
123+
aggregate.
124+
119125
### Sorting
120126

121127
```python
@@ -349,7 +355,7 @@ col("array_col")[1:3] # array slice (0-indexed)
349355
| `SELECT *, a + 1 AS c` | `df.with_column("c", col("a") + lit(1))` |
350356
| `WHERE a > 10` | `df.filter(col("a") > lit(10))` |
351357
| `GROUP BY a` with `SUM(b)` | `df.aggregate([col("a")], [F.sum(col("b"))])` |
352-
| `HAVING sum_b > 100` | `.filter(col("sum_b") > lit(100))` (after aggregate) |
358+
| `SUM(b) FILTER (WHERE b > 100)` | `F.sum(col("b"), filter=col("b") > lit(100))` |
353359
| `ORDER BY a DESC` | `df.sort(col("a").sort(ascending=False))` |
354360
| `LIMIT 10 OFFSET 5` | `df.limit(10, offset=5)` |
355361
| `DISTINCT` | `df.distinct()` |
@@ -396,12 +402,6 @@ col("array_col")[1:3] # array slice (0-indexed)
396402
frame is `RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW`. For a full
397403
partition frame, set `window_frame=WindowFrame("rows", None, None)`.
398404

399-
6. **Aggregate then filter for HAVING**: There is no separate `.having()` method.
400-
Use `.filter()` after `.aggregate()`:
401-
```python
402-
df.aggregate([col("g")], [F.sum(col("v")).alias("s")]).filter(col("s") > lit(100))
403-
```
404-
405405
## Idiomatic Patterns
406406

407407
### Fluent Chaining
@@ -414,6 +414,7 @@ result = (
414414
.aggregate([col("region")], [F.sum(col("sales")).alias("total")])
415415
.sort(col("total").sort(ascending=False))
416416
.limit(10)
417+
.collect()
417418
)
418419
```
419420

0 commit comments

Comments
 (0)