Skip to content

Commit 23b3be7

Browse files
timsaucerclaude
andcommitted
docs: add write-dataframe-code contributor skill
Adds `.ai/skills/write-dataframe-code/SKILL.md`, a contributor-facing skill for agents working on this repo. It layers on top of the user-facing repo-root SKILL.md with: - a TPC-H pattern index mapping idiomatic API usages to the query file that demonstrates them, - an ad-hoc plan-comparison workflow for checking DataFrame translations against a reference SQL query via `optimized_logical_plan()`, and - the project-specific docstring and aggregate/window documentation conventions that CLAUDE.md already enforces for contributors. Implements PR 4c of the plan in #1394. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent c7cdc63 commit 23b3be7

1 file changed

Lines changed: 152 additions & 0 deletions

File tree

Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
<!---
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
---
21+
name: write-dataframe-code
22+
description: Contributor-facing guidance for writing idiomatic datafusion-python DataFrame code inside the repo — examples, docstrings, tests, and benchmark queries. Use when adding or reviewing Python code in this project that builds DataFrames or expressions. Composes on top of the user-facing guide at the repo-root SKILL.md.
23+
argument-hint: [area] (e.g., "tpch", "docstrings", "plan-comparison")
24+
---
25+
26+
# Writing DataFrame Code in datafusion-python
27+
28+
This skill is for contributors working **on** the datafusion-python project
29+
(examples, tests, docstrings, benchmark queries). The primary reference for
30+
**how** to write DataFrame and expression code — imports, data loading, the
31+
DataFrame API, idiomatic patterns, common pitfalls, and the function
32+
catalog — is the repo-root [`SKILL.md`](../../SKILL.md). Read that first.
33+
34+
This file layers on contributor-specific extras:
35+
36+
1. The TPC-H pattern index — which example to use as a template for which API.
37+
2. The plan-comparison workflow — a diagnostic for checking a DataFrame
38+
translation against a reference SQL query.
39+
3. Docstring conventions enforced by this project (already summarized in
40+
`CLAUDE.md`; repeated here so the rule is on-hand while writing examples).
41+
42+
## TPC-H pattern index
43+
44+
`examples/tpch/q01..q22*.py` is the largest collection of idiomatic DataFrame
45+
code in the repo. Each query file pairs a DataFrame translation with the
46+
canonical TPC-H reference SQL embedded in the module docstring. When adding
47+
a new example or demo, pick the query that already exercises the pattern
48+
rather than re-deriving from scratch.
49+
50+
| Pattern | Canonical TPC-H example |
51+
|---|---|
52+
| Simple filter + aggregate + sort | `q01_pricing_summary_report.py` |
53+
| Multi-table join with date-range filter | `q03_shipping_priority.py` |
54+
| `DISTINCT` via `.select(...).distinct()` | `q04_order_priority_checking.py` |
55+
| Multi-hop region/nation/customer join | `q05_local_supplier_volume.py` |
56+
| `F.in_list(col, [...])` in place of CASE/array tricks | `q07_volume_shipping.py`, `q12_ship_mode_order_priority.py` |
57+
| Searched `F.when(...).otherwise(...)` against SQL `CASE WHEN` | `q08_market_share.py` |
58+
| Reusing computed expressions as variables | `q09_product_type_profit_measure.py` |
59+
| Window function in place of correlated scalar subquery | `q02_minimum_cost_supplier.py`, `q11_important_stock_identification.py`, `q15_top_supplier.py`, `q17_small_quantity_order.py`, `q22_global_sales_opportunity.py` |
60+
| `F.regexp_like(col, pattern)` for matching | `q16_part_supplier_relationship.py` |
61+
| Compound disjunctive predicate (OR of per-brand conditions) | `q19_discounted_revenue.py` |
62+
| Semi/anti joins for `EXISTS` / `NOT EXISTS` | `q21_suppliers_kept_orders_waiting.py` |
63+
| `F.starts_with(...)` for prefix matching | `q20_potential_part_promotion.py` |
64+
65+
The queries are correctness-gated against `examples/tpch/answers_sf1/` via
66+
`examples/tpch/_tests.py` at scale factor 1.
67+
68+
## Plan-comparison diagnostic workflow
69+
70+
When translating a SQL query to DataFrame form — TPC-H, a benchmark, or an
71+
answer to a user question — the answer-file comparison proves *correctness*
72+
but does not prove the translation is *equivalent at the plan level*. The
73+
optimizer usually smooths over surface differences (filter pushdown, join
74+
reordering, predicate simplification), so two surface-different builders that
75+
resolve to the same optimized plan are effectively identical queries.
76+
77+
Use this ad-hoc diagnostic when you suspect a DataFrame translation is doing
78+
more work than the SQL form:
79+
80+
```python
81+
from datafusion import SessionContext
82+
83+
ctx = SessionContext()
84+
# register the tables the SQL query expects
85+
# ...
86+
87+
sql_plan = ctx.sql(reference_sql).optimized_logical_plan()
88+
df_plan = dataframe_under_test.optimized_logical_plan()
89+
90+
if sql_plan == df_plan:
91+
print("Plans match exactly.")
92+
else:
93+
print("=== SQL plan ===")
94+
print(sql_plan.display_indent())
95+
print("=== DataFrame plan ===")
96+
print(df_plan.display_indent())
97+
```
98+
99+
- `LogicalPlan.__eq__` compares structurally.
100+
- `LogicalPlan.display_indent()` is the readable form for eyeballing diffs.
101+
- `DataFrame.optimized_logical_plan()` is the optimizer output — use it, not
102+
the unoptimized plan, because trivial differences (e.g. column order in a
103+
projection) will otherwise be reported as mismatches.
104+
105+
This is **a diagnostic, not a gate**. Answer-file comparison is the
106+
correctness gate. A plan-level mismatch does not mean the DataFrame form is
107+
wrong — it means the two forms are not literally the same plan, which is
108+
sometimes fine (e.g. the DataFrame form forces a particular partitioning the
109+
SQL form leaves to the optimizer).
110+
111+
## Docstring conventions
112+
113+
Every Python function added or modified in this project must include a
114+
docstring with at least one doctest-verified example. Pre-commit and the
115+
`pytest --doctest-modules` default in `pyproject.toml` will enforce that
116+
examples actually execute.
117+
118+
Rules (also in `CLAUDE.md`):
119+
120+
- Examples must run under the doctest harness. The `conftest.py` injects
121+
`dfn` (the `datafusion` module), `col`, `lit`, `F` (functions), `pa`
122+
(pyarrow), and `np` (numpy) so you do not need to import them inside
123+
examples.
124+
- Optional parameters: write a second example that passes the optional
125+
argument **by keyword** (`step=dfn.lit(3)`) so the reader sees which
126+
parameter is being demonstrated.
127+
- Reuse input data across examples for the same function so the effect of
128+
each optional argument is visible against a constant baseline.
129+
- Alias functions (one function that just wraps another — for example
130+
`list_sort` forwarding to `array_sort`) only need a one-line description
131+
and a `See Also` reference to the primary function. They do not need their
132+
own example.
133+
134+
## Aggregate and window function documentation
135+
136+
When adding or updating an aggregate or window function, update the matching
137+
site page:
138+
139+
- Aggregate functions → `docs/source/user-guide/common-operations/aggregations.rst`
140+
- Window functions → `docs/source/user-guide/common-operations/windows.rst`
141+
142+
Add the function to the function list at the bottom of the page and, if the
143+
function exposes a non-obvious option, add a short usage example.
144+
145+
## Related
146+
147+
- Repo-root [`SKILL.md`](../../SKILL.md) — primary DataFrame API guide
148+
(users and agents).
149+
- `.ai/skills/check-upstream/` — audit upstream Apache DataFusion features
150+
and flag what the Python bindings do not yet expose.
151+
- `.ai/skills/audit-skill-md/` — audit the repo-root `SKILL.md` against the
152+
current public Python API and flag drift.

0 commit comments

Comments
 (0)