|
| 1 | +<!--- |
| 2 | + Licensed to the Apache Software Foundation (ASF) under one |
| 3 | + or more contributor license agreements. See the NOTICE file |
| 4 | + distributed with this work for additional information |
| 5 | + regarding copyright ownership. The ASF licenses this file |
| 6 | + to you under the Apache License, Version 2.0 (the |
| 7 | + "License"); you may not use this file except in compliance |
| 8 | + with the License. You may obtain a copy of the License at |
| 9 | +
|
| 10 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 11 | +
|
| 12 | + Unless required by applicable law or agreed to in writing, |
| 13 | + software distributed under the License is distributed on an |
| 14 | + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 15 | + KIND, either express or implied. See the License for the |
| 16 | + specific language governing permissions and limitations |
| 17 | + under the License. |
| 18 | +--> |
| 19 | + |
| 20 | +--- |
| 21 | +name: write-dataframe-code |
| 22 | +description: Contributor-facing guidance for writing idiomatic datafusion-python DataFrame code inside the repo — examples, docstrings, tests, and benchmark queries. Use when adding or reviewing Python code in this project that builds DataFrames or expressions. Composes on top of the user-facing guide at the repo-root SKILL.md. |
| 23 | +argument-hint: [area] (e.g., "tpch", "docstrings", "plan-comparison") |
| 24 | +--- |
| 25 | + |
| 26 | +# Writing DataFrame Code in datafusion-python |
| 27 | + |
| 28 | +This skill is for contributors working **on** the datafusion-python project |
| 29 | +(examples, tests, docstrings, benchmark queries). The primary reference for |
| 30 | +**how** to write DataFrame and expression code — imports, data loading, the |
| 31 | +DataFrame API, idiomatic patterns, common pitfalls, and the function |
| 32 | +catalog — is the repo-root [`SKILL.md`](../../SKILL.md). Read that first. |
| 33 | + |
| 34 | +This file layers on contributor-specific extras: |
| 35 | + |
| 36 | +1. The TPC-H pattern index — which example to use as a template for which API. |
| 37 | +2. The plan-comparison workflow — a diagnostic for checking a DataFrame |
| 38 | + translation against a reference SQL query. |
| 39 | +3. Docstring conventions enforced by this project (already summarized in |
| 40 | + `CLAUDE.md`; repeated here so the rule is on-hand while writing examples). |
| 41 | + |
| 42 | +## TPC-H pattern index |
| 43 | + |
| 44 | +`examples/tpch/q01..q22*.py` is the largest collection of idiomatic DataFrame |
| 45 | +code in the repo. Each query file pairs a DataFrame translation with the |
| 46 | +canonical TPC-H reference SQL embedded in the module docstring. When adding |
| 47 | +a new example or demo, pick the query that already exercises the pattern |
| 48 | +rather than re-deriving from scratch. |
| 49 | + |
| 50 | +| Pattern | Canonical TPC-H example | |
| 51 | +|---|---| |
| 52 | +| Simple filter + aggregate + sort | `q01_pricing_summary_report.py` | |
| 53 | +| Multi-table join with date-range filter | `q03_shipping_priority.py` | |
| 54 | +| `DISTINCT` via `.select(...).distinct()` | `q04_order_priority_checking.py` | |
| 55 | +| Multi-hop region/nation/customer join | `q05_local_supplier_volume.py` | |
| 56 | +| `F.in_list(col, [...])` in place of CASE/array tricks | `q07_volume_shipping.py`, `q12_ship_mode_order_priority.py` | |
| 57 | +| Searched `F.when(...).otherwise(...)` against SQL `CASE WHEN` | `q08_market_share.py` | |
| 58 | +| Reusing computed expressions as variables | `q09_product_type_profit_measure.py` | |
| 59 | +| Window function in place of correlated scalar subquery | `q02_minimum_cost_supplier.py`, `q11_important_stock_identification.py`, `q15_top_supplier.py`, `q17_small_quantity_order.py`, `q22_global_sales_opportunity.py` | |
| 60 | +| `F.regexp_like(col, pattern)` for matching | `q16_part_supplier_relationship.py` | |
| 61 | +| Compound disjunctive predicate (OR of per-brand conditions) | `q19_discounted_revenue.py` | |
| 62 | +| Semi/anti joins for `EXISTS` / `NOT EXISTS` | `q21_suppliers_kept_orders_waiting.py` | |
| 63 | +| `F.starts_with(...)` for prefix matching | `q20_potential_part_promotion.py` | |
| 64 | + |
| 65 | +The queries are correctness-gated against `examples/tpch/answers_sf1/` via |
| 66 | +`examples/tpch/_tests.py` at scale factor 1. |
| 67 | + |
| 68 | +## Plan-comparison diagnostic workflow |
| 69 | + |
| 70 | +When translating a SQL query to DataFrame form — TPC-H, a benchmark, or an |
| 71 | +answer to a user question — the answer-file comparison proves *correctness* |
| 72 | +but does not prove the translation is *equivalent at the plan level*. The |
| 73 | +optimizer usually smooths over surface differences (filter pushdown, join |
| 74 | +reordering, predicate simplification), so two surface-different builders that |
| 75 | +resolve to the same optimized plan are effectively identical queries. |
| 76 | + |
| 77 | +Use this ad-hoc diagnostic when you suspect a DataFrame translation is doing |
| 78 | +more work than the SQL form: |
| 79 | + |
| 80 | +```python |
| 81 | +from datafusion import SessionContext |
| 82 | + |
| 83 | +ctx = SessionContext() |
| 84 | +# register the tables the SQL query expects |
| 85 | +# ... |
| 86 | + |
| 87 | +sql_plan = ctx.sql(reference_sql).optimized_logical_plan() |
| 88 | +df_plan = dataframe_under_test.optimized_logical_plan() |
| 89 | + |
| 90 | +if sql_plan == df_plan: |
| 91 | + print("Plans match exactly.") |
| 92 | +else: |
| 93 | + print("=== SQL plan ===") |
| 94 | + print(sql_plan.display_indent()) |
| 95 | + print("=== DataFrame plan ===") |
| 96 | + print(df_plan.display_indent()) |
| 97 | +``` |
| 98 | + |
| 99 | +- `LogicalPlan.__eq__` compares structurally. |
| 100 | +- `LogicalPlan.display_indent()` is the readable form for eyeballing diffs. |
| 101 | +- `DataFrame.optimized_logical_plan()` is the optimizer output — use it, not |
| 102 | + the unoptimized plan, because trivial differences (e.g. column order in a |
| 103 | + projection) will otherwise be reported as mismatches. |
| 104 | + |
| 105 | +This is **a diagnostic, not a gate**. Answer-file comparison is the |
| 106 | +correctness gate. A plan-level mismatch does not mean the DataFrame form is |
| 107 | +wrong — it means the two forms are not literally the same plan, which is |
| 108 | +sometimes fine (e.g. the DataFrame form forces a particular partitioning the |
| 109 | +SQL form leaves to the optimizer). |
| 110 | + |
| 111 | +## Docstring conventions |
| 112 | + |
| 113 | +Every Python function added or modified in this project must include a |
| 114 | +docstring with at least one doctest-verified example. Pre-commit and the |
| 115 | +`pytest --doctest-modules` default in `pyproject.toml` will enforce that |
| 116 | +examples actually execute. |
| 117 | + |
| 118 | +Rules (also in `CLAUDE.md`): |
| 119 | + |
| 120 | +- Examples must run under the doctest harness. The `conftest.py` injects |
| 121 | + `dfn` (the `datafusion` module), `col`, `lit`, `F` (functions), `pa` |
| 122 | + (pyarrow), and `np` (numpy) so you do not need to import them inside |
| 123 | + examples. |
| 124 | +- Optional parameters: write a second example that passes the optional |
| 125 | + argument **by keyword** (`step=dfn.lit(3)`) so the reader sees which |
| 126 | + parameter is being demonstrated. |
| 127 | +- Reuse input data across examples for the same function so the effect of |
| 128 | + each optional argument is visible against a constant baseline. |
| 129 | +- Alias functions (one function that just wraps another — for example |
| 130 | + `list_sort` forwarding to `array_sort`) only need a one-line description |
| 131 | + and a `See Also` reference to the primary function. They do not need their |
| 132 | + own example. |
| 133 | + |
| 134 | +## Aggregate and window function documentation |
| 135 | + |
| 136 | +When adding or updating an aggregate or window function, update the matching |
| 137 | +site page: |
| 138 | + |
| 139 | +- Aggregate functions → `docs/source/user-guide/common-operations/aggregations.rst` |
| 140 | +- Window functions → `docs/source/user-guide/common-operations/windows.rst` |
| 141 | + |
| 142 | +Add the function to the function list at the bottom of the page and, if the |
| 143 | +function exposes a non-obvious option, add a short usage example. |
| 144 | + |
| 145 | +## Related |
| 146 | + |
| 147 | +- Repo-root [`SKILL.md`](../../SKILL.md) — primary DataFrame API guide |
| 148 | + (users and agents). |
| 149 | +- `.ai/skills/check-upstream/` — audit upstream Apache DataFusion features |
| 150 | + and flag what the Python bindings do not yet expose. |
| 151 | +- `.ai/skills/audit-skill-md/` — audit the repo-root `SKILL.md` against the |
| 152 | + current public Python API and flag drift. |
0 commit comments