|
| 1 | +<!--- |
| 2 | + Licensed to the Apache Software Foundation (ASF) under one |
| 3 | + or more contributor license agreements. See the NOTICE file |
| 4 | + distributed with this work for additional information |
| 5 | + regarding copyright ownership. The ASF licenses this file |
| 6 | + to you under the Apache License, Version 2.0 (the |
| 7 | + "License"); you may not use this file except in compliance |
| 8 | + with the License. You may obtain a copy of the License at |
| 9 | +
|
| 10 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 11 | +
|
| 12 | + Unless required by applicable law or agreed to in writing, |
| 13 | + software distributed under the License is distributed on an |
| 14 | + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 15 | + KIND, either express or implied. See the License for the |
| 16 | + specific language governing permissions and limitations |
| 17 | + under the License. |
| 18 | +--> |
| 19 | + |
| 20 | +--- |
| 21 | +name: audit-skill-md |
| 22 | +description: Cross-reference the repo-root SKILL.md against the current public Python API (DataFrame, Expr, SessionContext, functions module) and report new APIs that need coverage and stale mentions that no longer exist. Use after upstream syncs or any PR that changes the public Python surface. |
| 23 | +argument-hint: [area] (e.g., "functions", "dataframe", "expr", "context", "all") |
| 24 | +--- |
| 25 | + |
| 26 | +# Audit SKILL.md Against the Python Public API |
| 27 | + |
| 28 | +This skill keeps the repo-root `SKILL.md` (the agent-facing DataFrame API |
| 29 | +guide) aligned with the actual Python surface exposed by the package. It is |
| 30 | +a **diff-only audit** — it does not auto-edit `SKILL.md`. The output is a |
| 31 | +report the user reviews and then asks the agent to act on. |
| 32 | + |
| 33 | +Run this whenever the public Python API changes — most commonly: |
| 34 | + |
| 35 | +- after an upstream DataFusion sync PR adds new functions or methods, |
| 36 | +- after a PR that adds or removes a `DataFrame`, `Expr`, or `SessionContext` |
| 37 | + method, |
| 38 | +- as a pre-release gate before cutting a new datafusion-python version. |
| 39 | + |
| 40 | +The companion skill [`check-upstream`](../check-upstream/SKILL.md) reports |
| 41 | +upstream APIs that are **not yet** exposed in the Python bindings. This skill |
| 42 | +reports APIs that **are** exposed but are missing or misspelled in the |
| 43 | +user-facing guide. |
| 44 | + |
| 45 | +## Areas to Check |
| 46 | + |
| 47 | +`$ARGUMENTS` selects a subset. If empty or `all`, audit every area. |
| 48 | + |
| 49 | +### 1. Scalar / aggregate / window functions |
| 50 | + |
| 51 | +**Source of truth:** `python/datafusion/functions.py` — the `__all__` list. |
| 52 | +Only symbols in `__all__` are part of the public surface; helpers not listed |
| 53 | +there are implementation details. |
| 54 | + |
| 55 | +**Procedure:** |
| 56 | + |
| 57 | +1. Load `python/datafusion/functions.py`, extract the `__all__` list. |
| 58 | +2. Parse `SKILL.md`, collect every function reference — patterns to look for: |
| 59 | + - Inline `F.<name>(...)`, `F.<name>` references. |
| 60 | + - Bare backticked names in the "Available Functions (Categorized)" |
| 61 | + section (`sum`, `avg`, ...). |
| 62 | +3. Cross-reference: |
| 63 | + - **In `__all__` but not mentioned in `SKILL.md`** → new API needing |
| 64 | + coverage. Flag unless it is an alias documented through a `See Also` |
| 65 | + in the primary function's docstring (see "Alias handling" below). |
| 66 | + - **Mentioned in `SKILL.md` but not in `__all__`** → stale reference, has |
| 67 | + been renamed or removed. |
| 68 | + |
| 69 | +### 2. `DataFrame` methods |
| 70 | + |
| 71 | +**Source of truth:** `python/datafusion/dataframe.py` — public methods on the |
| 72 | +`DataFrame` class. A method is public if its name does not begin with an |
| 73 | +underscore. |
| 74 | + |
| 75 | +**Procedure:** |
| 76 | + |
| 77 | +1. Import `DataFrame` and collect `dir(DataFrame)`, filtering to names that |
| 78 | + do not start with `_`. |
| 79 | +2. Parse `SKILL.md` for method references — patterns: |
| 80 | + - `df.<name>(`, `.<name>(`, and backticked bare names in prose. |
| 81 | + - The method tables in "Core Abstractions" and the pitfalls/idiomatic |
| 82 | + patterns sections. |
| 83 | +3. Flag: |
| 84 | + - **Public method, no mention in `SKILL.md`** → candidate addition. |
| 85 | + Weight the flag by whether the method would change how a user writes a |
| 86 | + query (e.g. `with_column`, `join`, `aggregate` are high-value; a new |
| 87 | + `explain_analyze_format` is low-value). |
| 88 | + - **Mentioned in `SKILL.md`, no longer a public method** → stale. |
| 89 | + |
| 90 | +### 3. `Expr` methods and attributes |
| 91 | + |
| 92 | +**Source of truth:** `python/datafusion/expr.py` — the `Expr` class. Also |
| 93 | +include `Window`, `WindowFrame`, and `GroupingSet` if they are re-exported |
| 94 | +from `datafusion.expr`. |
| 95 | + |
| 96 | +**Procedure:** same as for `DataFrame`. Pay particular attention to operator |
| 97 | +dunder methods mentioned in `SKILL.md` — the "Common Pitfalls" section |
| 98 | +already covers `&`, `|`, `~`, `==`, the comparison operators, and arithmetic |
| 99 | +operators on `Expr`. If a new operator is added (e.g. a new `__matmul__`), |
| 100 | +it probably warrants a pitfall or pattern note. |
| 101 | + |
| 102 | +### 4. `SessionContext` methods |
| 103 | + |
| 104 | +**Source of truth:** `python/datafusion/context.py` — the `SessionContext` |
| 105 | +class. |
| 106 | + |
| 107 | +**Procedure:** same as for `DataFrame`. The high-value methods in `SKILL.md` |
| 108 | +are the data-loading methods (`read_parquet`, `read_csv`, `read_json`, |
| 109 | +`from_pydict`, `from_pylist`, `from_pandas`) and the SQL entry points |
| 110 | +(`sql`, `register_*`, `table`). New additions in those families are |
| 111 | +worth flagging for a sentence in the data-loading section. |
| 112 | + |
| 113 | +### 5. Re-exports at package root |
| 114 | + |
| 115 | +**Source of truth:** `python/datafusion/__init__.py` — the top-level |
| 116 | +`from ... import ...` statements and `__all__`. A symbol re-exported at the |
| 117 | +package root is part of the "import" examples in `SKILL.md` even if it |
| 118 | +lives in a submodule. |
| 119 | + |
| 120 | +**Procedure:** verify every name in the top-level `__all__` resolves. Flag |
| 121 | +any new re-export that is not already mentioned in the "Import Conventions" |
| 122 | +or "Core Abstractions" section. |
| 123 | + |
| 124 | +## Alias handling |
| 125 | + |
| 126 | +Many functions in the `functions` module are aliases — for example |
| 127 | +`list_sort` aliases `array_sort`, and `character_length` aliases `length`. |
| 128 | +The convention in this project is that alias function docstrings carry only |
| 129 | +a one-line description and a `See Also` pointing at the primary function |
| 130 | +(see `CLAUDE.md`). Do not flag an alias as missing from `SKILL.md` as long |
| 131 | +as its primary function is already covered, unless the alias uses a name |
| 132 | +that a user would reasonably reach for first (e.g. SQL-standard names). |
| 133 | + |
| 134 | +## Output Format |
| 135 | + |
| 136 | +Produce a report of this shape: |
| 137 | + |
| 138 | +``` |
| 139 | +## SKILL.md Audit Report |
| 140 | +
|
| 141 | +### Summary |
| 142 | +- Functions checked: N |
| 143 | +- DataFrame methods checked: N |
| 144 | +- Expr members checked: N |
| 145 | +- SessionContext methods checked: N |
| 146 | +- Package-root re-exports checked: N |
| 147 | +
|
| 148 | +### New APIs needing coverage in SKILL.md |
| 149 | +- `functions.new_fn` — brief description. Suggested section: "String". |
| 150 | +- `DataFrame.with_catalog` — brief description. Suggested section: "Core Abstractions". |
| 151 | +
|
| 152 | +### Stale mentions in SKILL.md |
| 153 | +- `functions.old_fn` — referenced in "Available Functions" but no longer in `__all__`. Likely renamed to `new_fn` in <upstream PR/commit>. |
| 154 | +- `DataFrame.show_limit` — referenced in a pitfall; method removed in favor of `DataFrame.show(num=...)`. |
| 155 | +
|
| 156 | +### Informational |
| 157 | +- Alias `list_sort` covered transitively via `array_sort` — no action needed. |
| 158 | +``` |
| 159 | + |
| 160 | +If every area is clean, state that explicitly ("All audited areas are in |
| 161 | +sync. No action required."). An audit report that elides the summary line |
| 162 | +is harder to scan in a release checklist. |
| 163 | + |
| 164 | +## When to edit SKILL.md |
| 165 | + |
| 166 | +This skill does not auto-edit. After reporting, wait for the user to |
| 167 | +confirm which gaps are worth filling. New APIs often need a natural home |
| 168 | +chosen by a human — the categorized function list and the pitfalls section |
| 169 | +both have opinionated structure that an automated edit will not respect. |
| 170 | + |
| 171 | +## Related |
| 172 | + |
| 173 | +- Repo-root [`SKILL.md`](../../SKILL.md) — the file this skill audits. |
| 174 | +- `.ai/skills/check-upstream/` — the complementary audit against upstream |
| 175 | + Rust APIs not yet exposed in Python. |
| 176 | +- `.ai/skills/write-dataframe-code/` — how to write idiomatic DataFrame |
| 177 | + code in this repo. |
0 commit comments