Skip to content

Commit 35b7893

Browse files
timsaucerclaude
andcommitted
docs: add audit-skill-md skill
Adds `.ai/skills/audit-skill-md/SKILL.md`, a contributor skill that cross-references the repo-root `SKILL.md` against the current public Python API (functions module, DataFrame, Expr, SessionContext, and package-root re-exports). Reports two classes of drift: - new APIs exposed by the Python surface that are not yet covered in the user-facing guide, and - stale mentions in the guide that no longer exist in the public API. The skill is diff-only — it produces a report the user reviews before any edit to `SKILL.md`. Complements `check-upstream/`, which audits in the opposite direction (upstream Rust features not yet exposed). Implements PR 4d of the plan in #1394. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 23b3be7 commit 35b7893

1 file changed

Lines changed: 177 additions & 0 deletions

File tree

.ai/skills/audit-skill-md/SKILL.md

Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
<!---
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
---
21+
name: audit-skill-md
22+
description: Cross-reference the repo-root SKILL.md against the current public Python API (DataFrame, Expr, SessionContext, functions module) and report new APIs that need coverage and stale mentions that no longer exist. Use after upstream syncs or any PR that changes the public Python surface.
23+
argument-hint: [area] (e.g., "functions", "dataframe", "expr", "context", "all")
24+
---
25+
26+
# Audit SKILL.md Against the Python Public API
27+
28+
This skill keeps the repo-root `SKILL.md` (the agent-facing DataFrame API
29+
guide) aligned with the actual Python surface exposed by the package. It is
30+
a **diff-only audit** — it does not auto-edit `SKILL.md`. The output is a
31+
report the user reviews and then asks the agent to act on.
32+
33+
Run this whenever the public Python API changes — most commonly:
34+
35+
- after an upstream DataFusion sync PR adds new functions or methods,
36+
- after a PR that adds or removes a `DataFrame`, `Expr`, or `SessionContext`
37+
method,
38+
- as a pre-release gate before cutting a new datafusion-python version.
39+
40+
The companion skill [`check-upstream`](../check-upstream/SKILL.md) reports
41+
upstream APIs that are **not yet** exposed in the Python bindings. This skill
42+
reports APIs that **are** exposed but are missing or misspelled in the
43+
user-facing guide.
44+
45+
## Areas to Check
46+
47+
`$ARGUMENTS` selects a subset. If empty or `all`, audit every area.
48+
49+
### 1. Scalar / aggregate / window functions
50+
51+
**Source of truth:** `python/datafusion/functions.py` — the `__all__` list.
52+
Only symbols in `__all__` are part of the public surface; helpers not listed
53+
there are implementation details.
54+
55+
**Procedure:**
56+
57+
1. Load `python/datafusion/functions.py`, extract the `__all__` list.
58+
2. Parse `SKILL.md`, collect every function reference — patterns to look for:
59+
- Inline `F.<name>(...)`, `F.<name>` references.
60+
- Bare backticked names in the "Available Functions (Categorized)"
61+
section (`sum`, `avg`, ...).
62+
3. Cross-reference:
63+
- **In `__all__` but not mentioned in `SKILL.md`** → new API needing
64+
coverage. Flag unless it is an alias documented through a `See Also`
65+
in the primary function's docstring (see "Alias handling" below).
66+
- **Mentioned in `SKILL.md` but not in `__all__`** → stale reference, has
67+
been renamed or removed.
68+
69+
### 2. `DataFrame` methods
70+
71+
**Source of truth:** `python/datafusion/dataframe.py` — public methods on the
72+
`DataFrame` class. A method is public if its name does not begin with an
73+
underscore.
74+
75+
**Procedure:**
76+
77+
1. Import `DataFrame` and collect `dir(DataFrame)`, filtering to names that
78+
do not start with `_`.
79+
2. Parse `SKILL.md` for method references — patterns:
80+
- `df.<name>(`, `.<name>(`, and backticked bare names in prose.
81+
- The method tables in "Core Abstractions" and the pitfalls/idiomatic
82+
patterns sections.
83+
3. Flag:
84+
- **Public method, no mention in `SKILL.md`** → candidate addition.
85+
Weight the flag by whether the method would change how a user writes a
86+
query (e.g. `with_column`, `join`, `aggregate` are high-value; a new
87+
`explain_analyze_format` is low-value).
88+
- **Mentioned in `SKILL.md`, no longer a public method** → stale.
89+
90+
### 3. `Expr` methods and attributes
91+
92+
**Source of truth:** `python/datafusion/expr.py` — the `Expr` class. Also
93+
include `Window`, `WindowFrame`, and `GroupingSet` if they are re-exported
94+
from `datafusion.expr`.
95+
96+
**Procedure:** same as for `DataFrame`. Pay particular attention to operator
97+
dunder methods mentioned in `SKILL.md` — the "Common Pitfalls" section
98+
already covers `&`, `|`, `~`, `==`, the comparison operators, and arithmetic
99+
operators on `Expr`. If a new operator is added (e.g. a new `__matmul__`),
100+
it probably warrants a pitfall or pattern note.
101+
102+
### 4. `SessionContext` methods
103+
104+
**Source of truth:** `python/datafusion/context.py` — the `SessionContext`
105+
class.
106+
107+
**Procedure:** same as for `DataFrame`. The high-value methods in `SKILL.md`
108+
are the data-loading methods (`read_parquet`, `read_csv`, `read_json`,
109+
`from_pydict`, `from_pylist`, `from_pandas`) and the SQL entry points
110+
(`sql`, `register_*`, `table`). New additions in those families are
111+
worth flagging for a sentence in the data-loading section.
112+
113+
### 5. Re-exports at package root
114+
115+
**Source of truth:** `python/datafusion/__init__.py` — the top-level
116+
`from ... import ...` statements and `__all__`. A symbol re-exported at the
117+
package root is part of the "import" examples in `SKILL.md` even if it
118+
lives in a submodule.
119+
120+
**Procedure:** verify every name in the top-level `__all__` resolves. Flag
121+
any new re-export that is not already mentioned in the "Import Conventions"
122+
or "Core Abstractions" section.
123+
124+
## Alias handling
125+
126+
Many functions in the `functions` module are aliases — for example
127+
`list_sort` aliases `array_sort`, and `character_length` aliases `length`.
128+
The convention in this project is that alias function docstrings carry only
129+
a one-line description and a `See Also` pointing at the primary function
130+
(see `CLAUDE.md`). Do not flag an alias as missing from `SKILL.md` as long
131+
as its primary function is already covered, unless the alias uses a name
132+
that a user would reasonably reach for first (e.g. SQL-standard names).
133+
134+
## Output Format
135+
136+
Produce a report of this shape:
137+
138+
```
139+
## SKILL.md Audit Report
140+
141+
### Summary
142+
- Functions checked: N
143+
- DataFrame methods checked: N
144+
- Expr members checked: N
145+
- SessionContext methods checked: N
146+
- Package-root re-exports checked: N
147+
148+
### New APIs needing coverage in SKILL.md
149+
- `functions.new_fn` — brief description. Suggested section: "String".
150+
- `DataFrame.with_catalog` — brief description. Suggested section: "Core Abstractions".
151+
152+
### Stale mentions in SKILL.md
153+
- `functions.old_fn` — referenced in "Available Functions" but no longer in `__all__`. Likely renamed to `new_fn` in <upstream PR/commit>.
154+
- `DataFrame.show_limit` — referenced in a pitfall; method removed in favor of `DataFrame.show(num=...)`.
155+
156+
### Informational
157+
- Alias `list_sort` covered transitively via `array_sort` — no action needed.
158+
```
159+
160+
If every area is clean, state that explicitly ("All audited areas are in
161+
sync. No action required."). An audit report that elides the summary line
162+
is harder to scan in a release checklist.
163+
164+
## When to edit SKILL.md
165+
166+
This skill does not auto-edit. After reporting, wait for the user to
167+
confirm which gaps are worth filling. New APIs often need a natural home
168+
chosen by a human — the categorized function list and the pitfalls section
169+
both have opinionated structure that an automated edit will not respect.
170+
171+
## Related
172+
173+
- Repo-root [`SKILL.md`](../../SKILL.md) — the file this skill audits.
174+
- `.ai/skills/check-upstream/` — the complementary audit against upstream
175+
Rust APIs not yet exposed in Python.
176+
- `.ai/skills/write-dataframe-code/` — how to write idiomatic DataFrame
177+
code in this repo.

0 commit comments

Comments
 (0)