Commit 0357716
tpch examples: rewrite queries idiomatically and embed reference SQL (#1504)
* tpch examples: add reference SQL to each query, fix Q20
- Append the canonical TPC-H reference SQL (from benchmarks/tpch/queries/)
to each q01..q22 module docstring so readers can compare the DataFrame
translation against the SQL at a glance.
- Fix Q20: `df = df.filter(col("ps_availqty") > lit(0.5) * col("total_sold"))`
was missing the assignment so the filter was dropped from the pipeline.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* tpch examples: rewrite non-idiomatic queries in idiomatic DataFrame form
Rewrite the seven TPC-H example queries that did not demonstrate the
idiomatic DataFrame pattern. The remaining queries (Q02/Q11/Q15/Q17/Q22,
which use window functions in place of correlated subqueries) already are
idiomatic and are left unchanged.
- Q04: replace `.aggregate([col("l_orderkey")], [])` with
`.select("l_orderkey").distinct()`, which is the natural way to express
"reduce to one row per order" on a DataFrame.
- Q07: remove the CASE-as-filter on `n_name` and use
`F.in_list(col("n_name"), [nation_1, nation_2])` instead. Drops a
comment block that admitted the filter form was simpler.
- Q08: rewrite the switched CASE `F.case(...).when(lit(False), ...)` as a
searched `F.when(col(...).is_not_null(), ...).otherwise(...)`. That
mirrors the reference SQL's `case when ... then ... else 0 end` shape.
- Q12: replace `array_position(make_array(...), col)` with
`F.in_list(col("l_shipmode"), [...])`. Same semantics, without routing
through array construction / array search.
- Q19: remove the pyarrow UDF that re-implemented a disjunctive predicate
in Python. Build the same predicate in DataFusion by OR-combining one
`in_list` + range-filter expression per brand. Keeps the per-brand
constants in the existing `items_of_interest` dict.
- Q20: use `F.starts_with` instead of an explicit substring slice. Replace
the inner-join + `select(...).distinct()` tail with a semi join against
a precomputed set of excess-quantity suppliers so the supplier columns
are preserved without deduplication after the fact.
- Q21: replace the `array_agg` / `array_length` / `array_element` pipeline
with two semi joins. One semi join keeps orders with more than one
distinct supplier (stand-in for the reference SQL's `exists` subquery),
the other keeps orders with exactly one late supplier (stand-in for the
`not exists` subquery).
All 22 answer-file comparisons and 22 plan-comparison diagnostics still
pass (`pytest examples/tpch/_tests.py`: 44 passed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* tpch examples: align reference SQL constants with DataFrame queries
The reference SQL embedded in each q01..q22 module docstring was carried
over verbatim from ``benchmarks/tpch/queries/`` and uses a different set
of TPC-H substitution parameters than the DataFrame examples
(answer-file-validated at scale factor 1). Update each reference SQL to
use the substitution parameters the DataFrame uses, so both expressions
describe the same query and would produce the same results against the
same data.
Constants aligned:
- Q01: ``90 days`` cutoff (DataFrame ``DAYS_BEFORE_FINAL = 90``).
- Q02: ``p_size = 15``, ``p_type like '%BRASS'``, ``r_name = 'EUROPE'``.
- Q04: base date ``1993-07-01`` (``3 month`` interval preserved per the
"quarter of a year" wording).
- Q05: ``r_name = 'ASIA'``.
- Q06: ``l_discount between 0.06 - 0.01 and 0.06 + 0.01``.
- Q07: nations ``'FRANCE'`` / ``'GERMANY'``.
- Q08: ``r_name = 'AMERICA'``, ``p_type = 'ECONOMY ANODIZED STEEL'``,
inner-case ``nation = 'BRAZIL'``.
- Q09: ``p_name like '%green%'``.
- Q10: base date ``1993-10-01`` (``3 month`` interval preserved).
- Q11: ``n_name = 'GERMANY'``.
- Q12: ship modes ``('MAIL', 'SHIP')``, base date ``1994-01-01``.
- Q13: ``o_comment not like '%special%requests%'``.
- Q14: base date ``1995-09-01``.
- Q15: base date ``1996-01-01``.
- Q16: ``p_brand <> 'Brand#45'``, ``p_type not like 'MEDIUM POLISHED%'``,
sizes ``(49, 14, 23, 45, 19, 3, 36, 9)``.
- Q17: ``p_brand = 'Brand#23'``, ``p_container = 'MED BOX'``.
- Q18: ``sum(l_quantity) > 300``.
- Q19: brands ``Brand#12`` / ``Brand#23`` / ``Brand#34`` with the matching
minimum quantities (1, 10, 20).
- Q20: ``p_name like 'forest%'``, base date ``1994-01-01``,
``n_name = 'CANADA'``.
- Q21: ``n_name = 'SAUDI ARABIA'``.
- Q22: country codes ``('13', '31', '23', '29', '30', '18', '17')``.
Interval units (month / year) are preserved where the problem-statement
text reads "given quarter", "given year", "given month". Q01 keeps the
literal "days" unit because the TPC-H problem statement itself describes
the cutoff in days.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* tpch examples: apply SKILL.md idioms across all 22 queries
Sweep every q01..q22 example for idiomatic DataFrame style as described in
the repo-root SKILL.md:
- ``col("x") == "s"`` in place of ``col("x") == lit("s")`` on comparison
right-hand sides (auto-wrap applies).
- Plain-name strings in ``select``/``aggregate``/``sort`` group/sort key
lists when the key is a bare column.
- Drop redundant ``how="inner"`` and single-element ``left_on``/``right_on``
list wrapping on equi-joins.
- Collapse chained ``.filter(a).filter(b)`` runs into ``.filter(a, b)``
and chained ``.with_column`` runs into ``.with_columns(a=..., b=...)``.
- ``df.sort_by(...)`` or plain-name ``df.sort(...)`` when no null-placement
override is needed.
- ``F.count_star()`` in place of ``F.count(col("x"))`` whenever the SQL
reads ``count(*)``.
- ``F.starts_with(col, lit(prefix))`` and ``~F.starts_with(...)`` in place
of substring-prefix equality/inequality tricks.
- ``F.in_list(col, [lit(...)])`` in place of ``~F.array_position(...).
is_null()`` and in place of disjunctions of equality comparisons.
- Searched ``F.when(cond, x).otherwise(y)`` in place of switched
``F.case(bool_expr).when(lit(True/False), x).end()`` forms.
- Semi-joins as the DataFrame form of ``EXISTS`` (Q04); anti-joins as
``NOT EXISTS`` (Q22 was already using this idiom).
- Whole-frame window aggregates as the DataFrame stand-in for a SQL
scalar subquery (Q11/Q15/Q17/Q22).
Individual query fixes of note:
- Q16 — add the secondary sort keys (``p_brand``, ``p_type``, ``p_size``)
that the TPC-H spec requires but the original DataFrame omitted.
- Q22 — drop a stray ``df.show()`` mid-pipeline; replace the 0-based
substring slice with ``F.left(col("c_phone"), lit(2))``.
- Q14 — rewrite the promo/non-promo factor split as a searched CASE inside
``F.sum(...)`` so the DataFrame expression matches the reference SQL
shape exactly.
All 22 answer-file comparisons still pass at scale factor 1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* tpch examples: more idiomatic aggregate FILTER, string funcs, date handling
Additional sweep of the TPC-H DataFrame examples informed by comparing
against a fresh set of SKILL.md-only generations under
``examples/tpch/agentic_queries/``:
- Q02: ``F.ends_with(col("p_type"), lit(TYPE_OF_INTEREST))`` in place of
``F.strpos(col, lit) > 0``. The reference SQL is ``p_type like '%BRASS'``,
which is an ends_with check, not contains. ``F.strpos > 0`` returned the
correct rows on TPC-H data by coincidence but is semantically wrong.
- Q09: ``F.contains(col("p_name"), lit(part_color))`` in place of
``F.strpos(col, lit) > 0``. The SQL is ``p_name like '%green%'``.
- Q08, Q12, Q14: use the ``filter`` keyword on ``F.sum`` / ``F.count`` —
the DataFrame form of SQL ``sum(...) FILTER (WHERE ...)`` — instead of
wrapping the aggregate input in ``F.when(cond, x).otherwise(0)``. Q08
also reorganises to inner-join the supplier's nation onto the regional
sales, which removes the previous left-join + ``F.when(is_not_null, ...)``
dance.
- Q15: compute the grand maximum revenue as a separate scalar aggregate
and ``join_on(...)`` on equality, instead of the whole-frame window
``F.max`` + filter shape. Simpler plan, same result.
- Q16: ``F.regexp_like(col, pattern)`` in place of
``F.regexp_match(col, pattern).is_not_null()``.
- Q04, Q05, Q06, Q07, Q08, Q10, Q12, Q14, Q15, Q20: store both the start
and the end of the date window as plain ``datetime.date`` objects and
compare with ``lit(end_date)``, instead of carrying the start date +
``pa.month_day_nano_interval`` and adding them at query-build time.
Drops unused ``pyarrow`` imports from the files that no longer need
Arrow scalars.
All 22 answer-file comparisons still pass at scale factor 1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent c8bb9f7 commit 0357716
22 files changed
Lines changed: 1196 additions & 756 deletions
File tree
- examples/tpch
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
30 | 54 | | |
31 | 55 | | |
32 | 56 | | |
| |||
58 | 82 | | |
59 | 83 | | |
60 | 84 | | |
| 85 | + | |
| 86 | + | |
61 | 87 | | |
62 | | - | |
| 88 | + | |
63 | 89 | | |
64 | 90 | | |
65 | 91 | | |
66 | | - | |
67 | | - | |
68 | | - | |
69 | | - | |
70 | | - | |
71 | | - | |
72 | | - | |
73 | | - | |
| 92 | + | |
| 93 | + | |
74 | 94 | | |
75 | 95 | | |
76 | 96 | | |
77 | | - | |
78 | | - | |
79 | | - | |
| 97 | + | |
80 | 98 | | |
81 | 99 | | |
82 | 100 | | |
83 | 101 | | |
84 | 102 | | |
85 | | - | |
| 103 | + | |
86 | 104 | | |
87 | 105 | | |
88 | 106 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
30 | 76 | | |
31 | 77 | | |
32 | 78 | | |
| |||
67 | 113 | | |
68 | 114 | | |
69 | 115 | | |
70 | | - | |
71 | | - | |
72 | | - | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
73 | 119 | | |
74 | 120 | | |
75 | | - | |
76 | | - | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
77 | 124 | | |
78 | 125 | | |
79 | 126 | | |
80 | | - | |
| 127 | + | |
81 | 128 | | |
82 | 129 | | |
83 | 130 | | |
84 | 131 | | |
85 | | - | |
86 | | - | |
87 | | - | |
88 | | - | |
89 | | - | |
90 | | - | |
| 132 | + | |
| 133 | + | |
91 | 134 | | |
92 | 135 | | |
93 | 136 | | |
94 | 137 | | |
95 | 138 | | |
96 | | - | |
97 | | - | |
98 | | - | |
| 139 | + | |
99 | 140 | | |
100 | 141 | | |
101 | 142 | | |
| |||
112 | 153 | | |
113 | 154 | | |
114 | 155 | | |
115 | | - | |
116 | | - | |
117 | | - | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
118 | 159 | | |
119 | 160 | | |
120 | 161 | | |
| |||
132 | 173 | | |
133 | 174 | | |
134 | 175 | | |
135 | | - | |
136 | | - | |
137 | | - | |
138 | | - | |
139 | | - | |
140 | | - | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
141 | 180 | | |
142 | 181 | | |
143 | 182 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
28 | 53 | | |
29 | 54 | | |
30 | 55 | | |
| |||
50 | 75 | | |
51 | 76 | | |
52 | 77 | | |
53 | | - | |
| 78 | + | |
54 | 79 | | |
55 | 80 | | |
56 | 81 | | |
57 | 82 | | |
58 | 83 | | |
59 | | - | |
60 | | - | |
61 | | - | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
62 | 87 | | |
63 | 88 | | |
64 | 89 | | |
65 | 90 | | |
66 | | - | |
| 91 | + | |
67 | 92 | | |
68 | 93 | | |
69 | 94 | | |
70 | 95 | | |
71 | 96 | | |
72 | 97 | | |
73 | 98 | | |
74 | | - | |
75 | | - | |
76 | | - | |
77 | | - | |
78 | | - | |
| 99 | + | |
79 | 100 | | |
80 | | - | |
81 | | - | |
82 | | - | |
83 | | - | |
84 | | - | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
85 | 106 | | |
86 | 107 | | |
87 | 108 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
27 | 51 | | |
28 | 52 | | |
29 | | - | |
| 53 | + | |
30 | 54 | | |
31 | | - | |
32 | 55 | | |
33 | 56 | | |
34 | 57 | | |
35 | 58 | | |
36 | | - | |
37 | | - | |
38 | | - | |
| 59 | + | |
| 60 | + | |
39 | 61 | | |
40 | 62 | | |
41 | 63 | | |
| |||
48 | 70 | | |
49 | 71 | | |
50 | 72 | | |
51 | | - | |
52 | | - | |
53 | | - | |
54 | | - | |
55 | | - | |
56 | | - | |
57 | | - | |
58 | | - | |
59 | | - | |
60 | | - | |
61 | | - | |
62 | | - | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
63 | 79 | | |
64 | 80 | | |
65 | | - | |
66 | | - | |
67 | | - | |
68 | | - | |
| 81 | + | |
69 | 82 | | |
70 | | - | |
71 | 83 | | |
72 | | - | |
| 84 | + | |
73 | 85 | | |
74 | 86 | | |
75 | | - | |
76 | | - | |
77 | | - | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
78 | 90 | | |
79 | 91 | | |
80 | | - | |
81 | | - | |
82 | | - | |
83 | 92 | | |
0 commit comments