A hands-on training repo for querying address points in AWS S3 using DuckDB, PyArrow, and GeoPandas. Written for new Regrid data engineers and analysts onboarding onto the data platform, but useful to anyone who wants to turn a Parquet lake into answers.
Every lesson poses a question, answers it with code, and explains why the chosen tool is the right reach. The same canonical questions appear across multiple lessons so you can feel the tradeoffs between tools side by side.
- New Regrid hires getting their bearings with the bronze data layer.
- Data engineers who know SQL but haven't touched PyArrow or GeoPandas.
- Analysts comfortable with pandas who want to query S3 without downloading.
- Anyone who needs a reference for DuckDB +
credential_chainagainst S3.
- Python 3.11+
- (Optional) AWS SSO access to the Regrid account for lesson 07
- ~2 GB of disk space for local Parquet exports the lessons write to
data/local/
# 1. Install dependencies
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pip install -e . # makes `from playbook import ...` work in notebooks
# 2. Copy the example env file and edit it if needed
cp .env.example .env
# 3. Launch Jupyter
jupyter labIf you want to run the optional internal lesson, also run:
aws sso login --profile regrid
echo "PLAYBOOK_INTERNAL=1" >> .env| # | Notebook | Teaches |
|---|---|---|
| 00 | setup_and_auth | Verify DuckDB, PyArrow, and (optional) SSO |
| 01 | duckdb_basics | S3 + httpfs + spatial + SQL |
| 02 | pyarrow_basics | pyarrow.dataset, projection, streaming |
| 03 | geopandas_basics | CRS handling, plotting, spatial predicates |
| 04 | partitioned_reads | Hive partition pruning in all three tools |
| 05 | spatial_joins | Points-in-polygon: DuckDB vs GeoPandas |
| 06 | tool_comparison | Same six questions, three tools, side-by-side |
| 07 | internal_bronze | Optional — run against Regrid bronze |
Every notebook has a matching .py script in scripts/ that you
can run from the command line or diff as a git-friendly source of truth.
Lessons 01–06 return to the same questions so you can compare approaches:
- How many address points exist for state / country X?
- What is the bounding box of all points in city Y?
- What are the top 10 streets by point count in ZIP Z?
- Distribution of points by ZIP, exported to CSV.
- Which points fall within 100m of a reference location?
- Join address points to a polygon layer (county, tract, neighborhood).
| Dimension | DuckDB | PyArrow | GeoPandas |
|---|---|---|---|
| Language | SQL | Python / Arrow compute | Pandas-style Python |
| S3 native | ✅ httpfs |
✅ S3FileSystem |
via fsspec |
| Spatial functions | ✅ spatial extension |
❌ | ✅ (best in class) |
| Memory ceiling | Disk-bounded | Streamable | RAM-bounded |
| Best for | Heavy aggregations / ETL | Precise column IO | Interactive analysis |
| Plotting | ❌ | ❌ | ✅ |
| Write partitioned lake | ✅ COPY ... PARTITION_BY |
✅ write_dataset |
via pyarrow |
Rule of thumb: reach for DuckDB first, PyArrow when you need surgical IO,
GeoPandas when you need to see, sketch, or sjoin.
- Public (default): Overture Maps addresses
on
s3://overturemaps-us-west-2. Public read, no credentials required. - Internal (lesson 07 only): Regrid bronze parcels
(
s3://regrid-data-platform-us-east-2-bronze). Requires AWS SSO to the Regrid account.
See data/README.md for schemas and how to swap datasets.
The .ipynb files are generated from the matching .py scripts. If you edit
a script, rebuild the notebooks with:
python dev/build_notebooks.py- Copy an existing script in
scripts/and rename it with the next ordinal. - Follow the same rhythm: Question → Approach → Code → Inspect → Takeaway.
- Run
python dev/build_notebooks.pyto regenerate the matching.ipynb. - Link it from the lesson index above.