Regrid Data Operations Playbook

A hands-on training repo for querying address points in AWS S3 using DuckDB, PyArrow, and GeoPandas. Written for new Regrid data engineers and analysts onboarding onto the data platform, but useful to anyone who wants to turn a Parquet lake into answers.

Every lesson poses a question, answers it with code, and explains why the chosen tool is the right reach. The same canonical questions appear across multiple lessons so you can feel the tradeoffs between tools side by side.

Who it's for

New Regrid hires getting their bearings with the bronze data layer.
Data engineers who know SQL but haven't touched PyArrow or GeoPandas.
Analysts comfortable with pandas who want to query S3 without downloading.
Anyone who needs a reference for DuckDB + credential_chain against S3.

Prerequisites

Python 3.11+
(Optional) AWS SSO access to the Regrid account for lesson 07
~2 GB of disk space for local Parquet exports the lessons write to data/local/

Setup

# 1. Install dependencies
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pip install -e .    # makes `from playbook import ...` work in notebooks

# 2. Copy the example env file and edit it if needed
cp .env.example .env

# 3. Launch Jupyter
jupyter lab

If you want to run the optional internal lesson, also run:

aws sso login --profile regrid
echo "PLAYBOOK_INTERNAL=1" >> .env

Lesson index

#	Notebook	Teaches
00	setup_and_auth	Verify DuckDB, PyArrow, and (optional) SSO
01	duckdb_basics	S3 + `httpfs` + `spatial` + SQL
02	pyarrow_basics	`pyarrow.dataset`, projection, streaming
03	geopandas_basics	CRS handling, plotting, spatial predicates
04	partitioned_reads	Hive partition pruning in all three tools
05	spatial_joins	Points-in-polygon: DuckDB vs GeoPandas
06	tool_comparison	Same six questions, three tools, side-by-side
07	internal_bronze	Optional — run against Regrid bronze

Every notebook has a matching .py script in scripts/ that you can run from the command line or diff as a git-friendly source of truth.

The six canonical questions

Lessons 01–06 return to the same questions so you can compare approaches:

How many address points exist for state / country X?
What is the bounding box of all points in city Y?
What are the top 10 streets by point count in ZIP Z?
Distribution of points by ZIP, exported to CSV.
Which points fall within 100m of a reference location?
Join address points to a polygon layer (county, tract, neighborhood).

Tool cheat-sheet

Dimension	DuckDB	PyArrow	GeoPandas
Language	SQL	Python / Arrow compute	Pandas-style Python
S3 native	✅ `httpfs`	✅ `S3FileSystem`	via `fsspec`
Spatial functions	✅ `spatial` extension	❌	✅ (best in class)
Memory ceiling	Disk-bounded	Streamable	RAM-bounded
Best for	Heavy aggregations / ETL	Precise column IO	Interactive analysis
Plotting	❌	❌	✅
Write partitioned lake	✅ `COPY ... PARTITION_BY`	✅ `write_dataset`	via pyarrow

Rule of thumb: reach for DuckDB first, PyArrow when you need surgical IO, GeoPandas when you need to see, sketch, or sjoin.

Datasets

Public (default): Overture Maps addresses on s3://overturemaps-us-west-2. Public read, no credentials required.
Internal (lesson 07 only): Regrid bronze parcels (s3://regrid-data-platform-us-east-2-bronze). Requires AWS SSO to the Regrid account.

See data/README.md for schemas and how to swap datasets.

Regenerating notebooks

The .ipynb files are generated from the matching .py scripts. If you edit a script, rebuild the notebooks with:

python dev/build_notebooks.py

Contributing a new lesson

Copy an existing script in scripts/ and rename it with the next ordinal.
Follow the same rhythm: Question → Approach → Code → Inspect → Takeaway.
Run python dev/build_notebooks.py to regenerate the matching .ipynb.
Link it from the lesson index above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Regrid Data Operations Playbook

Who it's for

Prerequisites

Setup

Lesson index

The six canonical questions

Tool cheat-sheet

Datasets

Regenerating notebooks

Contributing a new lesson

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
dev		dev
notebooks		notebooks
scripts		scripts
src/playbook		src/playbook
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Regrid Data Operations Playbook

Who it's for

Prerequisites

Setup

Lesson index

The six canonical questions

Tool cheat-sheet

Datasets

Regenerating notebooks

Contributing a new lesson

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages