Skip to content

loveland/rgdp-data-operations-playbook

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Regrid Data Operations Playbook

A hands-on training repo for querying address points in AWS S3 using DuckDB, PyArrow, and GeoPandas. Written for new Regrid data engineers and analysts onboarding onto the data platform, but useful to anyone who wants to turn a Parquet lake into answers.

Every lesson poses a question, answers it with code, and explains why the chosen tool is the right reach. The same canonical questions appear across multiple lessons so you can feel the tradeoffs between tools side by side.

Who it's for

  • New Regrid hires getting their bearings with the bronze data layer.
  • Data engineers who know SQL but haven't touched PyArrow or GeoPandas.
  • Analysts comfortable with pandas who want to query S3 without downloading.
  • Anyone who needs a reference for DuckDB + credential_chain against S3.

Prerequisites

  • Python 3.11+
  • (Optional) AWS SSO access to the Regrid account for lesson 07
  • ~2 GB of disk space for local Parquet exports the lessons write to data/local/

Setup

# 1. Install dependencies
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pip install -e .    # makes `from playbook import ...` work in notebooks

# 2. Copy the example env file and edit it if needed
cp .env.example .env

# 3. Launch Jupyter
jupyter lab

If you want to run the optional internal lesson, also run:

aws sso login --profile regrid
echo "PLAYBOOK_INTERNAL=1" >> .env

Lesson index

# Notebook Teaches
00 setup_and_auth Verify DuckDB, PyArrow, and (optional) SSO
01 duckdb_basics S3 + httpfs + spatial + SQL
02 pyarrow_basics pyarrow.dataset, projection, streaming
03 geopandas_basics CRS handling, plotting, spatial predicates
04 partitioned_reads Hive partition pruning in all three tools
05 spatial_joins Points-in-polygon: DuckDB vs GeoPandas
06 tool_comparison Same six questions, three tools, side-by-side
07 internal_bronze Optional — run against Regrid bronze

Every notebook has a matching .py script in scripts/ that you can run from the command line or diff as a git-friendly source of truth.

The six canonical questions

Lessons 01–06 return to the same questions so you can compare approaches:

  1. How many address points exist for state / country X?
  2. What is the bounding box of all points in city Y?
  3. What are the top 10 streets by point count in ZIP Z?
  4. Distribution of points by ZIP, exported to CSV.
  5. Which points fall within 100m of a reference location?
  6. Join address points to a polygon layer (county, tract, neighborhood).

Tool cheat-sheet

Dimension DuckDB PyArrow GeoPandas
Language SQL Python / Arrow compute Pandas-style Python
S3 native httpfs S3FileSystem via fsspec
Spatial functions spatial extension ✅ (best in class)
Memory ceiling Disk-bounded Streamable RAM-bounded
Best for Heavy aggregations / ETL Precise column IO Interactive analysis
Plotting
Write partitioned lake COPY ... PARTITION_BY write_dataset via pyarrow

Rule of thumb: reach for DuckDB first, PyArrow when you need surgical IO, GeoPandas when you need to see, sketch, or sjoin.

Datasets

  • Public (default): Overture Maps addresses on s3://overturemaps-us-west-2. Public read, no credentials required.
  • Internal (lesson 07 only): Regrid bronze parcels (s3://regrid-data-platform-us-east-2-bronze). Requires AWS SSO to the Regrid account.

See data/README.md for schemas and how to swap datasets.

Regenerating notebooks

The .ipynb files are generated from the matching .py scripts. If you edit a script, rebuild the notebooks with:

python dev/build_notebooks.py

Contributing a new lesson

  1. Copy an existing script in scripts/ and rename it with the next ordinal.
  2. Follow the same rhythm: Question → Approach → Code → Inspect → Takeaway.
  3. Run python dev/build_notebooks.py to regenerate the matching .ipynb.
  4. Link it from the lesson index above.

About

Data platform training notebooks (duckdb, pyarrow, iceberg)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors