Data Processing Pipeline

Overview

This document outlines the complete data processing pipeline for generating jurisdiction-level EV charging infrastructure maps. The pipeline processes utility circuit line data, federal funding zones, environmental indicators, and demographic data to create priority and feasibility pixel grids.

Update Frequency: Utility circuit line data should be updated twice annually. Other datasets are updated as needed based on availability from source agencies.

Pipeline Workflow

Data Acquisition - Download utility circuit line data from each provider
Data Cleaning - Standardize columns, convert units, add utility identifiers
Concatenation - Combine all utility lines into single dataset
Pixelation - Convert utility lines to 100m x 100m pixel grid
Attribute Joining - Add demographic, environmental, and funding attributes
Output Generation - Create jurisdiction-specific priority and feasibility files

Part 1: Utility Circuit Line Processing

1.1 Pacific Gas & Electric (PG&E)

Source: PG&E GRIP Portal

Two acquisition methods available:

Method A: Direct Download

Navigate to the GRIP portal
In the layer list, expand ICA > ICA Results
Click options menu (three dots) for "ICA, Load Capacity (kW)"
Select Export > GeoJSON

Note: This method may encounter server timeout issues with large datasets.

Method B: API Access (Recommended)

Pull data directly from the ArcGIS Feature Server:

import requests
import geopandas as gpd

base_url = "https://services2.arcgis.com/mJaJSax0KPHoCNB6/arcgis/rest/services/DRPComplianceRelProd/FeatureServer/3/query"

params = {
    "where": "1=1",
    "outFields": "*",
    "f": "geojson",
    "resultOffset": 0,
    "resultRecordCount": 1000,
}

features = []

while True:
    print(f"Fetching offset {params['resultOffset']}")
    response = requests.get(base_url, params=params)
    data = response.json()

    if "features" not in data or not data["features"]:
        break

    features.extend(data["features"])
    params["resultOffset"] += params["resultRecordCount"]

pge = gpd.GeoDataFrame.from_features(features)

Data Processing:

# Retain only necessary columns
pge = pge[['LoadCapacity_kW', 'geometry']]

# Add utility identifier
pge['Utility'] = 'pge'

# Set CRS and save
pge = gpd.GeoDataFrame(pge, geometry='geometry')
pge.set_crs(epsg=4326, inplace=True)
pge.to_file('pge_load.geojson', driver='GeoJSON')

Note

Downloading manually using website times out.

Downloading using python script above sometimes hangs part way making it hard to script this automatically. Currently, looks like this dataset has 1289568 records.

1.2 San Diego Gas & Electric (SDG&E)

Source: SDG&E ICM API Explorer

Data Acquisition:

Access the ICM API Explorer (account creation may be required)
Navigate to Load Capacity Grids map
Download as GeoJSON or Shapefile

Data Processing:

import geopandas as gpd

# Load data
sdge = gpd.read_file("path/to/sdge.geojson")

# Verify load columns are identical
sdge['equal'] = sdge['ICAWOF_UNILOAD'] == sdge['ICAWNOF_UNILOAD']
sdge.loc[sdge['equal'] == False]  # Should return empty table

# Convert MW to kW
sdge['load_kw'] = sdge['ICAWOF_UNILOAD'] * 1000

# Retain only necessary columns
sdge = sdge[['load_kw', 'geometry']]

# Add utility identifier
sdge['Utility'] = 'sdge'

# Set CRS and save
sdge = gpd.GeoDataFrame(sdge, geometry='geometry')
sdge.set_crs(epsg=4326, inplace=True)
sdge.to_file('sdge_load.geojson', driver='GeoJSON')

Note

No login was required for me to download. Attempt to download GeoJSON fails to execute. Was able to download Shapefile. Shapefile has shortened field names so the script needs to be modified to deal with that. ICAWOF_UNILOAD -> ICAWOF_UNI, ICAWNOF_UNILOAD -> ICAWNOF_UN Shapefile is in PseudoMercator so the set_crs command instead needs to be to_crs.

1.3 Los Angeles Department of Water and Power (LADWP)

Source: LADWP Power GIS Portal

Data Acquisition:

Click "Download the 34.5 KV data" link
Unzip downloaded file to extract .kmz file
Convert .kmz to .gdb using ArcGIS "KMZ to Layer" tool

Data Processing:

import geopandas as gpd
import pandas as pd
from bs4 import BeautifulSoup

# Load geodatabase
ladwp = gpd.read_file("path/to/ladwp.gdb")

# Extract popup information
def extract_popup_info(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    data = {}
    table = soup.find_all('table')[1]

    for row in table.find_all('tr'):
        cols = row.find_all('td')
        if len(cols) == 2:
            key = cols[0].get_text(strip=True)
            value = cols[1].get_text(strip=True)
            data[key] = value

    return data

popup_info_df = ladwp['PopupInfo'].apply(extract_popup_info)
popup_info_expanded = pd.json_normalize(popup_info_df)
gdf_expanded = ladwp.drop(columns=['PopupInfo']).join(popup_info_expanded)

# Extract minimum capacity value from range
gdf_expanded['min_value'] = gdf_expanded['CAPACITY_RANGE_KW'].str.extract(r'^\s*(\d+)')

# Retain only necessary columns
ladwp = gdf_expanded[['min_value', 'geometry']]

# Add utility identifier
ladwp['Utility'] = 'ladwp'

# Set CRS and save
ladwp = gpd.GeoDataFrame(ladwp, geometry='geometry')
ladwp.set_crs(epsg=4326, inplace=True)
ladwp.to_file('ladwp_load.geojson', driver='GeoJSON')

Note

34.5kV zip file on website is corrupted... File was updated after reaching out to LADWP and file was retrieved...

1.4 Southern California Edison (SCE)

Source: SCE DRP Portal

Data Acquisition:

Click "ESRI API" tab
Navigate to "ICA Layer" > "ICA - Circuit Segments"
Download as GeoJSON or Shapefile
Also download "ICA - Circuit Segments, Non-3 Phase" if available

Note: SCE provides separate files for 3-phase and non-3-phase circuits. Verify whether these datasets contain unique data before concatenating. If datasets are identical, only one is needed.

Data Processing:

import geopandas as gpd

# Load data
socaled = gpd.read_file("path/to/socaled.geojson")

# Convert MW to kW (column is stored as string)
socaled['load_kw'] = (socaled['ica_overall_load'].astype('float')) * 1000

# Retain only necessary columns
socaled = socaled[['load_kw', 'geometry']]

# Add utility identifier
socaled['Utility'] = 'socaled'

# Set CRS and save
socaled = gpd.GeoDataFrame(socaled, geometry='geometry')
socaled.set_crs(epsg=4326, inplace=True)
socaled.to_file('socaled_load.geojson', driver='GeoJSON')

1.5 Concatenate All Utility Lines

Combine all processed utility datasets into a single file:

import pandas as pd
import geopandas as gpd

# Load all utility files
pge = gpd.read_file('pge_load.geojson')
ladwp = gpd.read_file('ladwp_load.geojson')
sdge = gpd.read_file('sdge_load.geojson')
socaled = gpd.read_file('socaled_load.geojson')

# Concatenate
utility_lines = pd.concat([pge, ladwp, sdge, socaled], ignore_index=True)

# Set CRS and save
utility_lines = gpd.GeoDataFrame(utility_lines, geometry='geometry')
utility_lines.set_crs(epsg=4326, inplace=True)
utility_lines.to_file('utility_lines.geojson', driver='GeoJSON')

Output: Save utility_lines.geojson to jurisdiction_script/data/other/

Part 2: Pixelation

Convert utility circuit lines into a 100m x 100m pixel grid covering areas within 75 meters of utility infrastructure.

Command:

cd jurisdiction_script
python create_utility_pixels.py \
  -i data/other/utility_lines.geojson \
  -o data/grids/utilities_pixels.json \
  -b 75

Process:

Creates 100m x 100m grid covering California (~98 million grid points)
Buffers utility lines by 75 meters
Clips grid to areas within utility buffer (~2 million pixels)
Converts point centroids to square polygons
Saves output to data/grids/utilities_pixels.json

Performance Requirements:

Memory: 16-32GB RAM
Processing Time: 45-90 minutes
Output Size: ~400-500MB

Output: Save utilities_pixels.json to jurisdiction_script/data/grids/

Part 3: Configuration and Execution

3.1 Update Configuration Files

Configuration files are located in jurisdiction_script/config/ as YAML files.

Update the following paths:

Feasibility pixels: Update to reference new utilities_pixels.json
Utility lines: Update to reference new utility_lines.geojson

3.2 Run Jurisdiction Processing

Execute the main processing script:

cd jurisdiction_script
python jscript.py config_file

Replace config_file with the appropriate configuration file name (without .yaml extension).

Example:

python jscript.py alameda_berkeley

Output: Priority and feasibility JSON files will be generated in jurisdiction_script/out/

[jurisdiction]_priority.json
[jurisdiction]_feasibility.json

Part 4: Data Sources

Jurisdiction Boundary Files

Data Type	Source
California County Boundaries	US Census TIGER/Line
California Place Boundaries	US Census TIGER/Line Places

Electric Utility Circuit Line Load Capacity

Utility	Source
Pacific Gas & Electric (PG&E)	PG&E DRP Integration Capacity Map
Southern California Edison (SCE)	SCE DRP Portal
San Diego Gas & Electric (SDG&E)	SDG&E ICM API Explorer
Los Angeles Dept. of Water & Power (LADWP)	LADWP Power GIS Portal

Environmental Indicator Data

Data Type	Source
CalEnviroScreen 4.0	OEHHA CalEnviroScreen
EJScreen	Harvard Dataverse
CEJST	Harvard Dataverse

Census Data (American Community Survey)

Data Type	Source
Non-White Population (2021 5-yr ACS)	Census Data Portal
Disability Characteristics (2021 5-yr ACS)	Census Data Portal
Commute Time (2021 5-yr ACS)	Census Data Portal

Notes on Environmental Indicators

Current Implementation:

EJScreen and CEJST indicators use percentile rankings across US census tracts
CalEnviroScreen provides intra-state (California-only) percentile comparisons
This provides both interstate and intrastate comparisons for California

Future Considerations: When expanding to states outside California:

CalEnviroScreen is California-specific and unavailable for other states
Consider using EJScreen's intrastate tract comparison option
This would maintain both inter- and intra-state comparison capabilities using CEJST (interstate) and EJScreen (intrastate)

Troubleshooting

Common Issues:

API URL Changes: Utility provider API endpoints may change. Check source portals for updated URLs.
Memory Issues: Pixelation process requires significant RAM. Close other applications or use a machine with more memory.
Timeout Errors: When downloading large datasets, use API-based methods rather than direct downloads.

Missing Dependencies: Ensure all required Python packages are installed:

conda install -c conda-forge geopandas numpy pandas scipy matplotlib pyyaml fiona shapely beautifulsoup4

CRS Mismatches: All output files should use EPSG:4326 (WGS84). Verify CRS after loading external datasets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Processing Pipeline

Overview

Pipeline Workflow

Part 1: Utility Circuit Line Processing

1.1 Pacific Gas & Electric (PG&E)

Method A: Direct Download

Method B: API Access (Recommended)

1.2 San Diego Gas & Electric (SDG&E)

1.3 Los Angeles Department of Water and Power (LADWP)

1.4 Southern California Edison (SCE)

1.5 Concatenate All Utility Lines

Part 2: Pixelation

Part 3: Configuration and Execution

3.1 Update Configuration Files

3.2 Run Jurisdiction Processing

Part 4: Data Sources

Jurisdiction Boundary Files

Electric Utility Circuit Line Load Capacity

Environmental Indicator Data

Census Data (American Community Survey)

Notes on Environmental Indicators

Troubleshooting

FilesExpand file tree

pipeline_doc.md

Latest commit

History

pipeline_doc.md

File metadata and controls

Data Processing Pipeline

Overview

Pipeline Workflow

Part 1: Utility Circuit Line Processing

1.1 Pacific Gas & Electric (PG&E)

Method A: Direct Download

Method B: API Access (Recommended)

1.2 San Diego Gas & Electric (SDG&E)

1.3 Los Angeles Department of Water and Power (LADWP)

1.4 Southern California Edison (SCE)

1.5 Concatenate All Utility Lines

Part 2: Pixelation

Part 3: Configuration and Execution

3.1 Update Configuration Files

3.2 Run Jurisdiction Processing

Part 4: Data Sources

Jurisdiction Boundary Files

Electric Utility Circuit Line Load Capacity

Environmental Indicator Data

Census Data (American Community Survey)

Notes on Environmental Indicators

Troubleshooting