Real-time-wikipedia-streaming-pipeline

A real-time data engineering project that ingests live Wikipedia edit events and processes them using the Databricks Lakehouse Medallion Architecture (Bronze → Silver → Gold). The pipeline captures live streaming events from the Wikipedia Recent Changes API, processes them using Spark Structured Streaming, transforms them using Delta Live Tables (DLT), and visualizes insights in a Databricks SQL Dashboard. This project demonstrates modern data engineering practices used in production environments.

Architecture Overview

real-time-wikipedia-streaming-pipeline
│
├── src
│
│   ├── notebooks
│   │
│   │   └── bronze_ingestion
│   │       └── wikipedia_stream_ingestion_notebook
│
│   ├── DTLpinline
│   │
│   │   └── silver_transformation_dtl_pipeline
│
│   ├── gold
│   │
│   │   └── gold_analytics_notebook
│
│   └── dashboard
│       └── wikipedia_dashboard.json
│
├── architecture.png
│
└── README.md

Pipeline Flow


Wikipedia Live Stream API
        │
        ▼
Bronze Layer (Raw Streaming Data)
        │
        ▼
Silver Layer (Cleaned & Structured Data)
        │
        ▼
Gold Layer (Aggregated Analytics Tables)
        │
        ▼
Databricks SQL Dashboard (Real-Time Insights)

Project Objectives

This project demonstrates how to build a modern real-time data pipeline using the Databricks ecosystem.

Key goals:

Ingest real-time streaming data
Implement Medallion Architecture
Use Delta Live Tables (DLT)
Apply data cleaning and transformations
Create analytics-ready datasets
Build real-time dashboards
Deploy pipelines using Databricks Asset Bundles

Data Source

Wikipedia Recent Changes Stream:
https://stream.wikimedia.org/v2/stream/recentchange

Event fields:

title – Page title
user – Editor username
wiki – Language
timestamp – Edit time
bot – Bot indicator
comment – Edit comment

Medallion Architecture

Bronze Layer

Stores raw streaming data
Preserves original JSON payload for debugging
Example table: bronze_wikipedia_events
Notebook: src/notebooks/bronze_ingestion/

Silver Layer

Cleans & structures data
Filters valid edits, normalizes usernames, timestamps, bot flags
Example table: silver_wikipedia_edits
Pipeline: src/DTLpinline/silver_transformation_dtl_pipeline/

Gold Layer

Analytics-ready tables for dashboards
Example metrics:
- Top edited pages
- Edits per language
- Bot vs human edits
Notebook: src/gold/gold_analytics_notebook/

Technology Stack

Tool	Purpose
Databricks	Data engineering platform
Apache Spark	Distributed processing
Spark Structured Streaming	Real-time streaming
Delta Lake	Reliable data storage
Delta Live Tables	Streaming transformations
Databricks SQL	Data visualization
Databricks Asset Bundles	Deployment and version control
GitHub	Source control

Prerequisites

Databricks workspace
Python 3.x / PySpark
Databricks cluster with Delta Live Tables enabled
Required Python libraries: sseclient, requests, pyspark
Access to GitHub for repository clone

How to Run the Project

Clone the repository:

git clone https://github.com/data-engineer-yogesh/real-time-wikipedia-streaming-pipeline.git

Upload notebooks and pipelines to Databricks workspace.
Install required libraries on the cluster:

%pip install sseclient requests

Run the Bronze notebook to start streaming ingestion.
Deploy DLT pipeline for Silver and Gold transformations.
Open Databricks SQL dashboard to visualize analytics.

Example Analytics Queries

Most edited pages:

SELECT title, COUNT(*) AS edits
FROM silver_wikipedia_edits
GROUP BY title
ORDER BY edits DESC;

Edits per language:

SELECT wiki, COUNT(*) AS edits
FROM silver_wikipedia_edits
GROUP BY wiki;

Bot vs Human edits:

SELECT bot, COUNT(*) AS edits
FROM silver_wikipedia_edits
GROUP BY bot;

Learning Outcomes

This project demonstrates:

Real-time streaming ingestion
Delta Lake & Delta Live Tables pipeline design
Medallion architecture implementation
Data transformation best practices
Analytics-ready dataset creation
Dashboarding on Databricks

It serves as a portfolio project for data engineering interviews and Databricks certification preparation.

Future Improvements

Potential enhancements:

Add Kafka as streaming source
Implement data quality checks
Add alerting on edit spikes
Build advanced analytics models
Integrate with BI tools

🪪 License MIT License — for learning, demonstration, and portfolio purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
real_time_wikipedia_streaming_pipeline		real_time_wikipedia_streaming_pipeline
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-time-wikipedia-streaming-pipeline

Architecture Overview

Pipeline Flow

Project Objectives

Data Source

Medallion Architecture

Bronze Layer

Silver Layer

Gold Layer

Technology Stack

Prerequisites

How to Run the Project

Example Analytics Queries

Learning Outcomes

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Real-time-wikipedia-streaming-pipeline

Architecture Overview

Pipeline Flow

Project Objectives

Data Source

Medallion Architecture

Bronze Layer

Silver Layer

Gold Layer

Technology Stack

Prerequisites

How to Run the Project

Example Analytics Queries

Learning Outcomes

Future Improvements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages