Boost Data Collector - Django project

Overview

Boost Data Collector is a Django project that collects and manages data from various Boost-related sources. The project has multiple Django apps in one repository. All apps share one virtual environment, one database (PostgreSQL), and the same Django settings. Each app exposes one or more management commands (e.g. run_boost_library_tracker). Production scheduling uses Celery Beat and config/boost_collector_schedule.yaml via run_scheduled_collectors (see docs/Workflow.md).

Quick start

Prerequisites

Python 3.11+
Django (version in requirements.txt)
PostgreSQL database access
pandoc — required by boost_library_docs_tracker for HTML→Markdown conversion (pypandoc calls the pandoc binary at runtime):
- macOS: brew install pandoc
- Debian/Ubuntu: sudo apt-get install pandoc
- Windows: winget install JohnMacFarlane.Pandoc or download from pandoc.org
Environment variables for database URL and API keys (e.g. via .env)

Initial setup

Clone the repository:

git clone <boost-data-collector-repo-url>
cd boost-data-collector

Create and activate a virtual environment:

python -m venv venv
# Windows
venv\Scripts\activate
# Linux/macOS
source venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Configure environment variables (e.g. copy .env.example to .env and set database URL and API credentials).
Create and run migrations (required before any command that uses the database):

python manage.py makemigrations
python manage.py migrate

Each project app has a migrations/ package; if you previously saw "No changes detected" but migrate only listed admin, auth, contenttypes, sessions, ensure those packages exist and run the commands again. After a successful migrate you should see migrations for cppa_user_tracker, github_activity_tracker, boost_library_tracker, core, and other installed apps (GitHub utilities under core.operations.github_ops are not Django apps and have no migrations).

If you see relation "cppa_user_tracker_githubaccount" does not exist (or similar), the database tables are missing — run the two commands above.

Run a single app command or the full workflow to confirm the project works:

python manage.py run_scheduled_collectors --schedule daily --group github

For local development you can start the dev server: python manage.py runserver.

Running with Docker

You can run the whole stack (Django, PostgreSQL, Redis, Celery worker and beat) in Docker. See docs/Docker.md for step-by-step instructions, including first-time setup and useful commands.

Celery

The daily workflow runs as a Celery task (see docs/Celery_test.md). You need Redis running (default: localhost:6379). Start the worker and (optionally) Beat in separate terminals:

# Worker (executes tasks)
celery -A config worker -l info

# Beat (schedules YAML-driven tasks per group / interval)
celery -A config beat -l info

On Windows, the project configures the worker to use the solo pool automatically; if you see PermissionError [WinError 5], run: celery -A config worker -l info --pool=solo.

Running tests

The project uses pytest with pytest-django. Tests run against config.test_settings (SQLite in-memory by default; set DATABASE_URL to use PostgreSQL).

Install test dependencies (once):

pip install -r requirements-dev.txt

Run the full test suite:

python -m pytest

Optional: run with coverage and enforce a minimum percentage locally:

python -m pytest --tb=short --cov=. --cov-report=term-missing --cov-fail-under=90

Coverage writes a local .coverage file (binary SQLite data used by coverage.py; safe to delete). It is listed in .gitignore.

PostgreSQL parity (recommended before merging DB-sensitive changes): GitHub Actions runs the full suite against Postgres (DATABASE_URL in .github/workflows/actions.yml; tests use 127.0.0.1 for a stable loopback connection). Locally, pytest.ini defaults to SQLite in-memory when DATABASE_URL is unset (config.test_settings). Run the full suite against Postgres when you touch JSONB, enums, or locks, for example:

# Linux / macOS
export DATABASE_URL=postgres://postgres:postgres@127.0.0.1:5432/postgres
python -m pytest

# Windows (Command Prompt)
set DATABASE_URL=postgres://postgres:postgres@127.0.0.1:5432/postgres
python -m pytest

Run a subset of tests (e.g. one app or one file):

python -m pytest cppa_user_tracker/tests/ -v
python -m pytest github_activity_tracker/tests/test_sync_utils.py -v

CI runs pytest with coverage (--cov, HTML/XML reports). To match a local coverage gate, use --cov-fail-under=90 (see step 3 above). If coverage fails locally or you need a fresh test DB schema after model changes, run once with python -m pytest --create-db.

See docs/Development_guideline.md for when to run tests during development.

Project structure

boost-data-collector/
├── manage.py
├── requirements.txt
├── .env.example
├── README.md
├── config/ or <project_name>/   # Django project settings (settings.py)
├── docs/                         # Documentation (per-topic)
│   ├── README.md                 # Topic index
│   ├── operations/               # Shared I/O (GitHub, Discord, etc.)
│   │   ├── README.md
│   │   └── github.md
│   ├── service_api/              # Per-app service API
│   ├── Workflow.md
│   ├── Schema.md
│   └── ...
├── workspace/                    # Raw/processed files (see docs/Workspace.md)
│   ├── github_activity_tracker/
│   ├── boost_library_tracker/
│   ├── ...
│   └── shared/
|   (Django Apps)
├── cppa_user_tracker/
├── github_activity_tracker/
├── core/                         # Shared utilities (e.g. collector base types)
└──     ...

Each Django app can expose management commands in management/commands/. All apps are in INSTALLED_APPS and use the shared database.

How it works

Django project: One Django project with multiple Django apps; all apps share the same settings and database.
Architecture / data flow: See docs/Architecture_data_flow.md for Mermaid diagrams (sources → collectors → PostgreSQL / workspace → Pinecone) and a per-app component map. Scheduling diagram: docs/Development_guideline.md.
Workflow: boost_collector_runner runs app commands from config/boost_collector_schedule.yaml (via run_scheduled_collectors and Celery). You can also run individual manage.py commands by hand.
Database: One PostgreSQL database (e.g. boost_dashboard); Django ORM and migrations for all apps.
Configuration: Django settings (settings.py) and environment variables (e.g. via django-environ or python-decouple).

GitHub tokens

The project supports multiple GitHub tokens for different operations (see .env.example):

GITHUB_TOKEN – Fallback when a specific token is not set.
GITHUB_TOKENS_SCRAPING – Comma-separated list for API read/scraping; tokens are used in round-robin to spread rate limits.
GITHUB_TOKEN_WRITE – Used for create PR, create issue, comment on issue, and git push (falls back to GITHUB_TOKEN).

Operations (shared I/O): External integrations (GitHub, Slack/markdown helpers, etc.) live under core.operations (for example core.operations.github_ops) and are not separate Django apps. See docs/operations/ and docs/operations/github.md for GitHub usage and token mapping.

Workspace (raw/processed files)

One folder, subfolders per app. For github_activity_tracker, sync uses workspace/github_activity_tracker/<owner>/<repo>/commits|issues|prs/*.json; files are processed into the DB then removed. Default root: workspace/ (configurable via WORKSPACE_DIR). See docs/Workspace.md.

Documentation

Docs are organized by topic (one doc per concern: workflow, workspace, service API, etc.). See docs/README.md for the full index.

Onboarding.md – First-day orientation for contributors (mental model, app roles, data flow).
docs/README.md – Per-topic index and how to find app-specific info.
Running tests – How to run the test suite (pytest, coverage).
Celery – How to start the Celery worker and Beat.
Celery_test.md – Testing the Celery task (run once, Beat, Redis).
operations/ – Operations group: shared I/O (GitHub, Discord, etc.); index and per-operation docs.
Architecture_data_flow.md – High-level data flow (collectors, DB, Pinecone).
How_to_add_a_collector.md – Checklist for adding a new collector command.
operations/github.md – GitHub layer (clone, push, fetch file, create PR/issue/comment) and token use.
Deployment.md – CI/CD pipeline, GitHub secrets, server setup, and deploy script behavior.
Workspace.md – Workspace layout and usage for file processing.
Schema.md – Database schema and table relationships.
Development_guideline.md – Development setup, app requirements, and step-by-step workflow.
Contributing.md – Service layer (single place for writes) and contributor guidelines.
Service_API.md – API reference and index for all service layer functions.
service_api/ – Per-app service API docs (name, description, parameters, return types, validation).

Deployment

The project deploys automatically over SSH after CI passes. Pushes to develop deploy to staging; pushes to main deploy to production.

See docs/Deployment.md for:

Required environment secrets (SSH_HOST, SSH_USER, SSH_PRIVATE_KEY) and optional SSH_PORT (defaults to 22) — set per environment (production / staging)
GitHub Environments setup (approval gates for production)
One-time server setup (prerequisites, .env, SSH key)
Deploy script behavior and override options

Branching strategy

main – Default/production branch (stable, release-ready code).
develop – Development branch (active integration and feature work).
Feature branches: Create from develop. Do not branch from main for day-to-day work.
Pull requests: Open PRs against develop; merge to main for releases.

Name		Name	Last commit message	Last commit date
Latest commit History 534 Commits
.github/workflows		.github/workflows
boost_collector_runner		boost_collector_runner
boost_library_docs_tracker		boost_library_docs_tracker
boost_library_tracker		boost_library_tracker
boost_library_usage_dashboard		boost_library_usage_dashboard
boost_mailing_list_tracker		boost_mailing_list_tracker
boost_usage_tracker		boost_usage_tracker
clang_github_tracker		clang_github_tracker
config		config
core		core
cppa_pinecone_sync		cppa_pinecone_sync
cppa_slack_tracker		cppa_slack_tracker
cppa_user_tracker		cppa_user_tracker
cppa_youtube_script_tracker		cppa_youtube_script_tracker
discord_activity_tracker		discord_activity_tracker
docs		docs
github_activity_tracker		github_activity_tracker
scripts		scripts
slack_event_handler		slack_event_handler
wg21_paper_tracker		wg21_paper_tracker
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
conftest.py		conftest.py
docker-compose.ci.yml		docker-compose.ci.yml
docker-compose.yml		docker-compose.yml
docker-entrypoint.sh		docker-entrypoint.sh
manage.py		manage.py
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Boost Data Collector - Django project

Overview

Quick start

Prerequisites

Initial setup

Running with Docker

Celery

Running tests

Project structure

How it works

GitHub tokens

Workspace (raw/processed files)

Documentation

Deployment

Branching strategy

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Boost Data Collector - Django project

Overview

Quick start

Prerequisites

Initial setup

Running with Docker

Celery

Running tests

Project structure

How it works

GitHub tokens

Workspace (raw/processed files)

Documentation

Deployment

Branching strategy

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages