CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Also read AGENTS.md — hard rules for any AI agent working in this repo (e.g. no monkey-patching).

Project Overview

Helico is an AlphaFold3 clone built from scratch in PyTorch for experimentation.

Project Structure

src/helico/
  __init__.py    Package entry (exports Helico, HelicoConfig)
  model.py       All neural network modules in a single file
  data.py        Data pipeline (CCD, mmCIF, tokenizer, MSA, cropping)
  train.py       Training loop, DDP, checkpointing, inference
  bench.py       FoldBench benchmark scoring and local runner
tests/
  test_data.py   Integration tests for the data pipeline
  test_model.py  Integration tests for all model components
modal/
  ci.py                    CI tests on Modal
  bench.py                 Parallel FoldBench benchmark on Modal
  train.py                 Multi-GPU DDP training on Modal
  preprocess_on_modal.py   Raw-data download + preprocess on Modal
  sync_train_data.py       Sync Protenix v1 bioassembly data into helico-train-data Volume
  upload_processed.py      One-shot upload of a local processed/ tree into the Volume

Build & Test Commands

Install: uv pip install -e ".[dev]"
Run all tests: uv run pytest
Run fast tests (skip CCD/seqres): uv run pytest -k "not CCD and not Seqres"
Run a single test: uv run pytest tests/test_model.py::TestTriangleOps::test_tri_mul_outgoing_shape -v
Train (synthetic): helico-train --synthetic --n-blocks 2 --n-diffusion-token-blocks 2 --max-steps 100

Architecture

The model lives in src/helico/model.py using PyTorch.
Target GPUs: H100 / B200 only. No other architectures.
Always use cuEquivariance kernels directly — no PyTorch-only fallback code paths.
Three cuEquivariance kernels are used: triangle_multiplicative_update, triangle_attention, attention_pair_bias.
Prioritize simplicity and single code paths over flexibility.

Testing

Unit tests for all non-trivial functionality.
Always full integration tests — never use stubs or mocks.
Tests run on GPU with bfloat16 precision.

Training Data

Data is hosted on HuggingFace at timodonnell/helico-data and auto-downloads to ~/.cache/helico/data/ on first use.
Download all data: helico-download (or helico-download --subset ccd-only for just the CCD cache)
Override default location with HELICO_DATA_DIR env var.
Preprocessing from raw data: helico-preprocess all <raw-dir> <processed-dir>
Generate CCD cache only: helico-preprocess ccd <raw-dir> <processed-dir>
See LOG.md for actual paths and commands used on our machines.
Processing follows the Boltz2 flow.

Reference Material

Key papers and repos to be familiar with:

AlphaFold3: paper / code
Boltz2: paper / code
OpenFold3: whitepaper / code / docs
cuEquivariance: code / docs
Protenix: code / paper

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Project Overview

Project Structure

Build & Test Commands

Architecture

Testing

Training Data

Reference Material

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Project Overview

Project Structure

Build & Test Commands

Architecture

Testing

Training Data

Reference Material