Skip to content

Latest commit

 

History

History
72 lines (55 loc) · 3.6 KB

File metadata and controls

72 lines (55 loc) · 3.6 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Also read AGENTS.md — hard rules for any AI agent working in this repo (e.g. no monkey-patching).

Project Overview

Helico is an AlphaFold3 clone built from scratch in PyTorch for experimentation.

Project Structure

src/helico/
  __init__.py    Package entry (exports Helico, HelicoConfig)
  model.py       All neural network modules in a single file
  data.py        Data pipeline (CCD, mmCIF, tokenizer, MSA, cropping)
  train.py       Training loop, DDP, checkpointing, inference
  bench.py       FoldBench benchmark scoring and local runner
tests/
  test_data.py   Integration tests for the data pipeline
  test_model.py  Integration tests for all model components
modal/
  ci.py                    CI tests on Modal
  bench.py                 Parallel FoldBench benchmark on Modal
  train.py                 Multi-GPU DDP training on Modal
  preprocess_on_modal.py   Raw-data download + preprocess on Modal
  sync_train_data.py       Sync Protenix v1 bioassembly data into helico-train-data Volume
  upload_processed.py      One-shot upload of a local processed/ tree into the Volume

Build & Test Commands

  • Install: uv pip install -e ".[dev]"
  • Run all tests: uv run pytest
  • Run fast tests (skip CCD/seqres): uv run pytest -k "not CCD and not Seqres"
  • Run a single test: uv run pytest tests/test_model.py::TestTriangleOps::test_tri_mul_outgoing_shape -v
  • Train (synthetic): helico-train --synthetic --n-blocks 2 --n-diffusion-token-blocks 2 --max-steps 100

Architecture

  • The model lives in src/helico/model.py using PyTorch.
  • Target GPUs: H100 / B200 only. No other architectures.
  • Always use cuEquivariance kernels directly — no PyTorch-only fallback code paths.
  • Three cuEquivariance kernels are used: triangle_multiplicative_update, triangle_attention, attention_pair_bias.
  • Prioritize simplicity and single code paths over flexibility.

Testing

  • Unit tests for all non-trivial functionality.
  • Always full integration tests — never use stubs or mocks.
  • Tests run on GPU with bfloat16 precision.

Training Data

  • Data is hosted on HuggingFace at timodonnell/helico-data and auto-downloads to ~/.cache/helico/data/ on first use.
  • Download all data: helico-download (or helico-download --subset ccd-only for just the CCD cache)
  • Override default location with HELICO_DATA_DIR env var.
  • Preprocessing from raw data: helico-preprocess all <raw-dir> <processed-dir>
  • Generate CCD cache only: helico-preprocess ccd <raw-dir> <processed-dir>
  • See LOG.md for actual paths and commands used on our machines.
  • Processing follows the Boltz2 flow.

Reference Material

Key papers and repos to be familiar with: