Generalize MFU/FLOPs module across recipes with log_mfu training hook by gagank1 · Pull Request #1548 · NVIDIA/bionemo-framework

gagank1 · 2026-04-04T05:43:01Z

Description

Adds an architecture-independent FLOPs counting and MFU tracking module (flops.py) that can be used by any native_te recipe. The module auto-detects model architecture (GQA vs MHA, SwiGLU vs standard FFN) from HuggingFace config dicts and supports analytical, simplified, and Hyena (Evo2) FLOPs formulas.

MFU tracking is integrated into all 11 training scripts across 4 recipes, gated behind a log_mfu: false Hydra flag

Some MFU results: https://linear.app/nvidia/issue/BIO-406/mfu-calculation-experiment#comment-50296297

Usage

During training:

# Any training script — just add log_mfu=true
  torchrun --nproc_per_node=2 train_fsdp2_cp.py --config-name L2_lingua_1b log_mfu=true                                   
                                                                                                                          
  Output during training:                                                                                                 
  [__main__][INFO] - MFU: 60.1% (93.19 TFLOPS/GPU)

CLI tool:

  # Show GPU and peak bf16 TFLOPS                                                                                         
  python flops.py gpu-info                                                                                                
                                                                  
  # Compute FLOPs for any model config (auto-detects architecture)                                                        
  python flops.py flops --config-path ./model_configs/lingua-1B
  python flops.py flops --config-path nvidia/esm2_t36_3B_UR50D   # from HF Hub                                            
                                                                  
  # Estimate CP ring attention communication overhead                                                                     
  python flops.py flops --config-path ./model_configs/lingua-1B --num-gpus 2 --cp-size 2 --seq-len 16384
                                                                                                                          
  # Measure P2P GPU bandwidth (matches CP ring attention pattern)                                                         
  torchrun --nproc_per_node=2 flops.py bandwidth

Type of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor
Documentation update
Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

ciflow:skip - Skip all CI tests for this PR
ciflow:notebooks - Run Jupyter notebooks execution tests
ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow
ciflow:all - Run all tests (unit tests, slow tests, and notebooks). This label can be used to enforce running all framework tests.
ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
/ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Triggering Code Rabbit AI Review

To trigger a code review from code rabbit, comment on a pull request with one of these commands:

@coderabbitai review - Triggers a standard review
@coderabbitai full review - Triggers a comprehensive review

See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.

Pre-submit Checklist

I have tested these changes locally
I have updated the documentation accordingly
I have added/updated tests as needed
All existing tests pass successfully

Summary by CodeRabbit

New Features
- Added Model FLOPs Utilization (MFU) tracking to monitor per-step compute efficiency during training via log_mfu=true parameter.
- Added FLOPs computation utilities with CLI tools for GPU peak performance detection, bandwidth measurement, and context-parallel communication overhead estimation.
Documentation
- Added "MFU Tracking" guides with usage examples for all recipe READMEs.
- Documented flops.py CLI commands for performance profiling and GPU analysis.
Tests
- Added comprehensive test coverage for FLOPs calculations and MFU tracking functionality.

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

copy-pr-bot · 2026-04-04T05:43:06Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-04-04T05:43:09Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f6683868-66cd-40bf-b18c-3df559516d31

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR introduces Model FLOPs Utilization (MFU) tracking across multiple native training recipes (LLaMA3, ESM2, CodonFM, OpenGenome2). A new flops.py module provides architecture-independent FLOPs computation, peak TFLOPS detection, and communication overhead estimation. Training scripts are updated to conditionally log MFU metrics per optimizer step. Configuration and documentation are updated to support the new feature, and comprehensive test coverage is added.

Changes

Cohort / File(s)	Summary
FLOPs Core Module `llama3_native_te/flops.py`, `esm2_native_te/flops.py`, `codonfm_native_te/flops.py`, `opengenome2_llama_native_te/flops.py`	New ~714-line module providing FLOPs computation (analytical, simplified, Hyena variants), MFU tracking, GPU peak TFLOPS detection, CP ring-attention and DP communication overhead estimation, model config loading from HF/local, bandwidth measurement, and CLI commands (`gpu-info`, `flops`, `cp-comm`, `bandwidth`).
Training Script MFU Integration `llama3_native_te/train_ddp.py`, `llama3_native_te/train_fsdp2.py`, `llama3_native_te/train_fsdp2_cp.py`, `esm2_native_te/train_ddp.py`, `esm2_native_te/train_ddp_cp.py`, `esm2_native_te/train_fsdp2.py`, `esm2_native_te/train_fsdp2_cp.py`, `esm2_native_te/train_mfsdp.py`, `codonfm_native_te/train_fsdp2.py`, `opengenome2_llama_native_te/train_fsdp2.py`, `opengenome2_llama_native_te/train_fsdp2_cp.py`	Adds optional MFU tracking gated by `args.log_mfu`: initializes `MFUTracker` from model config, measures per-step elapsed time via `time.perf_counter()`, computes and logs MFU percentage and per-GPU TFLOPS on main process each optimizer step. ~23--24 lines per file.
Configuration Defaults `llama3_native_te/hydra_config/defaults.yaml`, `esm2_native_te/hydra_config/defaults.yaml`, `codonfm_native_te/hydra_config/defaults.yaml`, `opengenome2_llama_native_te/hydra_config/defaults.yaml`	Added `log_mfu: false` configuration parameter to enable MFU logging via CLI override.
FLOPs Test Coverage `llama3_native_te/tests/test_flops.py`, `esm2_native_te/tests/test_flops.py`, `codonfm_native_te/tests/test_flops.py`, `opengenome2_llama_native_te/tests/test_flops.py`	New ~317--323 line test modules validating architecture auto-detection, analytical/simplified/Hyena FLOPs formulas, MFU calculations, communication estimation, and multi-GPU scaling with golden-value assertions.
Documentation `llama3_native_te/README.md`, `esm2_native_te/README.md`, `codonfm_native_te/README.md`, `opengenome2_llama_native_te/README.md`	Added "MFU Tracking" section documenting `log_mfu=true` CLI usage, logged metrics (MFU %, TFLOPS/GPU, step time), and CLI examples for `flops.py` commands (`gpu-info`, FLOPs computation, CP communication, bandwidth measurement).
Performance Logger & Metadata `esm2_native_te/perf_logger.py`	Added `last_step_time` instance attribute (initialized to `None`, updated in `log_step`) to track per-logged-step timing for MFU computation.
Model Configuration `llama3_native_te/model_configs/meta-llama/Llama-3.1-8B/config.json`	New Llama-3.1-8B config with architecture parameters (4096 hidden, 32 layers, 32 attention heads, 8 KV heads, 14336 intermediate, 131072 max positions), rope scaling, and transformers version metadata.
CI/CD Synchronization `ci/scripts/check_copied_files.py`	Added source-to-destination mappings for `flops.py` and `tests/test_flops.py` from `llama3_native_te` to `esm2_native_te`, `codonfm_native_te`, and `opengenome2_llama_native_te` for consistency verification.

Sequence Diagram(s)

sequenceDiagram
    participant TrainScript as Training Script
    participant MFUTracker as MFUTracker
    participant FlopsModule as FLOPs Compute
    participant GPU as GPU

    TrainScript->>MFUTracker: __init__(config, batch_size, seq_len, ...)
    MFUTracker->>FlopsModule: from_hf_config(model_config)
    FlopsModule-->>MFUTracker: ModelFLOPsConfig
    MFUTracker->>FlopsModule: compute_flops_analytical/simplified()
    FlopsModule-->>MFUTracker: total_training_flops
    MFUTracker->>GPU: detect_gpu_peak_tflops()
    GPU-->>MFUTracker: peak_tflops_bf16

    loop Each Optimizer Step
        TrainScript->>TrainScript: measure step_time (perf_counter)
        TrainScript->>MFUTracker: compute_mfu(step_time)
        MFUTracker->>MFUTracker: mfu = (total_flops / step_time / peak_tflops)
        MFUTracker-->>TrainScript: {mfu_percent, tflops_per_gpu}
        TrainScript->>TrainScript: log metrics
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

jstjohn
jwilber
cspades
dorotat-nv
trvachov
jomitchellnv
savitha-eng

Poem

🐰 Hops through training with metrics so bright,
FLOPs counted with analytical might,
MFU tracking each step of the way,
GPU peaks detected, efficiency on display! 🎯⚡

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 59.45% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately captures the main addition of MFU/FLOPs utilities with the log_mfu training hook, which aligns with the primary objective of the changeset across multiple recipes.
Description check	✅ Passed	PR description is comprehensive, providing detailed motivation, usage examples, and change summary aligned with template structure.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch gkaushik/mfu_experiment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

pstjohn · 2026-04-04T11:17:50Z

+    if rank == 0:
+        print("Counting FLOPs with HF model (meta device)...")
+    hf_config_meta = LlamaConfig.from_pretrained(args.config_path)
+    hf_config_meta._attn_implementation = "eager"


This is going to be the slowest possible attention, can you use fa2 or sdpa?

This was the old version of the PR, no longer includes HF comparison

- Add compare_mfu_validate.py: golden value tests comparing CP vs non-CP execution for both TE and HF models, validating loss, logits (cosine sim > 0.99), and gradients (cosine sim > 0.8) following the pattern from models/llama3/tests/test_cp_bshd.py - Fix silent RoPE bug in HF CP: position_ids were not passed to context_parallel buffers, causing each rank to auto-generate [0..S/CP-1] instead of correct global positions - Switch FLOPs counting from eager to SDPA attention for consistency with the actual training implementation (identical counts on meta device) - Add exact FLOPs output (full integers with commas) alongside abbreviated values in both single-GPU and multi-GPU scripts - Switch bandwidth measurement from all-reduce to all-gather for more accurate pure data movement measurement matching CP ring attention - Add position_ids and max_length_q/k support to measure_step_time Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

Replaces all-gather with explicit send/recv between rank 0 and rank 1, matching CP ring attention's actual communication pattern. Measures 6.6 GB/s unidirectional on PCIe Gen 3 x8 (vs 3.2 GB/s all-gather, 4.0 GB/s all-reduce). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

Adds subcommands for using MFU utilities outside of full training scripts: - gpu-info: Print GPU name, detected peak TFLOPS, and known GPU table - flops: Compute FLOPs from a model config (README + first-principles) - cp-comm: Estimate CP ring attention communication volume - bandwidth: Measure unidirectional P2P bandwidth via send/recv (torchrun) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

Add architecture-independent flops.py module (source: models/esm2/flops.py) supporting Llama (GQA+SwiGLU), ESM2, CodonFM (MHA+GELU), and Evo2 (Hyena). Key additions: - ModelFLOPsConfig dataclass with from_hf_config() auto-detection - Analytical, simplified (README), and Hyena FLOPs formulas - MFUTracker class for training script integration - Communication overhead estimation (CP ring attention, FSDP) - CLI: gpu-info, flops, cp-comm, bandwidth Training integration (gated by log_mfu: false in hydra config): - 11 training scripts across 4 native_te recipes - llama3_native_te (3), esm2_native_te (5), codonfm_native_te (1), opengenome2_llama_native_te (2) Removes superseded Llama-specific compare_mfu*.py benchmark scripts. Moves first_principles.md to models/esm2/ alongside the source module. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

- Fix RTX 4090 (330→165) and RTX 3090 (142→71): were using sparse values - Remove A5000, RTX 4090, RTX 3090 (not relevant for data center workloads) - Add B300 and GB300 Blackwell Ultra at 2,500 TFLOPS dense BF16 - Add authoritative NVIDIA source URLs for all GPU specs as comments Sources: B300/GB300: https://www.nvidia.com/en-us/data-center/gb300-nvl72/ (360 PFLOPS BF16 sparse / 72 GPUs / 2 = 2,500 TFLOPS dense per GPU) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

Remove functions that were only used by the now-deleted compare_mfu*.py standalone benchmark scripts: - compute_flops_first_principles(), compute_flops_readme() (backward compat wrappers) - measure_step_time(), split_for_cp_bshd(), count_flops_with_model(), cleanup_model() - Unused imports: gc, nullcontext, FlopCounterMode Retained: MFUTracker, from_hf_config, compute_flops_analytical, compute_flops_simplified, compute_flops_hyena, CLI, and all utilities needed by the training script log_mfu integration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

- Move flops.py source from models/esm2/ to recipes/llama3_native_te/ (it's a training utility, not a model utility) - Add test_flops.py with 27 tests covering config auto-detection, analytical/simplified/hyena formulas, MFUTracker, and CP comm estimation - Update check_copied_files.py: source is now llama3_native_te, copied to esm2, codonfm, opengenome2 recipes (both flops.py and test_flops.py) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

Document the log_mfu=true flag and flops.py CLI utilities in all 4 native_te recipe READMEs: llama3, esm2, codonfm, opengenome2. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

gagank1 · 2026-04-13T20:37:07Z

@coderabbitai full review

coderabbitai · 2026-04-13T20:37:22Z

✅ Actions performed

Full review triggered.

gagank1 · 2026-04-13T20:37:35Z

/ok to test 27255a1

coderabbitai

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

bionemo-recipes/recipes/esm2_native_te/perf_logger.py (1)

108-120: ⚠️ Potential issue | 🟠 Major

Reset last_step_time on non-logging steps to prevent stale MFU inputs.

self.last_step_time is only updated inside the logging-frequency branch, so it can retain an old value. Since training code reads perf_logger.last_step_time each step, MFU can be computed with stale timing.

Proposed fix

         with torch.no_grad():
+            self.last_step_time = None
             # FSDP2's clip_grad_norm_ returns a DTensor; convert to local tensor for torchmetrics compatibility.
             if isinstance(grad_norm, DTensor):
                 grad_norm = grad_norm.to_local()
@@
             if step % self.logging_frequency == 0 and step > 0:
@@
                 step_time = elapsed_time / self.logging_frequency
                 self.last_step_time = step_time

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/esm2_native_te/perf_logger.py` around lines 108 -
120, The perf logger currently only updates self.last_step_time inside the
logging branch guarded by self.logging_frequency, causing stale values on
non-logging steps; update the step handler in perf_logger.py so that when a step
is not a logging step you explicitly reset self.last_step_time (e.g., set to 0.0
or None) to avoid stale MFU inputs, keeping the existing logic that updates
self.previous_step_time and last_step_time when (step % self.logging_frequency
== 0).

🧹 Nitpick comments (3)

bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2.py (1)
354-359: Same minor note as train_fsdp2_cp.py.

The step_start_time reset on line 359 happens unconditionally. Same optional refactor applies if you want consistency.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2.py` around
lines 354 - 359, The unconditional reset of step_start_time happens even when
mfu_tracker is None; update the block around mfu_tracker so step_start_time is
only reset when mfu_tracker is present (i.e., move the step_start_time =
time.perf_counter() inside the if mfu_tracker is not None branch, after calling
mfu_tracker.compute_mfu and optional dist_config.is_main_process()/logger.info)
to match the optional flow used elsewhere (see mfu_tracker, compute_mfu,
dist_config.is_main_process, logger.info).
bionemo-recipes/recipes/llama3_native_te/tests/test_flops.py (1)
24-25: Consider using a more robust import approach.

The sys.path.insert(0, ...) pattern works but can be fragile. Since this test file will be copied across recipes, consider whether pytest's conftest.py or pyproject.toml configuration could provide a cleaner solution.

That said, this is a common pattern in recipe-style code and works correctly here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/llama3_native_te/tests/test_flops.py` around lines 24
- 25, The test currently mutates sys.path via the sys.path.insert(0, ...) call
in test_flops.py which is fragile; instead move that path-setup into a shared
pytest configuration so tests don't each modify sys.path: create or update
pytest's conftest.py to add the parent directory to sys.path (or better,
configure package discovery via pyproject.toml/paths or install the package in
editable mode) and remove the sys.path.insert(0, ...) line from test_flops.py so
imports are resolved by pytest configuration rather than per-test mutation.
bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2_cp.py (1)
362-367: Minor inconsistency: step_start_time reset occurs unconditionally.

The timer reset on line 367 happens even when mfu_tracker is None. While this has negligible performance impact, it's slightly inconsistent—either gate the entire timing block or make it explicit that timing is always active.

Also note this script uses manual time.perf_counter() timing while ESM2 scripts use perf_logger.last_step_time. This difference is acceptable since different recipes may have different PerfLogger implementations, but it's worth documenting if intentional.
Optional: Gate timing reset with MFU tracker check
                if mfu_tracker is not None:
                    step_time = time.perf_counter() - step_start_time
                    mfu_info = mfu_tracker.compute_mfu(step_time)
                    if dist_config.is_main_process():
                        logger.info("MFU: %.1f%% (%.2f TFLOPS/GPU)", mfu_info["mfu"], mfu_info["tflops_per_gpu"])
-                step_start_time = time.perf_counter()
+                    step_start_time = time.perf_counter()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2_cp.py` around
lines 362 - 367, The timing reset is currently unconditional; to be consistent,
only reset step_start_time when mfu tracking is active. Inside the block that
checks mfu_tracker is not None (the one that calls mfu_tracker.compute_mfu(...)
and logs via logger.info when dist_config.is_main_process()), move the line that
sets step_start_time = time.perf_counter() into that same guarded block (or
explicitly comment why timing should remain unconditional). This references the
mfu_tracker variable, compute_mfu method, step_start_time variable,
dist_config.is_main_process(), and logger.info to locate the code to change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bionemo-recipes/recipes/codonfm_native_te/tests/test_flops.py`:
- Around line 24-41: The tests unsafely import the local flops.py via sys.path
and a plain "from flops import ...", causing module-cache collisions across
pytest sessions; modify the test to load the local flops.py under a unique
module name (e.g., using importlib.util.spec_from_file_location or
importlib.machinery.SourceFileLoader with a name derived from the test file path
or parent directory) and then import the symbols MFUTracker, ModelFLOPsConfig,
compute_flops_analytical, compute_flops_hyena, compute_flops_simplified,
estimate_cp_comm_bytes, and from_hf_config from that module, ensuring you do not
rely on sys.path insertion or the global "flops" entry in sys.modules so each
recipe test loads its own isolated module.

In `@bionemo-recipes/recipes/codonfm_native_te/train_fsdp2.py`:
- Around line 222-227: The MFU log omits the per-step duration even though
step_time is computed; update the logging call inside the mfu_tracker block
(where mfu_info = mfu_tracker.compute_mfu(step_time) and inside the
dist_config.is_main_process() check) to include step_time in the logger.info
message alongside mfu_info["mfu"] and mfu_info["tflops_per_gpu"] (e.g., add a
placeholder for step_time and pass step_time as an argument) so each MFU line
reports MFU, TFLOPS/GPU, and the per-step time.

In `@bionemo-recipes/recipes/llama3_native_te/flops.py`:
- Around line 607-647: The CLI flag args.formula is unused: either honor it or
remove it; to fix, branch on args.formula (values "simplified" or "analytical")
before calling compute_flops_simplified and compute_flops_analytical so you only
compute and print the requested method. Specifically, use args.formula to decide
whether to call
compute_flops_simplified(b,s,config.hidden_size,config.num_hidden_layers,config.vocab_size)
or compute_flops_analytical(config,b,s) (and only compute the other if you need
a diff), adjust the printing loop to only print the selected result (and only
compute/print the Difference when both are intentionally computed), or
alternatively remove the --formula argument from the argparse setup if you want
both methods always shown.
- Around line 453-483: In measure_bus_bandwidth, reject world sizes != 2 instead
of assuming peers by updating the initial guard (world_size) to raise an error
when world_size is not exactly 2; ensure you check world_size at the top of the
function and raise a ValueError (or similar) with a clear message referencing
the peer calculation using peer = 1 - rank so callers know only 2-process runs
are supported; keep the rest of the logic unchanged once the guard ensures rank
is 0 or 1.
- Around line 104-140: from_hf_config currently computes derived fields
(head_dim, has_lm_head) before kwargs.update(overrides), which allows overrides
like num_attention_heads or vocab_size to leave derived values stale; update the
function so overrides are applied before computing derived fields (or apply
kwargs.update(overrides) and then recompute head_dim = hidden_size //
num_attention_heads and has_lm_head = vocab_size > 0 afterwards), ensuring
derived values (head_dim, has_lm_head and any other fields dependent on
overrides) reflect the final values passed to ModelFLOPsConfig; keep references
to from_hf_config, kwargs.update(overrides), head_dim, has_lm_head,
num_attention_heads, vocab_size and ModelFLOPsConfig to locate the changes.
- Around line 69-75: detect_gpu_peak_tflops currently calls
torch.cuda.get_device_name(0) unguarded and will crash on CPU-only hosts; update
detect_gpu_peak_tflops to first check torch.cuda.is_available() and only call
torch.cuda.get_device_name(0) when CUDA is available, otherwise return (None,
"unknown") so callers like _cli_gpu_info and MFUTracker(..., peak_tflops=None)
safely receive a fallback; keep the GPU_PEAK_TFLOPS_BF16 lookup logic intact
when a device name is present.
- Around line 486-507: The load_model_config function currently hardcodes
trust_remote_code=True when calling AutoConfig.from_pretrained; change it to
accept explicit keyword-only parameters (e.g., *, trust_remote_code: bool =
False, revision: Optional[str] = None) so callers must opt-in to remote code
execution and can pin a revision, then pass those parameters into
AutoConfig.from_pretrained(config_path, trust_remote_code=trust_remote_code,
revision=revision) and update any call sites accordingly.

In `@bionemo-recipes/recipes/llama3_native_te/train_ddp.py`:
- Around line 151-157: The MFUTracker is being configured with per-micro-batch
FLOPs which underestimates utilization when gradient accumulation is used;
update the MFUTracker initialization (the MFUTracker(...) call) to account for
gradient accumulation by multiplying the effective batch_size (or otherwise
scaling FLOPs) by the gradient accumulation factor (use
args.gradient_accumulation_steps or args.grad_acc_steps as present in the
training args) so the tracker uses full-step work rather than a single
micro-batch, and apply the same correction to the related MFU/config
calculations around this block.

In `@bionemo-recipes/recipes/llama3_native_te/train_fsdp2_cp.py`:
- Around line 187-193: The MFUTracker is using args.dataset.micro_batch_size as
the FLOPs batch size which ignores gradient accumulation and underestimates MFU;
update the mfu_tracker initialization (symbol: MFUTracker / variable:
mfu_tracker) to use the effective batch size = args.dataset.micro_batch_size *
args.grad_acc_steps (or otherwise multiply by grad_acc_steps) when setting the
batch_size parameter, and apply the same change wherever MFUTracker is
constructed (also the other block around the second occurrence).

In `@bionemo-recipes/recipes/llama3_native_te/train_fsdp2.py`:
- Around line 165-170: MFU is undercounted because MFUTracker is initialized
with a single micro-batch; multiply the batch work by gradient accumulation
steps so FLOPs/ MFU reflect a full optimizer step. Update the MFUTracker
initialization (mfu_tracker / MFUTracker) to use an effective batch size =
args.dataset.micro_batch_size * grad_acc_steps (or pass grad_acc_steps
separately) and also ensure any FLOPs or wall-time computations (the code path
that computes wall time for a full optimizer step) scale FLOPs by grad_acc_steps
so the computed FLOPs and MFU reflect the total work per optimizer step.

---

Outside diff comments:
In `@bionemo-recipes/recipes/esm2_native_te/perf_logger.py`:
- Around line 108-120: The perf logger currently only updates
self.last_step_time inside the logging branch guarded by self.logging_frequency,
causing stale values on non-logging steps; update the step handler in
perf_logger.py so that when a step is not a logging step you explicitly reset
self.last_step_time (e.g., set to 0.0 or None) to avoid stale MFU inputs,
keeping the existing logic that updates self.previous_step_time and
last_step_time when (step % self.logging_frequency == 0).

---

Nitpick comments:
In `@bionemo-recipes/recipes/llama3_native_te/tests/test_flops.py`:
- Around line 24-25: The test currently mutates sys.path via the
sys.path.insert(0, ...) call in test_flops.py which is fragile; instead move
that path-setup into a shared pytest configuration so tests don't each modify
sys.path: create or update pytest's conftest.py to add the parent directory to
sys.path (or better, configure package discovery via pyproject.toml/paths or
install the package in editable mode) and remove the sys.path.insert(0, ...)
line from test_flops.py so imports are resolved by pytest configuration rather
than per-test mutation.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2_cp.py`:
- Around line 362-367: The timing reset is currently unconditional; to be
consistent, only reset step_start_time when mfu tracking is active. Inside the
block that checks mfu_tracker is not None (the one that calls
mfu_tracker.compute_mfu(...) and logs via logger.info when
dist_config.is_main_process()), move the line that sets step_start_time =
time.perf_counter() into that same guarded block (or explicitly comment why
timing should remain unconditional). This references the mfu_tracker variable,
compute_mfu method, step_start_time variable, dist_config.is_main_process(), and
logger.info to locate the code to change.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2.py`:
- Around line 354-359: The unconditional reset of step_start_time happens even
when mfu_tracker is None; update the block around mfu_tracker so step_start_time
is only reset when mfu_tracker is present (i.e., move the step_start_time =
time.perf_counter() inside the if mfu_tracker is not None branch, after calling
mfu_tracker.compute_mfu and optional dist_config.is_main_process()/logger.info)
to match the optional flow used elsewhere (see mfu_tracker, compute_mfu,
dist_config.is_main_process, logger.info).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 158596f3-a749-4ec2-88e4-a5b4f9b08cc4

📥 Commits

Reviewing files that changed from the base of the PR and between 86c8329 and 27255a1.

📒 Files selected for processing (30)

bionemo-recipes/recipes/codonfm_native_te/README.md
bionemo-recipes/recipes/codonfm_native_te/flops.py
bionemo-recipes/recipes/codonfm_native_te/hydra_config/defaults.yaml
bionemo-recipes/recipes/codonfm_native_te/tests/test_flops.py
bionemo-recipes/recipes/codonfm_native_te/train_fsdp2.py
bionemo-recipes/recipes/esm2_native_te/README.md
bionemo-recipes/recipes/esm2_native_te/flops.py
bionemo-recipes/recipes/esm2_native_te/hydra_config/defaults.yaml
bionemo-recipes/recipes/esm2_native_te/perf_logger.py
bionemo-recipes/recipes/esm2_native_te/tests/test_flops.py
bionemo-recipes/recipes/esm2_native_te/train_ddp.py
bionemo-recipes/recipes/esm2_native_te/train_ddp_cp.py
bionemo-recipes/recipes/esm2_native_te/train_fsdp2.py
bionemo-recipes/recipes/esm2_native_te/train_fsdp2_cp.py
bionemo-recipes/recipes/esm2_native_te/train_mfsdp.py
bionemo-recipes/recipes/llama3_native_te/README.md
bionemo-recipes/recipes/llama3_native_te/flops.py
bionemo-recipes/recipes/llama3_native_te/hydra_config/defaults.yaml
bionemo-recipes/recipes/llama3_native_te/model_configs/meta-llama/Llama-3.1-8B/config.json
bionemo-recipes/recipes/llama3_native_te/tests/test_flops.py
bionemo-recipes/recipes/llama3_native_te/train_ddp.py
bionemo-recipes/recipes/llama3_native_te/train_fsdp2.py
bionemo-recipes/recipes/llama3_native_te/train_fsdp2_cp.py
bionemo-recipes/recipes/opengenome2_llama_native_te/README.md
bionemo-recipes/recipes/opengenome2_llama_native_te/flops.py
bionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/defaults.yaml
bionemo-recipes/recipes/opengenome2_llama_native_te/tests/test_flops.py
bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2.py
bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2_cp.py
ci/scripts/check_copied_files.py

coderabbitai · 2026-04-13T20:51:25Z

+import sys
+from pathlib import Path
+
+import pytest
+
+
+# Add parent directory so we can import flops
+sys.path.insert(0, str(Path(__file__).parent.parent))
+
+from flops import (
+    MFUTracker,
+    ModelFLOPsConfig,
+    compute_flops_analytical,
+    compute_flops_hyena,
+    compute_flops_simplified,
+    estimate_cp_comm_bytes,
+    from_hf_config,
+)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate the test file and its parent structure find . -name "test_flops.py" -type f 2>/dev/null | head -20

Repository: NVIDIA/bionemo-framework

Length of output: 329

🏁 Script executed:

# Check if flops.py exists in the parent directories of test files find . -name "flops.py" -type f 2>/dev/null | head -20

Repository: NVIDIA/bionemo-framework

Length of output: 285

🏁 Script executed:

# Search for other test files using the same sys.path pattern rg "sys\.path\.insert.*parent" --type py 2>/dev/null | head -20

Repository: NVIDIA/bionemo-framework

Length of output: 2354

🏁 Script executed:

# Verify what each test_flops.py imports from "flops" for f in ./bionemo-recipes/recipes/codonfm_native_te/tests/test_flops.py ./bionemo-recipes/recipes/esm2_native_te/tests/test_flops.py ./bionemo-recipes/recipes/llama3_native_te/tests/test_flops.py; do echo "=== $f ===" sed -n '1,45p' "$f" | tail -20 done

Repository: NVIDIA/bionemo-framework

Length of output: 1631

🏁 Script executed:

# Check how pytest is typically invoked - look for pytest config or CI configs find . -name "pytest.ini" -o -name "pyproject.toml" -o -name "setup.cfg" | grep -E "(pytest|pyproject)" | head -10

Repository: NVIDIA/bionemo-framework

Length of output: 658

🏁 Script executed:

# Check pyproject.toml for pytest configuration head -200 ./pyproject.toml 2>/dev/null | grep -A 20 "\[tool.pytest"

Repository: NVIDIA/bionemo-framework

Length of output: 50

🏁 Script executed:

# Check pytest.ini directly cat ./pytest.ini 2>/dev/null

Repository: NVIDIA/bionemo-framework

Length of output: 433

🏁 Script executed:

# Check main pyproject.toml for pytest config rg "\[tool\.pytest" -A 10 ./pyproject.toml 2>/dev/null

Repository: NVIDIA/bionemo-framework

Length of output: 50

🏁 Script executed:

# View the actual imports in llama3_native_te test_flops.py more completely head -30 ./bionemo-recipes/recipes/llama3_native_te/tests/test_flops.py

Repository: NVIDIA/bionemo-framework

Length of output: 1062

🏁 Script executed:

# Check if there's a conftest.py that might manipulate sys.path at the recipe level find ./bionemo-recipes/recipes -name "conftest.py" -type f

Repository: NVIDIA/bionemo-framework

Length of output: 706

🏁 Script executed:

# Check the conftest.py files for the affected recipes for f in ./bionemo-recipes/recipes/codonfm_native_te/tests/conftest.py ./bionemo-recipes/recipes/llama3_native_te/tests/conftest.py ./bionemo-recipes/recipes/esm2_native_te/tests/conftest.py; do echo "=== $f ===" cat "$f" 2>/dev/null || echo "Not found" echo "" done

Repository: NVIDIA/bionemo-framework

Length of output: 10467

🏁 Script executed:

# Test if flops modules would actually collide - simulate what happens when importing both python3 << 'EOF' import sys from pathlib import Path import importlib # Simulate first recipe test importing flops recipe1_dir = "./bionemo-recipes/recipes/llama3_native_te" sys.path.insert(0, recipe1_dir) # This would make "flops" from llama3 available print("sys.path after recipe1:", sys.path[:3]) # Simulate second recipe test in SAME process (as pytest would do) recipe2_dir = "./bionemo-recipes/recipes/codonfm_native_te" # If we just append another dir, we'd have both in sys.path sys.path.insert(0, recipe2_dir) print("sys.path after recipe2:", sys.path[:3]) # When we try to import flops now, Python will use the first match in sys.path # but if flops is already in sys.modules, it uses that cached version print("\nThis demonstrates the caching risk:") print("- First test imports flops from llama3_native_te") print("- Module 'flops' gets cached in sys.modules") print("- Second test also imports 'flops'") print("- Python finds 'flops' in sys.modules and returns cached version") print("- Test uses wrong recipe's flops.py!") EOF

Repository: NVIDIA/bionemo-framework

Length of output: 557

🏁 Script executed:

# Check if the flops.py files differ between recipes echo "=== Comparing llama3 vs codonfm flops.py size ===" wc -l ./bionemo-recipes/recipes/llama3_native_te/flops.py ./bionemo-recipes/recipes/codonfm_native_te/flops.py

Repository: NVIDIA/bionemo-framework

Length of output: 246

🏁 Script executed:

# Check if they're identical or different diff ./bionemo-recipes/recipes/llama3_native_te/flops.py ./bionemo-recipes/recipes/codonfm_native_te/flops.py | head -50

Repository: NVIDIA/bionemo-framework

Length of output: 359

🏁 Script executed:

# Verify pytest test discovery behavior - check if tests are typically run together find ./bionemo-recipes/recipes -name "test_*.py" -o -name "*_test.py" | wc -l

Repository: NVIDIA/bionemo-framework

Length of output: 70

🏁 Script executed:

# Check if CI configuration runs all recipe tests together find . -name "*.yml" -o -name "*.yaml" | xargs grep -l "pytest.*recipes" 2>/dev/null | head -5

Repository: NVIDIA/bionemo-framework

Length of output: 50

🏁 Script executed:

# Check if conftest.py at a higher level might run tests across recipes find ./bionemo-recipes/recipes -name "conftest.py" | head -5 ls -la ./bionemo-recipes/recipes/conftest.py 2>/dev/null || echo "No conftest.py at recipes level" ls -la ./bionemo-recipes/conftest.py 2>/dev/null || echo "No conftest.py at bionemo-recipes level"

Repository: NVIDIA/bionemo-framework

Length of output: 453

🏁 Script executed:

# Verify the exact issue: show what happens when flops is imported twice with same name python3 << 'EOF' import sys from pathlib import Path # Simulate pytest running both tests in sequence in same process print("=== Scenario: Both test_flops.py files run in same pytest process ===\n") # Test 1: llama3_native_te/tests/test_flops.py print("1. Test llama3_native_te imports flops:") sys.path.insert(0, str(Path("./bionemo-recipes/recipes/llama3_native_te"))) import flops as flops_v1 print(f" Module: {flops_v1.__file__}") print(f" sys.modules['flops']: {sys.modules.get('flops', 'NOT SET')}") # Simulate cleaning up sys.path for next test sys.path.pop(0) sys.path.insert(0, str(Path("./bionemo-recipes/recipes/codonfm_native_te"))) # Test 2: codonfm_native_te/tests/test_flops.py print("\n2. Test codonfm_native_te tries to import flops:") # This would normally do: from flops import MFUTracker, ... # But since 'flops' is already in sys.modules, Python returns the cached version! import flops as flops_v2 print(f" Module: {flops_v2.__file__}") print(f" Same module? {flops_v1 is flops_v2}") print(f"\n RESULT: Test got llama3's flops, not codonfm's!") EOF

Repository: NVIDIA/bionemo-framework

Length of output: 431

🏁 Script executed:

# Verify if the flops.py files are actually different (not just comments) diff -w ./bionemo-recipes/recipes/llama3_native_te/flops.py ./bionemo-recipes/recipes/codonfm_native_te/flops.py | grep -v "^[0-9]" | head -30

Repository: NVIDIA/bionemo-framework

Length of output: 350

🏁 Script executed:

# Check test execution patterns in CI/config files find . -name "*.sh" -o -name "*.yml" -o -name "*.yaml" | xargs grep -l "test_flops" 2>/dev/null | head -3

Repository: NVIDIA/bionemo-framework

Length of output: 50

Import the local flops.py under a unique module name.

This sys.path + from flops import ... pattern is unsafe when multiple recipe test files run in the same pytest process: the first imported flops module stays cached in sys.modules, so later test suites can end up testing the wrong recipe implementation. While the flops.py files are currently identical (copied from llama3_native_te), this pattern creates a fragile dependency on that equivalence. Please fix this in the source test template and re-sync the copies.

🔧 One safe pattern

-import sys +import importlib.util from pathlib import Path import pytest -# Add parent directory so we can import flops -sys.path.insert(0, str(Path(__file__).parent.parent)) - -from flops import ( - MFUTracker, - ModelFLOPsConfig, - compute_flops_analytical, - compute_flops_hyena, - compute_flops_simplified, - estimate_cp_comm_bytes, - from_hf_config, -) +FLOPS_PATH = Path(__file__).parent.parent / "flops.py" +SPEC = importlib.util.spec_from_file_location(f"{__name__}_flops", FLOPS_PATH) +assert SPEC is not None and SPEC.loader is not None +flops = importlib.util.module_from_spec(SPEC) +SPEC.loader.exec_module(flops) + +MFUTracker = flops.MFUTracker +ModelFLOPsConfig = flops.ModelFLOPsConfig +compute_flops_analytical = flops.compute_flops_analytical +compute_flops_hyena = flops.compute_flops_hyena +compute_flops_simplified = flops.compute_flops_simplified +estimate_cp_comm_bytes = flops.estimate_cp_comm_bytes +from_hf_config = flops.from_hf_config

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/tests/test_flops.py` around lines 24 - 41, The tests unsafely import the local flops.py via sys.path and a plain "from flops import ...", causing module-cache collisions across pytest sessions; modify the test to load the local flops.py under a unique module name (e.g., using importlib.util.spec_from_file_location or importlib.machinery.SourceFileLoader with a name derived from the test file path or parent directory) and then import the symbols MFUTracker, ModelFLOPsConfig, compute_flops_analytical, compute_flops_hyena, compute_flops_simplified, estimate_cp_comm_bytes, and from_hf_config from that module, ensuring you do not rely on sys.path insertion or the global "flops" entry in sys.modules so each recipe test loads its own isolated module.

coderabbitai · 2026-04-13T20:51:25Z

+                    if mfu_tracker is not None:
+                        step_time = time.perf_counter() - step_start_time
+                        mfu_info = mfu_tracker.compute_mfu(step_time)
+                        if dist_config.is_main_process():
+                            logger.info("MFU: %.1f%% (%.2f TFLOPS/GPU)", mfu_info["mfu"], mfu_info["tflops_per_gpu"])
+                    step_start_time = time.perf_counter()


⚠️ Potential issue | 🟡 Minor

Include step_time in the MFU log line for complete per-step telemetry.

You already compute step_time on Line 223, but the emitted message omits it.

Proposed fix

if mfu_tracker is not None: step_time = time.perf_counter() - step_start_time mfu_info = mfu_tracker.compute_mfu(step_time) if dist_config.is_main_process(): - logger.info("MFU: %.1f%% (%.2f TFLOPS/GPU)", mfu_info["mfu"], mfu_info["tflops_per_gpu"]) + logger.info( + "Step %d MFU: %.1f%% (%.2f TFLOPS/GPU, step_time=%.3fs)", + step, + mfu_info["mfu"], + mfu_info["tflops_per_gpu"], + step_time, + )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/train_fsdp2.py` around lines 222 - 227, The MFU log omits the per-step duration even though step_time is computed; update the logging call inside the mfu_tracker block (where mfu_info = mfu_tracker.compute_mfu(step_time) and inside the dist_config.is_main_process() check) to include step_time in the logger.info message alongside mfu_info["mfu"] and mfu_info["tflops_per_gpu"] (e.g., add a placeholder for step_time and pass step_time as an argument) so each MFU line reports MFU, TFLOPS/GPU, and the per-step time.

coderabbitai · 2026-04-13T20:51:25Z

+def detect_gpu_peak_tflops():
+    """Auto-detect GPU peak bf16 TFLOPS from device name via substring match."""
+    device_name = torch.cuda.get_device_name(0)
+    for gpu_key, tflops in GPU_PEAK_TFLOPS_BF16.items():
+        if gpu_key.lower() in device_name.lower():
+            return tflops, device_name
+    return None, device_name


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "flops.py" | grep -E "(llama|flops)"

Repository: NVIDIA/bionemo-framework

Length of output: 285

🏁 Script executed:

cd bionemo-recipes && find . -type f -name "flops.py" 2>/dev/null | head -20

Repository: NVIDIA/bionemo-framework

Length of output: 221

🏁 Script executed:

cat -n ./recipes/llama3_native_te/flops.py

Repository: NVIDIA/bionemo-framework

Length of output: 135

🏁 Script executed:

cat -n ./bionemo-recipes/recipes/llama3_native_te/flops.py

Repository: NVIDIA/bionemo-framework

Length of output: 32632

Guard CUDA-name lookup before touching torch.cuda.

The function calls torch.cuda.get_device_name(0) without checking whether CUDA is available, causing it to fail on CPU-only hosts instead of falling back to "unknown". _cli_gpu_info() and MFUTracker(..., peak_tflops=None) will crash in non-CUDA environments. Check torch.cuda.is_available() (per PyTorch documentation) before invoking CUDA device queries.

Minimal fix

def detect_gpu_peak_tflops(): """Auto-detect GPU peak bf16 TFLOPS from device name via substring match.""" + if not torch.cuda.is_available(): + return None, "unknown" device_name = torch.cuda.get_device_name(0) for gpu_key, tflops in GPU_PEAK_TFLOPS_BF16.items(): if gpu_key.lower() in device_name.lower(): return tflops, device_name return None, device_name

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def detect_gpu_peak_tflops():

"""Auto-detect GPU peak bf16 TFLOPS from device name via substring match."""

device_name = torch.cuda.get_device_name(0)

for gpu_key, tflops in GPU_PEAK_TFLOPS_BF16.items():

if gpu_key.lower() in device_name.lower():

return tflops, device_name

return None, device_name

def detect_gpu_peak_tflops():

"""Auto-detect GPU peak bf16 TFLOPS from device name via substring match."""

if not torch.cuda.is_available():

return None, "unknown"

device_name = torch.cuda.get_device_name(0)

for gpu_key, tflops in GPU_PEAK_TFLOPS_BF16.items():

if gpu_key.lower() in device_name.lower():

return tflops, device_name

return None, device_name

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/llama3_native_te/flops.py` around lines 69 - 75, detect_gpu_peak_tflops currently calls torch.cuda.get_device_name(0) unguarded and will crash on CPU-only hosts; update detect_gpu_peak_tflops to first check torch.cuda.is_available() and only call torch.cuda.get_device_name(0) when CUDA is available, otherwise return (None, "unknown") so callers like _cli_gpu_info and MFUTracker(..., peak_tflops=None) safely receive a fallback; keep the GPU_PEAK_TFLOPS_BF16 lookup logic intact when a device name is present.

coderabbitai · 2026-04-13T20:51:25Z

+def from_hf_config(config_dict, **overrides):
+    """Create ModelFLOPsConfig from an HF-compatible config dict.
+
+    Auto-detects architecture:
+    - GQA vs MHA: from num_key_value_heads (absent = MHA)
+    - Gated MLP (3 proj) vs standard FFN (2 proj): from model_type
+    - LM head: from vocab_size > 0
+
+    Args:
+        config_dict: Dict with standard HF config keys (hidden_size, num_hidden_layers, etc.).
+            Works with config.json dicts, config.to_dict(), or MODEL_PRESETS dicts.
+        **overrides: Explicit overrides for any field (e.g., num_mlp_projections=3).
+    """
+    h = config_dict["hidden_size"]
+    n_heads = config_dict["num_attention_heads"]
+    n_kv = config_dict.get("num_key_value_heads", n_heads)
+    vocab = config_dict.get("vocab_size", 0)
+    model_type = config_dict.get("model_type", "")
+
+    # Detect gated MLP (3 projections) vs standard FFN (2 projections).
+    # Llama/Mistral/Qwen use SwiGLU (gate + up + down = 3 projections).
+    # ESM2/CodonFM/Geneformer/BERT use standard FFN (up + down = 2 projections).
+    num_mlp_proj = 3 if model_type in GATED_MLP_MODEL_TYPES else 2
+
+    kwargs = {
+        "hidden_size": h,
+        "num_hidden_layers": config_dict["num_hidden_layers"],
+        "num_attention_heads": n_heads,
+        "num_kv_heads": n_kv,
+        "head_dim": h // n_heads,
+        "intermediate_size": config_dict["intermediate_size"],
+        "num_mlp_projections": num_mlp_proj,
+        "vocab_size": vocab,
+        "has_lm_head": vocab > 0,
+    }
+    kwargs.update(overrides)
+    return ModelFLOPsConfig(**kwargs)


⚠️ Potential issue | 🟠 Major

Recompute derived fields after applying overrides.

kwargs.update(overrides) can leave head_dim and has_lm_head inconsistent with the overridden values. For example, vocab_size=0 still keeps the precomputed has_lm_head unless the caller knows to override both, and overriding num_attention_heads leaves the old head_dim.

♻️ Suggested fix

def from_hf_config(config_dict, **overrides): @@ - kwargs = { - "hidden_size": h, - "num_hidden_layers": config_dict["num_hidden_layers"], - "num_attention_heads": n_heads, - "num_kv_heads": n_kv, - "head_dim": h // n_heads, - "intermediate_size": config_dict["intermediate_size"], - "num_mlp_projections": num_mlp_proj, - "vocab_size": vocab, - "has_lm_head": vocab > 0, - } - kwargs.update(overrides) - return ModelFLOPsConfig(**kwargs) + kwargs = { + "hidden_size": h, + "num_hidden_layers": config_dict["num_hidden_layers"], + "num_attention_heads": n_heads, + "num_kv_heads": n_kv, + "intermediate_size": config_dict["intermediate_size"], + "num_mlp_projections": num_mlp_proj, + "vocab_size": vocab, + } + kwargs.update(overrides) + kwargs.setdefault("head_dim", kwargs["hidden_size"] // kwargs["num_attention_heads"]) + kwargs.setdefault("has_lm_head", kwargs["vocab_size"] > 0) + return ModelFLOPsConfig(**kwargs)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/llama3_native_te/flops.py` around lines 104 - 140, from_hf_config currently computes derived fields (head_dim, has_lm_head) before kwargs.update(overrides), which allows overrides like num_attention_heads or vocab_size to leave derived values stale; update the function so overrides are applied before computing derived fields (or apply kwargs.update(overrides) and then recompute head_dim = hidden_size // num_attention_heads and has_lm_head = vocab_size > 0 afterwards), ensuring derived values (head_dim, has_lm_head and any other fields dependent on overrides) reflect the final values passed to ModelFLOPsConfig; keep references to from_hf_config, kwargs.update(overrides), head_dim, has_lm_head, num_attention_heads, vocab_size and ModelFLOPsConfig to locate the changes.

coderabbitai · 2026-04-13T20:51:25Z

+def measure_bus_bandwidth(device, world_size, num_iters=20, num_elements=10_000_000):
+    """Measure unidirectional P2P bandwidth via send/recv (matches CP ring pattern)."""
+    if world_size <= 1:
+        return 0.0
+
+    rank = dist.get_rank()
+    tensor = torch.randn(num_elements, device=device, dtype=torch.bfloat16)
+    peer = 1 - rank
+
+    for _ in range(5):
+        if rank == 0:
+            dist.send(tensor, dst=peer)
+            dist.recv(tensor, src=peer)
+        else:
+            dist.recv(tensor, src=peer)
+            dist.send(tensor, dst=peer)
+    torch.cuda.synchronize()
+
+    dist.barrier()
+    torch.cuda.synchronize()
+    start = time.perf_counter()
+    for _ in range(num_iters):
+        if rank == 0:
+            dist.send(tensor, dst=peer)
+        else:
+            dist.recv(tensor, src=peer)
+    torch.cuda.synchronize()
+    elapsed = time.perf_counter() - start
+
+    data_bytes = tensor.nelement() * tensor.element_size()
+    return num_iters * data_bytes / elapsed / 1e9


⚠️ Potential issue | 🟠 Major

Reject world sizes other than 2 in the bandwidth helper.

peer = 1 - rank only works for ranks 0 and 1. If this gets launched with more than two processes, ranks >= 2 compute invalid peers and can hang the send/recv loop. Please fail fast here instead of assuming the caller followed the docstring.

🔒 Suggested guard

def measure_bus_bandwidth(device, world_size, num_iters=20, num_elements=10_000_000): """Measure unidirectional P2P bandwidth via send/recv (matches CP ring pattern).""" if world_size <= 1: return 0.0 + if world_size != 2: + raise ValueError("measure_bus_bandwidth currently supports exactly 2 ranks") rank = dist.get_rank() tensor = torch.randn(num_elements, device=device, dtype=torch.bfloat16) - peer = 1 - rank + peer = 1 if rank == 0 else 0

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/llama3_native_te/flops.py` around lines 453 - 483, In measure_bus_bandwidth, reject world sizes != 2 instead of assuming peers by updating the initial guard (world_size) to raise an error when world_size is not exactly 2; ensure you check world_size at the top of the function and raise a ValueError (or similar) with a clear message referencing the peer calculation using peer = 1 - rank so callers know only 2-process runs are supported; keep the rest of the logic unchanged once the guard ensures rank is 0 or 1.

coderabbitai · 2026-04-13T20:51:25Z

+def load_model_config(config_path):
+    """Load model config dict from a local path or HuggingFace model ID.
+
+    Supports:
+    - Local directory: ./model_configs/lingua-1B (reads config.json inside)
+    - Local file: ./model_configs/lingua-1B/config.json
+    - HF model ID: nvidia/esm2_t36_3B_UR50D (fetches from HuggingFace Hub)
+    """
+    import json
+    from pathlib import Path
+
+    path = Path(config_path)
+    if path.is_dir():
+        path = path / "config.json"
+    if path.exists():
+        return json.loads(path.read_text())
+
+    # Fall back to HuggingFace Hub
+    from transformers import AutoConfig
+
+    hf_config = AutoConfig.from_pretrained(config_path, trust_remote_code=True)
+    return hf_config.to_dict()


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and read the file find . -type f -name "flops.py" | head -20

Repository: NVIDIA/bionemo-framework

Length of output: 285

🏁 Script executed:

# Also, let's search for the specific function rg "def load_model_config" -A 25

Repository: NVIDIA/bionemo-framework

Length of output: 8820

🏁 Script executed:

# Check if there are any calls to load_model_config in the codebase rg "load_model_config\(" --type py

Repository: NVIDIA/bionemo-framework

Length of output: 1391

🏁 Script executed:

# Verify the actual line numbers in the target file sed -n '486,507p' bionemo-recipes/recipes/llama3_native_te/flops.py | cat -n

Repository: NVIDIA/bionemo-framework

Length of output: 947

🏁 Script executed:

# Check that the docstring and function signature comply with line length (119 chars) head -n 510 bionemo-recipes/recipes/llama3_native_te/flops.py | tail -n 25 | cat -n

Repository: NVIDIA/bionemo-framework

Length of output: 1050

Remove hardcoded trust_remote_code=True and add explicit opt-in parameters.

AutoConfig.from_pretrained(config_path, trust_remote_code=True) unconditionally enables execution of arbitrary code from the Hugging Face Hub. Per Hugging Face documentation, trust_remote_code defaults to False and should only be enabled as an explicit opt-in for trusted sources. Add keyword-only parameters to make this configurable and allow revision pinning:

Suggested fix

-def load_model_config(config_path): +def load_model_config(config_path, *, trust_remote_code=False, revision=None): """Load model config dict from a local path or HuggingFace model ID. Supports: - Local directory: ./model_configs/lingua-1B (reads config.json inside) - Local file: ./model_configs/lingua-1B/config.json - HF model ID: nvidia/esm2_t36_3B_UR50D (fetches from HuggingFace Hub) """ @@ -494,5 +494,7 @@ def load_model_config(config_path): from transformers import AutoConfig - hf_config = AutoConfig.from_pretrained(config_path, trust_remote_code=True) + hf_config = AutoConfig.from_pretrained( + config_path, + trust_remote_code=trust_remote_code, + revision=revision, + ) return hf_config.to_dict()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/llama3_native_te/flops.py` around lines 486 - 507, The load_model_config function currently hardcodes trust_remote_code=True when calling AutoConfig.from_pretrained; change it to accept explicit keyword-only parameters (e.g., *, trust_remote_code: bool = False, revision: Optional[str] = None) so callers must opt-in to remote code execution and can pin a revision, then pass those parameters into AutoConfig.from_pretrained(config_path, trust_remote_code=trust_remote_code, revision=revision) and update any call sites accordingly.

coderabbitai · 2026-04-13T20:51:25Z

+    parser = argparse.ArgumentParser()
+    parser.add_argument("command")
+    parser.add_argument("--config-path", default="./model_configs/lingua-1B")
+    parser.add_argument("--batch-size", type=int, default=1)
+    parser.add_argument("--seq-len", type=int, default=4096)
+    parser.add_argument("--formula", default="analytical", choices=["analytical", "simplified"])
+    parser.add_argument("--num-gpus", type=int, default=1, help="Number of GPUs for per-GPU FLOPs and comm estimates")
+    parser.add_argument("--cp-size", type=int, default=1, help="Context parallelism size for comm overhead estimate")
+    parser.add_argument("--p2p-bw", type=float, default=6.0, help="P2P bandwidth in GB/s for comm time estimate")
+    args = parser.parse_args()
+
+    cfg_dict = load_model_config(args.config_path)
+    config = from_hf_config(cfg_dict)
+    b, s = args.batch_size, args.seq_len
+
+    print(
+        f"Config: H={config.hidden_size}, L={config.num_hidden_layers},"
+        f" n_heads={config.num_attention_heads}, n_kv={config.num_kv_heads},"
+        f" I={config.intermediate_size}, V={config.vocab_size}"
+    )
+    print(
+        f"MLP: {config.num_mlp_projections} projections"
+        f" ({'SwiGLU/GLU' if config.num_mlp_projections == 3 else 'standard FFN'})"
+    )
+    print(f"Batch: B={b}, S={s}, GPUs={args.num_gpus}, CP={args.cp_size}")
+    print()
+
+    simplified = compute_flops_simplified(b, s, config.hidden_size, config.num_hidden_layers, config.vocab_size)
+    analytical, breakdown, lm_head = compute_flops_analytical(config, b, s)
+
+    print(f"{'Method':<24} {'FLOPs/step':>14} {'Per-GPU':>14} {'Exact':>30}")
+    print("-" * 86)
+    for name, flops in [("Simplified (README)", simplified), ("Analytical", analytical)]:
+        per_gpu = flops // max(args.num_gpus, 1)
+        print(f"{name:<24} {format_flops(flops):>14} {format_flops(per_gpu):>14} {format_flops_exact(flops):>30}")
+
+    if simplified != analytical:
+        diff = analytical - simplified
+        print(f"\nDifference: {format_flops_exact(diff)} ({diff / simplified * 100:+.2f}%)")
+    else:
+        print("\nFormulas agree exactly for this config.")


⚠️ Potential issue | 🟡 Minor

--formula is currently a no-op.

The parser accepts --formula, but this code always computes and prints both simplified and analytical FLOPs and never branches on args.formula. Either honor the flag or remove it, otherwise the CLI advertises behavior it does not implement.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/llama3_native_te/flops.py` around lines 607 - 647, The CLI flag args.formula is unused: either honor it or remove it; to fix, branch on args.formula (values "simplified" or "analytical") before calling compute_flops_simplified and compute_flops_analytical so you only compute and print the requested method. Specifically, use args.formula to decide whether to call compute_flops_simplified(b,s,config.hidden_size,config.num_hidden_layers,config.vocab_size) or compute_flops_analytical(config,b,s) (and only compute the other if you need a diff), adjust the printing loop to only print the selected result (and only compute/print the Difference when both are intentionally computed), or alternatively remove the --formula argument from the argparse setup if you want both methods always shown.

coderabbitai · 2026-04-13T20:51:25Z

+        mfu_tracker = MFUTracker(
+            config=from_hf_config(config.to_dict()),
+            batch_size=args.dataset.micro_batch_size,
+            seq_len=args.dataset.max_seq_length,
+            num_gpus=dist_config.world_size,
+            parallelism={"dp": dist_config.world_size},
+        )


⚠️ Potential issue | 🟠 Major

DDP MFU is underestimated under gradient accumulation.

Line 206 measures full-step duration, but Line 153 configures FLOPs for a single micro-batch.
This biases MFU low by grad_acc_steps.

Suggested fix

mfu_tracker = MFUTracker( config=from_hf_config(config.to_dict()), - batch_size=args.dataset.micro_batch_size, + batch_size=args.dataset.micro_batch_size * args.grad_acc_steps, seq_len=args.dataset.max_seq_length, num_gpus=dist_config.world_size, parallelism={"dp": dist_config.world_size}, )

Also applies to: 205-210

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/llama3_native_te/train_ddp.py` around lines 151 - 157, The MFUTracker is being configured with per-micro-batch FLOPs which underestimates utilization when gradient accumulation is used; update the MFUTracker initialization (the MFUTracker(...) call) to account for gradient accumulation by multiplying the effective batch_size (or otherwise scaling FLOPs) by the gradient accumulation factor (use args.gradient_accumulation_steps or args.grad_acc_steps as present in the training args) so the tracker uses full-step work rather than a single micro-batch, and apply the same correction to the related MFU/config calculations around this block.

coderabbitai · 2026-04-13T20:51:25Z

+        mfu_tracker = MFUTracker(
+            config=from_hf_config(config.to_dict()),
+            batch_size=args.dataset.micro_batch_size,
+            seq_len=args.dataset.max_seq_length,
+            num_gpus=dist_config.world_size,
+            parallelism={"dp": dist_config.world_size // args.cp_size, "cp": args.cp_size},
+        )


⚠️ Potential issue | 🟠 Major

MFU math misses gradient accumulation scaling.

Line 244 times a full optimizer step, but Line 189 sets FLOPs batch size to one micro-batch.
With grad_acc_steps > 1, MFU is underestimated.

Suggested fix

mfu_tracker = MFUTracker( config=from_hf_config(config.to_dict()), - batch_size=args.dataset.micro_batch_size, + batch_size=args.dataset.micro_batch_size * args.grad_acc_steps, seq_len=args.dataset.max_seq_length, num_gpus=dist_config.world_size, parallelism={"dp": dist_config.world_size // args.cp_size, "cp": args.cp_size}, )

Also applies to: 243-248

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/llama3_native_te/train_fsdp2_cp.py` around lines 187 - 193, The MFUTracker is using args.dataset.micro_batch_size as the FLOPs batch size which ignores gradient accumulation and underestimates MFU; update the mfu_tracker initialization (symbol: MFUTracker / variable: mfu_tracker) to use the effective batch size = args.dataset.micro_batch_size * args.grad_acc_steps (or otherwise multiply by grad_acc_steps) when setting the batch_size parameter, and apply the same change wherever MFUTracker is constructed (also the other block around the second occurrence).

coderabbitai · 2026-04-13T20:51:25Z

+        mfu_tracker = MFUTracker(
+            config=from_hf_config(config.to_dict()),
+            batch_size=args.dataset.micro_batch_size,
+            seq_len=args.dataset.max_seq_length,
+            num_gpus=dist_config.world_size,
+            parallelism={"dp": dist_config.world_size},


⚠️ Potential issue | 🟠 Major

MFU is underreported when grad_acc_steps > 1.

Line 219 computes wall time for a full optimizer step, but Lines 167-170 initialize FLOPs with a single micro-batch.
That underestimates FLOPs/MFU by a factor of grad_acc_steps.

Suggested fix

mfu_tracker = MFUTracker( config=from_hf_config(config.to_dict()), - batch_size=args.dataset.micro_batch_size, + batch_size=args.dataset.micro_batch_size * args.grad_acc_steps, seq_len=args.dataset.max_seq_length, num_gpus=dist_config.world_size, parallelism={"dp": dist_config.world_size}, )

Also applies to: 218-223

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/llama3_native_te/train_fsdp2.py` around lines 165 - 170, MFU is undercounted because MFUTracker is initialized with a single micro-batch; multiply the batch work by gradient accumulation steps so FLOPs/ MFU reflect a full optimizer step. Update the MFUTracker initialization (mfu_tracker / MFUTracker) to use an effective batch size = args.dataset.micro_batch_size * grad_acc_steps (or pass grad_acc_steps separately) and also ensure any FLOPs or wall-time computations (the code path that computes wall time for a full optimizer step) scale FLOPs by grad_acc_steps so the computed FLOPs and MFU reflect the total work per optimizer step.

pstjohn · 2026-04-14T22:38:53Z

        "bionemo-recipes/recipes/codonfm_native_te/modeling_codonfm_te.py",
    ],
+    # FLOPs / MFU module - synced across recipes
+    "bionemo-recipes/recipes/llama3_native_te/flops.py": [


do we have to copy these to each recipe? it's just to calculate flops? maybe we just put it in one and point folks at it.

This should maybe be in the models/ definition though? is this just printing out a total flops for a given model size?

flops.py has 2 usage paths:

flops counting for a given model config and comms bandwidth/overhead calculation using CLI

can be used in the recipes' train scripts with config log_mfu=true to log MFU with each step

so it doesn't go purely with the model defs.

Addressed. flops.py is gone entirely. The FLOPs math is now a ~30-line helper (_compute_per_token_flops) inlined at the top of each recipe's existing perf_logger.py, so there's no new copied-files target. MFU metrics flow through the existing torchmetrics → WANDB path (train/mfu_pct, train/tflops_per_gpu).

trvachov

Across all the added code, I think you shouldn't use batch_size in isolation as a variable name, or at least if you do make it explicit that in that function it doesn't matter whether it's micro or global batch. Otherwise, use micro_batch_size or global_batch_size, because it gets very confusing if the distributed flops math is right without this.

The comm overhead calculations here have a number of issues including assumptions about model precision and also about FSDP2 communication patterns. Adding comm overhead calculations is a reasonable idea but I think it should be a separate PR. Each PR should be one-idea -> one PR.

There is a flops.py CLI, but I ran torchrun --nproc_per_node=2 flops.py bandwidth as written in the docstring it and it just hangs. This was on 8xA100 80GB PCIe, pytorch:26.02-py3 container. Does this work for you?

Regarding overall code structure --> indeed there is a lot of duplicated code, but that is also to some degree the philosophy of recipes. @pstjohn Can you help with some guidance here? Should flops.py really be in perf_logger.py to reduce duplicate files? Also I think the FLOPS metrics should likely be recorded in WANDB just like all of the other metrics we have.

trvachov · 2026-04-15T12:38:23Z

+    16*H^2 for 2 MLP projections with I=4H.
+    """
+    b, s, h = batch_size, seq_len, hidden_size
+    return (24 * b * s * h * h + 4 * b * s * s * h) * (3 * num_layers) + (6 * b * s * h * vocab_size)


Does this work with gradient accumulation?

Fixed. The old code took a static batch_size and multiplied once per optimizer step, which under-reported MFU by grad_acc_steps. The new code uses the num_unpadded_tokens counter already accumulated across all grad-acc micro-batches in PerfLogger.log_micro_step() (same counter that drives tokens_per_second_per_gpu), so per-step FLOPs scales with the full optimizer step's token volume regardless of grad_acc setting. For esm2 (no grad-acc in this recipe) the current batch's unpadded tokens are used directly.

trvachov · 2026-04-15T13:35:21Z

+        Returns:
+            Dict with comm_bytes, estimated_comm_time, comm_pct.
+        """
+        bw = measured_bw_gbps or 6.0  # Default ~PCIe Gen3 x8


Why PCIe gen3? Isn't that >15 years old?

Also -- have you done the measurement on PCIe Gen3? I ask because, if this 6.0 is a "datasheet number", it might be very hard to measure especially through pytorch because of launch overhead. You might need a very specialized environment to measure max-PCIe speeds.

Removed along with the comm-overhead code. Will revisit in the follow-up comm-overhead PR with a more realistic default (or with a measurement-based approach).

pstjohn · 2026-04-15T14:10:14Z

A couple general thoughts:

The simplicity of our training scripts is incredibly important; even as it stands today i'm not happy with how much has crept in there. These need to be accessible and readable to folks coming to the repository for the first time, and be basic scaffolds agents can build from. I really don't think a flops calculator like this needs to touch every single training script in our recipes. At most, it would go in the perf logger, but even there, I'd rather see this be something like https://huggingface.co/docs/accelerate/usage_guides/model_size_estimator where this lives as a javascript app in our docs, not necessarily something that runs alongside model training.
Logging Flops/MFU directly is notoriously a footgun when it comes to performance, especially if we end up re-calculating model flops every iteration. Are we 100% sure we're not doing this?

gagank1 · 2026-04-18T00:46:55Z

@trvachov @pstjohn — ready for re-review. I took your feedback and substantially reworked the PR.

TL;DR

Every concern you raised is addressed, and the PR is now considerably smaller and simpler:

~3000 lines removed (flops.py × 4, test_flops.py × 4, compare_mfu*.py × 4, CLI, comm-overhead code)
~300 lines added (FLOPs helper inlined in each of 4 existing perf_logger.py files)
Training scripts are back to their previous simplicity — just a single extra kwarg on the PerfLogger constructor

What changed vs. the previous revision

Deleted

flops.py in all 4 recipes — gone entirely
tests/test_flops.py in all 4 recipes — gone
The entire CLI (gpu-info, flops, cp-comm, bandwidth) — no more hanging torchrun flops.py bandwidth
All comm-overhead code (_estimate_comm, estimate_cp_comm_bytes, _estimate_model_params, PCIe default) — split out for a future PR
The MFUTracker class, ModelFLOPsConfig dataclass, and all its helpers
Per-script mfu_tracker = MFUTracker(...) blocks and step_start_time = time.perf_counter() scaffolding in all 11 training scripts
last_step_time attribute on esm2's PerfLogger
Entry for flops.py / test_flops.py in ci/scripts/check_copied_files.py

Added / consolidated

Each recipe's existing perf_logger.py gets ~30 lines: a _GPU_PEAK_TFLOPS_BF16 table, _detect_peak_tflops_bf16(), and _compute_per_token_flops(config_dict, seq_len)
PerfLogger.__init__ gets one new optional kwarg: model_config_dict: dict | None = None. When present (and args.log_mfu=True), the logger adds train/tflops_per_gpu and train/mfu_pct to its metrics dict
MFU is computed inside the step % logging_frequency == 0 branch, so it flows through the same torchmetrics → WANDB pipeline as loss/grad_norm/step_time (addresses pstjohn's "WANDB like other metrics" ask and trvachov's "logging_frequency spam" concern)

The MFU formula

Simple, and the same in all 4 perf_loggers:

flops_per_step = _per_token_flops * num_unpadded_tokens_on_this_rank
tflops_per_gpu = flops_per_step / step_time / 1e12
mfu_pct       = tflops_per_gpu / peak_tflops * 100

Why this form works everywhere without DP/CP/grad-acc factoring:

Scenario	Why unpadded tokens on rank are correct
DP (DDP/FSDP2)	Each rank sees its own micro-batch; `attention_mask.sum()` = rank's tokens
CP (TE/PyTorch)	Sequence is sharded by CP; each rank's batch is already `[B, S/cp]`
Gradient accum.	`num_unpadded_tokens` accumulates across grad-acc micro-batches via `log_micro_step` (same counter `tokens_per_second_per_gpu` already uses)
BSHD (no packing)	`attention_mask` marks real tokens vs. pad
THD (packing)	`attention_mask.sum()` or `input_ids.numel()` counts packed tokens

So: no num_gpus divide, no batch_size ambiguity (there's no batch_size arg at all anymore), no explicit cp_size / dp_size / grad_acc_steps factor.

Training script change per recipe (this is the whole diff)

-perf_logger = PerfLogger(dist_config, args)
+perf_logger = PerfLogger(
+    dist_config,
+    args,
+    model_config_dict=config.to_dict() if args.get("log_mfu", False) else None,
+)

Plus deletion of the 10-line mfu_tracker = MFUTracker(...) block and the 6-line if mfu_tracker is not None block inside the loop, for each of the 11 scripts.

Point-by-point

Individual replies are on each inline thread; consolidated here:

batch_size naming (trvachov, main review) — resolved by removing the batch_size argument entirely; the formula uses per-rank unpadded tokens from the actual batch.
Comm overhead should be a separate PR (trvachov) — done; removed entirely. Will follow up in a separate PR that can address the precision/FSDP2-pattern assumptions properly.
Bandwidth CLI hangs on 8× A100 PCIe (trvachov) — done; CLI removed entirely.
num_gpus division (trvachov, esm2/flops.py:326) — done; new formula doesn't need it. See math above.
Gradient accumulation (trvachov, flops.py:203) — done; num_unpadded_tokens is the per-rank accumulated token count across grad-acc, so FLOPs/step naturally scales with grad-acc.
logging_frequency spam (trvachov, esm2/train_ddp.py:206) — done; MFU fires only on logging steps.
FSDP2 comm-overhead concerns (trvachov, flops.py:390) — moot; comm code removed.
PCIe Gen3 default (trvachov, flops.py:366) — moot; comm code removed.
Training scripts should stay simple (pstjohn) — done; all MFU scaffolding removed from training scripts. The only per-script change is adding one kwarg.
FLOPs recomputation footgun (pstjohn) — per-token FLOPs is computed once in PerfLogger.__init__ and stored; at each logging step we just do per_token_flops * num_unpadded_tokens / step_time. No recomputation, no model introspection at runtime.
"Do we need to copy flops.py to every recipe?" (pstjohn, check_copied_files.py:203) — moot; no more copied file. Each recipe's existing perf_logger.py gets the FLOPs helper inlined.

Verified

Pre-commit passes (ruff, ruff format, mdformat, license check, copied-files, secrets).
Unit tests for _compute_per_token_flops (GQA+SwiGLU, MHA+standard-FFN, seq-len scaling, linearity, vocab=0 no-LM-head) + existing TestPerfLoggerLoss in llama3 → 11/11 pass. codonfm existing tests → 5/5 pass.
Smoke-tested on 2× A6000 for 5 steps per recipe/strategy: llama3 DDP + FSDP2 + FSDP2+CP, esm2 DDP + FSDP2, codonfm FSDP2, opengenome2 FSDP2 + FSDP2+CP — all emit train/mfu_pct and train/tflops_per_gpu via WANDB. Default path (no log_mfu) still works unchanged.
The esm2 FSDP2+CP (train_fsdp2_cp.py) hits an IndexError: list index out of range in PyTorch's DTensor redistribute during init_empty_weights — reproduces without log_mfu=true too, so it's a pre-existing issue unrelated to this PR.

Usage (unchanged from the outside)

torchrun --nproc_per_node=2 train_fsdp2.py log_mfu=true

Appears in WANDB as train/mfu_pct and train/tflops_per_gpu, alongside loss/grad_norm/step_time/etc.

Thanks for the thorough review — the PR is significantly better for it.

Reworks MFU tracking per reviewer feedback on #1548: - Delete per-recipe flops.py, test_flops.py, and the CLI entirely - Inline ~30-line FLOPs helper into each recipe's existing perf_logger.py - MFU metrics (train/tflops_per_gpu, train/mfu_pct) flow through the existing torchmetrics -> WANDB path, respecting logging_frequency - Drop comm-overhead estimation; will be a separate future PR The new formula is per_token_flops(seq_len) * num_unpadded_tokens_on_rank. The unpadded-tokens counter (already used by tokens_per_second_per_gpu) is per-rank after DP/CP sharding and accumulated across grad-acc micro-batches, so the formula works uniformly across DDP/FSDP2/FSDP2+CP/DDP+CP/mFSDP and across BSHD and THD (sequence packing) with no per-strategy factors. Net: -3000 / +300 lines. Training scripts lose all MFU scaffolding; the only change per script is one extra kwarg on the PerfLogger constructor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

gagank1 · 2026-04-18T01:00:59Z

/ok to test e6468fc

The prior implementation used `self.num_unpadded_tokens` (attention_mask.sum()) as the token count in the MFU formula, which produces an "effective-work MFU" that under-reports vs published numbers when the batch has padding. Switch to `self.num_tokens` (input_ids.numel()) to match the PaLM / Megatron-LM / MosaicML llm-foundry / torchtitan convention of counting the configured token shape the hardware actually runs matmuls over, regardless of attention masking. For fully-packed THD (no inter-sequence padding) the two counters are equivalent; for BSHD with padding, this restores standard MFU semantics. References for the convention: - PaLM (Chowdhery et al. 2022, App. B): "tokens per second ... peak throughput R = P / (6N + 12LHQT)" where T is the configured/padded sequence length. - Megatron-LM num_floating_point_operations() uses global batch * args.seq_length (configured/padded). - MosaicML llm-foundry and torchtitan both use configured seq_len. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

trvachov

I had a prior comment to replace batch_size with micro_batch_size or global_batch_size to make it explicit, but it seems like this code doesn't use the concept of batch_size at all and somehow does FLOPs calculations using effective sequence length as the core input. Why do this? It would make sense to me if this simplified things, but I think it obfuscates it more. e.g. I think it still doesn't get the right FLOPs number for THD sequence packing and makes the FLOPs counting code rather tricky to understand.

For non sequence packed attention, the inputs are max seq length per sequence and micro batch size * DDP ranks.
For sequence packaged attention, the inputs are max seq length for all the sequences and the boundaries of the sequences * DDP ranks.

Then for both of the above, there's true tokens and padded tokens. I think having this be explicit (e.g. MFU_padded and MFU_unpadded) in the token counting would make this very clear for the user and be a good recipe starting point.

As it stand, had batch_size as a concept in the first round of this PR, which was a little confusing because it didn't distinguish between micro vs global batch size, and now we have total sequence length as a concept, but a) That is miscalculating the THD FLOPS and b) it's somewhat inconsistent with our prior variable definition e.g. in this README here https://github.com/NVIDIA/bionemo-framework/tree/trvachov/mixtral-recipes-repush/bionemo-recipes/recipes/llama3_native_te

What do you think?

trvachov · 2026-04-22T15:03:05Z

+    First-principles matmul count: Q/K/V/O projections (GQA-aware), attention
+    logits/values (the S^2 cost expressed per-token as 4*S*H), 2-or-3-projection
+    MLP (SwiGLU detected via model_type), and LM head. The returned value is
+    multiplied by the actual unpadded token count at log time, so it naturally


This behavior is very strange for two reasons:

This docstring is documenting behavior outside of this function, which is just a weird thing to do code wise. Why should this function document code that is outside of this function?

I don't think what is being said here is correct -- "returned value is multiplied by the actual unpadded token count"... I don't think this is actually happening, and even if it was, I'm not sure that's correct.

trvachov · 2026-04-22T15:06:20Z

+        self._per_token_flops = 0
+        self._peak_tflops: float | None = None
+        if self._log_mfu:
+            self._per_token_flops = _compute_per_token_flops(model_config_dict, args.dataset.max_seq_length)


A few issues here:

self._per_token_flops isn't being multiplies by any sort of padded vs unpadded sequence factor, as is documented in _compute_per_token_flops docstrong.

Even if it was doing the above, that would simply make a correction for BSHD padding. I think this would be totally incorrect for THD sequence packing, where the attention matrix size is not O(max_seq_length^2) but rather a sum of the individual cu_seqlens fragments squared.

Here's some docs that explain this https://docs.nvidia.com/nemo-framework/user-guide/25.09/sft_peft/packed_sequence.html <--- these docs are a little old but probably right -- worth checking with TE team. @jomitchellnv @pstjohn is my understanding correct that TE does a sum of little squares for THD sequence packing?

re: 2. Yes this is true it doesn't use the sum of the squares of each packed sequence but that was an intentional choice since the formula in the Llama3 native te recipe that was used for benchmarking doesn't either (uses B*S^2). I considered doing that but it would require some tracking inserted into the training scripts to pass around cu_seq_lens (or its sum of squares) to keep track of this. I assumed we'd prefer simplicity over modifying the training scripts more (from Peter's previous comments), should I add that in?

trvachov · 2026-04-22T15:14:09Z

+    head_dim = h // n_heads
+    kv_dim = n_kv * head_dim
+    ffn = model_config_dict["intermediate_size"]
+    vocab = model_config_dict.get("vocab_size", 0)


should this be vocab_size, or padded_vocab_size?

I dont think this is a big correction because usually the num_layers term is much bigger than this one, but nonetheless maybe worth getting right.

Good catch — fixed in b979eed.

The LM-head matmul does run at padded_vocab_size (ESM-2: 33 → 64 for FP8/tensor-core-friendly shapes); logits get sliced back to vocab_size only after the compute happens (modeling_esm_te.py:585), so the FLOPs are real.

Scoped the correction to the hardware-view metric only:

mfu_pct / tflops_per_gpu (useful work) — keeps raw vocab_size.

mfu_padded_pct / tflops_per_gpu_padded (hardware view) — uses padded_vocab_size when present.

Applied uniformly across all four recipes' perf_logger.py via a use_padded_vocab=True kwarg on _compute_non_attn_per_token_flops. For llama3 / og2 / codonfm the configs don't set padded_vocab_size, so the fallback (or vocab_size) makes the two values identical and nothing changes in those recipes — only ESM-2 sees any delta, and it's small (≲0.01 pp of total FLOPs on 3B/15B, up to ~0.3 pp on the tiny sanity-check model) as you predicted.

Previous formula baked `4·S·H` attention per token at init-time using `args.dataset.max_seq_length`, then multiplied by num_tokens at log time. For a packed THD batch of total length S with docs L₁, L₂, …, Lₙ, this yields `4·H·S²` attention per rank — overstating work by `S²/Σ(Lᵢ²)` because Flash-Attention actually computes one Lᵢ×Lᵢ score matrix per packed segment. Split per-token FLOPs into non-attention (QKV/MLP/LM-head) and attention coefficient, then at each logging window: flops = non_attn_per_token · num_tokens + coeff · Σ(Lᵢ²) / cp_size `Σ(Lᵢ²)` is accumulated across grad-acc micro-batches from `cu_seq_lens_q` (real per-doc lengths) in log_micro_step, falling back to `cu_seq_lens_q_padded` or `B·S²` from the batch shape for BSHD. The algebraic identity `non_attn + coeff·S ≡ _compute_per_token_flops(cfg, S)` is locked in by a unit test in every recipe, so BSHD and synthetic-single-doc-THD benchmarks produce identical numbers to the legacy formula. Also fixes ESM-2's latent grad-acc undercount: its PerfLogger previously had no log_micro_step and read num_tokens from only the last micro-batch at log time, silently dividing reported FLOPs by grad_acc_steps if grad-acc were ever enabled. Retrofitted with the llama3/og2/codonfm accumulator pattern: all five ESM-2 training scripts (train_{fsdp2, fsdp2_cp,ddp,ddp_cp,mfsdp}.py) now call log_micro_step inside the inner loop and log_step(step, grad_norm, lr) after the optimizer step. Applied identically to llama3_native_te, opengenome2_llama_native_te, codonfm_native_te, esm2_native_te per the per-recipe isolation rule. Six invariant unit tests land in each recipe's test_perf_logger.py (algebraic identity, BSHD no-op, THD single-doc/multi-doc/CP division, unpadded-preferred, padded-fallback) plus ESM-2-specific grad-acc accumulator tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

Real-data benchmark runs (llama3 dclm + og2 metagenome at cp=8 dp=1 S=131072) surfaced a bug in _attn_work_from_batch's BSHD branch. ContextParallelDataLoaderWrapper pre-splits the sequence along the CP axis before the batch reaches log_micro_step, so batch["input_ids"].shape per rank is (B, S/cp), not (B, S). The helper computed B*(S/cp)² from the per-rank shape, and log_step divided by cp_size, yielding per-rank attention FLOPs of B*S²/cp³ — wrong by cp². Correct per-rank attention work is B*S²/cp (Q_rank @ K_full costs (S/cp)*S per head summed). Evidence on og2 BSHD cp=8 S=131072 with a full-length doc: reported 10.9% MFU / 246 TFLOPS/GPU, but the actual compute corresponds to ~60% MFU (matches the THD run with the same S on the same data). Fix: pass cp_size into _attn_work_from_batch and multiply the BSHD synthesis by cp_size². The helper now always returns a GLOBAL (pre-shard) Σ(Lᵢ²), so the /cp_size in log_step consistently gives per-rank attention regardless of whether the input is cu_seq_lens_q (already global) or per-rank shape (needs the cp² correction). THD path unaffected (cu_seq_lens_q was always global). BSHD without CP unaffected (cp_size=1 makes the multiplier 1). Existing published benchmarks in ci/benchmarks/{og2,esm2}_benchmarks.md used synthetic cu_seq_lens=[0, S] and therefore flowed through the THD path — they remain valid. Added test_bshd_cp_correction to all four recipes asserting both the BSHD+CP scaling and the THD invariance under cp_size variation. Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

The memory reporting used torch.cuda.memory_allocated() (current) at log time, so both the "max" and "mean" metrics reflected the post-step resting footprint. Under FSDP that massively undercounts the transient peak: params get all-gathered briefly during forward/backward and activations balloon before backward frees them, but by the time log_step runs they're already released. Switch to torch.cuda.max_memory_allocated() (tracks the high-water mark of live allocations) paired with torch.cuda.reset_peak_memory_stats() so each logging window reports its own peak rather than a running max since process start. Both are host-side counter ops: no sync, no kernel launch. The "mean" metric keeps using the post-step current value as a "resting footprint" indicator (useful to see how much memory is held between steps), while "max" now reflects what the hardware actually peaks at. These are the values that belong in the benchmark tables. Applied identically to all four MFU-tracking recipes. Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

Current reporting mixes conventions: non-attention flops count every slot (padded, PaLM convention), but attention flops come from cu_seq_lens_q which drops both CP-zigzag padding and BSHD row padding. For BSHD with dynamic-padded short docs this made the attention term tiny while non-attention counted all the pad positions — an inconsistent hybrid that is neither pure useful work nor pure hardware view. Split into two parallel metrics: train/mfu_pct / tflops_per_gpu Useful-work rate. Counts only real tokens: non_attn uses num_unpadded_tokens (attention_mask.sum()) attn uses Sigma(L_i^2) from cu_seq_lens_q (THD), or per-row attention_mask.sum(dim=-1) (BSHD). Excludes CP zigzag divisibility pad and BSHD row pad. train/mfu_padded_pct / tflops_per_gpu_padded Hardware view. Counts all slots the GPU actually burns cycles on: non_attn uses input_ids.numel() attn uses Sigma(L_i^2) from cu_seq_lens_q_padded (THD) or full batch shape (BSHD). Includes both padding types. For synthetic single-doc-per-pack THD runs (where Sigma L_i^2 == S^2 and there is no padding of either kind), unpadded == padded and the two metrics report identical values; published benchmarks in ci/benchmarks remain internally consistent either way. Added _attn_work_from_batch(include_padding=bool) kwarg. The function still returns a pre-CP-shard global quantity; the caller divides by cp_size in log_step. BSHD per-row real lengths come from attention_mask; note in docstring that for BSHD+CP this is exact under evenly distributed padding (common case) and may slightly underestimate when padding is concentrated on one end. Added two unit tests per recipe (test_include_padding_thd, test_include_padding_bshd_with_attention_mask) pinning the split. Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

The single-term per-token FLOP function was kept around after the two-term split (non_attn + coeff·Σ(Lᵢ²)) was introduced, solely so old unit tests and the startup log line could still reference it. Nothing in the runtime path was using it — log_step had already switched to the split formula. Remove the function, the self._per_token_flops attribute, and the test scaffolding that existed only to pin the split against the legacy reference: - _compute_per_token_flops(cfg, seq_len) deleted from all 4 recipes - self._per_token_flops init + assignment + log-line mention removed - logger.info startup format no longer includes "per-token FLOPs=%e" - TestComputePerTokenFlops test class deleted (llama3 only) - test_algebraic_identity deleted (its whole purpose was to pin non_attn + coeff·S against the legacy function) - test_bshd_no_op simplified to test_bshd_shape_synthesis — keeps the Σ(Lᵢ²)=B·S² shape-synthesis check, drops the legacy comparison - Docstrings no longer reference the old function The tests that exercise actual formula correctness (test_thd_multi_doc_uses_squared_sum, test_cp_size_divides_attention_only, test_bshd_cp_correction, test_include_padding_*, etc.) all stay — they verify behavior directly without going through the legacy. Net: -336 / +40 lines. Pure dead-code removal, no runtime change. Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

Reflect the modern two-pair metric layout (unpadded useful-work vs padded hardware view) and the peak-memory reporting fix. Applied identically to all four MFU-tracking recipes. Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

@trvachov

For configs with padded_vocab_size set (ESM-2: 33→64 for FP8/tensor-core friendliness), the LM-head matmul physically runs at padded width and the logits are sliced back afterward. Count the padded width in the hardware-view metric (mfu_padded_pct, tflops_per_gpu_padded) while continuing to count raw vocab_size in the useful-work metric (mfu_pct, tflops_per_gpu). For configs without padded_vocab_size (llama3, og2, codonfm) the two values collapse and nothing changes. Addresses review feedback from @trvachov on PR #1548. Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

llama3 mfu experiment

c761be5

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

gagank1 self-assigned this Apr 4, 2026

pstjohn reviewed Apr 4, 2026

View reviewed changes

gagank1 and others added 11 commits April 6, 2026 17:59

Merge remote-tracking branch 'origin/main' into gkaushik/mfu_experiment

3b891f0

Remove first_principles.md from repository

1930644

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

Remove first_principles.md reference from flops.py docstring

7094d17

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

gagank1 changed the title ~~Recipes SOL Perf Baselines~~ Recipes MFU calculation utilities Apr 13, 2026

gagank1 marked this pull request as ready for review April 13, 2026 20:35

gagank1 requested review from cspades, dorotat-nv, jomitchellnv, jstjohn, jwilber, savitha-eng and trvachov as code owners April 13, 2026 20:35

gagank1 enabled auto-merge April 13, 2026 20:36

gagank1 changed the title ~~Recipes MFU calculation utilities~~ Generalize MFU/FLOPs module across recipes with log_mfu training hook Apr 13, 2026

coderabbitai Bot reviewed Apr 13, 2026

View reviewed changes

gagank1 requested a review from pstjohn April 13, 2026 21:35

pstjohn reviewed Apr 14, 2026

View reviewed changes

trvachov requested changes Apr 15, 2026

View reviewed changes

gagank1 and others added 2 commits April 18, 2026 02:38

Merge remote-tracking branch 'origin/main' into gkaushik/mfu_experiment

afc8450

trvachov requested changes Apr 22, 2026

View reviewed changes

gagank1 and others added 7 commits April 22, 2026 12:25

docs: update MFU tracking sections in recipe READMEs

423eab7

Reflect the modern two-pair metric layout (unpadded useful-work vs padded hardware view) and the peak-memory reporting fix. Applied identically to all four MFU-tracking recipes. Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

Conversation

gagank1 commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Usage

Type of changes

CI Pipeline Configuration

Authorizing CI Runs

Triggering Code Rabbit AI Review

Pre-submit Checklist

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Apr 4, 2026

Uh oh!

coderabbitai Bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gagank1 commented Apr 13, 2026

Uh oh!

coderabbitai Bot commented Apr 13, 2026

Uh oh!

gagank1 commented Apr 13, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trvachov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gagank1 commented Apr 4, 2026 •

edited

Loading

coderabbitai Bot commented Apr 4, 2026 •

edited

Loading