Skip to content

Generalize MFU/FLOPs module across recipes with log_mfu training hook#1548

Open
gagank1 wants to merge 22 commits intomainfrom
gkaushik/mfu_experiment
Open

Generalize MFU/FLOPs module across recipes with log_mfu training hook#1548
gagank1 wants to merge 22 commits intomainfrom
gkaushik/mfu_experiment

Conversation

@gagank1
Copy link
Copy Markdown
Collaborator

@gagank1 gagank1 commented Apr 4, 2026

Description

Adds an architecture-independent FLOPs counting and MFU tracking module (flops.py) that can be used by any native_te recipe. The module auto-detects model architecture (GQA vs MHA, SwiGLU vs standard FFN) from HuggingFace config dicts and supports analytical, simplified, and Hyena (Evo2) FLOPs formulas.

MFU tracking is integrated into all 11 training scripts across 4 recipes, gated behind a log_mfu: false Hydra flag

Some MFU results: https://linear.app/nvidia/issue/BIO-406/mfu-calculation-experiment#comment-50296297

Usage

During training:

# Any training script — just add log_mfu=true
  torchrun --nproc_per_node=2 train_fsdp2_cp.py --config-name L2_lingua_1b log_mfu=true                                   
                                                                                                                          
  Output during training:                                                                                                 
  [__main__][INFO] - MFU: 60.1% (93.19 TFLOPS/GPU) 

CLI tool:

  # Show GPU and peak bf16 TFLOPS                                                                                         
  python flops.py gpu-info                                                                                                
                                                                  
  # Compute FLOPs for any model config (auto-detects architecture)                                                        
  python flops.py flops --config-path ./model_configs/lingua-1B
  python flops.py flops --config-path nvidia/esm2_t36_3B_UR50D   # from HF Hub                                            
                                                                  
  # Estimate CP ring attention communication overhead                                                                     
  python flops.py flops --config-path ./model_configs/lingua-1B --num-gpus 2 --cp-size 2 --seq-len 16384
                                                                                                                          
  # Measure P2P GPU bandwidth (matches CP ring attention pattern)                                                         
  torchrun --nproc_per_node=2 flops.py bandwidth                                                                          

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

  • ciflow:skip - Skip all CI tests for this PR
  • ciflow:notebooks - Run Jupyter notebooks execution tests
  • ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow
  • ciflow:all - Run all tests (unit tests, slow tests, and notebooks). This label can be used to enforce running all framework tests.
  • ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

  • If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
    automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
  • If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
    /ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Triggering Code Rabbit AI Review

To trigger a code review from code rabbit, comment on a pull request with one of these commands:

See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

Summary by CodeRabbit

  • New Features

    • Added Model FLOPs Utilization (MFU) tracking to monitor per-step compute efficiency during training via log_mfu=true parameter.
    • Added FLOPs computation utilities with CLI tools for GPU peak performance detection, bandwidth measurement, and context-parallel communication overhead estimation.
  • Documentation

    • Added "MFU Tracking" guides with usage examples for all recipe READMEs.
    • Documented flops.py CLI commands for performance profiling and GPU analysis.
  • Tests

    • Added comprehensive test coverage for FLOPs calculations and MFU tracking functionality.

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
@gagank1 gagank1 self-assigned this Apr 4, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 4, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f6683868-66cd-40bf-b18c-3df559516d31

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR introduces Model FLOPs Utilization (MFU) tracking across multiple native training recipes (LLaMA3, ESM2, CodonFM, OpenGenome2). A new flops.py module provides architecture-independent FLOPs computation, peak TFLOPS detection, and communication overhead estimation. Training scripts are updated to conditionally log MFU metrics per optimizer step. Configuration and documentation are updated to support the new feature, and comprehensive test coverage is added.

Changes

Cohort / File(s) Summary
FLOPs Core Module
llama3_native_te/flops.py, esm2_native_te/flops.py, codonfm_native_te/flops.py, opengenome2_llama_native_te/flops.py
New ~714-line module providing FLOPs computation (analytical, simplified, Hyena variants), MFU tracking, GPU peak TFLOPS detection, CP ring-attention and DP communication overhead estimation, model config loading from HF/local, bandwidth measurement, and CLI commands (gpu-info, flops, cp-comm, bandwidth).
Training Script MFU Integration
llama3_native_te/train_ddp.py, llama3_native_te/train_fsdp2.py, llama3_native_te/train_fsdp2_cp.py, esm2_native_te/train_ddp.py, esm2_native_te/train_ddp_cp.py, esm2_native_te/train_fsdp2.py, esm2_native_te/train_fsdp2_cp.py, esm2_native_te/train_mfsdp.py, codonfm_native_te/train_fsdp2.py, opengenome2_llama_native_te/train_fsdp2.py, opengenome2_llama_native_te/train_fsdp2_cp.py
Adds optional MFU tracking gated by args.log_mfu: initializes MFUTracker from model config, measures per-step elapsed time via time.perf_counter(), computes and logs MFU percentage and per-GPU TFLOPS on main process each optimizer step. ~23--24 lines per file.
Configuration Defaults
llama3_native_te/hydra_config/defaults.yaml, esm2_native_te/hydra_config/defaults.yaml, codonfm_native_te/hydra_config/defaults.yaml, opengenome2_llama_native_te/hydra_config/defaults.yaml
Added log_mfu: false configuration parameter to enable MFU logging via CLI override.
FLOPs Test Coverage
llama3_native_te/tests/test_flops.py, esm2_native_te/tests/test_flops.py, codonfm_native_te/tests/test_flops.py, opengenome2_llama_native_te/tests/test_flops.py
New ~317--323 line test modules validating architecture auto-detection, analytical/simplified/Hyena FLOPs formulas, MFU calculations, communication estimation, and multi-GPU scaling with golden-value assertions.
Documentation
llama3_native_te/README.md, esm2_native_te/README.md, codonfm_native_te/README.md, opengenome2_llama_native_te/README.md
Added "MFU Tracking" section documenting log_mfu=true CLI usage, logged metrics (MFU %, TFLOPS/GPU, step time), and CLI examples for flops.py commands (gpu-info, FLOPs computation, CP communication, bandwidth measurement).
Performance Logger & Metadata
esm2_native_te/perf_logger.py
Added last_step_time instance attribute (initialized to None, updated in log_step) to track per-logged-step timing for MFU computation.
Model Configuration
llama3_native_te/model_configs/meta-llama/Llama-3.1-8B/config.json
New Llama-3.1-8B config with architecture parameters (4096 hidden, 32 layers, 32 attention heads, 8 KV heads, 14336 intermediate, 131072 max positions), rope scaling, and transformers version metadata.
CI/CD Synchronization
ci/scripts/check_copied_files.py
Added source-to-destination mappings for flops.py and tests/test_flops.py from llama3_native_te to esm2_native_te, codonfm_native_te, and opengenome2_llama_native_te for consistency verification.

Sequence Diagram(s)

sequenceDiagram
    participant TrainScript as Training Script
    participant MFUTracker as MFUTracker
    participant FlopsModule as FLOPs Compute
    participant GPU as GPU

    TrainScript->>MFUTracker: __init__(config, batch_size, seq_len, ...)
    MFUTracker->>FlopsModule: from_hf_config(model_config)
    FlopsModule-->>MFUTracker: ModelFLOPsConfig
    MFUTracker->>FlopsModule: compute_flops_analytical/simplified()
    FlopsModule-->>MFUTracker: total_training_flops
    MFUTracker->>GPU: detect_gpu_peak_tflops()
    GPU-->>MFUTracker: peak_tflops_bf16

    loop Each Optimizer Step
        TrainScript->>TrainScript: measure step_time (perf_counter)
        TrainScript->>MFUTracker: compute_mfu(step_time)
        MFUTracker->>MFUTracker: mfu = (total_flops / step_time / peak_tflops)
        MFUTracker-->>TrainScript: {mfu_percent, tflops_per_gpu}
        TrainScript->>TrainScript: log metrics
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • jstjohn
  • jwilber
  • cspades
  • dorotat-nv
  • trvachov
  • jomitchellnv
  • savitha-eng

Poem

🐰 Hops through training with metrics so bright,
FLOPs counted with analytical might,
MFU tracking each step of the way,
GPU peaks detected, efficiency on display! 🎯⚡

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 59.45% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the main addition of MFU/FLOPs utilities with the log_mfu training hook, which aligns with the primary objective of the changeset across multiple recipes.
Description check ✅ Passed PR description is comprehensive, providing detailed motivation, usage examples, and change summary aligned with template structure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch gkaushik/mfu_experiment

Comment @coderabbitai help to get the list of available commands and usage tips.

if rank == 0:
print("Counting FLOPs with HF model (meta device)...")
hf_config_meta = LlamaConfig.from_pretrained(args.config_path)
hf_config_meta._attn_implementation = "eager"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be the slowest possible attention, can you use fa2 or sdpa?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the old version of the PR, no longer includes HF comparison

gagank1 and others added 11 commits April 6, 2026 17:59
- Add compare_mfu_validate.py: golden value tests comparing CP vs non-CP
  execution for both TE and HF models, validating loss, logits (cosine
  sim > 0.99), and gradients (cosine sim > 0.8) following the pattern
  from models/llama3/tests/test_cp_bshd.py
- Fix silent RoPE bug in HF CP: position_ids were not passed to
  context_parallel buffers, causing each rank to auto-generate [0..S/CP-1]
  instead of correct global positions
- Switch FLOPs counting from eager to SDPA attention for consistency with
  the actual training implementation (identical counts on meta device)
- Add exact FLOPs output (full integers with commas) alongside abbreviated
  values in both single-GPU and multi-GPU scripts
- Switch bandwidth measurement from all-reduce to all-gather for more
  accurate pure data movement measurement matching CP ring attention
- Add position_ids and max_length_q/k support to measure_step_time

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Replaces all-gather with explicit send/recv between rank 0 and rank 1,
matching CP ring attention's actual communication pattern. Measures 6.6 GB/s
unidirectional on PCIe Gen 3 x8 (vs 3.2 GB/s all-gather, 4.0 GB/s all-reduce).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Adds subcommands for using MFU utilities outside of full training scripts:
  - gpu-info:   Print GPU name, detected peak TFLOPS, and known GPU table
  - flops:      Compute FLOPs from a model config (README + first-principles)
  - cp-comm:    Estimate CP ring attention communication volume
  - bandwidth:  Measure unidirectional P2P bandwidth via send/recv (torchrun)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Add architecture-independent flops.py module (source: models/esm2/flops.py)
supporting Llama (GQA+SwiGLU), ESM2, CodonFM (MHA+GELU), and Evo2 (Hyena).

Key additions:
- ModelFLOPsConfig dataclass with from_hf_config() auto-detection
- Analytical, simplified (README), and Hyena FLOPs formulas
- MFUTracker class for training script integration
- Communication overhead estimation (CP ring attention, FSDP)
- CLI: gpu-info, flops, cp-comm, bandwidth

Training integration (gated by log_mfu: false in hydra config):
- 11 training scripts across 4 native_te recipes
- llama3_native_te (3), esm2_native_te (5), codonfm_native_te (1),
  opengenome2_llama_native_te (2)

Removes superseded Llama-specific compare_mfu*.py benchmark scripts.
Moves first_principles.md to models/esm2/ alongside the source module.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
- Fix RTX 4090 (330→165) and RTX 3090 (142→71): were using sparse values
- Remove A5000, RTX 4090, RTX 3090 (not relevant for data center workloads)
- Add B300 and GB300 Blackwell Ultra at 2,500 TFLOPS dense BF16
- Add authoritative NVIDIA source URLs for all GPU specs as comments

Sources:
  B300/GB300: https://www.nvidia.com/en-us/data-center/gb300-nvl72/
  (360 PFLOPS BF16 sparse / 72 GPUs / 2 = 2,500 TFLOPS dense per GPU)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Remove functions that were only used by the now-deleted compare_mfu*.py
standalone benchmark scripts:
- compute_flops_first_principles(), compute_flops_readme() (backward compat wrappers)
- measure_step_time(), split_for_cp_bshd(), count_flops_with_model(), cleanup_model()
- Unused imports: gc, nullcontext, FlopCounterMode

Retained: MFUTracker, from_hf_config, compute_flops_analytical,
compute_flops_simplified, compute_flops_hyena, CLI, and all utilities
needed by the training script log_mfu integration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
- Move flops.py source from models/esm2/ to recipes/llama3_native_te/
  (it's a training utility, not a model utility)
- Add test_flops.py with 27 tests covering config auto-detection,
  analytical/simplified/hyena formulas, MFUTracker, and CP comm estimation
- Update check_copied_files.py: source is now llama3_native_te, copied to
  esm2, codonfm, opengenome2 recipes (both flops.py and test_flops.py)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Document the log_mfu=true flag and flops.py CLI utilities in all 4
native_te recipe READMEs: llama3, esm2, codonfm, opengenome2.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
@gagank1 gagank1 changed the title Recipes SOL Perf Baselines Recipes MFU calculation utilities Apr 13, 2026
@gagank1 gagank1 marked this pull request as ready for review April 13, 2026 20:35
@gagank1 gagank1 enabled auto-merge April 13, 2026 20:36
@gagank1
Copy link
Copy Markdown
Collaborator Author

gagank1 commented Apr 13, 2026

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 13, 2026

✅ Actions performed

Full review triggered.

@gagank1
Copy link
Copy Markdown
Collaborator Author

gagank1 commented Apr 13, 2026

/ok to test 27255a1

@gagank1 gagank1 changed the title Recipes MFU calculation utilities Generalize MFU/FLOPs module across recipes with log_mfu training hook Apr 13, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
bionemo-recipes/recipes/esm2_native_te/perf_logger.py (1)

108-120: ⚠️ Potential issue | 🟠 Major

Reset last_step_time on non-logging steps to prevent stale MFU inputs.

self.last_step_time is only updated inside the logging-frequency branch, so it can retain an old value. Since training code reads perf_logger.last_step_time each step, MFU can be computed with stale timing.

Proposed fix
         with torch.no_grad():
+            self.last_step_time = None
             # FSDP2's clip_grad_norm_ returns a DTensor; convert to local tensor for torchmetrics compatibility.
             if isinstance(grad_norm, DTensor):
                 grad_norm = grad_norm.to_local()
@@
             if step % self.logging_frequency == 0 and step > 0:
@@
                 step_time = elapsed_time / self.logging_frequency
                 self.last_step_time = step_time
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/esm2_native_te/perf_logger.py` around lines 108 -
120, The perf logger currently only updates self.last_step_time inside the
logging branch guarded by self.logging_frequency, causing stale values on
non-logging steps; update the step handler in perf_logger.py so that when a step
is not a logging step you explicitly reset self.last_step_time (e.g., set to 0.0
or None) to avoid stale MFU inputs, keeping the existing logic that updates
self.previous_step_time and last_step_time when (step % self.logging_frequency
== 0).
🧹 Nitpick comments (3)
bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2.py (1)

354-359: Same minor note as train_fsdp2_cp.py.

The step_start_time reset on line 359 happens unconditionally. Same optional refactor applies if you want consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2.py` around
lines 354 - 359, The unconditional reset of step_start_time happens even when
mfu_tracker is None; update the block around mfu_tracker so step_start_time is
only reset when mfu_tracker is present (i.e., move the step_start_time =
time.perf_counter() inside the if mfu_tracker is not None branch, after calling
mfu_tracker.compute_mfu and optional dist_config.is_main_process()/logger.info)
to match the optional flow used elsewhere (see mfu_tracker, compute_mfu,
dist_config.is_main_process, logger.info).
bionemo-recipes/recipes/llama3_native_te/tests/test_flops.py (1)

24-25: Consider using a more robust import approach.

The sys.path.insert(0, ...) pattern works but can be fragile. Since this test file will be copied across recipes, consider whether pytest's conftest.py or pyproject.toml configuration could provide a cleaner solution.

That said, this is a common pattern in recipe-style code and works correctly here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/llama3_native_te/tests/test_flops.py` around lines 24
- 25, The test currently mutates sys.path via the sys.path.insert(0, ...) call
in test_flops.py which is fragile; instead move that path-setup into a shared
pytest configuration so tests don't each modify sys.path: create or update
pytest's conftest.py to add the parent directory to sys.path (or better,
configure package discovery via pyproject.toml/paths or install the package in
editable mode) and remove the sys.path.insert(0, ...) line from test_flops.py so
imports are resolved by pytest configuration rather than per-test mutation.
bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2_cp.py (1)

362-367: Minor inconsistency: step_start_time reset occurs unconditionally.

The timer reset on line 367 happens even when mfu_tracker is None. While this has negligible performance impact, it's slightly inconsistent—either gate the entire timing block or make it explicit that timing is always active.

Also note this script uses manual time.perf_counter() timing while ESM2 scripts use perf_logger.last_step_time. This difference is acceptable since different recipes may have different PerfLogger implementations, but it's worth documenting if intentional.

Optional: Gate timing reset with MFU tracker check
                if mfu_tracker is not None:
                    step_time = time.perf_counter() - step_start_time
                    mfu_info = mfu_tracker.compute_mfu(step_time)
                    if dist_config.is_main_process():
                        logger.info("MFU: %.1f%% (%.2f TFLOPS/GPU)", mfu_info["mfu"], mfu_info["tflops_per_gpu"])
-                step_start_time = time.perf_counter()
+                    step_start_time = time.perf_counter()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2_cp.py` around
lines 362 - 367, The timing reset is currently unconditional; to be consistent,
only reset step_start_time when mfu tracking is active. Inside the block that
checks mfu_tracker is not None (the one that calls mfu_tracker.compute_mfu(...)
and logs via logger.info when dist_config.is_main_process()), move the line that
sets step_start_time = time.perf_counter() into that same guarded block (or
explicitly comment why timing should remain unconditional). This references the
mfu_tracker variable, compute_mfu method, step_start_time variable,
dist_config.is_main_process(), and logger.info to locate the code to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bionemo-recipes/recipes/codonfm_native_te/tests/test_flops.py`:
- Around line 24-41: The tests unsafely import the local flops.py via sys.path
and a plain "from flops import ...", causing module-cache collisions across
pytest sessions; modify the test to load the local flops.py under a unique
module name (e.g., using importlib.util.spec_from_file_location or
importlib.machinery.SourceFileLoader with a name derived from the test file path
or parent directory) and then import the symbols MFUTracker, ModelFLOPsConfig,
compute_flops_analytical, compute_flops_hyena, compute_flops_simplified,
estimate_cp_comm_bytes, and from_hf_config from that module, ensuring you do not
rely on sys.path insertion or the global "flops" entry in sys.modules so each
recipe test loads its own isolated module.

In `@bionemo-recipes/recipes/codonfm_native_te/train_fsdp2.py`:
- Around line 222-227: The MFU log omits the per-step duration even though
step_time is computed; update the logging call inside the mfu_tracker block
(where mfu_info = mfu_tracker.compute_mfu(step_time) and inside the
dist_config.is_main_process() check) to include step_time in the logger.info
message alongside mfu_info["mfu"] and mfu_info["tflops_per_gpu"] (e.g., add a
placeholder for step_time and pass step_time as an argument) so each MFU line
reports MFU, TFLOPS/GPU, and the per-step time.

In `@bionemo-recipes/recipes/llama3_native_te/flops.py`:
- Around line 607-647: The CLI flag args.formula is unused: either honor it or
remove it; to fix, branch on args.formula (values "simplified" or "analytical")
before calling compute_flops_simplified and compute_flops_analytical so you only
compute and print the requested method. Specifically, use args.formula to decide
whether to call
compute_flops_simplified(b,s,config.hidden_size,config.num_hidden_layers,config.vocab_size)
or compute_flops_analytical(config,b,s) (and only compute the other if you need
a diff), adjust the printing loop to only print the selected result (and only
compute/print the Difference when both are intentionally computed), or
alternatively remove the --formula argument from the argparse setup if you want
both methods always shown.
- Around line 453-483: In measure_bus_bandwidth, reject world sizes != 2 instead
of assuming peers by updating the initial guard (world_size) to raise an error
when world_size is not exactly 2; ensure you check world_size at the top of the
function and raise a ValueError (or similar) with a clear message referencing
the peer calculation using peer = 1 - rank so callers know only 2-process runs
are supported; keep the rest of the logic unchanged once the guard ensures rank
is 0 or 1.
- Around line 104-140: from_hf_config currently computes derived fields
(head_dim, has_lm_head) before kwargs.update(overrides), which allows overrides
like num_attention_heads or vocab_size to leave derived values stale; update the
function so overrides are applied before computing derived fields (or apply
kwargs.update(overrides) and then recompute head_dim = hidden_size //
num_attention_heads and has_lm_head = vocab_size > 0 afterwards), ensuring
derived values (head_dim, has_lm_head and any other fields dependent on
overrides) reflect the final values passed to ModelFLOPsConfig; keep references
to from_hf_config, kwargs.update(overrides), head_dim, has_lm_head,
num_attention_heads, vocab_size and ModelFLOPsConfig to locate the changes.
- Around line 69-75: detect_gpu_peak_tflops currently calls
torch.cuda.get_device_name(0) unguarded and will crash on CPU-only hosts; update
detect_gpu_peak_tflops to first check torch.cuda.is_available() and only call
torch.cuda.get_device_name(0) when CUDA is available, otherwise return (None,
"unknown") so callers like _cli_gpu_info and MFUTracker(..., peak_tflops=None)
safely receive a fallback; keep the GPU_PEAK_TFLOPS_BF16 lookup logic intact
when a device name is present.
- Around line 486-507: The load_model_config function currently hardcodes
trust_remote_code=True when calling AutoConfig.from_pretrained; change it to
accept explicit keyword-only parameters (e.g., *, trust_remote_code: bool =
False, revision: Optional[str] = None) so callers must opt-in to remote code
execution and can pin a revision, then pass those parameters into
AutoConfig.from_pretrained(config_path, trust_remote_code=trust_remote_code,
revision=revision) and update any call sites accordingly.

In `@bionemo-recipes/recipes/llama3_native_te/train_ddp.py`:
- Around line 151-157: The MFUTracker is being configured with per-micro-batch
FLOPs which underestimates utilization when gradient accumulation is used;
update the MFUTracker initialization (the MFUTracker(...) call) to account for
gradient accumulation by multiplying the effective batch_size (or otherwise
scaling FLOPs) by the gradient accumulation factor (use
args.gradient_accumulation_steps or args.grad_acc_steps as present in the
training args) so the tracker uses full-step work rather than a single
micro-batch, and apply the same correction to the related MFU/config
calculations around this block.

In `@bionemo-recipes/recipes/llama3_native_te/train_fsdp2_cp.py`:
- Around line 187-193: The MFUTracker is using args.dataset.micro_batch_size as
the FLOPs batch size which ignores gradient accumulation and underestimates MFU;
update the mfu_tracker initialization (symbol: MFUTracker / variable:
mfu_tracker) to use the effective batch size = args.dataset.micro_batch_size *
args.grad_acc_steps (or otherwise multiply by grad_acc_steps) when setting the
batch_size parameter, and apply the same change wherever MFUTracker is
constructed (also the other block around the second occurrence).

In `@bionemo-recipes/recipes/llama3_native_te/train_fsdp2.py`:
- Around line 165-170: MFU is undercounted because MFUTracker is initialized
with a single micro-batch; multiply the batch work by gradient accumulation
steps so FLOPs/ MFU reflect a full optimizer step. Update the MFUTracker
initialization (mfu_tracker / MFUTracker) to use an effective batch size =
args.dataset.micro_batch_size * grad_acc_steps (or pass grad_acc_steps
separately) and also ensure any FLOPs or wall-time computations (the code path
that computes wall time for a full optimizer step) scale FLOPs by grad_acc_steps
so the computed FLOPs and MFU reflect the total work per optimizer step.

---

Outside diff comments:
In `@bionemo-recipes/recipes/esm2_native_te/perf_logger.py`:
- Around line 108-120: The perf logger currently only updates
self.last_step_time inside the logging branch guarded by self.logging_frequency,
causing stale values on non-logging steps; update the step handler in
perf_logger.py so that when a step is not a logging step you explicitly reset
self.last_step_time (e.g., set to 0.0 or None) to avoid stale MFU inputs,
keeping the existing logic that updates self.previous_step_time and
last_step_time when (step % self.logging_frequency == 0).

---

Nitpick comments:
In `@bionemo-recipes/recipes/llama3_native_te/tests/test_flops.py`:
- Around line 24-25: The test currently mutates sys.path via the
sys.path.insert(0, ...) call in test_flops.py which is fragile; instead move
that path-setup into a shared pytest configuration so tests don't each modify
sys.path: create or update pytest's conftest.py to add the parent directory to
sys.path (or better, configure package discovery via pyproject.toml/paths or
install the package in editable mode) and remove the sys.path.insert(0, ...)
line from test_flops.py so imports are resolved by pytest configuration rather
than per-test mutation.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2_cp.py`:
- Around line 362-367: The timing reset is currently unconditional; to be
consistent, only reset step_start_time when mfu tracking is active. Inside the
block that checks mfu_tracker is not None (the one that calls
mfu_tracker.compute_mfu(...) and logs via logger.info when
dist_config.is_main_process()), move the line that sets step_start_time =
time.perf_counter() into that same guarded block (or explicitly comment why
timing should remain unconditional). This references the mfu_tracker variable,
compute_mfu method, step_start_time variable, dist_config.is_main_process(), and
logger.info to locate the code to change.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2.py`:
- Around line 354-359: The unconditional reset of step_start_time happens even
when mfu_tracker is None; update the block around mfu_tracker so step_start_time
is only reset when mfu_tracker is present (i.e., move the step_start_time =
time.perf_counter() inside the if mfu_tracker is not None branch, after calling
mfu_tracker.compute_mfu and optional dist_config.is_main_process()/logger.info)
to match the optional flow used elsewhere (see mfu_tracker, compute_mfu,
dist_config.is_main_process, logger.info).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 158596f3-a749-4ec2-88e4-a5b4f9b08cc4

📥 Commits

Reviewing files that changed from the base of the PR and between 86c8329 and 27255a1.

📒 Files selected for processing (30)
  • bionemo-recipes/recipes/codonfm_native_te/README.md
  • bionemo-recipes/recipes/codonfm_native_te/flops.py
  • bionemo-recipes/recipes/codonfm_native_te/hydra_config/defaults.yaml
  • bionemo-recipes/recipes/codonfm_native_te/tests/test_flops.py
  • bionemo-recipes/recipes/codonfm_native_te/train_fsdp2.py
  • bionemo-recipes/recipes/esm2_native_te/README.md
  • bionemo-recipes/recipes/esm2_native_te/flops.py
  • bionemo-recipes/recipes/esm2_native_te/hydra_config/defaults.yaml
  • bionemo-recipes/recipes/esm2_native_te/perf_logger.py
  • bionemo-recipes/recipes/esm2_native_te/tests/test_flops.py
  • bionemo-recipes/recipes/esm2_native_te/train_ddp.py
  • bionemo-recipes/recipes/esm2_native_te/train_ddp_cp.py
  • bionemo-recipes/recipes/esm2_native_te/train_fsdp2.py
  • bionemo-recipes/recipes/esm2_native_te/train_fsdp2_cp.py
  • bionemo-recipes/recipes/esm2_native_te/train_mfsdp.py
  • bionemo-recipes/recipes/llama3_native_te/README.md
  • bionemo-recipes/recipes/llama3_native_te/flops.py
  • bionemo-recipes/recipes/llama3_native_te/hydra_config/defaults.yaml
  • bionemo-recipes/recipes/llama3_native_te/model_configs/meta-llama/Llama-3.1-8B/config.json
  • bionemo-recipes/recipes/llama3_native_te/tests/test_flops.py
  • bionemo-recipes/recipes/llama3_native_te/train_ddp.py
  • bionemo-recipes/recipes/llama3_native_te/train_fsdp2.py
  • bionemo-recipes/recipes/llama3_native_te/train_fsdp2_cp.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/README.md
  • bionemo-recipes/recipes/opengenome2_llama_native_te/flops.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/defaults.yaml
  • bionemo-recipes/recipes/opengenome2_llama_native_te/tests/test_flops.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2_cp.py
  • ci/scripts/check_copied_files.py

Comment on lines +24 to +41
import sys
from pathlib import Path

import pytest


# Add parent directory so we can import flops
sys.path.insert(0, str(Path(__file__).parent.parent))

from flops import (
MFUTracker,
ModelFLOPsConfig,
compute_flops_analytical,
compute_flops_hyena,
compute_flops_simplified,
estimate_cp_comm_bytes,
from_hf_config,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate the test file and its parent structure
find . -name "test_flops.py" -type f 2>/dev/null | head -20

Repository: NVIDIA/bionemo-framework

Length of output: 329


🏁 Script executed:

# Check if flops.py exists in the parent directories of test files
find . -name "flops.py" -type f 2>/dev/null | head -20

Repository: NVIDIA/bionemo-framework

Length of output: 285


🏁 Script executed:

# Search for other test files using the same sys.path pattern
rg "sys\.path\.insert.*parent" --type py 2>/dev/null | head -20

Repository: NVIDIA/bionemo-framework

Length of output: 2354


🏁 Script executed:

# Verify what each test_flops.py imports from "flops"
for f in ./bionemo-recipes/recipes/codonfm_native_te/tests/test_flops.py ./bionemo-recipes/recipes/esm2_native_te/tests/test_flops.py ./bionemo-recipes/recipes/llama3_native_te/tests/test_flops.py; do
  echo "=== $f ==="
  sed -n '1,45p' "$f" | tail -20
done

Repository: NVIDIA/bionemo-framework

Length of output: 1631


🏁 Script executed:

# Check how pytest is typically invoked - look for pytest config or CI configs
find . -name "pytest.ini" -o -name "pyproject.toml" -o -name "setup.cfg" | grep -E "(pytest|pyproject)" | head -10

Repository: NVIDIA/bionemo-framework

Length of output: 658


🏁 Script executed:

# Check pyproject.toml for pytest configuration
head -200 ./pyproject.toml 2>/dev/null | grep -A 20 "\[tool.pytest"

Repository: NVIDIA/bionemo-framework

Length of output: 50


🏁 Script executed:

# Check pytest.ini directly
cat ./pytest.ini 2>/dev/null

Repository: NVIDIA/bionemo-framework

Length of output: 433


🏁 Script executed:

# Check main pyproject.toml for pytest config
rg "\[tool\.pytest" -A 10 ./pyproject.toml 2>/dev/null

Repository: NVIDIA/bionemo-framework

Length of output: 50


🏁 Script executed:

# View the actual imports in llama3_native_te test_flops.py more completely
head -30 ./bionemo-recipes/recipes/llama3_native_te/tests/test_flops.py

Repository: NVIDIA/bionemo-framework

Length of output: 1062


🏁 Script executed:

# Check if there's a conftest.py that might manipulate sys.path at the recipe level
find ./bionemo-recipes/recipes -name "conftest.py" -type f

Repository: NVIDIA/bionemo-framework

Length of output: 706


🏁 Script executed:

# Check the conftest.py files for the affected recipes
for f in ./bionemo-recipes/recipes/codonfm_native_te/tests/conftest.py ./bionemo-recipes/recipes/llama3_native_te/tests/conftest.py ./bionemo-recipes/recipes/esm2_native_te/tests/conftest.py; do
  echo "=== $f ==="
  cat "$f" 2>/dev/null || echo "Not found"
  echo ""
done

Repository: NVIDIA/bionemo-framework

Length of output: 10467


🏁 Script executed:

# Test if flops modules would actually collide - simulate what happens when importing both
python3 << 'EOF'
import sys
from pathlib import Path
import importlib

# Simulate first recipe test importing flops
recipe1_dir = "./bionemo-recipes/recipes/llama3_native_te"
sys.path.insert(0, recipe1_dir)

# This would make "flops" from llama3 available
print("sys.path after recipe1:", sys.path[:3])

# Simulate second recipe test in SAME process (as pytest would do)
recipe2_dir = "./bionemo-recipes/recipes/codonfm_native_te"
# If we just append another dir, we'd have both in sys.path
sys.path.insert(0, recipe2_dir)
print("sys.path after recipe2:", sys.path[:3])

# When we try to import flops now, Python will use the first match in sys.path
# but if flops is already in sys.modules, it uses that cached version
print("\nThis demonstrates the caching risk:")
print("- First test imports flops from llama3_native_te")
print("- Module 'flops' gets cached in sys.modules")
print("- Second test also imports 'flops'")
print("- Python finds 'flops' in sys.modules and returns cached version")
print("- Test uses wrong recipe's flops.py!")
EOF

Repository: NVIDIA/bionemo-framework

Length of output: 557


🏁 Script executed:

# Check if the flops.py files differ between recipes
echo "=== Comparing llama3 vs codonfm flops.py size ==="
wc -l ./bionemo-recipes/recipes/llama3_native_te/flops.py ./bionemo-recipes/recipes/codonfm_native_te/flops.py

Repository: NVIDIA/bionemo-framework

Length of output: 246


🏁 Script executed:

# Check if they're identical or different
diff ./bionemo-recipes/recipes/llama3_native_te/flops.py ./bionemo-recipes/recipes/codonfm_native_te/flops.py | head -50

Repository: NVIDIA/bionemo-framework

Length of output: 359


🏁 Script executed:

# Verify pytest test discovery behavior - check if tests are typically run together
find ./bionemo-recipes/recipes -name "test_*.py" -o -name "*_test.py" | wc -l

Repository: NVIDIA/bionemo-framework

Length of output: 70


🏁 Script executed:

# Check if CI configuration runs all recipe tests together
find . -name "*.yml" -o -name "*.yaml" | xargs grep -l "pytest.*recipes" 2>/dev/null | head -5

Repository: NVIDIA/bionemo-framework

Length of output: 50


🏁 Script executed:

# Check if conftest.py at a higher level might run tests across recipes
find ./bionemo-recipes/recipes -name "conftest.py" | head -5
ls -la ./bionemo-recipes/recipes/conftest.py 2>/dev/null || echo "No conftest.py at recipes level"
ls -la ./bionemo-recipes/conftest.py 2>/dev/null || echo "No conftest.py at bionemo-recipes level"

Repository: NVIDIA/bionemo-framework

Length of output: 453


🏁 Script executed:

# Verify the exact issue: show what happens when flops is imported twice with same name
python3 << 'EOF'
import sys
from pathlib import Path

# Simulate pytest running both tests in sequence in same process
print("=== Scenario: Both test_flops.py files run in same pytest process ===\n")

# Test 1: llama3_native_te/tests/test_flops.py
print("1. Test llama3_native_te imports flops:")
sys.path.insert(0, str(Path("./bionemo-recipes/recipes/llama3_native_te")))
import flops as flops_v1
print(f"   Module: {flops_v1.__file__}")
print(f"   sys.modules['flops']: {sys.modules.get('flops', 'NOT SET')}")

# Simulate cleaning up sys.path for next test
sys.path.pop(0)
sys.path.insert(0, str(Path("./bionemo-recipes/recipes/codonfm_native_te")))

# Test 2: codonfm_native_te/tests/test_flops.py
print("\n2. Test codonfm_native_te tries to import flops:")
# This would normally do: from flops import MFUTracker, ...
# But since 'flops' is already in sys.modules, Python returns the cached version!
import flops as flops_v2
print(f"   Module: {flops_v2.__file__}")
print(f"   Same module? {flops_v1 is flops_v2}")
print(f"\n   RESULT: Test got llama3's flops, not codonfm's!")
EOF

Repository: NVIDIA/bionemo-framework

Length of output: 431


🏁 Script executed:

# Verify if the flops.py files are actually different (not just comments)
diff -w ./bionemo-recipes/recipes/llama3_native_te/flops.py ./bionemo-recipes/recipes/codonfm_native_te/flops.py | grep -v "^[0-9]" | head -30

Repository: NVIDIA/bionemo-framework

Length of output: 350


🏁 Script executed:

# Check test execution patterns in CI/config files
find . -name "*.sh" -o -name "*.yml" -o -name "*.yaml" | xargs grep -l "test_flops" 2>/dev/null | head -3

Repository: NVIDIA/bionemo-framework

Length of output: 50


Import the local flops.py under a unique module name.

This sys.path + from flops import ... pattern is unsafe when multiple recipe test files run in the same pytest process: the first imported flops module stays cached in sys.modules, so later test suites can end up testing the wrong recipe implementation. While the flops.py files are currently identical (copied from llama3_native_te), this pattern creates a fragile dependency on that equivalence. Please fix this in the source test template and re-sync the copies.

🔧 One safe pattern
-import sys
+import importlib.util
 from pathlib import Path

 import pytest


-# Add parent directory so we can import flops
-sys.path.insert(0, str(Path(__file__).parent.parent))
-
-from flops import (
-    MFUTracker,
-    ModelFLOPsConfig,
-    compute_flops_analytical,
-    compute_flops_hyena,
-    compute_flops_simplified,
-    estimate_cp_comm_bytes,
-    from_hf_config,
-)
+FLOPS_PATH = Path(__file__).parent.parent / "flops.py"
+SPEC = importlib.util.spec_from_file_location(f"{__name__}_flops", FLOPS_PATH)
+assert SPEC is not None and SPEC.loader is not None
+flops = importlib.util.module_from_spec(SPEC)
+SPEC.loader.exec_module(flops)
+
+MFUTracker = flops.MFUTracker
+ModelFLOPsConfig = flops.ModelFLOPsConfig
+compute_flops_analytical = flops.compute_flops_analytical
+compute_flops_hyena = flops.compute_flops_hyena
+compute_flops_simplified = flops.compute_flops_simplified
+estimate_cp_comm_bytes = flops.estimate_cp_comm_bytes
+from_hf_config = flops.from_hf_config
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/codonfm_native_te/tests/test_flops.py` around lines
24 - 41, The tests unsafely import the local flops.py via sys.path and a plain
"from flops import ...", causing module-cache collisions across pytest sessions;
modify the test to load the local flops.py under a unique module name (e.g.,
using importlib.util.spec_from_file_location or
importlib.machinery.SourceFileLoader with a name derived from the test file path
or parent directory) and then import the symbols MFUTracker, ModelFLOPsConfig,
compute_flops_analytical, compute_flops_hyena, compute_flops_simplified,
estimate_cp_comm_bytes, and from_hf_config from that module, ensuring you do not
rely on sys.path insertion or the global "flops" entry in sys.modules so each
recipe test loads its own isolated module.

Comment on lines +222 to +227
if mfu_tracker is not None:
step_time = time.perf_counter() - step_start_time
mfu_info = mfu_tracker.compute_mfu(step_time)
if dist_config.is_main_process():
logger.info("MFU: %.1f%% (%.2f TFLOPS/GPU)", mfu_info["mfu"], mfu_info["tflops_per_gpu"])
step_start_time = time.perf_counter()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Include step_time in the MFU log line for complete per-step telemetry.

You already compute step_time on Line 223, but the emitted message omits it.

Proposed fix
                     if mfu_tracker is not None:
                         step_time = time.perf_counter() - step_start_time
                         mfu_info = mfu_tracker.compute_mfu(step_time)
                         if dist_config.is_main_process():
-                            logger.info("MFU: %.1f%% (%.2f TFLOPS/GPU)", mfu_info["mfu"], mfu_info["tflops_per_gpu"])
+                            logger.info(
+                                "Step %d MFU: %.1f%% (%.2f TFLOPS/GPU, step_time=%.3fs)",
+                                step,
+                                mfu_info["mfu"],
+                                mfu_info["tflops_per_gpu"],
+                                step_time,
+                            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/codonfm_native_te/train_fsdp2.py` around lines 222 -
227, The MFU log omits the per-step duration even though step_time is computed;
update the logging call inside the mfu_tracker block (where mfu_info =
mfu_tracker.compute_mfu(step_time) and inside the dist_config.is_main_process()
check) to include step_time in the logger.info message alongside mfu_info["mfu"]
and mfu_info["tflops_per_gpu"] (e.g., add a placeholder for step_time and pass
step_time as an argument) so each MFU line reports MFU, TFLOPS/GPU, and the
per-step time.

Comment on lines +69 to +75
def detect_gpu_peak_tflops():
"""Auto-detect GPU peak bf16 TFLOPS from device name via substring match."""
device_name = torch.cuda.get_device_name(0)
for gpu_key, tflops in GPU_PEAK_TFLOPS_BF16.items():
if gpu_key.lower() in device_name.lower():
return tflops, device_name
return None, device_name
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "flops.py" | grep -E "(llama|flops)"

Repository: NVIDIA/bionemo-framework

Length of output: 285


🏁 Script executed:

cd bionemo-recipes && find . -type f -name "flops.py" 2>/dev/null | head -20

Repository: NVIDIA/bionemo-framework

Length of output: 221


🏁 Script executed:

cat -n ./recipes/llama3_native_te/flops.py

Repository: NVIDIA/bionemo-framework

Length of output: 135


🏁 Script executed:

cat -n ./bionemo-recipes/recipes/llama3_native_te/flops.py

Repository: NVIDIA/bionemo-framework

Length of output: 32632


Guard CUDA-name lookup before touching torch.cuda.

The function calls torch.cuda.get_device_name(0) without checking whether CUDA is available, causing it to fail on CPU-only hosts instead of falling back to "unknown". _cli_gpu_info() and MFUTracker(..., peak_tflops=None) will crash in non-CUDA environments. Check torch.cuda.is_available() (per PyTorch documentation) before invoking CUDA device queries.

Minimal fix
 def detect_gpu_peak_tflops():
     """Auto-detect GPU peak bf16 TFLOPS from device name via substring match."""
+    if not torch.cuda.is_available():
+        return None, "unknown"
     device_name = torch.cuda.get_device_name(0)
     for gpu_key, tflops in GPU_PEAK_TFLOPS_BF16.items():
         if gpu_key.lower() in device_name.lower():
             return tflops, device_name
     return None, device_name
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def detect_gpu_peak_tflops():
"""Auto-detect GPU peak bf16 TFLOPS from device name via substring match."""
device_name = torch.cuda.get_device_name(0)
for gpu_key, tflops in GPU_PEAK_TFLOPS_BF16.items():
if gpu_key.lower() in device_name.lower():
return tflops, device_name
return None, device_name
def detect_gpu_peak_tflops():
"""Auto-detect GPU peak bf16 TFLOPS from device name via substring match."""
if not torch.cuda.is_available():
return None, "unknown"
device_name = torch.cuda.get_device_name(0)
for gpu_key, tflops in GPU_PEAK_TFLOPS_BF16.items():
if gpu_key.lower() in device_name.lower():
return tflops, device_name
return None, device_name
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/llama3_native_te/flops.py` around lines 69 - 75,
detect_gpu_peak_tflops currently calls torch.cuda.get_device_name(0) unguarded
and will crash on CPU-only hosts; update detect_gpu_peak_tflops to first check
torch.cuda.is_available() and only call torch.cuda.get_device_name(0) when CUDA
is available, otherwise return (None, "unknown") so callers like _cli_gpu_info
and MFUTracker(..., peak_tflops=None) safely receive a fallback; keep the
GPU_PEAK_TFLOPS_BF16 lookup logic intact when a device name is present.

Comment on lines +104 to +140
def from_hf_config(config_dict, **overrides):
"""Create ModelFLOPsConfig from an HF-compatible config dict.

Auto-detects architecture:
- GQA vs MHA: from num_key_value_heads (absent = MHA)
- Gated MLP (3 proj) vs standard FFN (2 proj): from model_type
- LM head: from vocab_size > 0

Args:
config_dict: Dict with standard HF config keys (hidden_size, num_hidden_layers, etc.).
Works with config.json dicts, config.to_dict(), or MODEL_PRESETS dicts.
**overrides: Explicit overrides for any field (e.g., num_mlp_projections=3).
"""
h = config_dict["hidden_size"]
n_heads = config_dict["num_attention_heads"]
n_kv = config_dict.get("num_key_value_heads", n_heads)
vocab = config_dict.get("vocab_size", 0)
model_type = config_dict.get("model_type", "")

# Detect gated MLP (3 projections) vs standard FFN (2 projections).
# Llama/Mistral/Qwen use SwiGLU (gate + up + down = 3 projections).
# ESM2/CodonFM/Geneformer/BERT use standard FFN (up + down = 2 projections).
num_mlp_proj = 3 if model_type in GATED_MLP_MODEL_TYPES else 2

kwargs = {
"hidden_size": h,
"num_hidden_layers": config_dict["num_hidden_layers"],
"num_attention_heads": n_heads,
"num_kv_heads": n_kv,
"head_dim": h // n_heads,
"intermediate_size": config_dict["intermediate_size"],
"num_mlp_projections": num_mlp_proj,
"vocab_size": vocab,
"has_lm_head": vocab > 0,
}
kwargs.update(overrides)
return ModelFLOPsConfig(**kwargs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Recompute derived fields after applying overrides.

kwargs.update(overrides) can leave head_dim and has_lm_head inconsistent with the overridden values. For example, vocab_size=0 still keeps the precomputed has_lm_head unless the caller knows to override both, and overriding num_attention_heads leaves the old head_dim.

♻️ Suggested fix
 def from_hf_config(config_dict, **overrides):
@@
-    kwargs = {
-        "hidden_size": h,
-        "num_hidden_layers": config_dict["num_hidden_layers"],
-        "num_attention_heads": n_heads,
-        "num_kv_heads": n_kv,
-        "head_dim": h // n_heads,
-        "intermediate_size": config_dict["intermediate_size"],
-        "num_mlp_projections": num_mlp_proj,
-        "vocab_size": vocab,
-        "has_lm_head": vocab > 0,
-    }
-    kwargs.update(overrides)
-    return ModelFLOPsConfig(**kwargs)
+    kwargs = {
+        "hidden_size": h,
+        "num_hidden_layers": config_dict["num_hidden_layers"],
+        "num_attention_heads": n_heads,
+        "num_kv_heads": n_kv,
+        "intermediate_size": config_dict["intermediate_size"],
+        "num_mlp_projections": num_mlp_proj,
+        "vocab_size": vocab,
+    }
+    kwargs.update(overrides)
+    kwargs.setdefault("head_dim", kwargs["hidden_size"] // kwargs["num_attention_heads"])
+    kwargs.setdefault("has_lm_head", kwargs["vocab_size"] > 0)
+    return ModelFLOPsConfig(**kwargs)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/llama3_native_te/flops.py` around lines 104 - 140,
from_hf_config currently computes derived fields (head_dim, has_lm_head) before
kwargs.update(overrides), which allows overrides like num_attention_heads or
vocab_size to leave derived values stale; update the function so overrides are
applied before computing derived fields (or apply kwargs.update(overrides) and
then recompute head_dim = hidden_size // num_attention_heads and has_lm_head =
vocab_size > 0 afterwards), ensuring derived values (head_dim, has_lm_head and
any other fields dependent on overrides) reflect the final values passed to
ModelFLOPsConfig; keep references to from_hf_config, kwargs.update(overrides),
head_dim, has_lm_head, num_attention_heads, vocab_size and ModelFLOPsConfig to
locate the changes.

Comment on lines +453 to +483
def measure_bus_bandwidth(device, world_size, num_iters=20, num_elements=10_000_000):
"""Measure unidirectional P2P bandwidth via send/recv (matches CP ring pattern)."""
if world_size <= 1:
return 0.0

rank = dist.get_rank()
tensor = torch.randn(num_elements, device=device, dtype=torch.bfloat16)
peer = 1 - rank

for _ in range(5):
if rank == 0:
dist.send(tensor, dst=peer)
dist.recv(tensor, src=peer)
else:
dist.recv(tensor, src=peer)
dist.send(tensor, dst=peer)
torch.cuda.synchronize()

dist.barrier()
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(num_iters):
if rank == 0:
dist.send(tensor, dst=peer)
else:
dist.recv(tensor, src=peer)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start

data_bytes = tensor.nelement() * tensor.element_size()
return num_iters * data_bytes / elapsed / 1e9
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject world sizes other than 2 in the bandwidth helper.

peer = 1 - rank only works for ranks 0 and 1. If this gets launched with more than two processes, ranks >= 2 compute invalid peers and can hang the send/recv loop. Please fail fast here instead of assuming the caller followed the docstring.

🔒 Suggested guard
 def measure_bus_bandwidth(device, world_size, num_iters=20, num_elements=10_000_000):
     """Measure unidirectional P2P bandwidth via send/recv (matches CP ring pattern)."""
     if world_size <= 1:
         return 0.0
+    if world_size != 2:
+        raise ValueError("measure_bus_bandwidth currently supports exactly 2 ranks")

     rank = dist.get_rank()
     tensor = torch.randn(num_elements, device=device, dtype=torch.bfloat16)
-    peer = 1 - rank
+    peer = 1 if rank == 0 else 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/llama3_native_te/flops.py` around lines 453 - 483, In
measure_bus_bandwidth, reject world sizes != 2 instead of assuming peers by
updating the initial guard (world_size) to raise an error when world_size is not
exactly 2; ensure you check world_size at the top of the function and raise a
ValueError (or similar) with a clear message referencing the peer calculation
using peer = 1 - rank so callers know only 2-process runs are supported; keep
the rest of the logic unchanged once the guard ensures rank is 0 or 1.

Comment on lines +486 to +507
def load_model_config(config_path):
"""Load model config dict from a local path or HuggingFace model ID.

Supports:
- Local directory: ./model_configs/lingua-1B (reads config.json inside)
- Local file: ./model_configs/lingua-1B/config.json
- HF model ID: nvidia/esm2_t36_3B_UR50D (fetches from HuggingFace Hub)
"""
import json
from pathlib import Path

path = Path(config_path)
if path.is_dir():
path = path / "config.json"
if path.exists():
return json.loads(path.read_text())

# Fall back to HuggingFace Hub
from transformers import AutoConfig

hf_config = AutoConfig.from_pretrained(config_path, trust_remote_code=True)
return hf_config.to_dict()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and read the file
find . -type f -name "flops.py" | head -20

Repository: NVIDIA/bionemo-framework

Length of output: 285


🏁 Script executed:

# Also, let's search for the specific function
rg "def load_model_config" -A 25

Repository: NVIDIA/bionemo-framework

Length of output: 8820


🏁 Script executed:

# Check if there are any calls to load_model_config in the codebase
rg "load_model_config\(" --type py

Repository: NVIDIA/bionemo-framework

Length of output: 1391


🏁 Script executed:

# Verify the actual line numbers in the target file
sed -n '486,507p' bionemo-recipes/recipes/llama3_native_te/flops.py | cat -n

Repository: NVIDIA/bionemo-framework

Length of output: 947


🏁 Script executed:

# Check that the docstring and function signature comply with line length (119 chars)
head -n 510 bionemo-recipes/recipes/llama3_native_te/flops.py | tail -n 25 | cat -n

Repository: NVIDIA/bionemo-framework

Length of output: 1050


Remove hardcoded trust_remote_code=True and add explicit opt-in parameters.

AutoConfig.from_pretrained(config_path, trust_remote_code=True) unconditionally enables execution of arbitrary code from the Hugging Face Hub. Per Hugging Face documentation, trust_remote_code defaults to False and should only be enabled as an explicit opt-in for trusted sources. Add keyword-only parameters to make this configurable and allow revision pinning:

Suggested fix
-def load_model_config(config_path):
+def load_model_config(config_path, *, trust_remote_code=False, revision=None):
     """Load model config dict from a local path or HuggingFace model ID.
 
     Supports:
     - Local directory: ./model_configs/lingua-1B (reads config.json inside)
     - Local file: ./model_configs/lingua-1B/config.json
     - HF model ID: nvidia/esm2_t36_3B_UR50D (fetches from HuggingFace Hub)
     """
@@ -494,5 +494,7 @@ def load_model_config(config_path):
     from transformers import AutoConfig
 
-    hf_config = AutoConfig.from_pretrained(config_path, trust_remote_code=True)
+    hf_config = AutoConfig.from_pretrained(
+        config_path,
+        trust_remote_code=trust_remote_code,
+        revision=revision,
+    )
     return hf_config.to_dict()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/llama3_native_te/flops.py` around lines 486 - 507,
The load_model_config function currently hardcodes trust_remote_code=True when
calling AutoConfig.from_pretrained; change it to accept explicit keyword-only
parameters (e.g., *, trust_remote_code: bool = False, revision: Optional[str] =
None) so callers must opt-in to remote code execution and can pin a revision,
then pass those parameters into AutoConfig.from_pretrained(config_path,
trust_remote_code=trust_remote_code, revision=revision) and update any call
sites accordingly.

Comment on lines +607 to +647
parser = argparse.ArgumentParser()
parser.add_argument("command")
parser.add_argument("--config-path", default="./model_configs/lingua-1B")
parser.add_argument("--batch-size", type=int, default=1)
parser.add_argument("--seq-len", type=int, default=4096)
parser.add_argument("--formula", default="analytical", choices=["analytical", "simplified"])
parser.add_argument("--num-gpus", type=int, default=1, help="Number of GPUs for per-GPU FLOPs and comm estimates")
parser.add_argument("--cp-size", type=int, default=1, help="Context parallelism size for comm overhead estimate")
parser.add_argument("--p2p-bw", type=float, default=6.0, help="P2P bandwidth in GB/s for comm time estimate")
args = parser.parse_args()

cfg_dict = load_model_config(args.config_path)
config = from_hf_config(cfg_dict)
b, s = args.batch_size, args.seq_len

print(
f"Config: H={config.hidden_size}, L={config.num_hidden_layers},"
f" n_heads={config.num_attention_heads}, n_kv={config.num_kv_heads},"
f" I={config.intermediate_size}, V={config.vocab_size}"
)
print(
f"MLP: {config.num_mlp_projections} projections"
f" ({'SwiGLU/GLU' if config.num_mlp_projections == 3 else 'standard FFN'})"
)
print(f"Batch: B={b}, S={s}, GPUs={args.num_gpus}, CP={args.cp_size}")
print()

simplified = compute_flops_simplified(b, s, config.hidden_size, config.num_hidden_layers, config.vocab_size)
analytical, breakdown, lm_head = compute_flops_analytical(config, b, s)

print(f"{'Method':<24} {'FLOPs/step':>14} {'Per-GPU':>14} {'Exact':>30}")
print("-" * 86)
for name, flops in [("Simplified (README)", simplified), ("Analytical", analytical)]:
per_gpu = flops // max(args.num_gpus, 1)
print(f"{name:<24} {format_flops(flops):>14} {format_flops(per_gpu):>14} {format_flops_exact(flops):>30}")

if simplified != analytical:
diff = analytical - simplified
print(f"\nDifference: {format_flops_exact(diff)} ({diff / simplified * 100:+.2f}%)")
else:
print("\nFormulas agree exactly for this config.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

--formula is currently a no-op.

The parser accepts --formula, but this code always computes and prints both simplified and analytical FLOPs and never branches on args.formula. Either honor the flag or remove it, otherwise the CLI advertises behavior it does not implement.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/llama3_native_te/flops.py` around lines 607 - 647,
The CLI flag args.formula is unused: either honor it or remove it; to fix,
branch on args.formula (values "simplified" or "analytical") before calling
compute_flops_simplified and compute_flops_analytical so you only compute and
print the requested method. Specifically, use args.formula to decide whether to
call
compute_flops_simplified(b,s,config.hidden_size,config.num_hidden_layers,config.vocab_size)
or compute_flops_analytical(config,b,s) (and only compute the other if you need
a diff), adjust the printing loop to only print the selected result (and only
compute/print the Difference when both are intentionally computed), or
alternatively remove the --formula argument from the argparse setup if you want
both methods always shown.

Comment on lines +151 to +157
mfu_tracker = MFUTracker(
config=from_hf_config(config.to_dict()),
batch_size=args.dataset.micro_batch_size,
seq_len=args.dataset.max_seq_length,
num_gpus=dist_config.world_size,
parallelism={"dp": dist_config.world_size},
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

DDP MFU is underestimated under gradient accumulation.

Line 206 measures full-step duration, but Line 153 configures FLOPs for a single micro-batch.
This biases MFU low by grad_acc_steps.

Suggested fix
         mfu_tracker = MFUTracker(
             config=from_hf_config(config.to_dict()),
-            batch_size=args.dataset.micro_batch_size,
+            batch_size=args.dataset.micro_batch_size * args.grad_acc_steps,
             seq_len=args.dataset.max_seq_length,
             num_gpus=dist_config.world_size,
             parallelism={"dp": dist_config.world_size},
         )

Also applies to: 205-210

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/llama3_native_te/train_ddp.py` around lines 151 -
157, The MFUTracker is being configured with per-micro-batch FLOPs which
underestimates utilization when gradient accumulation is used; update the
MFUTracker initialization (the MFUTracker(...) call) to account for gradient
accumulation by multiplying the effective batch_size (or otherwise scaling
FLOPs) by the gradient accumulation factor (use args.gradient_accumulation_steps
or args.grad_acc_steps as present in the training args) so the tracker uses
full-step work rather than a single micro-batch, and apply the same correction
to the related MFU/config calculations around this block.

Comment on lines +187 to +193
mfu_tracker = MFUTracker(
config=from_hf_config(config.to_dict()),
batch_size=args.dataset.micro_batch_size,
seq_len=args.dataset.max_seq_length,
num_gpus=dist_config.world_size,
parallelism={"dp": dist_config.world_size // args.cp_size, "cp": args.cp_size},
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

MFU math misses gradient accumulation scaling.

Line 244 times a full optimizer step, but Line 189 sets FLOPs batch size to one micro-batch.
With grad_acc_steps > 1, MFU is underestimated.

Suggested fix
         mfu_tracker = MFUTracker(
             config=from_hf_config(config.to_dict()),
-            batch_size=args.dataset.micro_batch_size,
+            batch_size=args.dataset.micro_batch_size * args.grad_acc_steps,
             seq_len=args.dataset.max_seq_length,
             num_gpus=dist_config.world_size,
             parallelism={"dp": dist_config.world_size // args.cp_size, "cp": args.cp_size},
         )

Also applies to: 243-248

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/llama3_native_te/train_fsdp2_cp.py` around lines 187
- 193, The MFUTracker is using args.dataset.micro_batch_size as the FLOPs batch
size which ignores gradient accumulation and underestimates MFU; update the
mfu_tracker initialization (symbol: MFUTracker / variable: mfu_tracker) to use
the effective batch size = args.dataset.micro_batch_size * args.grad_acc_steps
(or otherwise multiply by grad_acc_steps) when setting the batch_size parameter,
and apply the same change wherever MFUTracker is constructed (also the other
block around the second occurrence).

Comment on lines +165 to +170
mfu_tracker = MFUTracker(
config=from_hf_config(config.to_dict()),
batch_size=args.dataset.micro_batch_size,
seq_len=args.dataset.max_seq_length,
num_gpus=dist_config.world_size,
parallelism={"dp": dist_config.world_size},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

MFU is underreported when grad_acc_steps > 1.

Line 219 computes wall time for a full optimizer step, but Lines 167-170 initialize FLOPs with a single micro-batch.
That underestimates FLOPs/MFU by a factor of grad_acc_steps.

Suggested fix
         mfu_tracker = MFUTracker(
             config=from_hf_config(config.to_dict()),
-            batch_size=args.dataset.micro_batch_size,
+            batch_size=args.dataset.micro_batch_size * args.grad_acc_steps,
             seq_len=args.dataset.max_seq_length,
             num_gpus=dist_config.world_size,
             parallelism={"dp": dist_config.world_size},
         )

Also applies to: 218-223

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/llama3_native_te/train_fsdp2.py` around lines 165 -
170, MFU is undercounted because MFUTracker is initialized with a single
micro-batch; multiply the batch work by gradient accumulation steps so FLOPs/
MFU reflect a full optimizer step. Update the MFUTracker initialization
(mfu_tracker / MFUTracker) to use an effective batch size =
args.dataset.micro_batch_size * grad_acc_steps (or pass grad_acc_steps
separately) and also ensure any FLOPs or wall-time computations (the code path
that computes wall time for a full optimizer step) scale FLOPs by grad_acc_steps
so the computed FLOPs and MFU reflect the total work per optimizer step.

@gagank1 gagank1 requested a review from pstjohn April 13, 2026 21:35
Comment thread ci/scripts/check_copied_files.py Outdated
"bionemo-recipes/recipes/codonfm_native_te/modeling_codonfm_te.py",
],
# FLOPs / MFU module - synced across recipes
"bionemo-recipes/recipes/llama3_native_te/flops.py": [
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have to copy these to each recipe? it's just to calculate flops? maybe we just put it in one and point folks at it.

This should maybe be in the models/ definition though? is this just printing out a total flops for a given model size?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flops.py has 2 usage paths:

  1. flops counting for a given model config and comms bandwidth/overhead calculation using CLI
  2. can be used in the recipes' train scripts with config log_mfu=true to log MFU with each step

so it doesn't go purely with the model defs.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. flops.py is gone entirely. The FLOPs math is now a ~30-line helper (_compute_per_token_flops) inlined at the top of each recipe's existing perf_logger.py, so there's no new copied-files target. MFU metrics flow through the existing torchmetrics → WANDB path (train/mfu_pct, train/tflops_per_gpu).

Copy link
Copy Markdown
Collaborator

@trvachov trvachov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Across all the added code, I think you shouldn't use batch_size in isolation as a variable name, or at least if you do make it explicit that in that function it doesn't matter whether it's micro or global batch. Otherwise, use micro_batch_size or global_batch_size, because it gets very confusing if the distributed flops math is right without this.

The comm overhead calculations here have a number of issues including assumptions about model precision and also about FSDP2 communication patterns. Adding comm overhead calculations is a reasonable idea but I think it should be a separate PR. Each PR should be one-idea -> one PR.

There is a flops.py CLI, but I ran torchrun --nproc_per_node=2 flops.py bandwidth as written in the docstring it and it just hangs. This was on 8xA100 80GB PCIe, pytorch:26.02-py3 container. Does this work for you?

Regarding overall code structure --> indeed there is a lot of duplicated code, but that is also to some degree the philosophy of recipes. @pstjohn Can you help with some guidance here? Should flops.py really be in perf_logger.py to reduce duplicate files? Also I think the FLOPS metrics should likely be recorded in WANDB just like all of the other metrics we have.

Comment thread bionemo-recipes/recipes/esm2_native_te/flops.py Outdated
16*H^2 for 2 MLP projections with I=4H.
"""
b, s, h = batch_size, seq_len, hidden_size
return (24 * b * s * h * h + 4 * b * s * s * h) * (3 * num_layers) + (6 * b * s * h * vocab_size)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work with gradient accumulation?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The old code took a static batch_size and multiplied once per optimizer step, which under-reported MFU by grad_acc_steps. The new code uses the num_unpadded_tokens counter already accumulated across all grad-acc micro-batches in PerfLogger.log_micro_step() (same counter that drives tokens_per_second_per_gpu), so per-step FLOPs scales with the full optimizer step's token volume regardless of grad_acc setting. For esm2 (no grad-acc in this recipe) the current batch's unpadded tokens are used directly.

Comment thread bionemo-recipes/recipes/esm2_native_te/train_ddp.py Outdated
Comment thread bionemo-recipes/recipes/llama3_native_te/flops.py Outdated
Returns:
Dict with comm_bytes, estimated_comm_time, comm_pct.
"""
bw = measured_bw_gbps or 6.0 # Default ~PCIe Gen3 x8
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why PCIe gen3? Isn't that >15 years old?

Also -- have you done the measurement on PCIe Gen3? I ask because, if this 6.0 is a "datasheet number", it might be very hard to measure especially through pytorch because of launch overhead. You might need a very specialized environment to measure max-PCIe speeds.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed along with the comm-overhead code. Will revisit in the follow-up comm-overhead PR with a more realistic default (or with a measurement-based approach).

Comment thread bionemo-recipes/recipes/llama3_native_te/flops.py Outdated
@pstjohn
Copy link
Copy Markdown
Collaborator

pstjohn commented Apr 15, 2026

A couple general thoughts:

  1. The simplicity of our training scripts is incredibly important; even as it stands today i'm not happy with how much has crept in there. These need to be accessible and readable to folks coming to the repository for the first time, and be basic scaffolds agents can build from. I really don't think a flops calculator like this needs to touch every single training script in our recipes. At most, it would go in the perf logger, but even there, I'd rather see this be something like https://huggingface.co/docs/accelerate/usage_guides/model_size_estimator where this lives as a javascript app in our docs, not necessarily something that runs alongside model training.
  2. Logging Flops/MFU directly is notoriously a footgun when it comes to performance, especially if we end up re-calculating model flops every iteration. Are we 100% sure we're not doing this?

@gagank1
Copy link
Copy Markdown
Collaborator Author

gagank1 commented Apr 18, 2026

@trvachov @pstjohn — ready for re-review. I took your feedback and substantially reworked the PR.

TL;DR

Every concern you raised is addressed, and the PR is now considerably smaller and simpler:

  • ~3000 lines removed (flops.py × 4, test_flops.py × 4, compare_mfu*.py × 4, CLI, comm-overhead code)
  • ~300 lines added (FLOPs helper inlined in each of 4 existing perf_logger.py files)
  • Training scripts are back to their previous simplicity — just a single extra kwarg on the PerfLogger constructor

What changed vs. the previous revision

Deleted

  • flops.py in all 4 recipes — gone entirely
  • tests/test_flops.py in all 4 recipes — gone
  • The entire CLI (gpu-info, flops, cp-comm, bandwidth) — no more hanging torchrun flops.py bandwidth
  • All comm-overhead code (_estimate_comm, estimate_cp_comm_bytes, _estimate_model_params, PCIe default) — split out for a future PR
  • The MFUTracker class, ModelFLOPsConfig dataclass, and all its helpers
  • Per-script mfu_tracker = MFUTracker(...) blocks and step_start_time = time.perf_counter() scaffolding in all 11 training scripts
  • last_step_time attribute on esm2's PerfLogger
  • Entry for flops.py / test_flops.py in ci/scripts/check_copied_files.py

Added / consolidated

  • Each recipe's existing perf_logger.py gets ~30 lines: a _GPU_PEAK_TFLOPS_BF16 table, _detect_peak_tflops_bf16(), and _compute_per_token_flops(config_dict, seq_len)
  • PerfLogger.__init__ gets one new optional kwarg: model_config_dict: dict | None = None. When present (and args.log_mfu=True), the logger adds train/tflops_per_gpu and train/mfu_pct to its metrics dict
  • MFU is computed inside the step % logging_frequency == 0 branch, so it flows through the same torchmetrics → WANDB pipeline as loss/grad_norm/step_time (addresses pstjohn's "WANDB like other metrics" ask and trvachov's "logging_frequency spam" concern)

The MFU formula

Simple, and the same in all 4 perf_loggers:

flops_per_step = _per_token_flops * num_unpadded_tokens_on_this_rank
tflops_per_gpu = flops_per_step / step_time / 1e12
mfu_pct       = tflops_per_gpu / peak_tflops * 100

Why this form works everywhere without DP/CP/grad-acc factoring:

Scenario Why unpadded tokens on rank are correct
DP (DDP/FSDP2) Each rank sees its own micro-batch; attention_mask.sum() = rank's tokens
CP (TE/PyTorch) Sequence is sharded by CP; each rank's batch is already [B, S/cp]
Gradient accum. num_unpadded_tokens accumulates across grad-acc micro-batches via log_micro_step (same counter tokens_per_second_per_gpu already uses)
BSHD (no packing) attention_mask marks real tokens vs. pad
THD (packing) attention_mask.sum() or input_ids.numel() counts packed tokens

So: no num_gpus divide, no batch_size ambiguity (there's no batch_size arg at all anymore), no explicit cp_size / dp_size / grad_acc_steps factor.

Training script change per recipe (this is the whole diff)

-perf_logger = PerfLogger(dist_config, args)
+perf_logger = PerfLogger(
+    dist_config,
+    args,
+    model_config_dict=config.to_dict() if args.get("log_mfu", False) else None,
+)

Plus deletion of the 10-line mfu_tracker = MFUTracker(...) block and the 6-line if mfu_tracker is not None block inside the loop, for each of the 11 scripts.

Point-by-point

Individual replies are on each inline thread; consolidated here:

  • batch_size naming (trvachov, main review) — resolved by removing the batch_size argument entirely; the formula uses per-rank unpadded tokens from the actual batch.
  • Comm overhead should be a separate PR (trvachov) — done; removed entirely. Will follow up in a separate PR that can address the precision/FSDP2-pattern assumptions properly.
  • Bandwidth CLI hangs on 8× A100 PCIe (trvachov) — done; CLI removed entirely.
  • num_gpus division (trvachov, esm2/flops.py:326) — done; new formula doesn't need it. See math above.
  • Gradient accumulation (trvachov, flops.py:203) — done; num_unpadded_tokens is the per-rank accumulated token count across grad-acc, so FLOPs/step naturally scales with grad-acc.
  • logging_frequency spam (trvachov, esm2/train_ddp.py:206) — done; MFU fires only on logging steps.
  • FSDP2 comm-overhead concerns (trvachov, flops.py:390) — moot; comm code removed.
  • PCIe Gen3 default (trvachov, flops.py:366) — moot; comm code removed.
  • Training scripts should stay simple (pstjohn) — done; all MFU scaffolding removed from training scripts. The only per-script change is adding one kwarg.
  • FLOPs recomputation footgun (pstjohn) — per-token FLOPs is computed once in PerfLogger.__init__ and stored; at each logging step we just do per_token_flops * num_unpadded_tokens / step_time. No recomputation, no model introspection at runtime.
  • "Do we need to copy flops.py to every recipe?" (pstjohn, check_copied_files.py:203) — moot; no more copied file. Each recipe's existing perf_logger.py gets the FLOPs helper inlined.

Verified

  • Pre-commit passes (ruff, ruff format, mdformat, license check, copied-files, secrets).
  • Unit tests for _compute_per_token_flops (GQA+SwiGLU, MHA+standard-FFN, seq-len scaling, linearity, vocab=0 no-LM-head) + existing TestPerfLoggerLoss in llama3 → 11/11 pass. codonfm existing tests → 5/5 pass.
  • Smoke-tested on 2× A6000 for 5 steps per recipe/strategy: llama3 DDP + FSDP2 + FSDP2+CP, esm2 DDP + FSDP2, codonfm FSDP2, opengenome2 FSDP2 + FSDP2+CP — all emit train/mfu_pct and train/tflops_per_gpu via WANDB. Default path (no log_mfu) still works unchanged.
  • The esm2 FSDP2+CP (train_fsdp2_cp.py) hits an IndexError: list index out of range in PyTorch's DTensor redistribute during init_empty_weights — reproduces without log_mfu=true too, so it's a pre-existing issue unrelated to this PR.

Usage (unchanged from the outside)

torchrun --nproc_per_node=2 train_fsdp2.py log_mfu=true

Appears in WANDB as train/mfu_pct and train/tflops_per_gpu, alongside loss/grad_norm/step_time/etc.

Thanks for the thorough review — the PR is significantly better for it.

Reworks MFU tracking per reviewer feedback on #1548:

- Delete per-recipe flops.py, test_flops.py, and the CLI entirely
- Inline ~30-line FLOPs helper into each recipe's existing perf_logger.py
- MFU metrics (train/tflops_per_gpu, train/mfu_pct) flow through the existing
  torchmetrics -> WANDB path, respecting logging_frequency
- Drop comm-overhead estimation; will be a separate future PR

The new formula is per_token_flops(seq_len) * num_unpadded_tokens_on_rank.
The unpadded-tokens counter (already used by tokens_per_second_per_gpu) is
per-rank after DP/CP sharding and accumulated across grad-acc micro-batches,
so the formula works uniformly across DDP/FSDP2/FSDP2+CP/DDP+CP/mFSDP and
across BSHD and THD (sequence packing) with no per-strategy factors.

Net: -3000 / +300 lines. Training scripts lose all MFU scaffolding; the
only change per script is one extra kwarg on the PerfLogger constructor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
@gagank1
Copy link
Copy Markdown
Collaborator Author

gagank1 commented Apr 18, 2026

/ok to test e6468fc

gagank1 and others added 2 commits April 18, 2026 02:38
The prior implementation used `self.num_unpadded_tokens` (attention_mask.sum())
as the token count in the MFU formula, which produces an "effective-work MFU"
that under-reports vs published numbers when the batch has padding.

Switch to `self.num_tokens` (input_ids.numel()) to match the PaLM /
Megatron-LM / MosaicML llm-foundry / torchtitan convention of counting the
configured token shape the hardware actually runs matmuls over, regardless
of attention masking. For fully-packed THD (no inter-sequence padding) the
two counters are equivalent; for BSHD with padding, this restores
standard MFU semantics.

References for the convention:
  - PaLM (Chowdhery et al. 2022, App. B): "tokens per second ... peak
    throughput R = P / (6N + 12LHQT)" where T is the configured/padded
    sequence length.
  - Megatron-LM num_floating_point_operations() uses global batch *
    args.seq_length (configured/padded).
  - MosaicML llm-foundry and torchtitan both use configured seq_len.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Copy link
Copy Markdown
Collaborator

@trvachov trvachov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a prior comment to replace batch_size with micro_batch_size or global_batch_size to make it explicit, but it seems like this code doesn't use the concept of batch_size at all and somehow does FLOPs calculations using effective sequence length as the core input. Why do this? It would make sense to me if this simplified things, but I think it obfuscates it more. e.g. I think it still doesn't get the right FLOPs number for THD sequence packing and makes the FLOPs counting code rather tricky to understand.

For non sequence packed attention, the inputs are max seq length per sequence and micro batch size * DDP ranks.
For sequence packaged attention, the inputs are max seq length for all the sequences and the boundaries of the sequences * DDP ranks.

Then for both of the above, there's true tokens and padded tokens. I think having this be explicit (e.g. MFU_padded and MFU_unpadded) in the token counting would make this very clear for the user and be a good recipe starting point.

As it stand, had batch_size as a concept in the first round of this PR, which was a little confusing because it didn't distinguish between micro vs global batch size, and now we have total sequence length as a concept, but a) That is miscalculating the THD FLOPS and b) it's somewhat inconsistent with our prior variable definition e.g. in this README here https://github.com/NVIDIA/bionemo-framework/tree/trvachov/mixtral-recipes-repush/bionemo-recipes/recipes/llama3_native_te

What do you think?

First-principles matmul count: Q/K/V/O projections (GQA-aware), attention
logits/values (the S^2 cost expressed per-token as 4*S*H), 2-or-3-projection
MLP (SwiGLU detected via model_type), and LM head. The returned value is
multiplied by the actual unpadded token count at log time, so it naturally
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior is very strange for two reasons:

  1. This docstring is documenting behavior outside of this function, which is just a weird thing to do code wise. Why should this function document code that is outside of this function?
  2. I don't think what is being said here is correct -- "returned value is multiplied by the actual unpadded token count"... I don't think this is actually happening, and even if it was, I'm not sure that's correct.

self._per_token_flops = 0
self._peak_tflops: float | None = None
if self._log_mfu:
self._per_token_flops = _compute_per_token_flops(model_config_dict, args.dataset.max_seq_length)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few issues here:

  1. self._per_token_flops isn't being multiplies by any sort of padded vs unpadded sequence factor, as is documented in _compute_per_token_flops docstrong.
  2. Even if it was doing the above, that would simply make a correction for BSHD padding. I think this would be totally incorrect for THD sequence packing, where the attention matrix size is not O(max_seq_length^2) but rather a sum of the individual cu_seqlens fragments squared.

Here's some docs that explain this https://docs.nvidia.com/nemo-framework/user-guide/25.09/sft_peft/packed_sequence.html <--- these docs are a little old but probably right -- worth checking with TE team. @jomitchellnv @pstjohn is my understanding correct that TE does a sum of little squares for THD sequence packing?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: 2. Yes this is true it doesn't use the sum of the squares of each packed sequence but that was an intentional choice since the formula in the Llama3 native te recipe that was used for benchmarking doesn't either (uses B*S^2). I considered doing that but it would require some tracking inserted into the training scripts to pass around cu_seq_lens (or its sum of squares) to keep track of this. I assumed we'd prefer simplicity over modifying the training scripts more (from Peter's previous comments), should I add that in?

head_dim = h // n_heads
kv_dim = n_kv * head_dim
ffn = model_config_dict["intermediate_size"]
vocab = model_config_dict.get("vocab_size", 0)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be vocab_size, or padded_vocab_size?

I dont think this is a big correction because usually the num_layers term is much bigger than this one, but nonetheless maybe worth getting right.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in b979eed.

The LM-head matmul does run at padded_vocab_size (ESM-2: 33 → 64 for FP8/tensor-core-friendly shapes); logits get sliced back to vocab_size only after the compute happens (modeling_esm_te.py:585), so the FLOPs are real.

Scoped the correction to the hardware-view metric only:

  • mfu_pct / tflops_per_gpu (useful work) — keeps raw vocab_size.
  • mfu_padded_pct / tflops_per_gpu_padded (hardware view) — uses padded_vocab_size when present.

Applied uniformly across all four recipes' perf_logger.py via a use_padded_vocab=True kwarg on _compute_non_attn_per_token_flops. For llama3 / og2 / codonfm the configs don't set padded_vocab_size, so the fallback (or vocab_size) makes the two values identical and nothing changes in those recipes — only ESM-2 sees any delta, and it's small (≲0.01 pp of total FLOPs on 3B/15B, up to ~0.3 pp on the tiny sanity-check model) as you predicted.

gagank1 and others added 7 commits April 22, 2026 12:25
Previous formula baked `4·S·H` attention per token at init-time using
`args.dataset.max_seq_length`, then multiplied by num_tokens at log time.
For a packed THD batch of total length S with docs L₁, L₂, …, Lₙ, this
yields `4·H·S²` attention per rank — overstating work by `S²/Σ(Lᵢ²)`
because Flash-Attention actually computes one Lᵢ×Lᵢ score matrix per
packed segment.

Split per-token FLOPs into non-attention (QKV/MLP/LM-head) and attention
coefficient, then at each logging window:
  flops = non_attn_per_token · num_tokens + coeff · Σ(Lᵢ²) / cp_size

`Σ(Lᵢ²)` is accumulated across grad-acc micro-batches from
`cu_seq_lens_q` (real per-doc lengths) in log_micro_step, falling back
to `cu_seq_lens_q_padded` or `B·S²` from the batch shape for BSHD. The
algebraic identity `non_attn + coeff·S ≡ _compute_per_token_flops(cfg, S)`
is locked in by a unit test in every recipe, so BSHD and
synthetic-single-doc-THD benchmarks produce identical numbers to the
legacy formula.

Also fixes ESM-2's latent grad-acc undercount: its PerfLogger previously
had no log_micro_step and read num_tokens from only the last micro-batch
at log time, silently dividing reported FLOPs by grad_acc_steps if
grad-acc were ever enabled. Retrofitted with the llama3/og2/codonfm
accumulator pattern: all five ESM-2 training scripts (train_{fsdp2,
fsdp2_cp,ddp,ddp_cp,mfsdp}.py) now call log_micro_step inside the inner
loop and log_step(step, grad_norm, lr) after the optimizer step.

Applied identically to llama3_native_te, opengenome2_llama_native_te,
codonfm_native_te, esm2_native_te per the per-recipe isolation rule.
Six invariant unit tests land in each recipe's test_perf_logger.py
(algebraic identity, BSHD no-op, THD single-doc/multi-doc/CP division,
unpadded-preferred, padded-fallback) plus ESM-2-specific grad-acc
accumulator tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Real-data benchmark runs (llama3 dclm + og2 metagenome at cp=8 dp=1
S=131072) surfaced a bug in _attn_work_from_batch's BSHD branch.

ContextParallelDataLoaderWrapper pre-splits the sequence along the CP
axis before the batch reaches log_micro_step, so batch["input_ids"].shape
per rank is (B, S/cp), not (B, S). The helper computed B*(S/cp)² from
the per-rank shape, and log_step divided by cp_size, yielding per-rank
attention FLOPs of B*S²/cp³ — wrong by cp².

Correct per-rank attention work is B*S²/cp (Q_rank @ K_full costs
(S/cp)*S per head summed).

Evidence on og2 BSHD cp=8 S=131072 with a full-length doc: reported
10.9% MFU / 246 TFLOPS/GPU, but the actual compute corresponds to
~60% MFU (matches the THD run with the same S on the same data).

Fix: pass cp_size into _attn_work_from_batch and multiply the BSHD
synthesis by cp_size². The helper now always returns a GLOBAL
(pre-shard) Σ(Lᵢ²), so the /cp_size in log_step consistently gives
per-rank attention regardless of whether the input is cu_seq_lens_q
(already global) or per-rank shape (needs the cp² correction).

THD path unaffected (cu_seq_lens_q was always global). BSHD without
CP unaffected (cp_size=1 makes the multiplier 1). Existing published
benchmarks in ci/benchmarks/{og2,esm2}_benchmarks.md used synthetic
cu_seq_lens=[0, S] and therefore flowed through the THD path — they
remain valid.

Added test_bshd_cp_correction to all four recipes asserting both
the BSHD+CP scaling and the THD invariance under cp_size variation.

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
The memory reporting used torch.cuda.memory_allocated() (current) at log
time, so both the "max" and "mean" metrics reflected the post-step
resting footprint. Under FSDP that massively undercounts the transient
peak: params get all-gathered briefly during forward/backward and
activations balloon before backward frees them, but by the time
log_step runs they're already released.

Switch to torch.cuda.max_memory_allocated() (tracks the high-water
mark of live allocations) paired with torch.cuda.reset_peak_memory_stats()
so each logging window reports its own peak rather than a running max
since process start. Both are host-side counter ops: no sync, no
kernel launch.

The "mean" metric keeps using the post-step current value as a
"resting footprint" indicator (useful to see how much memory is held
between steps), while "max" now reflects what the hardware actually
peaks at. These are the values that belong in the benchmark tables.

Applied identically to all four MFU-tracking recipes.

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Current reporting mixes conventions: non-attention flops count every slot
(padded, PaLM convention), but attention flops come from cu_seq_lens_q
which drops both CP-zigzag padding and BSHD row padding. For BSHD with
dynamic-padded short docs this made the attention term tiny while
non-attention counted all the pad positions — an inconsistent hybrid that
is neither pure useful work nor pure hardware view.

Split into two parallel metrics:

  train/mfu_pct / tflops_per_gpu
    Useful-work rate. Counts only real tokens:
      non_attn uses num_unpadded_tokens (attention_mask.sum())
      attn     uses Sigma(L_i^2) from cu_seq_lens_q (THD), or per-row
               attention_mask.sum(dim=-1) (BSHD). Excludes CP zigzag
               divisibility pad and BSHD row pad.

  train/mfu_padded_pct / tflops_per_gpu_padded
    Hardware view. Counts all slots the GPU actually burns cycles on:
      non_attn uses input_ids.numel()
      attn     uses Sigma(L_i^2) from cu_seq_lens_q_padded (THD) or
               full batch shape (BSHD). Includes both padding types.

For synthetic single-doc-per-pack THD runs (where Sigma L_i^2 == S^2 and
there is no padding of either kind), unpadded == padded and the two
metrics report identical values; published benchmarks in ci/benchmarks
remain internally consistent either way.

Added _attn_work_from_batch(include_padding=bool) kwarg. The function
still returns a pre-CP-shard global quantity; the caller divides by
cp_size in log_step. BSHD per-row real lengths come from attention_mask;
note in docstring that for BSHD+CP this is exact under evenly distributed
padding (common case) and may slightly underestimate when padding is
concentrated on one end.

Added two unit tests per recipe (test_include_padding_thd,
test_include_padding_bshd_with_attention_mask) pinning the split.

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
The single-term per-token FLOP function was kept around after the
two-term split (non_attn + coeff·Σ(Lᵢ²)) was introduced, solely so
old unit tests and the startup log line could still reference it.
Nothing in the runtime path was using it — log_step had already
switched to the split formula. Remove the function, the
self._per_token_flops attribute, and the test scaffolding that
existed only to pin the split against the legacy reference:

  - _compute_per_token_flops(cfg, seq_len) deleted from all 4 recipes
  - self._per_token_flops init + assignment + log-line mention removed
  - logger.info startup format no longer includes "per-token FLOPs=%e"
  - TestComputePerTokenFlops test class deleted (llama3 only)
  - test_algebraic_identity deleted (its whole purpose was to pin
    non_attn + coeff·S against the legacy function)
  - test_bshd_no_op simplified to test_bshd_shape_synthesis — keeps
    the Σ(Lᵢ²)=B·S² shape-synthesis check, drops the legacy comparison
  - Docstrings no longer reference the old function

The tests that exercise actual formula correctness
(test_thd_multi_doc_uses_squared_sum, test_cp_size_divides_attention_only,
test_bshd_cp_correction, test_include_padding_*, etc.) all stay — they
verify behavior directly without going through the legacy.

Net: -336 / +40 lines. Pure dead-code removal, no runtime change.
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Reflect the modern two-pair metric layout (unpadded useful-work vs
padded hardware view) and the peak-memory reporting fix. Applied
identically to all four MFU-tracking recipes.

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
For configs with padded_vocab_size set (ESM-2: 33→64 for FP8/tensor-core
friendliness), the LM-head matmul physically runs at padded width and the
logits are sliced back afterward. Count the padded width in the hardware-view
metric (mfu_padded_pct, tflops_per_gpu_padded) while continuing to count raw
vocab_size in the useful-work metric (mfu_pct, tflops_per_gpu). For configs
without padded_vocab_size (llama3, og2, codonfm) the two values collapse and
nothing changes.

Addresses review feedback from @trvachov on PR #1548.

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants