Generalize MFU/FLOPs module across recipes with log_mfu training hook#1548
Generalize MFU/FLOPs module across recipes with log_mfu training hook#1548
Conversation
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR introduces Model FLOPs Utilization (MFU) tracking across multiple native training recipes (LLaMA3, ESM2, CodonFM, OpenGenome2). A new Changes
Sequence Diagram(s)sequenceDiagram
participant TrainScript as Training Script
participant MFUTracker as MFUTracker
participant FlopsModule as FLOPs Compute
participant GPU as GPU
TrainScript->>MFUTracker: __init__(config, batch_size, seq_len, ...)
MFUTracker->>FlopsModule: from_hf_config(model_config)
FlopsModule-->>MFUTracker: ModelFLOPsConfig
MFUTracker->>FlopsModule: compute_flops_analytical/simplified()
FlopsModule-->>MFUTracker: total_training_flops
MFUTracker->>GPU: detect_gpu_peak_tflops()
GPU-->>MFUTracker: peak_tflops_bf16
loop Each Optimizer Step
TrainScript->>TrainScript: measure step_time (perf_counter)
TrainScript->>MFUTracker: compute_mfu(step_time)
MFUTracker->>MFUTracker: mfu = (total_flops / step_time / peak_tflops)
MFUTracker-->>TrainScript: {mfu_percent, tflops_per_gpu}
TrainScript->>TrainScript: log metrics
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
| if rank == 0: | ||
| print("Counting FLOPs with HF model (meta device)...") | ||
| hf_config_meta = LlamaConfig.from_pretrained(args.config_path) | ||
| hf_config_meta._attn_implementation = "eager" |
There was a problem hiding this comment.
This is going to be the slowest possible attention, can you use fa2 or sdpa?
There was a problem hiding this comment.
This was the old version of the PR, no longer includes HF comparison
- Add compare_mfu_validate.py: golden value tests comparing CP vs non-CP execution for both TE and HF models, validating loss, logits (cosine sim > 0.99), and gradients (cosine sim > 0.8) following the pattern from models/llama3/tests/test_cp_bshd.py - Fix silent RoPE bug in HF CP: position_ids were not passed to context_parallel buffers, causing each rank to auto-generate [0..S/CP-1] instead of correct global positions - Switch FLOPs counting from eager to SDPA attention for consistency with the actual training implementation (identical counts on meta device) - Add exact FLOPs output (full integers with commas) alongside abbreviated values in both single-GPU and multi-GPU scripts - Switch bandwidth measurement from all-reduce to all-gather for more accurate pure data movement measurement matching CP ring attention - Add position_ids and max_length_q/k support to measure_step_time Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Replaces all-gather with explicit send/recv between rank 0 and rank 1, matching CP ring attention's actual communication pattern. Measures 6.6 GB/s unidirectional on PCIe Gen 3 x8 (vs 3.2 GB/s all-gather, 4.0 GB/s all-reduce). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Adds subcommands for using MFU utilities outside of full training scripts: - gpu-info: Print GPU name, detected peak TFLOPS, and known GPU table - flops: Compute FLOPs from a model config (README + first-principles) - cp-comm: Estimate CP ring attention communication volume - bandwidth: Measure unidirectional P2P bandwidth via send/recv (torchrun) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Add architecture-independent flops.py module (source: models/esm2/flops.py) supporting Llama (GQA+SwiGLU), ESM2, CodonFM (MHA+GELU), and Evo2 (Hyena). Key additions: - ModelFLOPsConfig dataclass with from_hf_config() auto-detection - Analytical, simplified (README), and Hyena FLOPs formulas - MFUTracker class for training script integration - Communication overhead estimation (CP ring attention, FSDP) - CLI: gpu-info, flops, cp-comm, bandwidth Training integration (gated by log_mfu: false in hydra config): - 11 training scripts across 4 native_te recipes - llama3_native_te (3), esm2_native_te (5), codonfm_native_te (1), opengenome2_llama_native_te (2) Removes superseded Llama-specific compare_mfu*.py benchmark scripts. Moves first_principles.md to models/esm2/ alongside the source module. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
- Fix RTX 4090 (330→165) and RTX 3090 (142→71): were using sparse values - Remove A5000, RTX 4090, RTX 3090 (not relevant for data center workloads) - Add B300 and GB300 Blackwell Ultra at 2,500 TFLOPS dense BF16 - Add authoritative NVIDIA source URLs for all GPU specs as comments Sources: B300/GB300: https://www.nvidia.com/en-us/data-center/gb300-nvl72/ (360 PFLOPS BF16 sparse / 72 GPUs / 2 = 2,500 TFLOPS dense per GPU) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Remove functions that were only used by the now-deleted compare_mfu*.py standalone benchmark scripts: - compute_flops_first_principles(), compute_flops_readme() (backward compat wrappers) - measure_step_time(), split_for_cp_bshd(), count_flops_with_model(), cleanup_model() - Unused imports: gc, nullcontext, FlopCounterMode Retained: MFUTracker, from_hf_config, compute_flops_analytical, compute_flops_simplified, compute_flops_hyena, CLI, and all utilities needed by the training script log_mfu integration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
- Move flops.py source from models/esm2/ to recipes/llama3_native_te/ (it's a training utility, not a model utility) - Add test_flops.py with 27 tests covering config auto-detection, analytical/simplified/hyena formulas, MFUTracker, and CP comm estimation - Update check_copied_files.py: source is now llama3_native_te, copied to esm2, codonfm, opengenome2 recipes (both flops.py and test_flops.py) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Document the log_mfu=true flag and flops.py CLI utilities in all 4 native_te recipe READMEs: llama3, esm2, codonfm, opengenome2. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
|
/ok to test 27255a1 |
There was a problem hiding this comment.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
bionemo-recipes/recipes/esm2_native_te/perf_logger.py (1)
108-120:⚠️ Potential issue | 🟠 MajorReset
last_step_timeon non-logging steps to prevent stale MFU inputs.
self.last_step_timeis only updated inside the logging-frequency branch, so it can retain an old value. Since training code readsperf_logger.last_step_timeeach step, MFU can be computed with stale timing.Proposed fix
with torch.no_grad(): + self.last_step_time = None # FSDP2's clip_grad_norm_ returns a DTensor; convert to local tensor for torchmetrics compatibility. if isinstance(grad_norm, DTensor): grad_norm = grad_norm.to_local() @@ if step % self.logging_frequency == 0 and step > 0: @@ step_time = elapsed_time / self.logging_frequency self.last_step_time = step_time🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/esm2_native_te/perf_logger.py` around lines 108 - 120, The perf logger currently only updates self.last_step_time inside the logging branch guarded by self.logging_frequency, causing stale values on non-logging steps; update the step handler in perf_logger.py so that when a step is not a logging step you explicitly reset self.last_step_time (e.g., set to 0.0 or None) to avoid stale MFU inputs, keeping the existing logic that updates self.previous_step_time and last_step_time when (step % self.logging_frequency == 0).
🧹 Nitpick comments (3)
bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2.py (1)
354-359: Same minor note astrain_fsdp2_cp.py.The
step_start_timereset on line 359 happens unconditionally. Same optional refactor applies if you want consistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2.py` around lines 354 - 359, The unconditional reset of step_start_time happens even when mfu_tracker is None; update the block around mfu_tracker so step_start_time is only reset when mfu_tracker is present (i.e., move the step_start_time = time.perf_counter() inside the if mfu_tracker is not None branch, after calling mfu_tracker.compute_mfu and optional dist_config.is_main_process()/logger.info) to match the optional flow used elsewhere (see mfu_tracker, compute_mfu, dist_config.is_main_process, logger.info).bionemo-recipes/recipes/llama3_native_te/tests/test_flops.py (1)
24-25: Consider using a more robust import approach.The
sys.path.insert(0, ...)pattern works but can be fragile. Since this test file will be copied across recipes, consider whether pytest'sconftest.pyorpyproject.tomlconfiguration could provide a cleaner solution.That said, this is a common pattern in recipe-style code and works correctly here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/llama3_native_te/tests/test_flops.py` around lines 24 - 25, The test currently mutates sys.path via the sys.path.insert(0, ...) call in test_flops.py which is fragile; instead move that path-setup into a shared pytest configuration so tests don't each modify sys.path: create or update pytest's conftest.py to add the parent directory to sys.path (or better, configure package discovery via pyproject.toml/paths or install the package in editable mode) and remove the sys.path.insert(0, ...) line from test_flops.py so imports are resolved by pytest configuration rather than per-test mutation.bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2_cp.py (1)
362-367: Minor inconsistency:step_start_timereset occurs unconditionally.The timer reset on line 367 happens even when
mfu_tracker is None. While this has negligible performance impact, it's slightly inconsistent—either gate the entire timing block or make it explicit that timing is always active.Also note this script uses manual
time.perf_counter()timing while ESM2 scripts useperf_logger.last_step_time. This difference is acceptable since different recipes may have differentPerfLoggerimplementations, but it's worth documenting if intentional.Optional: Gate timing reset with MFU tracker check
if mfu_tracker is not None: step_time = time.perf_counter() - step_start_time mfu_info = mfu_tracker.compute_mfu(step_time) if dist_config.is_main_process(): logger.info("MFU: %.1f%% (%.2f TFLOPS/GPU)", mfu_info["mfu"], mfu_info["tflops_per_gpu"]) - step_start_time = time.perf_counter() + step_start_time = time.perf_counter()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2_cp.py` around lines 362 - 367, The timing reset is currently unconditional; to be consistent, only reset step_start_time when mfu tracking is active. Inside the block that checks mfu_tracker is not None (the one that calls mfu_tracker.compute_mfu(...) and logs via logger.info when dist_config.is_main_process()), move the line that sets step_start_time = time.perf_counter() into that same guarded block (or explicitly comment why timing should remain unconditional). This references the mfu_tracker variable, compute_mfu method, step_start_time variable, dist_config.is_main_process(), and logger.info to locate the code to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bionemo-recipes/recipes/codonfm_native_te/tests/test_flops.py`:
- Around line 24-41: The tests unsafely import the local flops.py via sys.path
and a plain "from flops import ...", causing module-cache collisions across
pytest sessions; modify the test to load the local flops.py under a unique
module name (e.g., using importlib.util.spec_from_file_location or
importlib.machinery.SourceFileLoader with a name derived from the test file path
or parent directory) and then import the symbols MFUTracker, ModelFLOPsConfig,
compute_flops_analytical, compute_flops_hyena, compute_flops_simplified,
estimate_cp_comm_bytes, and from_hf_config from that module, ensuring you do not
rely on sys.path insertion or the global "flops" entry in sys.modules so each
recipe test loads its own isolated module.
In `@bionemo-recipes/recipes/codonfm_native_te/train_fsdp2.py`:
- Around line 222-227: The MFU log omits the per-step duration even though
step_time is computed; update the logging call inside the mfu_tracker block
(where mfu_info = mfu_tracker.compute_mfu(step_time) and inside the
dist_config.is_main_process() check) to include step_time in the logger.info
message alongside mfu_info["mfu"] and mfu_info["tflops_per_gpu"] (e.g., add a
placeholder for step_time and pass step_time as an argument) so each MFU line
reports MFU, TFLOPS/GPU, and the per-step time.
In `@bionemo-recipes/recipes/llama3_native_te/flops.py`:
- Around line 607-647: The CLI flag args.formula is unused: either honor it or
remove it; to fix, branch on args.formula (values "simplified" or "analytical")
before calling compute_flops_simplified and compute_flops_analytical so you only
compute and print the requested method. Specifically, use args.formula to decide
whether to call
compute_flops_simplified(b,s,config.hidden_size,config.num_hidden_layers,config.vocab_size)
or compute_flops_analytical(config,b,s) (and only compute the other if you need
a diff), adjust the printing loop to only print the selected result (and only
compute/print the Difference when both are intentionally computed), or
alternatively remove the --formula argument from the argparse setup if you want
both methods always shown.
- Around line 453-483: In measure_bus_bandwidth, reject world sizes != 2 instead
of assuming peers by updating the initial guard (world_size) to raise an error
when world_size is not exactly 2; ensure you check world_size at the top of the
function and raise a ValueError (or similar) with a clear message referencing
the peer calculation using peer = 1 - rank so callers know only 2-process runs
are supported; keep the rest of the logic unchanged once the guard ensures rank
is 0 or 1.
- Around line 104-140: from_hf_config currently computes derived fields
(head_dim, has_lm_head) before kwargs.update(overrides), which allows overrides
like num_attention_heads or vocab_size to leave derived values stale; update the
function so overrides are applied before computing derived fields (or apply
kwargs.update(overrides) and then recompute head_dim = hidden_size //
num_attention_heads and has_lm_head = vocab_size > 0 afterwards), ensuring
derived values (head_dim, has_lm_head and any other fields dependent on
overrides) reflect the final values passed to ModelFLOPsConfig; keep references
to from_hf_config, kwargs.update(overrides), head_dim, has_lm_head,
num_attention_heads, vocab_size and ModelFLOPsConfig to locate the changes.
- Around line 69-75: detect_gpu_peak_tflops currently calls
torch.cuda.get_device_name(0) unguarded and will crash on CPU-only hosts; update
detect_gpu_peak_tflops to first check torch.cuda.is_available() and only call
torch.cuda.get_device_name(0) when CUDA is available, otherwise return (None,
"unknown") so callers like _cli_gpu_info and MFUTracker(..., peak_tflops=None)
safely receive a fallback; keep the GPU_PEAK_TFLOPS_BF16 lookup logic intact
when a device name is present.
- Around line 486-507: The load_model_config function currently hardcodes
trust_remote_code=True when calling AutoConfig.from_pretrained; change it to
accept explicit keyword-only parameters (e.g., *, trust_remote_code: bool =
False, revision: Optional[str] = None) so callers must opt-in to remote code
execution and can pin a revision, then pass those parameters into
AutoConfig.from_pretrained(config_path, trust_remote_code=trust_remote_code,
revision=revision) and update any call sites accordingly.
In `@bionemo-recipes/recipes/llama3_native_te/train_ddp.py`:
- Around line 151-157: The MFUTracker is being configured with per-micro-batch
FLOPs which underestimates utilization when gradient accumulation is used;
update the MFUTracker initialization (the MFUTracker(...) call) to account for
gradient accumulation by multiplying the effective batch_size (or otherwise
scaling FLOPs) by the gradient accumulation factor (use
args.gradient_accumulation_steps or args.grad_acc_steps as present in the
training args) so the tracker uses full-step work rather than a single
micro-batch, and apply the same correction to the related MFU/config
calculations around this block.
In `@bionemo-recipes/recipes/llama3_native_te/train_fsdp2_cp.py`:
- Around line 187-193: The MFUTracker is using args.dataset.micro_batch_size as
the FLOPs batch size which ignores gradient accumulation and underestimates MFU;
update the mfu_tracker initialization (symbol: MFUTracker / variable:
mfu_tracker) to use the effective batch size = args.dataset.micro_batch_size *
args.grad_acc_steps (or otherwise multiply by grad_acc_steps) when setting the
batch_size parameter, and apply the same change wherever MFUTracker is
constructed (also the other block around the second occurrence).
In `@bionemo-recipes/recipes/llama3_native_te/train_fsdp2.py`:
- Around line 165-170: MFU is undercounted because MFUTracker is initialized
with a single micro-batch; multiply the batch work by gradient accumulation
steps so FLOPs/ MFU reflect a full optimizer step. Update the MFUTracker
initialization (mfu_tracker / MFUTracker) to use an effective batch size =
args.dataset.micro_batch_size * grad_acc_steps (or pass grad_acc_steps
separately) and also ensure any FLOPs or wall-time computations (the code path
that computes wall time for a full optimizer step) scale FLOPs by grad_acc_steps
so the computed FLOPs and MFU reflect the total work per optimizer step.
---
Outside diff comments:
In `@bionemo-recipes/recipes/esm2_native_te/perf_logger.py`:
- Around line 108-120: The perf logger currently only updates
self.last_step_time inside the logging branch guarded by self.logging_frequency,
causing stale values on non-logging steps; update the step handler in
perf_logger.py so that when a step is not a logging step you explicitly reset
self.last_step_time (e.g., set to 0.0 or None) to avoid stale MFU inputs,
keeping the existing logic that updates self.previous_step_time and
last_step_time when (step % self.logging_frequency == 0).
---
Nitpick comments:
In `@bionemo-recipes/recipes/llama3_native_te/tests/test_flops.py`:
- Around line 24-25: The test currently mutates sys.path via the
sys.path.insert(0, ...) call in test_flops.py which is fragile; instead move
that path-setup into a shared pytest configuration so tests don't each modify
sys.path: create or update pytest's conftest.py to add the parent directory to
sys.path (or better, configure package discovery via pyproject.toml/paths or
install the package in editable mode) and remove the sys.path.insert(0, ...)
line from test_flops.py so imports are resolved by pytest configuration rather
than per-test mutation.
In `@bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2_cp.py`:
- Around line 362-367: The timing reset is currently unconditional; to be
consistent, only reset step_start_time when mfu tracking is active. Inside the
block that checks mfu_tracker is not None (the one that calls
mfu_tracker.compute_mfu(...) and logs via logger.info when
dist_config.is_main_process()), move the line that sets step_start_time =
time.perf_counter() into that same guarded block (or explicitly comment why
timing should remain unconditional). This references the mfu_tracker variable,
compute_mfu method, step_start_time variable, dist_config.is_main_process(), and
logger.info to locate the code to change.
In `@bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2.py`:
- Around line 354-359: The unconditional reset of step_start_time happens even
when mfu_tracker is None; update the block around mfu_tracker so step_start_time
is only reset when mfu_tracker is present (i.e., move the step_start_time =
time.perf_counter() inside the if mfu_tracker is not None branch, after calling
mfu_tracker.compute_mfu and optional dist_config.is_main_process()/logger.info)
to match the optional flow used elsewhere (see mfu_tracker, compute_mfu,
dist_config.is_main_process, logger.info).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 158596f3-a749-4ec2-88e4-a5b4f9b08cc4
📒 Files selected for processing (30)
bionemo-recipes/recipes/codonfm_native_te/README.mdbionemo-recipes/recipes/codonfm_native_te/flops.pybionemo-recipes/recipes/codonfm_native_te/hydra_config/defaults.yamlbionemo-recipes/recipes/codonfm_native_te/tests/test_flops.pybionemo-recipes/recipes/codonfm_native_te/train_fsdp2.pybionemo-recipes/recipes/esm2_native_te/README.mdbionemo-recipes/recipes/esm2_native_te/flops.pybionemo-recipes/recipes/esm2_native_te/hydra_config/defaults.yamlbionemo-recipes/recipes/esm2_native_te/perf_logger.pybionemo-recipes/recipes/esm2_native_te/tests/test_flops.pybionemo-recipes/recipes/esm2_native_te/train_ddp.pybionemo-recipes/recipes/esm2_native_te/train_ddp_cp.pybionemo-recipes/recipes/esm2_native_te/train_fsdp2.pybionemo-recipes/recipes/esm2_native_te/train_fsdp2_cp.pybionemo-recipes/recipes/esm2_native_te/train_mfsdp.pybionemo-recipes/recipes/llama3_native_te/README.mdbionemo-recipes/recipes/llama3_native_te/flops.pybionemo-recipes/recipes/llama3_native_te/hydra_config/defaults.yamlbionemo-recipes/recipes/llama3_native_te/model_configs/meta-llama/Llama-3.1-8B/config.jsonbionemo-recipes/recipes/llama3_native_te/tests/test_flops.pybionemo-recipes/recipes/llama3_native_te/train_ddp.pybionemo-recipes/recipes/llama3_native_te/train_fsdp2.pybionemo-recipes/recipes/llama3_native_te/train_fsdp2_cp.pybionemo-recipes/recipes/opengenome2_llama_native_te/README.mdbionemo-recipes/recipes/opengenome2_llama_native_te/flops.pybionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/defaults.yamlbionemo-recipes/recipes/opengenome2_llama_native_te/tests/test_flops.pybionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2.pybionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2_cp.pyci/scripts/check_copied_files.py
| import sys | ||
| from pathlib import Path | ||
|
|
||
| import pytest | ||
|
|
||
|
|
||
| # Add parent directory so we can import flops | ||
| sys.path.insert(0, str(Path(__file__).parent.parent)) | ||
|
|
||
| from flops import ( | ||
| MFUTracker, | ||
| ModelFLOPsConfig, | ||
| compute_flops_analytical, | ||
| compute_flops_hyena, | ||
| compute_flops_simplified, | ||
| estimate_cp_comm_bytes, | ||
| from_hf_config, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate the test file and its parent structure
find . -name "test_flops.py" -type f 2>/dev/null | head -20Repository: NVIDIA/bionemo-framework
Length of output: 329
🏁 Script executed:
# Check if flops.py exists in the parent directories of test files
find . -name "flops.py" -type f 2>/dev/null | head -20Repository: NVIDIA/bionemo-framework
Length of output: 285
🏁 Script executed:
# Search for other test files using the same sys.path pattern
rg "sys\.path\.insert.*parent" --type py 2>/dev/null | head -20Repository: NVIDIA/bionemo-framework
Length of output: 2354
🏁 Script executed:
# Verify what each test_flops.py imports from "flops"
for f in ./bionemo-recipes/recipes/codonfm_native_te/tests/test_flops.py ./bionemo-recipes/recipes/esm2_native_te/tests/test_flops.py ./bionemo-recipes/recipes/llama3_native_te/tests/test_flops.py; do
echo "=== $f ==="
sed -n '1,45p' "$f" | tail -20
doneRepository: NVIDIA/bionemo-framework
Length of output: 1631
🏁 Script executed:
# Check how pytest is typically invoked - look for pytest config or CI configs
find . -name "pytest.ini" -o -name "pyproject.toml" -o -name "setup.cfg" | grep -E "(pytest|pyproject)" | head -10Repository: NVIDIA/bionemo-framework
Length of output: 658
🏁 Script executed:
# Check pyproject.toml for pytest configuration
head -200 ./pyproject.toml 2>/dev/null | grep -A 20 "\[tool.pytest"Repository: NVIDIA/bionemo-framework
Length of output: 50
🏁 Script executed:
# Check pytest.ini directly
cat ./pytest.ini 2>/dev/nullRepository: NVIDIA/bionemo-framework
Length of output: 433
🏁 Script executed:
# Check main pyproject.toml for pytest config
rg "\[tool\.pytest" -A 10 ./pyproject.toml 2>/dev/nullRepository: NVIDIA/bionemo-framework
Length of output: 50
🏁 Script executed:
# View the actual imports in llama3_native_te test_flops.py more completely
head -30 ./bionemo-recipes/recipes/llama3_native_te/tests/test_flops.pyRepository: NVIDIA/bionemo-framework
Length of output: 1062
🏁 Script executed:
# Check if there's a conftest.py that might manipulate sys.path at the recipe level
find ./bionemo-recipes/recipes -name "conftest.py" -type fRepository: NVIDIA/bionemo-framework
Length of output: 706
🏁 Script executed:
# Check the conftest.py files for the affected recipes
for f in ./bionemo-recipes/recipes/codonfm_native_te/tests/conftest.py ./bionemo-recipes/recipes/llama3_native_te/tests/conftest.py ./bionemo-recipes/recipes/esm2_native_te/tests/conftest.py; do
echo "=== $f ==="
cat "$f" 2>/dev/null || echo "Not found"
echo ""
doneRepository: NVIDIA/bionemo-framework
Length of output: 10467
🏁 Script executed:
# Test if flops modules would actually collide - simulate what happens when importing both
python3 << 'EOF'
import sys
from pathlib import Path
import importlib
# Simulate first recipe test importing flops
recipe1_dir = "./bionemo-recipes/recipes/llama3_native_te"
sys.path.insert(0, recipe1_dir)
# This would make "flops" from llama3 available
print("sys.path after recipe1:", sys.path[:3])
# Simulate second recipe test in SAME process (as pytest would do)
recipe2_dir = "./bionemo-recipes/recipes/codonfm_native_te"
# If we just append another dir, we'd have both in sys.path
sys.path.insert(0, recipe2_dir)
print("sys.path after recipe2:", sys.path[:3])
# When we try to import flops now, Python will use the first match in sys.path
# but if flops is already in sys.modules, it uses that cached version
print("\nThis demonstrates the caching risk:")
print("- First test imports flops from llama3_native_te")
print("- Module 'flops' gets cached in sys.modules")
print("- Second test also imports 'flops'")
print("- Python finds 'flops' in sys.modules and returns cached version")
print("- Test uses wrong recipe's flops.py!")
EOFRepository: NVIDIA/bionemo-framework
Length of output: 557
🏁 Script executed:
# Check if the flops.py files differ between recipes
echo "=== Comparing llama3 vs codonfm flops.py size ==="
wc -l ./bionemo-recipes/recipes/llama3_native_te/flops.py ./bionemo-recipes/recipes/codonfm_native_te/flops.pyRepository: NVIDIA/bionemo-framework
Length of output: 246
🏁 Script executed:
# Check if they're identical or different
diff ./bionemo-recipes/recipes/llama3_native_te/flops.py ./bionemo-recipes/recipes/codonfm_native_te/flops.py | head -50Repository: NVIDIA/bionemo-framework
Length of output: 359
🏁 Script executed:
# Verify pytest test discovery behavior - check if tests are typically run together
find ./bionemo-recipes/recipes -name "test_*.py" -o -name "*_test.py" | wc -lRepository: NVIDIA/bionemo-framework
Length of output: 70
🏁 Script executed:
# Check if CI configuration runs all recipe tests together
find . -name "*.yml" -o -name "*.yaml" | xargs grep -l "pytest.*recipes" 2>/dev/null | head -5Repository: NVIDIA/bionemo-framework
Length of output: 50
🏁 Script executed:
# Check if conftest.py at a higher level might run tests across recipes
find ./bionemo-recipes/recipes -name "conftest.py" | head -5
ls -la ./bionemo-recipes/recipes/conftest.py 2>/dev/null || echo "No conftest.py at recipes level"
ls -la ./bionemo-recipes/conftest.py 2>/dev/null || echo "No conftest.py at bionemo-recipes level"Repository: NVIDIA/bionemo-framework
Length of output: 453
🏁 Script executed:
# Verify the exact issue: show what happens when flops is imported twice with same name
python3 << 'EOF'
import sys
from pathlib import Path
# Simulate pytest running both tests in sequence in same process
print("=== Scenario: Both test_flops.py files run in same pytest process ===\n")
# Test 1: llama3_native_te/tests/test_flops.py
print("1. Test llama3_native_te imports flops:")
sys.path.insert(0, str(Path("./bionemo-recipes/recipes/llama3_native_te")))
import flops as flops_v1
print(f" Module: {flops_v1.__file__}")
print(f" sys.modules['flops']: {sys.modules.get('flops', 'NOT SET')}")
# Simulate cleaning up sys.path for next test
sys.path.pop(0)
sys.path.insert(0, str(Path("./bionemo-recipes/recipes/codonfm_native_te")))
# Test 2: codonfm_native_te/tests/test_flops.py
print("\n2. Test codonfm_native_te tries to import flops:")
# This would normally do: from flops import MFUTracker, ...
# But since 'flops' is already in sys.modules, Python returns the cached version!
import flops as flops_v2
print(f" Module: {flops_v2.__file__}")
print(f" Same module? {flops_v1 is flops_v2}")
print(f"\n RESULT: Test got llama3's flops, not codonfm's!")
EOFRepository: NVIDIA/bionemo-framework
Length of output: 431
🏁 Script executed:
# Verify if the flops.py files are actually different (not just comments)
diff -w ./bionemo-recipes/recipes/llama3_native_te/flops.py ./bionemo-recipes/recipes/codonfm_native_te/flops.py | grep -v "^[0-9]" | head -30Repository: NVIDIA/bionemo-framework
Length of output: 350
🏁 Script executed:
# Check test execution patterns in CI/config files
find . -name "*.sh" -o -name "*.yml" -o -name "*.yaml" | xargs grep -l "test_flops" 2>/dev/null | head -3Repository: NVIDIA/bionemo-framework
Length of output: 50
Import the local flops.py under a unique module name.
This sys.path + from flops import ... pattern is unsafe when multiple recipe test files run in the same pytest process: the first imported flops module stays cached in sys.modules, so later test suites can end up testing the wrong recipe implementation. While the flops.py files are currently identical (copied from llama3_native_te), this pattern creates a fragile dependency on that equivalence. Please fix this in the source test template and re-sync the copies.
🔧 One safe pattern
-import sys
+import importlib.util
from pathlib import Path
import pytest
-# Add parent directory so we can import flops
-sys.path.insert(0, str(Path(__file__).parent.parent))
-
-from flops import (
- MFUTracker,
- ModelFLOPsConfig,
- compute_flops_analytical,
- compute_flops_hyena,
- compute_flops_simplified,
- estimate_cp_comm_bytes,
- from_hf_config,
-)
+FLOPS_PATH = Path(__file__).parent.parent / "flops.py"
+SPEC = importlib.util.spec_from_file_location(f"{__name__}_flops", FLOPS_PATH)
+assert SPEC is not None and SPEC.loader is not None
+flops = importlib.util.module_from_spec(SPEC)
+SPEC.loader.exec_module(flops)
+
+MFUTracker = flops.MFUTracker
+ModelFLOPsConfig = flops.ModelFLOPsConfig
+compute_flops_analytical = flops.compute_flops_analytical
+compute_flops_hyena = flops.compute_flops_hyena
+compute_flops_simplified = flops.compute_flops_simplified
+estimate_cp_comm_bytes = flops.estimate_cp_comm_bytes
+from_hf_config = flops.from_hf_config🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/codonfm_native_te/tests/test_flops.py` around lines
24 - 41, The tests unsafely import the local flops.py via sys.path and a plain
"from flops import ...", causing module-cache collisions across pytest sessions;
modify the test to load the local flops.py under a unique module name (e.g.,
using importlib.util.spec_from_file_location or
importlib.machinery.SourceFileLoader with a name derived from the test file path
or parent directory) and then import the symbols MFUTracker, ModelFLOPsConfig,
compute_flops_analytical, compute_flops_hyena, compute_flops_simplified,
estimate_cp_comm_bytes, and from_hf_config from that module, ensuring you do not
rely on sys.path insertion or the global "flops" entry in sys.modules so each
recipe test loads its own isolated module.
| if mfu_tracker is not None: | ||
| step_time = time.perf_counter() - step_start_time | ||
| mfu_info = mfu_tracker.compute_mfu(step_time) | ||
| if dist_config.is_main_process(): | ||
| logger.info("MFU: %.1f%% (%.2f TFLOPS/GPU)", mfu_info["mfu"], mfu_info["tflops_per_gpu"]) | ||
| step_start_time = time.perf_counter() |
There was a problem hiding this comment.
Include step_time in the MFU log line for complete per-step telemetry.
You already compute step_time on Line 223, but the emitted message omits it.
Proposed fix
if mfu_tracker is not None:
step_time = time.perf_counter() - step_start_time
mfu_info = mfu_tracker.compute_mfu(step_time)
if dist_config.is_main_process():
- logger.info("MFU: %.1f%% (%.2f TFLOPS/GPU)", mfu_info["mfu"], mfu_info["tflops_per_gpu"])
+ logger.info(
+ "Step %d MFU: %.1f%% (%.2f TFLOPS/GPU, step_time=%.3fs)",
+ step,
+ mfu_info["mfu"],
+ mfu_info["tflops_per_gpu"],
+ step_time,
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/codonfm_native_te/train_fsdp2.py` around lines 222 -
227, The MFU log omits the per-step duration even though step_time is computed;
update the logging call inside the mfu_tracker block (where mfu_info =
mfu_tracker.compute_mfu(step_time) and inside the dist_config.is_main_process()
check) to include step_time in the logger.info message alongside mfu_info["mfu"]
and mfu_info["tflops_per_gpu"] (e.g., add a placeholder for step_time and pass
step_time as an argument) so each MFU line reports MFU, TFLOPS/GPU, and the
per-step time.
| def detect_gpu_peak_tflops(): | ||
| """Auto-detect GPU peak bf16 TFLOPS from device name via substring match.""" | ||
| device_name = torch.cuda.get_device_name(0) | ||
| for gpu_key, tflops in GPU_PEAK_TFLOPS_BF16.items(): | ||
| if gpu_key.lower() in device_name.lower(): | ||
| return tflops, device_name | ||
| return None, device_name |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "flops.py" | grep -E "(llama|flops)"Repository: NVIDIA/bionemo-framework
Length of output: 285
🏁 Script executed:
cd bionemo-recipes && find . -type f -name "flops.py" 2>/dev/null | head -20Repository: NVIDIA/bionemo-framework
Length of output: 221
🏁 Script executed:
cat -n ./recipes/llama3_native_te/flops.pyRepository: NVIDIA/bionemo-framework
Length of output: 135
🏁 Script executed:
cat -n ./bionemo-recipes/recipes/llama3_native_te/flops.pyRepository: NVIDIA/bionemo-framework
Length of output: 32632
Guard CUDA-name lookup before touching torch.cuda.
The function calls torch.cuda.get_device_name(0) without checking whether CUDA is available, causing it to fail on CPU-only hosts instead of falling back to "unknown". _cli_gpu_info() and MFUTracker(..., peak_tflops=None) will crash in non-CUDA environments. Check torch.cuda.is_available() (per PyTorch documentation) before invoking CUDA device queries.
Minimal fix
def detect_gpu_peak_tflops():
"""Auto-detect GPU peak bf16 TFLOPS from device name via substring match."""
+ if not torch.cuda.is_available():
+ return None, "unknown"
device_name = torch.cuda.get_device_name(0)
for gpu_key, tflops in GPU_PEAK_TFLOPS_BF16.items():
if gpu_key.lower() in device_name.lower():
return tflops, device_name
return None, device_name📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def detect_gpu_peak_tflops(): | |
| """Auto-detect GPU peak bf16 TFLOPS from device name via substring match.""" | |
| device_name = torch.cuda.get_device_name(0) | |
| for gpu_key, tflops in GPU_PEAK_TFLOPS_BF16.items(): | |
| if gpu_key.lower() in device_name.lower(): | |
| return tflops, device_name | |
| return None, device_name | |
| def detect_gpu_peak_tflops(): | |
| """Auto-detect GPU peak bf16 TFLOPS from device name via substring match.""" | |
| if not torch.cuda.is_available(): | |
| return None, "unknown" | |
| device_name = torch.cuda.get_device_name(0) | |
| for gpu_key, tflops in GPU_PEAK_TFLOPS_BF16.items(): | |
| if gpu_key.lower() in device_name.lower(): | |
| return tflops, device_name | |
| return None, device_name |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/llama3_native_te/flops.py` around lines 69 - 75,
detect_gpu_peak_tflops currently calls torch.cuda.get_device_name(0) unguarded
and will crash on CPU-only hosts; update detect_gpu_peak_tflops to first check
torch.cuda.is_available() and only call torch.cuda.get_device_name(0) when CUDA
is available, otherwise return (None, "unknown") so callers like _cli_gpu_info
and MFUTracker(..., peak_tflops=None) safely receive a fallback; keep the
GPU_PEAK_TFLOPS_BF16 lookup logic intact when a device name is present.
| def from_hf_config(config_dict, **overrides): | ||
| """Create ModelFLOPsConfig from an HF-compatible config dict. | ||
|
|
||
| Auto-detects architecture: | ||
| - GQA vs MHA: from num_key_value_heads (absent = MHA) | ||
| - Gated MLP (3 proj) vs standard FFN (2 proj): from model_type | ||
| - LM head: from vocab_size > 0 | ||
|
|
||
| Args: | ||
| config_dict: Dict with standard HF config keys (hidden_size, num_hidden_layers, etc.). | ||
| Works with config.json dicts, config.to_dict(), or MODEL_PRESETS dicts. | ||
| **overrides: Explicit overrides for any field (e.g., num_mlp_projections=3). | ||
| """ | ||
| h = config_dict["hidden_size"] | ||
| n_heads = config_dict["num_attention_heads"] | ||
| n_kv = config_dict.get("num_key_value_heads", n_heads) | ||
| vocab = config_dict.get("vocab_size", 0) | ||
| model_type = config_dict.get("model_type", "") | ||
|
|
||
| # Detect gated MLP (3 projections) vs standard FFN (2 projections). | ||
| # Llama/Mistral/Qwen use SwiGLU (gate + up + down = 3 projections). | ||
| # ESM2/CodonFM/Geneformer/BERT use standard FFN (up + down = 2 projections). | ||
| num_mlp_proj = 3 if model_type in GATED_MLP_MODEL_TYPES else 2 | ||
|
|
||
| kwargs = { | ||
| "hidden_size": h, | ||
| "num_hidden_layers": config_dict["num_hidden_layers"], | ||
| "num_attention_heads": n_heads, | ||
| "num_kv_heads": n_kv, | ||
| "head_dim": h // n_heads, | ||
| "intermediate_size": config_dict["intermediate_size"], | ||
| "num_mlp_projections": num_mlp_proj, | ||
| "vocab_size": vocab, | ||
| "has_lm_head": vocab > 0, | ||
| } | ||
| kwargs.update(overrides) | ||
| return ModelFLOPsConfig(**kwargs) |
There was a problem hiding this comment.
Recompute derived fields after applying overrides.
kwargs.update(overrides) can leave head_dim and has_lm_head inconsistent with the overridden values. For example, vocab_size=0 still keeps the precomputed has_lm_head unless the caller knows to override both, and overriding num_attention_heads leaves the old head_dim.
♻️ Suggested fix
def from_hf_config(config_dict, **overrides):
@@
- kwargs = {
- "hidden_size": h,
- "num_hidden_layers": config_dict["num_hidden_layers"],
- "num_attention_heads": n_heads,
- "num_kv_heads": n_kv,
- "head_dim": h // n_heads,
- "intermediate_size": config_dict["intermediate_size"],
- "num_mlp_projections": num_mlp_proj,
- "vocab_size": vocab,
- "has_lm_head": vocab > 0,
- }
- kwargs.update(overrides)
- return ModelFLOPsConfig(**kwargs)
+ kwargs = {
+ "hidden_size": h,
+ "num_hidden_layers": config_dict["num_hidden_layers"],
+ "num_attention_heads": n_heads,
+ "num_kv_heads": n_kv,
+ "intermediate_size": config_dict["intermediate_size"],
+ "num_mlp_projections": num_mlp_proj,
+ "vocab_size": vocab,
+ }
+ kwargs.update(overrides)
+ kwargs.setdefault("head_dim", kwargs["hidden_size"] // kwargs["num_attention_heads"])
+ kwargs.setdefault("has_lm_head", kwargs["vocab_size"] > 0)
+ return ModelFLOPsConfig(**kwargs)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/llama3_native_te/flops.py` around lines 104 - 140,
from_hf_config currently computes derived fields (head_dim, has_lm_head) before
kwargs.update(overrides), which allows overrides like num_attention_heads or
vocab_size to leave derived values stale; update the function so overrides are
applied before computing derived fields (or apply kwargs.update(overrides) and
then recompute head_dim = hidden_size // num_attention_heads and has_lm_head =
vocab_size > 0 afterwards), ensuring derived values (head_dim, has_lm_head and
any other fields dependent on overrides) reflect the final values passed to
ModelFLOPsConfig; keep references to from_hf_config, kwargs.update(overrides),
head_dim, has_lm_head, num_attention_heads, vocab_size and ModelFLOPsConfig to
locate the changes.
| def measure_bus_bandwidth(device, world_size, num_iters=20, num_elements=10_000_000): | ||
| """Measure unidirectional P2P bandwidth via send/recv (matches CP ring pattern).""" | ||
| if world_size <= 1: | ||
| return 0.0 | ||
|
|
||
| rank = dist.get_rank() | ||
| tensor = torch.randn(num_elements, device=device, dtype=torch.bfloat16) | ||
| peer = 1 - rank | ||
|
|
||
| for _ in range(5): | ||
| if rank == 0: | ||
| dist.send(tensor, dst=peer) | ||
| dist.recv(tensor, src=peer) | ||
| else: | ||
| dist.recv(tensor, src=peer) | ||
| dist.send(tensor, dst=peer) | ||
| torch.cuda.synchronize() | ||
|
|
||
| dist.barrier() | ||
| torch.cuda.synchronize() | ||
| start = time.perf_counter() | ||
| for _ in range(num_iters): | ||
| if rank == 0: | ||
| dist.send(tensor, dst=peer) | ||
| else: | ||
| dist.recv(tensor, src=peer) | ||
| torch.cuda.synchronize() | ||
| elapsed = time.perf_counter() - start | ||
|
|
||
| data_bytes = tensor.nelement() * tensor.element_size() | ||
| return num_iters * data_bytes / elapsed / 1e9 |
There was a problem hiding this comment.
Reject world sizes other than 2 in the bandwidth helper.
peer = 1 - rank only works for ranks 0 and 1. If this gets launched with more than two processes, ranks >= 2 compute invalid peers and can hang the send/recv loop. Please fail fast here instead of assuming the caller followed the docstring.
🔒 Suggested guard
def measure_bus_bandwidth(device, world_size, num_iters=20, num_elements=10_000_000):
"""Measure unidirectional P2P bandwidth via send/recv (matches CP ring pattern)."""
if world_size <= 1:
return 0.0
+ if world_size != 2:
+ raise ValueError("measure_bus_bandwidth currently supports exactly 2 ranks")
rank = dist.get_rank()
tensor = torch.randn(num_elements, device=device, dtype=torch.bfloat16)
- peer = 1 - rank
+ peer = 1 if rank == 0 else 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/llama3_native_te/flops.py` around lines 453 - 483, In
measure_bus_bandwidth, reject world sizes != 2 instead of assuming peers by
updating the initial guard (world_size) to raise an error when world_size is not
exactly 2; ensure you check world_size at the top of the function and raise a
ValueError (or similar) with a clear message referencing the peer calculation
using peer = 1 - rank so callers know only 2-process runs are supported; keep
the rest of the logic unchanged once the guard ensures rank is 0 or 1.
| def load_model_config(config_path): | ||
| """Load model config dict from a local path or HuggingFace model ID. | ||
|
|
||
| Supports: | ||
| - Local directory: ./model_configs/lingua-1B (reads config.json inside) | ||
| - Local file: ./model_configs/lingua-1B/config.json | ||
| - HF model ID: nvidia/esm2_t36_3B_UR50D (fetches from HuggingFace Hub) | ||
| """ | ||
| import json | ||
| from pathlib import Path | ||
|
|
||
| path = Path(config_path) | ||
| if path.is_dir(): | ||
| path = path / "config.json" | ||
| if path.exists(): | ||
| return json.loads(path.read_text()) | ||
|
|
||
| # Fall back to HuggingFace Hub | ||
| from transformers import AutoConfig | ||
|
|
||
| hf_config = AutoConfig.from_pretrained(config_path, trust_remote_code=True) | ||
| return hf_config.to_dict() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's locate and read the file
find . -type f -name "flops.py" | head -20Repository: NVIDIA/bionemo-framework
Length of output: 285
🏁 Script executed:
# Also, let's search for the specific function
rg "def load_model_config" -A 25Repository: NVIDIA/bionemo-framework
Length of output: 8820
🏁 Script executed:
# Check if there are any calls to load_model_config in the codebase
rg "load_model_config\(" --type pyRepository: NVIDIA/bionemo-framework
Length of output: 1391
🏁 Script executed:
# Verify the actual line numbers in the target file
sed -n '486,507p' bionemo-recipes/recipes/llama3_native_te/flops.py | cat -nRepository: NVIDIA/bionemo-framework
Length of output: 947
🏁 Script executed:
# Check that the docstring and function signature comply with line length (119 chars)
head -n 510 bionemo-recipes/recipes/llama3_native_te/flops.py | tail -n 25 | cat -nRepository: NVIDIA/bionemo-framework
Length of output: 1050
Remove hardcoded trust_remote_code=True and add explicit opt-in parameters.
AutoConfig.from_pretrained(config_path, trust_remote_code=True) unconditionally enables execution of arbitrary code from the Hugging Face Hub. Per Hugging Face documentation, trust_remote_code defaults to False and should only be enabled as an explicit opt-in for trusted sources. Add keyword-only parameters to make this configurable and allow revision pinning:
Suggested fix
-def load_model_config(config_path):
+def load_model_config(config_path, *, trust_remote_code=False, revision=None):
"""Load model config dict from a local path or HuggingFace model ID.
Supports:
- Local directory: ./model_configs/lingua-1B (reads config.json inside)
- Local file: ./model_configs/lingua-1B/config.json
- HF model ID: nvidia/esm2_t36_3B_UR50D (fetches from HuggingFace Hub)
"""
@@ -494,5 +494,7 @@ def load_model_config(config_path):
from transformers import AutoConfig
- hf_config = AutoConfig.from_pretrained(config_path, trust_remote_code=True)
+ hf_config = AutoConfig.from_pretrained(
+ config_path,
+ trust_remote_code=trust_remote_code,
+ revision=revision,
+ )
return hf_config.to_dict()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/llama3_native_te/flops.py` around lines 486 - 507,
The load_model_config function currently hardcodes trust_remote_code=True when
calling AutoConfig.from_pretrained; change it to accept explicit keyword-only
parameters (e.g., *, trust_remote_code: bool = False, revision: Optional[str] =
None) so callers must opt-in to remote code execution and can pin a revision,
then pass those parameters into AutoConfig.from_pretrained(config_path,
trust_remote_code=trust_remote_code, revision=revision) and update any call
sites accordingly.
| parser = argparse.ArgumentParser() | ||
| parser.add_argument("command") | ||
| parser.add_argument("--config-path", default="./model_configs/lingua-1B") | ||
| parser.add_argument("--batch-size", type=int, default=1) | ||
| parser.add_argument("--seq-len", type=int, default=4096) | ||
| parser.add_argument("--formula", default="analytical", choices=["analytical", "simplified"]) | ||
| parser.add_argument("--num-gpus", type=int, default=1, help="Number of GPUs for per-GPU FLOPs and comm estimates") | ||
| parser.add_argument("--cp-size", type=int, default=1, help="Context parallelism size for comm overhead estimate") | ||
| parser.add_argument("--p2p-bw", type=float, default=6.0, help="P2P bandwidth in GB/s for comm time estimate") | ||
| args = parser.parse_args() | ||
|
|
||
| cfg_dict = load_model_config(args.config_path) | ||
| config = from_hf_config(cfg_dict) | ||
| b, s = args.batch_size, args.seq_len | ||
|
|
||
| print( | ||
| f"Config: H={config.hidden_size}, L={config.num_hidden_layers}," | ||
| f" n_heads={config.num_attention_heads}, n_kv={config.num_kv_heads}," | ||
| f" I={config.intermediate_size}, V={config.vocab_size}" | ||
| ) | ||
| print( | ||
| f"MLP: {config.num_mlp_projections} projections" | ||
| f" ({'SwiGLU/GLU' if config.num_mlp_projections == 3 else 'standard FFN'})" | ||
| ) | ||
| print(f"Batch: B={b}, S={s}, GPUs={args.num_gpus}, CP={args.cp_size}") | ||
| print() | ||
|
|
||
| simplified = compute_flops_simplified(b, s, config.hidden_size, config.num_hidden_layers, config.vocab_size) | ||
| analytical, breakdown, lm_head = compute_flops_analytical(config, b, s) | ||
|
|
||
| print(f"{'Method':<24} {'FLOPs/step':>14} {'Per-GPU':>14} {'Exact':>30}") | ||
| print("-" * 86) | ||
| for name, flops in [("Simplified (README)", simplified), ("Analytical", analytical)]: | ||
| per_gpu = flops // max(args.num_gpus, 1) | ||
| print(f"{name:<24} {format_flops(flops):>14} {format_flops(per_gpu):>14} {format_flops_exact(flops):>30}") | ||
|
|
||
| if simplified != analytical: | ||
| diff = analytical - simplified | ||
| print(f"\nDifference: {format_flops_exact(diff)} ({diff / simplified * 100:+.2f}%)") | ||
| else: | ||
| print("\nFormulas agree exactly for this config.") |
There was a problem hiding this comment.
--formula is currently a no-op.
The parser accepts --formula, but this code always computes and prints both simplified and analytical FLOPs and never branches on args.formula. Either honor the flag or remove it, otherwise the CLI advertises behavior it does not implement.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/llama3_native_te/flops.py` around lines 607 - 647,
The CLI flag args.formula is unused: either honor it or remove it; to fix,
branch on args.formula (values "simplified" or "analytical") before calling
compute_flops_simplified and compute_flops_analytical so you only compute and
print the requested method. Specifically, use args.formula to decide whether to
call
compute_flops_simplified(b,s,config.hidden_size,config.num_hidden_layers,config.vocab_size)
or compute_flops_analytical(config,b,s) (and only compute the other if you need
a diff), adjust the printing loop to only print the selected result (and only
compute/print the Difference when both are intentionally computed), or
alternatively remove the --formula argument from the argparse setup if you want
both methods always shown.
| mfu_tracker = MFUTracker( | ||
| config=from_hf_config(config.to_dict()), | ||
| batch_size=args.dataset.micro_batch_size, | ||
| seq_len=args.dataset.max_seq_length, | ||
| num_gpus=dist_config.world_size, | ||
| parallelism={"dp": dist_config.world_size}, | ||
| ) |
There was a problem hiding this comment.
DDP MFU is underestimated under gradient accumulation.
Line 206 measures full-step duration, but Line 153 configures FLOPs for a single micro-batch.
This biases MFU low by grad_acc_steps.
Suggested fix
mfu_tracker = MFUTracker(
config=from_hf_config(config.to_dict()),
- batch_size=args.dataset.micro_batch_size,
+ batch_size=args.dataset.micro_batch_size * args.grad_acc_steps,
seq_len=args.dataset.max_seq_length,
num_gpus=dist_config.world_size,
parallelism={"dp": dist_config.world_size},
)Also applies to: 205-210
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/llama3_native_te/train_ddp.py` around lines 151 -
157, The MFUTracker is being configured with per-micro-batch FLOPs which
underestimates utilization when gradient accumulation is used; update the
MFUTracker initialization (the MFUTracker(...) call) to account for gradient
accumulation by multiplying the effective batch_size (or otherwise scaling
FLOPs) by the gradient accumulation factor (use args.gradient_accumulation_steps
or args.grad_acc_steps as present in the training args) so the tracker uses
full-step work rather than a single micro-batch, and apply the same correction
to the related MFU/config calculations around this block.
| mfu_tracker = MFUTracker( | ||
| config=from_hf_config(config.to_dict()), | ||
| batch_size=args.dataset.micro_batch_size, | ||
| seq_len=args.dataset.max_seq_length, | ||
| num_gpus=dist_config.world_size, | ||
| parallelism={"dp": dist_config.world_size // args.cp_size, "cp": args.cp_size}, | ||
| ) |
There was a problem hiding this comment.
MFU math misses gradient accumulation scaling.
Line 244 times a full optimizer step, but Line 189 sets FLOPs batch size to one micro-batch.
With grad_acc_steps > 1, MFU is underestimated.
Suggested fix
mfu_tracker = MFUTracker(
config=from_hf_config(config.to_dict()),
- batch_size=args.dataset.micro_batch_size,
+ batch_size=args.dataset.micro_batch_size * args.grad_acc_steps,
seq_len=args.dataset.max_seq_length,
num_gpus=dist_config.world_size,
parallelism={"dp": dist_config.world_size // args.cp_size, "cp": args.cp_size},
)Also applies to: 243-248
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/llama3_native_te/train_fsdp2_cp.py` around lines 187
- 193, The MFUTracker is using args.dataset.micro_batch_size as the FLOPs batch
size which ignores gradient accumulation and underestimates MFU; update the
mfu_tracker initialization (symbol: MFUTracker / variable: mfu_tracker) to use
the effective batch size = args.dataset.micro_batch_size * args.grad_acc_steps
(or otherwise multiply by grad_acc_steps) when setting the batch_size parameter,
and apply the same change wherever MFUTracker is constructed (also the other
block around the second occurrence).
| mfu_tracker = MFUTracker( | ||
| config=from_hf_config(config.to_dict()), | ||
| batch_size=args.dataset.micro_batch_size, | ||
| seq_len=args.dataset.max_seq_length, | ||
| num_gpus=dist_config.world_size, | ||
| parallelism={"dp": dist_config.world_size}, |
There was a problem hiding this comment.
MFU is underreported when grad_acc_steps > 1.
Line 219 computes wall time for a full optimizer step, but Lines 167-170 initialize FLOPs with a single micro-batch.
That underestimates FLOPs/MFU by a factor of grad_acc_steps.
Suggested fix
mfu_tracker = MFUTracker(
config=from_hf_config(config.to_dict()),
- batch_size=args.dataset.micro_batch_size,
+ batch_size=args.dataset.micro_batch_size * args.grad_acc_steps,
seq_len=args.dataset.max_seq_length,
num_gpus=dist_config.world_size,
parallelism={"dp": dist_config.world_size},
)Also applies to: 218-223
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/llama3_native_te/train_fsdp2.py` around lines 165 -
170, MFU is undercounted because MFUTracker is initialized with a single
micro-batch; multiply the batch work by gradient accumulation steps so FLOPs/
MFU reflect a full optimizer step. Update the MFUTracker initialization
(mfu_tracker / MFUTracker) to use an effective batch size =
args.dataset.micro_batch_size * grad_acc_steps (or pass grad_acc_steps
separately) and also ensure any FLOPs or wall-time computations (the code path
that computes wall time for a full optimizer step) scale FLOPs by grad_acc_steps
so the computed FLOPs and MFU reflect the total work per optimizer step.
| "bionemo-recipes/recipes/codonfm_native_te/modeling_codonfm_te.py", | ||
| ], | ||
| # FLOPs / MFU module - synced across recipes | ||
| "bionemo-recipes/recipes/llama3_native_te/flops.py": [ |
There was a problem hiding this comment.
do we have to copy these to each recipe? it's just to calculate flops? maybe we just put it in one and point folks at it.
This should maybe be in the models/ definition though? is this just printing out a total flops for a given model size?
There was a problem hiding this comment.
flops.py has 2 usage paths:
- flops counting for a given model config and comms bandwidth/overhead calculation using CLI
- can be used in the recipes' train scripts with config
log_mfu=trueto log MFU with each step
so it doesn't go purely with the model defs.
There was a problem hiding this comment.
Addressed. flops.py is gone entirely. The FLOPs math is now a ~30-line helper (_compute_per_token_flops) inlined at the top of each recipe's existing perf_logger.py, so there's no new copied-files target. MFU metrics flow through the existing torchmetrics → WANDB path (train/mfu_pct, train/tflops_per_gpu).
trvachov
left a comment
There was a problem hiding this comment.
Across all the added code, I think you shouldn't use batch_size in isolation as a variable name, or at least if you do make it explicit that in that function it doesn't matter whether it's micro or global batch. Otherwise, use micro_batch_size or global_batch_size, because it gets very confusing if the distributed flops math is right without this.
The comm overhead calculations here have a number of issues including assumptions about model precision and also about FSDP2 communication patterns. Adding comm overhead calculations is a reasonable idea but I think it should be a separate PR. Each PR should be one-idea -> one PR.
There is a flops.py CLI, but I ran torchrun --nproc_per_node=2 flops.py bandwidth as written in the docstring it and it just hangs. This was on 8xA100 80GB PCIe, pytorch:26.02-py3 container. Does this work for you?
Regarding overall code structure --> indeed there is a lot of duplicated code, but that is also to some degree the philosophy of recipes. @pstjohn Can you help with some guidance here? Should flops.py really be in perf_logger.py to reduce duplicate files? Also I think the FLOPS metrics should likely be recorded in WANDB just like all of the other metrics we have.
| 16*H^2 for 2 MLP projections with I=4H. | ||
| """ | ||
| b, s, h = batch_size, seq_len, hidden_size | ||
| return (24 * b * s * h * h + 4 * b * s * s * h) * (3 * num_layers) + (6 * b * s * h * vocab_size) |
There was a problem hiding this comment.
Does this work with gradient accumulation?
There was a problem hiding this comment.
Fixed. The old code took a static batch_size and multiplied once per optimizer step, which under-reported MFU by grad_acc_steps. The new code uses the num_unpadded_tokens counter already accumulated across all grad-acc micro-batches in PerfLogger.log_micro_step() (same counter that drives tokens_per_second_per_gpu), so per-step FLOPs scales with the full optimizer step's token volume regardless of grad_acc setting. For esm2 (no grad-acc in this recipe) the current batch's unpadded tokens are used directly.
| Returns: | ||
| Dict with comm_bytes, estimated_comm_time, comm_pct. | ||
| """ | ||
| bw = measured_bw_gbps or 6.0 # Default ~PCIe Gen3 x8 |
There was a problem hiding this comment.
Why PCIe gen3? Isn't that >15 years old?
Also -- have you done the measurement on PCIe Gen3? I ask because, if this 6.0 is a "datasheet number", it might be very hard to measure especially through pytorch because of launch overhead. You might need a very specialized environment to measure max-PCIe speeds.
There was a problem hiding this comment.
Removed along with the comm-overhead code. Will revisit in the follow-up comm-overhead PR with a more realistic default (or with a measurement-based approach).
|
A couple general thoughts:
|
|
@trvachov @pstjohn — ready for re-review. I took your feedback and substantially reworked the PR. TL;DREvery concern you raised is addressed, and the PR is now considerably smaller and simpler:
What changed vs. the previous revisionDeleted
Added / consolidated
The MFU formulaSimple, and the same in all 4 perf_loggers: flops_per_step = _per_token_flops * num_unpadded_tokens_on_this_rank
tflops_per_gpu = flops_per_step / step_time / 1e12
mfu_pct = tflops_per_gpu / peak_tflops * 100Why this form works everywhere without DP/CP/grad-acc factoring:
So: no Training script change per recipe (this is the whole diff)-perf_logger = PerfLogger(dist_config, args)
+perf_logger = PerfLogger(
+ dist_config,
+ args,
+ model_config_dict=config.to_dict() if args.get("log_mfu", False) else None,
+)Plus deletion of the 10-line Point-by-pointIndividual replies are on each inline thread; consolidated here:
Verified
Usage (unchanged from the outside)torchrun --nproc_per_node=2 train_fsdp2.py log_mfu=trueAppears in WANDB as Thanks for the thorough review — the PR is significantly better for it. |
Reworks MFU tracking per reviewer feedback on #1548: - Delete per-recipe flops.py, test_flops.py, and the CLI entirely - Inline ~30-line FLOPs helper into each recipe's existing perf_logger.py - MFU metrics (train/tflops_per_gpu, train/mfu_pct) flow through the existing torchmetrics -> WANDB path, respecting logging_frequency - Drop comm-overhead estimation; will be a separate future PR The new formula is per_token_flops(seq_len) * num_unpadded_tokens_on_rank. The unpadded-tokens counter (already used by tokens_per_second_per_gpu) is per-rank after DP/CP sharding and accumulated across grad-acc micro-batches, so the formula works uniformly across DDP/FSDP2/FSDP2+CP/DDP+CP/mFSDP and across BSHD and THD (sequence packing) with no per-strategy factors. Net: -3000 / +300 lines. Training scripts lose all MFU scaffolding; the only change per script is one extra kwarg on the PerfLogger constructor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
|
/ok to test e6468fc |
The prior implementation used `self.num_unpadded_tokens` (attention_mask.sum())
as the token count in the MFU formula, which produces an "effective-work MFU"
that under-reports vs published numbers when the batch has padding.
Switch to `self.num_tokens` (input_ids.numel()) to match the PaLM /
Megatron-LM / MosaicML llm-foundry / torchtitan convention of counting the
configured token shape the hardware actually runs matmuls over, regardless
of attention masking. For fully-packed THD (no inter-sequence padding) the
two counters are equivalent; for BSHD with padding, this restores
standard MFU semantics.
References for the convention:
- PaLM (Chowdhery et al. 2022, App. B): "tokens per second ... peak
throughput R = P / (6N + 12LHQT)" where T is the configured/padded
sequence length.
- Megatron-LM num_floating_point_operations() uses global batch *
args.seq_length (configured/padded).
- MosaicML llm-foundry and torchtitan both use configured seq_len.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
trvachov
left a comment
There was a problem hiding this comment.
I had a prior comment to replace batch_size with micro_batch_size or global_batch_size to make it explicit, but it seems like this code doesn't use the concept of batch_size at all and somehow does FLOPs calculations using effective sequence length as the core input. Why do this? It would make sense to me if this simplified things, but I think it obfuscates it more. e.g. I think it still doesn't get the right FLOPs number for THD sequence packing and makes the FLOPs counting code rather tricky to understand.
For non sequence packed attention, the inputs are max seq length per sequence and micro batch size * DDP ranks.
For sequence packaged attention, the inputs are max seq length for all the sequences and the boundaries of the sequences * DDP ranks.
Then for both of the above, there's true tokens and padded tokens. I think having this be explicit (e.g. MFU_padded and MFU_unpadded) in the token counting would make this very clear for the user and be a good recipe starting point.
As it stand, had batch_size as a concept in the first round of this PR, which was a little confusing because it didn't distinguish between micro vs global batch size, and now we have total sequence length as a concept, but a) That is miscalculating the THD FLOPS and b) it's somewhat inconsistent with our prior variable definition e.g. in this README here https://github.com/NVIDIA/bionemo-framework/tree/trvachov/mixtral-recipes-repush/bionemo-recipes/recipes/llama3_native_te
What do you think?
| First-principles matmul count: Q/K/V/O projections (GQA-aware), attention | ||
| logits/values (the S^2 cost expressed per-token as 4*S*H), 2-or-3-projection | ||
| MLP (SwiGLU detected via model_type), and LM head. The returned value is | ||
| multiplied by the actual unpadded token count at log time, so it naturally |
There was a problem hiding this comment.
This behavior is very strange for two reasons:
- This docstring is documenting behavior outside of this function, which is just a weird thing to do code wise. Why should this function document code that is outside of this function?
- I don't think what is being said here is correct -- "returned value is multiplied by the actual unpadded token count"... I don't think this is actually happening, and even if it was, I'm not sure that's correct.
| self._per_token_flops = 0 | ||
| self._peak_tflops: float | None = None | ||
| if self._log_mfu: | ||
| self._per_token_flops = _compute_per_token_flops(model_config_dict, args.dataset.max_seq_length) |
There was a problem hiding this comment.
A few issues here:
- self._per_token_flops isn't being multiplies by any sort of padded vs unpadded sequence factor, as is documented in _compute_per_token_flops docstrong.
- Even if it was doing the above, that would simply make a correction for BSHD padding. I think this would be totally incorrect for THD sequence packing, where the attention matrix size is not O(max_seq_length^2) but rather a sum of the individual cu_seqlens fragments squared.
Here's some docs that explain this https://docs.nvidia.com/nemo-framework/user-guide/25.09/sft_peft/packed_sequence.html <--- these docs are a little old but probably right -- worth checking with TE team. @jomitchellnv @pstjohn is my understanding correct that TE does a sum of little squares for THD sequence packing?
There was a problem hiding this comment.
re: 2. Yes this is true it doesn't use the sum of the squares of each packed sequence but that was an intentional choice since the formula in the Llama3 native te recipe that was used for benchmarking doesn't either (uses B*S^2). I considered doing that but it would require some tracking inserted into the training scripts to pass around cu_seq_lens (or its sum of squares) to keep track of this. I assumed we'd prefer simplicity over modifying the training scripts more (from Peter's previous comments), should I add that in?
| head_dim = h // n_heads | ||
| kv_dim = n_kv * head_dim | ||
| ffn = model_config_dict["intermediate_size"] | ||
| vocab = model_config_dict.get("vocab_size", 0) |
There was a problem hiding this comment.
should this be vocab_size, or padded_vocab_size?
I dont think this is a big correction because usually the num_layers term is much bigger than this one, but nonetheless maybe worth getting right.
There was a problem hiding this comment.
Good catch — fixed in b979eed.
The LM-head matmul does run at padded_vocab_size (ESM-2: 33 → 64 for FP8/tensor-core-friendly shapes); logits get sliced back to vocab_size only after the compute happens (modeling_esm_te.py:585), so the FLOPs are real.
Scoped the correction to the hardware-view metric only:
mfu_pct/tflops_per_gpu(useful work) — keeps rawvocab_size.mfu_padded_pct/tflops_per_gpu_padded(hardware view) — usespadded_vocab_sizewhen present.
Applied uniformly across all four recipes' perf_logger.py via a use_padded_vocab=True kwarg on _compute_non_attn_per_token_flops. For llama3 / og2 / codonfm the configs don't set padded_vocab_size, so the fallback (or vocab_size) makes the two values identical and nothing changes in those recipes — only ESM-2 sees any delta, and it's small (≲0.01 pp of total FLOPs on 3B/15B, up to ~0.3 pp on the tiny sanity-check model) as you predicted.
Previous formula baked `4·S·H` attention per token at init-time using
`args.dataset.max_seq_length`, then multiplied by num_tokens at log time.
For a packed THD batch of total length S with docs L₁, L₂, …, Lₙ, this
yields `4·H·S²` attention per rank — overstating work by `S²/Σ(Lᵢ²)`
because Flash-Attention actually computes one Lᵢ×Lᵢ score matrix per
packed segment.
Split per-token FLOPs into non-attention (QKV/MLP/LM-head) and attention
coefficient, then at each logging window:
flops = non_attn_per_token · num_tokens + coeff · Σ(Lᵢ²) / cp_size
`Σ(Lᵢ²)` is accumulated across grad-acc micro-batches from
`cu_seq_lens_q` (real per-doc lengths) in log_micro_step, falling back
to `cu_seq_lens_q_padded` or `B·S²` from the batch shape for BSHD. The
algebraic identity `non_attn + coeff·S ≡ _compute_per_token_flops(cfg, S)`
is locked in by a unit test in every recipe, so BSHD and
synthetic-single-doc-THD benchmarks produce identical numbers to the
legacy formula.
Also fixes ESM-2's latent grad-acc undercount: its PerfLogger previously
had no log_micro_step and read num_tokens from only the last micro-batch
at log time, silently dividing reported FLOPs by grad_acc_steps if
grad-acc were ever enabled. Retrofitted with the llama3/og2/codonfm
accumulator pattern: all five ESM-2 training scripts (train_{fsdp2,
fsdp2_cp,ddp,ddp_cp,mfsdp}.py) now call log_micro_step inside the inner
loop and log_step(step, grad_norm, lr) after the optimizer step.
Applied identically to llama3_native_te, opengenome2_llama_native_te,
codonfm_native_te, esm2_native_te per the per-recipe isolation rule.
Six invariant unit tests land in each recipe's test_perf_logger.py
(algebraic identity, BSHD no-op, THD single-doc/multi-doc/CP division,
unpadded-preferred, padded-fallback) plus ESM-2-specific grad-acc
accumulator tests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Real-data benchmark runs (llama3 dclm + og2 metagenome at cp=8 dp=1
S=131072) surfaced a bug in _attn_work_from_batch's BSHD branch.
ContextParallelDataLoaderWrapper pre-splits the sequence along the CP
axis before the batch reaches log_micro_step, so batch["input_ids"].shape
per rank is (B, S/cp), not (B, S). The helper computed B*(S/cp)² from
the per-rank shape, and log_step divided by cp_size, yielding per-rank
attention FLOPs of B*S²/cp³ — wrong by cp².
Correct per-rank attention work is B*S²/cp (Q_rank @ K_full costs
(S/cp)*S per head summed).
Evidence on og2 BSHD cp=8 S=131072 with a full-length doc: reported
10.9% MFU / 246 TFLOPS/GPU, but the actual compute corresponds to
~60% MFU (matches the THD run with the same S on the same data).
Fix: pass cp_size into _attn_work_from_batch and multiply the BSHD
synthesis by cp_size². The helper now always returns a GLOBAL
(pre-shard) Σ(Lᵢ²), so the /cp_size in log_step consistently gives
per-rank attention regardless of whether the input is cu_seq_lens_q
(already global) or per-rank shape (needs the cp² correction).
THD path unaffected (cu_seq_lens_q was always global). BSHD without
CP unaffected (cp_size=1 makes the multiplier 1). Existing published
benchmarks in ci/benchmarks/{og2,esm2}_benchmarks.md used synthetic
cu_seq_lens=[0, S] and therefore flowed through the THD path — they
remain valid.
Added test_bshd_cp_correction to all four recipes asserting both
the BSHD+CP scaling and the THD invariance under cp_size variation.
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
The memory reporting used torch.cuda.memory_allocated() (current) at log time, so both the "max" and "mean" metrics reflected the post-step resting footprint. Under FSDP that massively undercounts the transient peak: params get all-gathered briefly during forward/backward and activations balloon before backward frees them, but by the time log_step runs they're already released. Switch to torch.cuda.max_memory_allocated() (tracks the high-water mark of live allocations) paired with torch.cuda.reset_peak_memory_stats() so each logging window reports its own peak rather than a running max since process start. Both are host-side counter ops: no sync, no kernel launch. The "mean" metric keeps using the post-step current value as a "resting footprint" indicator (useful to see how much memory is held between steps), while "max" now reflects what the hardware actually peaks at. These are the values that belong in the benchmark tables. Applied identically to all four MFU-tracking recipes. Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Current reporting mixes conventions: non-attention flops count every slot
(padded, PaLM convention), but attention flops come from cu_seq_lens_q
which drops both CP-zigzag padding and BSHD row padding. For BSHD with
dynamic-padded short docs this made the attention term tiny while
non-attention counted all the pad positions — an inconsistent hybrid that
is neither pure useful work nor pure hardware view.
Split into two parallel metrics:
train/mfu_pct / tflops_per_gpu
Useful-work rate. Counts only real tokens:
non_attn uses num_unpadded_tokens (attention_mask.sum())
attn uses Sigma(L_i^2) from cu_seq_lens_q (THD), or per-row
attention_mask.sum(dim=-1) (BSHD). Excludes CP zigzag
divisibility pad and BSHD row pad.
train/mfu_padded_pct / tflops_per_gpu_padded
Hardware view. Counts all slots the GPU actually burns cycles on:
non_attn uses input_ids.numel()
attn uses Sigma(L_i^2) from cu_seq_lens_q_padded (THD) or
full batch shape (BSHD). Includes both padding types.
For synthetic single-doc-per-pack THD runs (where Sigma L_i^2 == S^2 and
there is no padding of either kind), unpadded == padded and the two
metrics report identical values; published benchmarks in ci/benchmarks
remain internally consistent either way.
Added _attn_work_from_batch(include_padding=bool) kwarg. The function
still returns a pre-CP-shard global quantity; the caller divides by
cp_size in log_step. BSHD per-row real lengths come from attention_mask;
note in docstring that for BSHD+CP this is exact under evenly distributed
padding (common case) and may slightly underestimate when padding is
concentrated on one end.
Added two unit tests per recipe (test_include_padding_thd,
test_include_padding_bshd_with_attention_mask) pinning the split.
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
The single-term per-token FLOP function was kept around after the
two-term split (non_attn + coeff·Σ(Lᵢ²)) was introduced, solely so
old unit tests and the startup log line could still reference it.
Nothing in the runtime path was using it — log_step had already
switched to the split formula. Remove the function, the
self._per_token_flops attribute, and the test scaffolding that
existed only to pin the split against the legacy reference:
- _compute_per_token_flops(cfg, seq_len) deleted from all 4 recipes
- self._per_token_flops init + assignment + log-line mention removed
- logger.info startup format no longer includes "per-token FLOPs=%e"
- TestComputePerTokenFlops test class deleted (llama3 only)
- test_algebraic_identity deleted (its whole purpose was to pin
non_attn + coeff·S against the legacy function)
- test_bshd_no_op simplified to test_bshd_shape_synthesis — keeps
the Σ(Lᵢ²)=B·S² shape-synthesis check, drops the legacy comparison
- Docstrings no longer reference the old function
The tests that exercise actual formula correctness
(test_thd_multi_doc_uses_squared_sum, test_cp_size_divides_attention_only,
test_bshd_cp_correction, test_include_padding_*, etc.) all stay — they
verify behavior directly without going through the legacy.
Net: -336 / +40 lines. Pure dead-code removal, no runtime change.
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Reflect the modern two-pair metric layout (unpadded useful-work vs padded hardware view) and the peak-memory reporting fix. Applied identically to all four MFU-tracking recipes. Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
For configs with padded_vocab_size set (ESM-2: 33→64 for FP8/tensor-core friendliness), the LM-head matmul physically runs at padded width and the logits are sliced back afterward. Count the padded width in the hardware-view metric (mfu_padded_pct, tflops_per_gpu_padded) while continuing to count raw vocab_size in the useful-work metric (mfu_pct, tflops_per_gpu). For configs without padded_vocab_size (llama3, og2, codonfm) the two values collapse and nothing changes. Addresses review feedback from @trvachov on PR #1548. Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Description
Adds an architecture-independent FLOPs counting and MFU tracking module (flops.py) that can be used by any native_te recipe. The module auto-detects model architecture (GQA vs MHA, SwiGLU vs standard FFN) from HuggingFace config dicts and supports analytical, simplified, and Hyena (Evo2) FLOPs formulas.
MFU tracking is integrated into all 11 training scripts across 4 recipes, gated behind a log_mfu: false Hydra flag
Some MFU results: https://linear.app/nvidia/issue/BIO-406/mfu-calculation-experiment#comment-50296297
Usage
During training:
CLI tool:
Type of changes
CI Pipeline Configuration
Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.
Unit tests marked as
@pytest.mark.multi_gpuor@pytest.mark.distributedare not run in the PR pipeline.For more details, see CONTRIBUTING
Note
By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.
Authorizing CI Runs
We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
/ok to testcomment on the pull request to trigger CI. This will need to be done for each new commit.Triggering Code Rabbit AI Review
To trigger a code review from code rabbit, comment on a pull request with one of these commands:
See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.
Pre-submit Checklist
Summary by CodeRabbit
New Features
log_mfu=trueparameter.Documentation
flops.pyCLI commands for performance profiling and GPU analysis.Tests