Skip to content

Commit a047ec3

Browse files
savitha-engclaude
andcommitted
Add post-launch verification and validation as downstream signal
- Mandatory post-launch self-check: agent verifies GPU count (48), grad_acc_steps (8), effective batch size, and resume step. If wrong, agent kills and restarts immediately. - Re-enable validation at 1000-step intervals as a downstream quality signal (FP8 paper notes training loss can diverge without hurting downstream tasks). Validation is informational only — does not trigger rollbacks. Failures are caught by try/except in training. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent cabe055 commit a047ec3

1 file changed

Lines changed: 26 additions & 1 deletion

File tree

bionemo-recipes/recipes/opengenome2_llama_native_te/OG2_FP8_AGENT_GUIDE.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -177,7 +177,9 @@ torchrun \
177177
checkpoint.resume_from_checkpoint=true \ # ← FIXED (always true; auto-finds latest checkpoint)
178178
checkpoint.max_checkpoints=4 \
179179
checkpoint.save_final_model=true \
180-
validation.enabled=false \ # ← FIXED (disabled for agent runs)
180+
validation.enabled=true \ # ← FIXED
181+
validation.eval_interval=1000 \ # ← FIXED (every 1000 steps)
182+
validation.num_batches=40 \ # ← FIXED
181183
hydra.run.dir=$WORKSPACE_ROOT/<run_name>/hydra_outputs \
182184
wandb.project=$WANDB_PROJECT \ # ← FIXED
183185
+wandb.group=<run_name> \ # ← FIXED (computed once at session start, never changes)
@@ -265,6 +267,29 @@ torchrun \
265267
- Each relaunch (after demotion/recovery) uses the next number: `1.sh`, `2.sh`, `3.sh`, etc.
266268
- Track the launch counter in `state.json` so you can resume correctly after a crash.
267269

270+
### Post-Launch Verification (MANDATORY)
271+
272+
After the **first** torchrun launch in this session, verify the training setup is correct before proceeding. Check the stdout/WandB output for:
273+
274+
1. **GPU count**: Must show `GPU count: {NPROC_PER_NODE * NNODES}` (e.g., 48 for 6 nodes × 8 GPUs). If it shows only `NPROC_PER_NODE` (e.g., 8), multi-node is broken — kill immediately and debug.
275+
2. **grad_acc_steps**: Must be `$GRAD_ACC_STEPS` (e.g., 8). If it shows any other value, kill immediately and fix.
276+
3. **Effective batch size**: Should be `$MICRO_BATCH_SIZE × $NPROC_PER_NODE × $NNODES × $GRAD_ACC_STEPS` (e.g., 1 × 8 × 6 × 8 = 384).
277+
4. **Resume step**: For warm-start, must show `Starting training loop from step <LKG_STEP + 1>`.
278+
279+
If ANY of these are wrong, kill training immediately, diagnose the issue, fix it, and relaunch. Do NOT let incorrect training continue — it wastes GPU time and produces unusable results.
280+
281+
### Validation as Downstream Signal
282+
283+
Validation is enabled in the training command (`validation.enabled=true`). The training script automatically runs validation every 1000 steps and logs `val/loss` and `val/ppl` to WandB. This provides a downstream-like signal: the FP8 paper (Nemotron-3 Super) notes that training loss can diverge slightly under low-precision without hurting downstream task quality.
284+
285+
**How the agent uses validation metrics:**
286+
287+
- At each check-in, also read `val/loss` and `val/ppl` from `wandb-history.jsonl` (if a validation step has occurred since the last check-in).
288+
- Log validation metrics to `history.json` alongside training metrics.
289+
- Validation metrics are **informational only** — they do NOT trigger rollbacks. Only training perplexity triggers rollbacks.
290+
- In `report.md`, include a comparison of validation perplexity between this FP8 run and the BF16 baseline. This helps determine if FP8 precision loss affects downstream quality.
291+
- If validation fails with an error (e.g., data loading issue), the training script already handles this with try/except — training continues uninterrupted. The agent should log the failure but NOT take any action.
292+
268293
### Layer Precision Control
269294

270295
Precision is controlled via a single list:

0 commit comments

Comments
 (0)