You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add post-launch verification and validation as downstream signal
- Mandatory post-launch self-check: agent verifies GPU count (48),
grad_acc_steps (8), effective batch size, and resume step. If wrong,
agent kills and restarts immediately.
- Re-enable validation at 1000-step intervals as a downstream quality
signal (FP8 paper notes training loss can diverge without hurting
downstream tasks). Validation is informational only — does not
trigger rollbacks. Failures are caught by try/except in training.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
+wandb.group=<run_name>\ # ← FIXED (computed once at session start, never changes)
@@ -265,6 +267,29 @@ torchrun \
265
267
- Each relaunch (after demotion/recovery) uses the next number: `1.sh`, `2.sh`, `3.sh`, etc.
266
268
- Track the launch counter in `state.json` so you can resume correctly after a crash.
267
269
270
+
### Post-Launch Verification (MANDATORY)
271
+
272
+
After the **first** torchrun launch in this session, verify the training setup is correct before proceeding. Check the stdout/WandB output for:
273
+
274
+
1.**GPU count**: Must show `GPU count: {NPROC_PER_NODE * NNODES}` (e.g., 48 for 6 nodes × 8 GPUs). If it shows only `NPROC_PER_NODE` (e.g., 8), multi-node is broken — kill immediately and debug.
275
+
2.**grad_acc_steps**: Must be `$GRAD_ACC_STEPS` (e.g., 8). If it shows any other value, kill immediately and fix.
4.**Resume step**: For warm-start, must show `Starting training loop from step <LKG_STEP + 1>`.
278
+
279
+
If ANY of these are wrong, kill training immediately, diagnose the issue, fix it, and relaunch. Do NOT let incorrect training continue — it wastes GPU time and produces unusable results.
280
+
281
+
### Validation as Downstream Signal
282
+
283
+
Validation is enabled in the training command (`validation.enabled=true`). The training script automatically runs validation every 1000 steps and logs `val/loss` and `val/ppl` to WandB. This provides a downstream-like signal: the FP8 paper (Nemotron-3 Super) notes that training loss can diverge slightly under low-precision without hurting downstream task quality.
284
+
285
+
**How the agent uses validation metrics:**
286
+
287
+
- At each check-in, also read `val/loss` and `val/ppl` from `wandb-history.jsonl` (if a validation step has occurred since the last check-in).
288
+
- Log validation metrics to `history.json` alongside training metrics.
289
+
- Validation metrics are **informational only** — they do NOT trigger rollbacks. Only training perplexity triggers rollbacks.
290
+
- In `report.md`, include a comparison of validation perplexity between this FP8 run and the BF16 baseline. This helps determine if FP8 precision loss affects downstream quality.
291
+
- If validation fails with an error (e.g., data loading issue), the training script already handles this with try/except — training continues uninterrupted. The agent should log the failure but NOT take any action.
0 commit comments