Commit 1de83f9
authored
Fix OOM regression in _apply() for quantized models during inference (#13372)
Skip unnecessary clone of inference-mode tensors when already inside
torch.inference_mode(), matching the existing guard in set_attr_param.
The unconditional clone introduced in 20561aa caused transient VRAM
doubling during model movement for FP8/quantized models.1 parent 8f37471 commit 1de83f9
1 file changed
+1
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1151 | 1151 | | |
1152 | 1152 | | |
1153 | 1153 | | |
1154 | | - | |
| 1154 | + | |
1155 | 1155 | | |
1156 | 1156 | | |
1157 | 1157 | | |
| |||
0 commit comments