Support for fast_inference in Nemotron-3-nano-30b-a3b-bf16

Is Nemotron-3-nano-30b-a3b-bf16  supported right now for fast_inference with vllm ? 

While running 
```
MODEL_PATH = '/nemotron-3-nano-30b-a3b-bf16/transformers/default/1'
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = MODEL_PATH,
    max_seq_length = MAX_MODEL_LEN, # Choose any for long context!
    load_in_4bit = False,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    trust_remote_code = True,
    unsloth_force_compile = False,
    attn_implementation = "eager",
    torch_dtype=torch.bfloat16,
    dtype=None,
    gpu_memory_utilization=0.85, # Competition mandated vLLM memory fence 
    fast_inference=True,
)
```
I get the following error

```
Unsloth: WARNING `trust_remote_code` is True.
Are you certain you want to do remote code execution?
==((====))==  Unsloth 2026.3.15: Fast Nemotron_H patching. Transformers: 4.57.6. vLLM: 0.18.0.
   \\   /|    NVIDIA RTX PRO 6000 Blackwell Server Edition. Num GPUs = 1. Max memory: 94.971 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.10.0+cu128. CUDA: 12.0. CUDA Toolkit: 12.8. Triton: 3.6.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.35. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.
INFO 03-28 15:05:27 [vllm_utils.py:724] Unsloth: Patching vLLM v1 graph capture
Unsloth: Standby mode is enabled. Changing `gpu_memory_utilization` to 0.87875.
Unsloth: vLLM loading /kaggle/input/models/metric/nemotron-3-nano-30b-a3b-bf16/transformers/default/1 with actual GPU utilization = 87.2%
Unsloth: Your GPU has CUDA compute capability 12.0 with VRAM = 94.97 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 8192. Num Sequences = 128.
Unsloth: vLLM's KV Cache can use up to 78.23 GB. Also swap space = 6 GB.
Unsloth: Not an error, but `level` is not supported in vLLM.config.CompilationConfig. Skipping.
Unsloth: Not an error, but `use_cudagraph` is not supported in vLLM.config.CompilationConfig. Skipping.
Unsloth: Not an error, but `use_inductor` is not supported in vLLM.config.CompilationConfig. Skipping.
Unsloth: Not an error, but `swap_space` is not supported in vLLM. Skipping.
Unsloth: Not an error, but `device` is not supported in vLLM. Skipping.
INFO 03-28 15:05:27 [utils.py:233] non-default args: {'dtype': torch.bfloat16, 'max_model_len': 8192, 'enable_prefix_caching': True, 'gpu_memory_utilization': 0.87198938978831, 'max_num_batched_tokens': 8192, 'max_num_seqs': 128, 'max_logprobs': 0, 'disable_log_stats': True, 'enable_lora': True, 'max_lora_rank': 64, 'enable_chunked_prefill': True, 'compilation_config': {'mode': None, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': [], 'splitting_ops': None, 'compile_mm_encoder': False, 'compile_sizes': None, 'compile_ranges_endpoints': None, 'inductor_compile_config': {'epilogue_fusion': True, 'max_autotune': False, 'shape_padding': True, 'trace.enabled': False, 'triton.cudagraphs': False, 'debug': False, 'dce': True, 'memory_planning': True, 'coordinate_descent_tuning': False, 'trace.graph_diagram': False, 'compile_threads': 32, 'group_fusion': True, 'disable_progress': False, 'verbose_progress': True, 'triton.multi_kernel': 0, 'triton.use_block_ptr': True, 'triton.enable_persistent_tma_matmul': True, 'triton.autotune_at_compile_time': False, 'triton.cooperative_reductions': False, 'cuda.compile_opt_level': '-O2', 'cuda.enable_cuda_lto': True, 'combo_kernels': False, 'benchmark_combo_kernel': True, 'combo_kernel_foreach_dynamic_shapes': True, 'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': None, 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': None, 'pass_config': {}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': None, 'static_all_moe_layers': []}, 'enable_sleep_mode': True, 'model': '/kaggle/input/models/metric/nemotron-3-nano-30b-a3b-bf16/transformers/default/1'}
WARNING 03-28 15:05:27 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_ATTENTION_BACKEND
WARNING 03-28 15:05:27 [arg_utils.py:1352] The global random seed is set to 0. Since VLLM_ENABLE_V1_MULTIPROCESSING is set to False, this may affect the random state of the Python process that launched vLLM.
INFO 03-28 15:05:38 [model.py:533] Resolved architecture: NemotronHForCausalLM
INFO 03-28 15:05:38 [model.py:1582] Using max model len 8192
INFO 03-28 15:05:38 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 03-28 15:05:38 [config.py:427] Updating mamba_ssm_cache_dtype to 'float32' for NemotronH model
WARNING 03-28 15:05:38 [config.py:372] Mamba cache mode is set to 'all' for NemotronHForCausalLM by default when prefix caching is enabled
INFO 03-28 15:05:38 [config.py:392] Warning: Prefix caching in Mamba cache 'all' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
INFO 03-28 15:05:38 [config.py:212] Setting attention block size to 2176 tokens to ensure that attention page size is >= mamba page size.
INFO 03-28 15:05:38 [config.py:243] Padding mamba page size by 4.41% to ensure that mamba page size and attention page size are exactly equal.
INFO 03-28 15:05:38 [vllm.py:754] Asynchronous scheduling is enabled.
INFO 03-28 15:05:39 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='/kaggle/input/models/metric/nemotron-3-nano-30b-a3b-bf16/transformers/default/1', speculative_config=None, tokenizer='/kaggle/input/models/metric/nemotron-3-nano-30b-a3b-bf16/transformers/default/1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/kaggle/input/models/metric/nemotron-3-nano-30b-a3b-bf16/transformers/default/1, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'epilogue_fusion': True, 'max_autotune': False, 'shape_padding': True, 'trace.enabled': False, 'triton.cudagraphs': False, 'debug': False, 'dce': True, 'memory_planning': True, 'coordinate_descent_tuning': False, 'trace.graph_diagram': False, 'compile_threads': 32, 'group_fusion': True, 'disable_progress': False, 'verbose_progress': True, 'triton.multi_kernel': 0, 'triton.use_block_ptr': True, 'triton.enable_persistent_tma_matmul': True, 'triton.autotune_at_compile_time': False, 'triton.cooperative_reductions': False, 'cuda.compile_opt_level': '-O2', 'cuda.enable_cuda_lto': True, 'combo_kernels': False, 'benchmark_combo_kernel': True, 'combo_kernel_foreach_dynamic_shapes': True, 'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 256, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
INFO 03-28 15:05:39 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.19.2.2:34405 backend=nccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
INFO 03-28 15:05:39 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
[W328 15:05:39.040622649 socket.cpp:207] [c10d] The hostname of the client socket cannot be retrieved. err=-3
INFO 03-28 15:05:39 [topk_topp_sampler.py:51] Using FlashInfer for top-p & top-k sampling.
INFO 03-28 15:05:39 [gpu_model_runner.py:4481] Starting to load model /kaggle/input/models/metric/nemotron-3-nano-30b-a3b-bf16/transformers/default/1...
INFO 03-28 15:05:40 [unquantized.py:186] Using TRITON backend for Unquantized MoE
INFO 03-28 15:05:40 [cuda.py:317] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
INFO 03-28 15:05:40 [flash_attn.py:598] Using FlashAttention version 2
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards: 100% Completed | 13/13 [00:06<00:00,  2.02it/s]
INFO 03-28 15:05:47 [default_loader.py:384] Loading weights took 6.87 seconds
INFO 03-28 15:05:47 [utils.py:98] MoE model detected. Using fused MoE LoRA implementation.
INFO 03-28 15:05:47 [punica_selector.py:20] Using PunicaWrapperGPU.
INFO 03-28 15:05:48 [gpu_model_runner.py:4566] Model loading took 62.32 GiB memory and 7.255007 seconds
INFO 03-28 15:05:54 [backends.py:988] Using cache directory: /root/.cache/vllm/torch_compile_cache/8b6c0954a7/rank_0_0/backbone for vLLM's torch.compile
INFO 03-28 15:05:54 [backends.py:1048] Dynamo bytecode transform time: 4.56 s
Unsloth: Compiling kernels: 0it [00:00, ?it/s]
INFO 03-28 15:05:56 [backends.py:371] Cache the graph of compile range (1, 8192) for later use

Unsloth: Compiling kernels: 100%|██████████| 4/4 [00:00<00:00,  8.91it/s, triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_3]                   
Unsloth: Compiling kernels: 100%|██████████| 3/3 [00:00<00:00, 15.35it/s, triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_2]           
Unsloth: Compiling kernels: 100%|██████████| 3/3 [00:00<00:00, 18.56it/s, triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_2]
Unsloth: Compiling kernels: 100%|██████████| 5/5 [00:00<00:00, 29.82it/s, triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_4]           
INFO 03-28 15:06:00 [backends.py:387] Compiling a graph for compile range (1, 8192) takes 5.85 s

INFO 03-28 15:06:01 [decorators.py:627] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/5a9026e3ef2e124fe23401875ad6806d543b2382a392303ea829a5b96464d9fc/rank_0_0/model
INFO 03-28 15:06:01 [monitor.py:48] torch.compile took 11.32 s in total
WARNING 03-28 15:06:01 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=1856,device_name=NVIDIA_RTX_PRO_6000_Blackwell_Server_Edition.json
INFO 03-28 15:06:04 [monitor.py:76] Initial profiling/warmup run took 3.71 s
WARNING 03-28 15:06:48 [kv_cache_utils.py:1056] Add 1 padding layers, may waste at most 4.35% KV cache memory
INFO 03-28 15:06:48 [kv_cache_utils.py:826] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=256
INFO 03-28 15:06:48 [gpu_model_runner.py:5607] Profiling CUDA graph memory: PIECEWISE=70 (largest=256), FULL=38 (largest=128)
WARNING 03-28 15:06:49 [utils.py:268] Using default LoRA kernel configs
INFO 03-28 15:07:34 [gpu_model_runner.py:5686] Estimated CUDA graph memory: 0.52 GiB total
INFO 03-28 15:07:35 [gpu_worker.py:456] Available KV cache memory: 14.42 GiB
INFO 03-28 15:07:35 [gpu_worker.py:490] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.8720 to 0.8774 to maintain the same effective KV cache size.
WARNING 03-28 15:07:35 [kv_cache_utils.py:1056] Add 1 padding layers, may waste at most 4.35% KV cache memory
INFO 03-28 15:07:35 [kv_cache_utils.py:1316] GPU KV cache size: 502,656 tokens
INFO 03-28 15:07:35 [kv_cache_utils.py:1321] Maximum concurrency for 8,192 tokens per request: 57.90x
2026-03-28 15:07:35,286 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-03-28 15:07:35,419 - INFO - autotuner.py:268 - flashinfer.jit: [Autotuner]: Autotuning process ends
INFO 03-28 15:07:35 [vllm_utils.py:729] Unsloth: Running patched vLLM v1 `capture_model`.
INFO 03-28 15:07:35 [vllm_utils.py:729] Unsloth: Running patched vLLM v1 `capture_model`.
INFO 03-28 15:07:35 [vllm_utils.py:729] Unsloth: Running patched vLLM v1 `capture_model`.
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 70/70 [00:12<00:00,  5.49it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 38/38 [00:03<00:00, 11.91it/s]
INFO 03-28 15:07:51 [gpu_model_runner.py:5746] Graph capturing finished in 16 secs, took -0.47 GiB
INFO 03-28 15:07:51 [vllm_utils.py:736] Unsloth: Patched vLLM v1 graph capture finished in 16 secs.
INFO 03-28 15:07:51 [vllm_utils.py:736] Unsloth: Patched vLLM v1 graph capture finished in 16 secs.
INFO 03-28 15:07:51 [vllm_utils.py:736] Unsloth: Patched vLLM v1 graph capture finished in 16 secs.

INFO 03-28 15:07:52 [gpu_worker.py:617] CUDA graph pool memory: -0.47 GiB (actual), 0.52 GiB (estimated), difference: 0.98 GiB (105696460800.0%).
INFO 03-28 15:07:53 [core.py:281] init engine (profile, create kv cache, warmup model) took 124.33 seconds
INFO 03-28 15:07:54 [llm.py:391] Supported tasks: ('generate',)
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
/tmp/ipykernel_161/2337999723.py in <cell line: 0>()
      8 os.environ['CUDA_VISIBLE_DEVICES'] = '0'
      9 
---> 10 model, tokenizer = FastLanguageModel.from_pretrained(
     11     model_name = MODEL_PATH,
     12     max_seq_length = MAX_MODEL_LEN, # Choose any for long context!

/usr/local/lib/python3.12/dist-packages/unsloth/models/loader.py in from_pretrained(model_name, max_seq_length, dtype, load_in_4bit, load_in_8bit, load_in_16bit, full_finetuning, token, device_map, rope_scaling, fix_tokenizer, trust_remote_code, use_gradient_checkpointing, resize_model_vocab, revision, use_exact_model_name, offload_embedding, float32_mixed_precision, fast_inference, gpu_memory_utilization, float8_kv_cache, random_state, max_lora_rank, disable_log_stats, qat_scheme, load_in_fp8, unsloth_tiled_mlp, *args, **kwargs)
    650         #     dispatch_model = FastGraniteModel
    651         else:
--> 652             return FastModel.from_pretrained(
    653                 model_name = old_model_name,
    654                 max_seq_length = max_seq_length,

/usr/local/lib/python3.12/dist-packages/unsloth/models/loader.py in from_pretrained(model_name, max_seq_length, dtype, load_in_4bit, load_in_8bit, load_in_16bit, full_finetuning, token, device_map, rope_scaling, fix_tokenizer, trust_remote_code, use_gradient_checkpointing, resize_model_vocab, revision, return_logits, fullgraph, use_exact_model_name, auto_model, whisper_language, whisper_task, unsloth_force_compile, offload_embedding, float32_mixed_precision, fast_inference, gpu_memory_utilization, float8_kv_cache, random_state, max_lora_rank, disable_log_stats, qat_scheme, load_in_fp8, unsloth_tiled_mlp, target_parameters, *args, **kwargs)
   1431             load_in_8bit_kwargs = False
   1432 
-> 1433         model, tokenizer = FastBaseModel.from_pretrained(
   1434             model_name = model_name,
   1435             max_seq_length = max_seq_length,

/usr/local/lib/python3.12/dist-packages/unsloth/models/vision.py in from_pretrained(model_name, max_seq_length, dtype, load_in_4bit, load_in_8bit, load_in_16bit, full_finetuning, token, device_map, trust_remote_code, model_types, tokenizer_name, auto_model, use_gradient_checkpointing, supports_sdpa, whisper_language, whisper_task, auto_config, offload_embedding, float32_mixed_precision, fast_inference, gpu_memory_utilization, float8_kv_cache, random_state, max_lora_rank, disable_log_stats, unsloth_vllm_standby, load_in_fp8, **kwargs)
    893 
    894             # Convert to HF format
--> 895             _, quant_state_dict = get_vllm_state_dict(
    896                 llm,
    897                 config = model_config,

/usr/local/lib/python3.12/dist-packages/unsloth_zoo/vllm_utils.py in get_vllm_state_dict(llm, return_state_dict, config, is_vision_model, load_in_fp8)
    867         ctx_manager = torch.inference_mode()
    868     with ctx_manager:
--> 869         return _get_vllm_state_dict(llm, return_state_dict, config, is_vision_model)
    870 
    871 

/usr/local/lib/python3.12/dist-packages/unsloth_zoo/vllm_utils.py in _get_vllm_state_dict(llm, return_state_dict, config, is_vision_model)
   1121             get_state_dict(f"{prefix}.v_proj", 2, state_dict, kv_proj)
   1122 
-> 1123         get_state_dict(f"{prefix}.o_proj", 0, state_dict, o_proj)
   1124 
   1125         proj = layer.mlp.gate_up_proj

UnboundLocalError: cannot access local variable 'prefix' where it is not associated with a value
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for fast_inference in Nemotron-3-nano-30b-a3b-bf16 #565

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support for fast_inference in Nemotron-3-nano-30b-a3b-bf16 #565

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions