Skip to content

Support for fast_inference in Nemotron-3-nano-30b-a3b-bf16 #565

@droidraja

Description

@droidraja

Is Nemotron-3-nano-30b-a3b-bf16 supported right now for fast_inference with vllm ?

While running

MODEL_PATH = '/nemotron-3-nano-30b-a3b-bf16/transformers/default/1'
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = MODEL_PATH,
    max_seq_length = MAX_MODEL_LEN, # Choose any for long context!
    load_in_4bit = False,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    trust_remote_code = True,
    unsloth_force_compile = False,
    attn_implementation = "eager",
    torch_dtype=torch.bfloat16,
    dtype=None,
    gpu_memory_utilization=0.85, # Competition mandated vLLM memory fence 
    fast_inference=True,
)

I get the following error

Unsloth: WARNING `trust_remote_code` is True.
Are you certain you want to do remote code execution?
==((====))==  Unsloth 2026.3.15: Fast Nemotron_H patching. Transformers: 4.57.6. vLLM: 0.18.0.
   \\   /|    NVIDIA RTX PRO 6000 Blackwell Server Edition. Num GPUs = 1. Max memory: 94.971 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.10.0+cu128. CUDA: 12.0. CUDA Toolkit: 12.8. Triton: 3.6.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.35. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.
INFO 03-28 15:05:27 [vllm_utils.py:724] Unsloth: Patching vLLM v1 graph capture
Unsloth: Standby mode is enabled. Changing `gpu_memory_utilization` to 0.87875.
Unsloth: vLLM loading /kaggle/input/models/metric/nemotron-3-nano-30b-a3b-bf16/transformers/default/1 with actual GPU utilization = 87.2%
Unsloth: Your GPU has CUDA compute capability 12.0 with VRAM = 94.97 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 8192. Num Sequences = 128.
Unsloth: vLLM's KV Cache can use up to 78.23 GB. Also swap space = 6 GB.
Unsloth: Not an error, but `level` is not supported in vLLM.config.CompilationConfig. Skipping.
Unsloth: Not an error, but `use_cudagraph` is not supported in vLLM.config.CompilationConfig. Skipping.
Unsloth: Not an error, but `use_inductor` is not supported in vLLM.config.CompilationConfig. Skipping.
Unsloth: Not an error, but `swap_space` is not supported in vLLM. Skipping.
Unsloth: Not an error, but `device` is not supported in vLLM. Skipping.
INFO 03-28 15:05:27 [utils.py:233] non-default args: {'dtype': torch.bfloat16, 'max_model_len': 8192, 'enable_prefix_caching': True, 'gpu_memory_utilization': 0.87198938978831, 'max_num_batched_tokens': 8192, 'max_num_seqs': 128, 'max_logprobs': 0, 'disable_log_stats': True, 'enable_lora': True, 'max_lora_rank': 64, 'enable_chunked_prefill': True, 'compilation_config': {'mode': None, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': [], 'splitting_ops': None, 'compile_mm_encoder': False, 'compile_sizes': None, 'compile_ranges_endpoints': None, 'inductor_compile_config': {'epilogue_fusion': True, 'max_autotune': False, 'shape_padding': True, 'trace.enabled': False, 'triton.cudagraphs': False, 'debug': False, 'dce': True, 'memory_planning': True, 'coordinate_descent_tuning': False, 'trace.graph_diagram': False, 'compile_threads': 32, 'group_fusion': True, 'disable_progress': False, 'verbose_progress': True, 'triton.multi_kernel': 0, 'triton.use_block_ptr': True, 'triton.enable_persistent_tma_matmul': True, 'triton.autotune_at_compile_time': False, 'triton.cooperative_reductions': False, 'cuda.compile_opt_level': '-O2', 'cuda.enable_cuda_lto': True, 'combo_kernels': False, 'benchmark_combo_kernel': True, 'combo_kernel_foreach_dynamic_shapes': True, 'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': None, 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': None, 'pass_config': {}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': None, 'static_all_moe_layers': []}, 'enable_sleep_mode': True, 'model': '/kaggle/input/models/metric/nemotron-3-nano-30b-a3b-bf16/transformers/default/1'}
WARNING 03-28 15:05:27 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_ATTENTION_BACKEND
WARNING 03-28 15:05:27 [arg_utils.py:1352] The global random seed is set to 0. Since VLLM_ENABLE_V1_MULTIPROCESSING is set to False, this may affect the random state of the Python process that launched vLLM.
INFO 03-28 15:05:38 [model.py:533] Resolved architecture: NemotronHForCausalLM
INFO 03-28 15:05:38 [model.py:1582] Using max model len 8192
INFO 03-28 15:05:38 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 03-28 15:05:38 [config.py:427] Updating mamba_ssm_cache_dtype to 'float32' for NemotronH model
WARNING 03-28 15:05:38 [config.py:372] Mamba cache mode is set to 'all' for NemotronHForCausalLM by default when prefix caching is enabled
INFO 03-28 15:05:38 [config.py:392] Warning: Prefix caching in Mamba cache 'all' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
INFO 03-28 15:05:38 [config.py:212] Setting attention block size to 2176 tokens to ensure that attention page size is >= mamba page size.
INFO 03-28 15:05:38 [config.py:243] Padding mamba page size by 4.41% to ensure that mamba page size and attention page size are exactly equal.
INFO 03-28 15:05:38 [vllm.py:754] Asynchronous scheduling is enabled.
INFO 03-28 15:05:39 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='/kaggle/input/models/metric/nemotron-3-nano-30b-a3b-bf16/transformers/default/1', speculative_config=None, tokenizer='/kaggle/input/models/metric/nemotron-3-nano-30b-a3b-bf16/transformers/default/1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/kaggle/input/models/metric/nemotron-3-nano-30b-a3b-bf16/transformers/default/1, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'epilogue_fusion': True, 'max_autotune': False, 'shape_padding': True, 'trace.enabled': False, 'triton.cudagraphs': False, 'debug': False, 'dce': True, 'memory_planning': True, 'coordinate_descent_tuning': False, 'trace.graph_diagram': False, 'compile_threads': 32, 'group_fusion': True, 'disable_progress': False, 'verbose_progress': True, 'triton.multi_kernel': 0, 'triton.use_block_ptr': True, 'triton.enable_persistent_tma_matmul': True, 'triton.autotune_at_compile_time': False, 'triton.cooperative_reductions': False, 'cuda.compile_opt_level': '-O2', 'cuda.enable_cuda_lto': True, 'combo_kernels': False, 'benchmark_combo_kernel': True, 'combo_kernel_foreach_dynamic_shapes': True, 'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 256, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
INFO 03-28 15:05:39 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.19.2.2:34405 backend=nccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
INFO 03-28 15:05:39 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
[W328 15:05:39.040622649 socket.cpp:207] [c10d] The hostname of the client socket cannot be retrieved. err=-3
INFO 03-28 15:05:39 [topk_topp_sampler.py:51] Using FlashInfer for top-p & top-k sampling.
INFO 03-28 15:05:39 [gpu_model_runner.py:4481] Starting to load model /kaggle/input/models/metric/nemotron-3-nano-30b-a3b-bf16/transformers/default/1...
INFO 03-28 15:05:40 [unquantized.py:186] Using TRITON backend for Unquantized MoE
INFO 03-28 15:05:40 [cuda.py:317] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
INFO 03-28 15:05:40 [flash_attn.py:598] Using FlashAttention version 2
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards: 100% Completed | 13/13 [00:06<00:00,  2.02it/s]
INFO 03-28 15:05:47 [default_loader.py:384] Loading weights took 6.87 seconds
INFO 03-28 15:05:47 [utils.py:98] MoE model detected. Using fused MoE LoRA implementation.
INFO 03-28 15:05:47 [punica_selector.py:20] Using PunicaWrapperGPU.
INFO 03-28 15:05:48 [gpu_model_runner.py:4566] Model loading took 62.32 GiB memory and 7.255007 seconds
INFO 03-28 15:05:54 [backends.py:988] Using cache directory: /root/.cache/vllm/torch_compile_cache/8b6c0954a7/rank_0_0/backbone for vLLM's torch.compile
INFO 03-28 15:05:54 [backends.py:1048] Dynamo bytecode transform time: 4.56 s
Unsloth: Compiling kernels: 0it [00:00, ?it/s]
INFO 03-28 15:05:56 [backends.py:371] Cache the graph of compile range (1, 8192) for later use

Unsloth: Compiling kernels: 100%|██████████| 4/4 [00:00<00:00,  8.91it/s, triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_3]                   
Unsloth: Compiling kernels: 100%|██████████| 3/3 [00:00<00:00, 15.35it/s, triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_2]           
Unsloth: Compiling kernels: 100%|██████████| 3/3 [00:00<00:00, 18.56it/s, triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_2]
Unsloth: Compiling kernels: 100%|██████████| 5/5 [00:00<00:00, 29.82it/s, triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_4]           
INFO 03-28 15:06:00 [backends.py:387] Compiling a graph for compile range (1, 8192) takes 5.85 s

INFO 03-28 15:06:01 [decorators.py:627] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/5a9026e3ef2e124fe23401875ad6806d543b2382a392303ea829a5b96464d9fc/rank_0_0/model
INFO 03-28 15:06:01 [monitor.py:48] torch.compile took 11.32 s in total
WARNING 03-28 15:06:01 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=1856,device_name=NVIDIA_RTX_PRO_6000_Blackwell_Server_Edition.json
INFO 03-28 15:06:04 [monitor.py:76] Initial profiling/warmup run took 3.71 s
WARNING 03-28 15:06:48 [kv_cache_utils.py:1056] Add 1 padding layers, may waste at most 4.35% KV cache memory
INFO 03-28 15:06:48 [kv_cache_utils.py:826] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=256
INFO 03-28 15:06:48 [gpu_model_runner.py:5607] Profiling CUDA graph memory: PIECEWISE=70 (largest=256), FULL=38 (largest=128)
WARNING 03-28 15:06:49 [utils.py:268] Using default LoRA kernel configs
INFO 03-28 15:07:34 [gpu_model_runner.py:5686] Estimated CUDA graph memory: 0.52 GiB total
INFO 03-28 15:07:35 [gpu_worker.py:456] Available KV cache memory: 14.42 GiB
INFO 03-28 15:07:35 [gpu_worker.py:490] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.8720 to 0.8774 to maintain the same effective KV cache size.
WARNING 03-28 15:07:35 [kv_cache_utils.py:1056] Add 1 padding layers, may waste at most 4.35% KV cache memory
INFO 03-28 15:07:35 [kv_cache_utils.py:1316] GPU KV cache size: 502,656 tokens
INFO 03-28 15:07:35 [kv_cache_utils.py:1321] Maximum concurrency for 8,192 tokens per request: 57.90x
2026-03-28 15:07:35,286 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-03-28 15:07:35,419 - INFO - autotuner.py:268 - flashinfer.jit: [Autotuner]: Autotuning process ends
INFO 03-28 15:07:35 [vllm_utils.py:729] Unsloth: Running patched vLLM v1 `capture_model`.
INFO 03-28 15:07:35 [vllm_utils.py:729] Unsloth: Running patched vLLM v1 `capture_model`.
INFO 03-28 15:07:35 [vllm_utils.py:729] Unsloth: Running patched vLLM v1 `capture_model`.
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 70/70 [00:12<00:00,  5.49it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 38/38 [00:03<00:00, 11.91it/s]
INFO 03-28 15:07:51 [gpu_model_runner.py:5746] Graph capturing finished in 16 secs, took -0.47 GiB
INFO 03-28 15:07:51 [vllm_utils.py:736] Unsloth: Patched vLLM v1 graph capture finished in 16 secs.
INFO 03-28 15:07:51 [vllm_utils.py:736] Unsloth: Patched vLLM v1 graph capture finished in 16 secs.
INFO 03-28 15:07:51 [vllm_utils.py:736] Unsloth: Patched vLLM v1 graph capture finished in 16 secs.

INFO 03-28 15:07:52 [gpu_worker.py:617] CUDA graph pool memory: -0.47 GiB (actual), 0.52 GiB (estimated), difference: 0.98 GiB (105696460800.0%).
INFO 03-28 15:07:53 [core.py:281] init engine (profile, create kv cache, warmup model) took 124.33 seconds
INFO 03-28 15:07:54 [llm.py:391] Supported tasks: ('generate',)
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
/tmp/ipykernel_161/2337999723.py in <cell line: 0>()
      8 os.environ['CUDA_VISIBLE_DEVICES'] = '0'
      9 
---> 10 model, tokenizer = FastLanguageModel.from_pretrained(
     11     model_name = MODEL_PATH,
     12     max_seq_length = MAX_MODEL_LEN, # Choose any for long context!

/usr/local/lib/python3.12/dist-packages/unsloth/models/loader.py in from_pretrained(model_name, max_seq_length, dtype, load_in_4bit, load_in_8bit, load_in_16bit, full_finetuning, token, device_map, rope_scaling, fix_tokenizer, trust_remote_code, use_gradient_checkpointing, resize_model_vocab, revision, use_exact_model_name, offload_embedding, float32_mixed_precision, fast_inference, gpu_memory_utilization, float8_kv_cache, random_state, max_lora_rank, disable_log_stats, qat_scheme, load_in_fp8, unsloth_tiled_mlp, *args, **kwargs)
    650         #     dispatch_model = FastGraniteModel
    651         else:
--> 652             return FastModel.from_pretrained(
    653                 model_name = old_model_name,
    654                 max_seq_length = max_seq_length,

/usr/local/lib/python3.12/dist-packages/unsloth/models/loader.py in from_pretrained(model_name, max_seq_length, dtype, load_in_4bit, load_in_8bit, load_in_16bit, full_finetuning, token, device_map, rope_scaling, fix_tokenizer, trust_remote_code, use_gradient_checkpointing, resize_model_vocab, revision, return_logits, fullgraph, use_exact_model_name, auto_model, whisper_language, whisper_task, unsloth_force_compile, offload_embedding, float32_mixed_precision, fast_inference, gpu_memory_utilization, float8_kv_cache, random_state, max_lora_rank, disable_log_stats, qat_scheme, load_in_fp8, unsloth_tiled_mlp, target_parameters, *args, **kwargs)
   1431             load_in_8bit_kwargs = False
   1432 
-> 1433         model, tokenizer = FastBaseModel.from_pretrained(
   1434             model_name = model_name,
   1435             max_seq_length = max_seq_length,

/usr/local/lib/python3.12/dist-packages/unsloth/models/vision.py in from_pretrained(model_name, max_seq_length, dtype, load_in_4bit, load_in_8bit, load_in_16bit, full_finetuning, token, device_map, trust_remote_code, model_types, tokenizer_name, auto_model, use_gradient_checkpointing, supports_sdpa, whisper_language, whisper_task, auto_config, offload_embedding, float32_mixed_precision, fast_inference, gpu_memory_utilization, float8_kv_cache, random_state, max_lora_rank, disable_log_stats, unsloth_vllm_standby, load_in_fp8, **kwargs)
    893 
    894             # Convert to HF format
--> 895             _, quant_state_dict = get_vllm_state_dict(
    896                 llm,
    897                 config = model_config,

/usr/local/lib/python3.12/dist-packages/unsloth_zoo/vllm_utils.py in get_vllm_state_dict(llm, return_state_dict, config, is_vision_model, load_in_fp8)
    867         ctx_manager = torch.inference_mode()
    868     with ctx_manager:
--> 869         return _get_vllm_state_dict(llm, return_state_dict, config, is_vision_model)
    870 
    871 

/usr/local/lib/python3.12/dist-packages/unsloth_zoo/vllm_utils.py in _get_vllm_state_dict(llm, return_state_dict, config, is_vision_model)
   1121             get_state_dict(f"{prefix}.v_proj", 2, state_dict, kv_proj)
   1122 
-> 1123         get_state_dict(f"{prefix}.o_proj", 0, state_dict, o_proj)
   1124 
   1125         proj = layer.mlp.gate_up_proj

UnboundLocalError: cannot access local variable 'prefix' where it is not associated with a value

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions