WIP: TurboQuant for ORT WebGPU by sushraja-msft · Pull Request #28059 · microsoft/onnxruntime

sushraja-msft · 2026-04-14T02:59:09Z

Description

WIP Turbo quant implementation from claude, uses hadmard matrix for rotation instead of regular matrix - deviates from the paper this way.

Early numbers

Without Turbo Quant

C:\onnxruntime-genai\examples\c>C:\onnxruntime-genai\examples\c\build\RelWithDebInfo\model_benchmark.exe -i "C:\models\phi4-onnx" -l 1024
Batch size: 1, prompt tokens: 1024, tokens to generate: 128
Prompt processing (time to first token):
avg (us): 572028
avg (tokens/s): 1790.12
p50 (us): 571558
stddev (us): 1542.95
n: 5 * 1024 token(s)
Token generation:
avg (us): 10015.9
avg (tokens/s): 99.8415
p50 (us): 9654.5
stddev (us): 3461.49
n: 635 * 1 token(s)
Token sampling:
avg (us): 35.9
avg (tokens/s): 27855.2
p50 (us): 38
stddev (us): 6.61135
n: 5 * 1 token(s)
E2E generation (entire generation loop):
avg (ms): 1844.11
p50 (ms): 1845.6
stddev (ms): 4.43472
n: 5
Peak working set size (bytes): 2098737152

With Turbo Quant

C:\onnxruntime-genai\examples\c>C:\onnxruntime-genai\examples\c\build\RelWithDebInfo\model_benchmark.exe -i "C:\models\phi4-onnx" -l 1024
Batch size: 1, prompt tokens: 1024, tokens to generate: 128
Prompt processing (time to first token):
avg (us): 589068
avg (tokens/s): 1738.34
p50 (us): 588342
stddev (us): 2264.58
n: 5 * 1024 token(s)
Token generation:
avg (us): 10817.2
avg (tokens/s): 92.4455
p50 (us): 10443
stddev (us): 3579.67
n: 635 * 1 token(s)
Token sampling:
avg (us): 37.1
avg (tokens/s): 26954.2
p50 (us): 38.8
stddev (us): 5.39954
n: 5 * 1 token(s)
E2E generation (entire generation loop):
avg (ms): 1962.92
p50 (ms): 1960.86
stddev (ms): 3.43245
n: 5
Peak working set size (bytes): 1856163840

Saves about 200MB of VRAM for 1K prompt, but does slow down prompt and generation.
Implementation passes needle in haystack and ruler kind of tests but then has clear model dumbing down for prompts like "HI", write me a poem. That is the Hi response has some strange tool call, poem was more repetitive.

Next Step - root cause why model quality degrades.

Motivation and Context

github-actions

You can commit the suggested changes from lintrunner.

github-actions · 2026-04-14T03:06:26Z

+      .AddUniformVariables({{num_workgroups},    // sequence_length = total workgroups
+                            {num_workgroups},    // present_sequence_length = same (flat)
+                            {0u},                // start_token = 0
+                            {1u}});              // n_reps = 1


Suggested change

.AddUniformVariables({{num_workgroups}, // sequence_length = total workgroups

{num_workgroups}, // present_sequence_length = same (flat)

{0u}, // start_token = 0

{1u}}); // n_reps = 1

.AddUniformVariables({{num_workgroups}, // sequence_length = total workgroups

{num_workgroups}, // present_sequence_length = same (flat)

{0u}, // start_token = 0

{1u}}); // n_reps = 1

github-actions · 2026-04-14T03:06:26Z

  bool enable_pix_capture{false};                // PIX capture is disabled by default
  bool enable_int64{false};                      // int64 ops are not enabled by default
  uint32_t multi_rotary_cache_concat_offset{0};  // offset for concatenated multi rotary cache (0 = disabled)
+  bool turbo_quant{false};                        // enable TurboQuant KV cache compression


Suggested change

bool turbo_quant{false}; // enable TurboQuant KV cache compression

bool turbo_quant{false}; // enable TurboQuant KV cache compression

github-actions · 2026-04-14T03:06:26Z

+
+static const float TQ_CENTROIDS[16] = {
+    -0.2377f, -0.1809f, -0.1419f, -0.1104f, -0.0829f, -0.0578f, -0.0342f, -0.0113f,
+     0.0113f,  0.0342f,  0.0578f,  0.0829f,  0.1104f,  0.1419f,  0.1809f,  0.2377f};


Suggested change

0.0113f, 0.0342f, 0.0578f, 0.0829f, 0.1104f, 0.1419f, 0.1809f, 0.2377f};

0.0113f, 0.0342f, 0.0578f, 0.0829f, 0.1104f, 0.1419f, 0.1809f, 0.2377f};

github-actions · 2026-04-14T03:06:26Z

+
+static const float TQ_BOUNDARIES[15] = {
+    -0.2093f, -0.1614f, -0.1261f, -0.0966f, -0.0704f, -0.0460f, -0.0227f,
+     0.0000f,  0.0227f,  0.0460f,  0.0704f,  0.0966f,  0.1261f,  0.1614f,  0.2093f};


Suggested change

0.0000f, 0.0227f, 0.0460f, 0.0704f, 0.0966f, 0.1261f, 0.1614f, 0.2093f};

0.0000f, 0.0227f, 0.0460f, 0.0704f, 0.0966f, 0.1261f, 0.1614f, 0.2093f};

github-actions · 2026-04-14T03:06:26Z

+                                         static_cast<int64_t>(cfg.max_cache),
+                                         cache_dim};


Suggested change

static_cast<int64_t>(cfg.max_cache),

cache_dim};

static_cast<int64_t>(cfg.max_cache),

cache_dim};

github-actions · 2026-04-14T03:06:26Z

+    query = helper.make_tensor_value_info(
+        "query", TensorProto.FLOAT16, [batch_size, "seq_len", hidden_size]
+    )
+    key = helper.make_tensor_value_info(
+        "key", TensorProto.FLOAT16, [batch_size, "seq_len", kv_hidden_size]
+    )
+    value = helper.make_tensor_value_info(
+        "value", TensorProto.FLOAT16, [batch_size, "seq_len", kv_hidden_size]
+    )


Suggested change

query = helper.make_tensor_value_info(

"query", TensorProto.FLOAT16, [batch_size, "seq_len", hidden_size]

)

key = helper.make_tensor_value_info(

"key", TensorProto.FLOAT16, [batch_size, "seq_len", kv_hidden_size]

)

value = helper.make_tensor_value_info(

"value", TensorProto.FLOAT16, [batch_size, "seq_len", kv_hidden_size]

)

query = helper.make_tensor_value_info("query", TensorProto.FLOAT16, [batch_size, "seq_len", hidden_size])

key = helper.make_tensor_value_info("key", TensorProto.FLOAT16, [batch_size, "seq_len", kv_hidden_size])

value = helper.make_tensor_value_info("value", TensorProto.FLOAT16, [batch_size, "seq_len", kv_hidden_size])

github-actions · 2026-04-14T03:06:26Z

+    seqlens_k = helper.make_tensor_value_info(
+        "seqlens_k", TensorProto.INT32, [batch_size]
+    )
+    total_sequence_length = helper.make_tensor_value_info(
+        "total_sequence_length", TensorProto.INT32, [1]
+    )


Suggested change

seqlens_k = helper.make_tensor_value_info(

"seqlens_k", TensorProto.INT32, [batch_size]

)

total_sequence_length = helper.make_tensor_value_info(

"total_sequence_length", TensorProto.INT32, [1]

)

seqlens_k = helper.make_tensor_value_info("seqlens_k", TensorProto.INT32, [batch_size])

total_sequence_length = helper.make_tensor_value_info("total_sequence_length", TensorProto.INT32, [1])

github-actions · 2026-04-14T03:06:27Z

+    output = helper.make_tensor_value_info(
+        "output", TensorProto.FLOAT16, [batch_size, "seq_len", hidden_size]
+    )


Suggested change

output = helper.make_tensor_value_info(

"output", TensorProto.FLOAT16, [batch_size, "seq_len", hidden_size]

)

output = helper.make_tensor_value_info("output", TensorProto.FLOAT16, [batch_size, "seq_len", hidden_size])

github-actions · 2026-04-14T03:06:27Z

+            "query",            # 0
+            "key",              # 1
+            "value",            # 2
+            "past_key",         # 3
+            "past_value",       # 4
+            "seqlens_k",        # 5


Suggested change

"query", # 0

"key", # 1

"value", # 2

"past_key", # 3

"past_value", # 4

"seqlens_k", # 5

"query", # 0

"key", # 1

"value", # 2

"past_key", # 3

"past_value", # 4

"seqlens_k", # 5

github-actions · 2026-04-14T03:06:27Z

+            "output",           # 0
+            "present_key",      # 1
+            "present_value",    # 2


Suggested change

"output", # 0

"present_key", # 1

"present_value", # 2

"output", # 0

"present_key", # 1

"present_value", # 2

@@ -0,0 +1,199 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.


@@ -0,0 +1,1032 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.


@@ -0,0 +1,148 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.


sushraja-msft added 6 commits April 12, 2026 15:10

Initial turbo quant rot implementation

47e03a9

Rot for V implemented and tested

3842971

rotate QK

0f31937

pseudo quantize

9b9fd48

use a shared shader.

2729fcc

Full quantization

6bd2002

sushraja-msft changed the title ~~User/sushraja/turbo quant~~ WIP: Turbo quant for ORT WebGPU Apr 14, 2026

sushraja-msft marked this pull request as draft April 14, 2026 02:59

github-actions bot reviewed Apr 14, 2026

View reviewed changes

github-advanced-security AI found potential problems Apr 14, 2026

View reviewed changes

sushraja-msft mentioned this pull request Apr 14, 2026

WIP: TurboQuant for ORT WebGPU microsoft/onnxruntime-genai#2084

Draft

sushraja-msft changed the title ~~WIP: Turbo quant for ORT WebGPU~~ WIP: TurboQuant for ORT WebGPU Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: TurboQuant for ORT WebGPU#28059

WIP: TurboQuant for ORT WebGPU#28059
sushraja-msft wants to merge 6 commits intomainfrom
user/sushraja/turbo_quant

sushraja-msft commented Apr 14, 2026 •

edited

Loading

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot Apr 14, 2026

Uh oh!

github-actions bot Apr 14, 2026

Uh oh!

github-actions bot Apr 14, 2026

Uh oh!

github-actions bot Apr 14, 2026

Uh oh!

github-actions bot Apr 14, 2026

Uh oh!

github-actions bot Apr 14, 2026

Uh oh!

github-actions bot Apr 14, 2026

Uh oh!

github-actions bot Apr 14, 2026

Uh oh!

github-actions bot Apr 14, 2026

Uh oh!

github-actions bot Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	bool turbo_quant{false}; // enable TurboQuant KV cache compression
	bool turbo_quant{false}; // enable TurboQuant KV cache compression

	0.0113f, 0.0342f, 0.0578f, 0.0829f, 0.1104f, 0.1419f, 0.1809f, 0.2377f};
	0.0113f, 0.0342f, 0.0578f, 0.0829f, 0.1104f, 0.1419f, 0.1809f, 0.2377f};

	0.0000f, 0.0227f, 0.0460f, 0.0704f, 0.0966f, 0.1261f, 0.1614f, 0.2093f};
	0.0000f, 0.0227f, 0.0460f, 0.0704f, 0.0966f, 0.1261f, 0.1614f, 0.2093f};

		@@ -0,0 +1,199 @@
		// Copyright (c) Microsoft Corporation. All rights reserved.

		@@ -0,0 +1,1032 @@
		// Copyright (c) Microsoft Corporation. All rights reserved.

		@@ -0,0 +1,148 @@
		# Copyright (c) Microsoft Corporation. All rights reserved.

Conversation

sushraja-msft commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sushraja-msft commented Apr 14, 2026 •

edited

Loading