feat(gpu): optimize BLS12-446 field arithmetic for MSM performance by bbarbakadze · Pull Request #3448 · zama-ai/tfhe-rs

bbarbakadze · 2026-04-03T16:58:27Z

PR content/description

Optimize BLS12-446 field arithmetic for MSM performance

Replace 64-bit CIOS Montgomery multiplication with 32-bit MAD chains
(mad.lo.cc/madc.hi.cc), exploiting native 2x throughput of 32-bit ops
on NVIDIA GPUs via even/odd accumulator separation
Add fp_mont_sqr using a triangular MAD chain (upper triangle computed
once and doubled, diagonal added separately), saving ~40% of the
multiplications versus treating squaring as a general multiplication
Add fp_add_lazy/fp_sub_lazy (and Fp2 variants): skip the final
conditional subtraction when the result feeds fp_mont_mul, which
accepts inputs in [0, 2p). Wired into fp2_mont_mul, fp2_mont_square,
and G1/G2 projective_point_double
Replace all fp_mont_mul(c, a, a) squaring patterns with fp_mont_sqr
across curve.cu and fp2.cu (point addition, doubling, inversion)

Check-list:

Tests for the changes have been added (for bug fixes / features)
Docs have been added / updated (for bug fixes / features)
Relevant issues are marked as resolved/closed, related issues are linked in the description
Check for breaking changes (including serialization changes) and add them to commit message following the conventional commit specification

github-actions · 2026-04-03T17:00:29Z

⚠️ This PR contains unsigned commits. To get your PR merged, please sign those commits (git rebase --exec 'git commit -S --amend --no-edit -n' @{upstream}) and force push them to this branch (git push --force-with-lease).

If you're new to commit signing, there are different ways to set it up:

Sign commits with gpg

Follow the steps below to set up commit signing with gpg:

Sign commits with ssh-agent

Follow the steps below to set up commit signing with ssh-agent:

Sign commits with 1Password

You can also sign commits using 1Password, which lets you sign commits with biometrics without the signing key leaving the local 1Password process.

Learn how to use 1Password to sign your commits.

pdroalves · 2026-04-06T15:22:11Z

This PR should change this line to 32 by default.

pdroalves · 2026-04-06T18:34:36Z

@bbarbakadze Something is wrong with benchmarks: https://github.com/zama-ai/tfhe-rs/actions/runs/24038290191

pdroalves

I added a batch of high-level comments. Let me know when you are done with this PR so I can do a careful review line by line.

pdroalves · 2026-04-08T15:47:44Z

+// largest field alignment (4 bytes in 32-bit limb mode, 8 bytes in 64-bit).
+// Forcing alignas(8) ensures sizeof(G1Affine)==120 in both modes, matching
+// the Rust FFI bindings which are always generated from the 64-bit layout.
+struct alignas(8) G1Affine {


Can you replace this magic number by a function based on LIMB_BITS_CONFIG?

So this is actually not dependent on a LIMB_BITS_CONFIG, it depends on the layout Rust is using. So once 64-bit limbs are used, we need the same alignment. still the magic number is replaced with sizeof

pdroalves · 2026-04-08T15:48:38Z

+  // across all 14 limbs.
+  // Operand map: %0..%13 = c[0..13], %14 = carry_out,
+  //              %15..%28 = a[0..13], %29..%42 = b[0..13].
+  uint32_t carry_out;


@guillermo-oyarzun do you want to double check this PTX? It seems ok to me.

yup the ptx looks good!

pdroalves · 2026-04-08T15:48:49Z

+  // Operand map: %0..%13 = c[0..13], %14 = borrow_out,
+  //              %15..%28 = a[0..13], %29..%42 = b[0..13].
+  uint32_t borrow_out;
+  asm("sub.cc.u32   %0,  %15, %29;\n\t" // c[0]  = a[0]  - b[0],  set BF


@guillermo-oyarzun here too.

same here, it looks good too!

pdroalves · 2026-04-08T15:50:18Z

 #endif // LIMB_BITS_CONFIG == 64
 #endif // __CUDA_ARCH__

+// 32-bit dual MAD-chain Montgomery multiplication (device path)


Do you have a reference for this MAD-chain multiplication? If so, a link as comment would help.

pdroalves · 2026-04-08T15:51:16Z

+    fp_qad_row_32(&wtemp[2 * i], &wide[2 * i + 2], &a32[i + 1], a32[i], n - i);
+  }
+
+  asm("mul.lo.u32 %0, %2, %3; mul.hi.u32 %1, %2, %3;"


I don't like PTX in the middle of a function like this one. Maybe you could move it to a macro and add comments explaining what it is.

pdroalves · 2026-04-08T15:52:11Z

+                 p4 = DEVICE_MODULUS.limb[4], p5 = DEVICE_MODULUS.limb[5],
+                 p6 = DEVICE_MODULUS.limb[6];
+  uint64_t r0, r1, r2, r3, r4, r5, r6, mask64;
+  asm("sub.cc.u64   %0, %8,  %15;\n\t"


This diff is full of PTX. We need to careful read them and if possible remove them from within functions.

guillermo-oyarzun · 2026-04-15T10:53:57Z

+  // across all 14 limbs.
+  // Operand map: %0..%13 = c[0..13], %14 = carry_out,
+  //              %15..%28 = a[0..13], %29..%42 = b[0..13].
+  uint32_t carry_out;


yup the ptx looks good!

guillermo-oyarzun · 2026-04-15T10:54:26Z

+  // Operand map: %0..%13 = c[0..13], %14 = borrow_out,
+  //              %15..%28 = a[0..13], %29..%42 = b[0..13].
+  uint32_t borrow_out;
+  asm("sub.cc.u32   %0,  %15, %29;\n\t" // c[0]  = a[0]  - b[0],  set BF


same here, it looks good too!

guillermo-oyarzun · 2026-04-15T10:56:49Z

-#if defined(__CUDA_ARCH__) && LIMB_BITS_CONFIG == 64
-  // Device path: fully unrolled PTX with hardware carry flags
-  fp_mont_mul_cios_ptx(c, a, b);
+#ifdef __CUDA_ARCH__


i understand that now we have 2 versions for 32 and 64-bit limbs, can we add a panic in the correct place in case someone attempts to use it with 128-bit?

@guillermo-oyarzun you mean if someone tries to set value other than 32 and 64 to LIMB_BITS_CONFIG

in this case maybe use enum? with two values 32BIT and 64BIT

yup enum should work, just trying be extra safe because the code shouldn't work with 128-bit, right? we would need to emulate them somehow

for now limbs can only be 32 or 64 I will rewrite it with enum, should be better than panic.

btw there is already a protection implemented for this inside fp.h line:55

static_assert(LIMB_BITS == 32 || LIMB_BITS == 64, "LIMB_BITS_CONFIG must be 32 or 64");

So I guess it is fine to leave it as it is.

- Replace 64-bit CIOS Montgomery multiplication with 32-bit MAD chains (mad.lo.cc/madc.hi.cc), exploiting native 2x throughput of 32-bit ops on NVIDIA GPUs via even/odd accumulator separation - Add fp_mont_sqr using a triangular MAD chain (upper triangle computed once and doubled, diagonal added separately), saving of the multiplications versus treating squaring as a general multiplication - Add fp_add_lazy/fp_sub_lazy (and Fp2 variants): skip the final conditional subtraction when the result feeds fp_mont_mul, which accepts inputs in [0, 2p). Wired into fp2_mont_mul, fp2_mont_square, and G1/G2 projective_point_double - Replace all fp_mont_mul(c, a, a) squaring patterns with fp_mont_sqr across curve.cu and fp2.cu (point addition, doubling, inversion)

pdroalves · 2026-04-22T18:22:32Z

+#define LIMB_BITS_CONFIG 32
 #endif

 #if LIMB_BITS_CONFIG == 64


We cannot forget to deprecate this. Once we merge this PR we should completely remove the 64-bit mode. Do you agree @bbarbakadze ?

pdroalves

Just a few minor comments and style changes.

The PR is quite good in my opinion. Is there anything else you need to do here? Otherwise it's a good moment to rebase. We can merge after these changes.

pdroalves · 2026-04-22T18:23:43Z

 // Uses the complex-squaring identity: c0 = (a0+a1)(a0-a1), c1 = 2*a0*a1
 // Only 2 Fp multiplications vs 3 for fp2_mont_mul(c, a, a).
-// NOTE: All inputs and outputs are in Montgomery form (no conversions)
+// NOTE: All inputs should be in Montgomery form


What about outputs?

pdroalves · 2026-04-22T20:44:21Z

+}
+
+// Montgomery squaring using CIOS with triangular 32-bit MAD chains.
+// See fp_mont_mul_mad32 for the algorithm reference (Koç et al., 1996).


Where can I find this reference? Could be good to add the paper name and venue with full author names.

pdroalves · 2026-04-22T20:47:56Z

+    for (int j = 0; j < FP_LIMBS; j++) {
+      uint64_t acc =
+          (uint64_t)t[i + j] + (uint64_t)u * (uint64_t)p.limb[j] + carry;
+      t[i + j] = (UNSIGNED_LIMB)acc;


We've been slowly trying to avoid this type of cast in new C++ code. In the ZK backend that's a convention that we should be following.

Instead of

a = (type_t) b

you should be using

a = static_cast<type_t>(b) ```.

I asked code to point other lines changed in this PR that needs to be fixed.

● 14 C-style casts in new PR lines, across two functions: fp_mont_reduce (32-bit path), lines 596-611: ┌──────┬──────────────────────────────────────────────────────┐ │ Line │ Cast │ ├──────┼──────────────────────────────────────────────────────┤ │ 600 │ (uint64_t)t[i + j], (uint64_t)u, (uint64_t)p.limb[j] │ ├──────┼──────────────────────────────────────────────────────┤ │ 601 │ (UNSIGNED_LIMB)acc │ ├──────┼──────────────────────────────────────────────────────┤ │ 607 │ (uint64_t)t[idx] │ ├──────┼──────────────────────────────────────────────────────┤ │ 608 │ (UNSIGNED_LIMB)acc │ └──────┴──────────────────────────────────────────────────────┘ fp_mont_mul_cios (32-bit Step 1), lines 1263-1273: ┌──────┬──────────────────────────────────────────────────────────┐ │ Line │ Cast │ ├──────┼──────────────────────────────────────────────────────────┤ │ 1267 │ (uint64_t)t[j], (uint64_t)a.limb[j], (uint64_t)b.limb[i] │ ├──────┼──────────────────────────────────────────────────────────┤ │ 1268 │ (UNSIGNED_LIMB)acc │ ├──────┼──────────────────────────────────────────────────────────┤ │ 1271 │ (uint64_t)t[FP_LIMBS] │ ├──────┼──────────────────────────────────────────────────────────┤ │ 1272 │ (UNSIGNED_LIMB)(sum64 >> LIMB_BITS) │ ├──────┼──────────────────────────────────────────────────────────┤ │ 1273 │ (UNSIGNED_LIMB)sum64 │ └──────┴──────────────────────────────────────────────────────────┘ fp_mont_mul_cios (32-bit Step 2), lines 1298-1310: ┌──────┬──────────────────────────────────────────────────┐ │ Line │ Cast │ ├──────┼──────────────────────────────────────────────────┤ │ 1302 │ (uint64_t)t[j], (uint64_t)m, (uint64_t)p.limb[j] │ ├──────┼──────────────────────────────────────────────────┤ │ 1303 │ (UNSIGNED_LIMB)acc │ ├──────┼──────────────────────────────────────────────────┤ │ 1308 │ (uint64_t)t[FP_LIMBS], (uint64_t)overflow │ ├──────┼──────────────────────────────────────────────────┤ │ 1309 │ (UNSIGNED_LIMB)s64 │ ├──────┼──────────────────────────────────────────────────┤ │ 1310 │ (UNSIGNED_LIMB)(s64 >> LIMB_BITS) │ └──────┴──────────────────────────────────────────────────┘ All 14 are widening (uint32_t → uint64_t) or truncating (uint64_t → uint32_t) integer casts that should use static_cast<>.

pdroalves · 2026-04-22T20:48:05Z

+    int idx = i + FP_LIMBS;
+    while (carry != 0 && idx <= 2 * FP_LIMBS) {
+      uint64_t acc = (uint64_t)t[idx] + carry;
+      t[idx] = (UNSIGNED_LIMB)acc;


pdroalves · 2026-04-22T20:50:52Z

+  // Add reduced lower half into upper half wide[n..2n-1]; the result lives
+  // in wide[n..2n-1] and is in [0, 2p).
+  fp_cadd_n_32(&wide[n], &wide[0], n);
+  FP_CARRY_32(wide[0]); // discard overflow (always 0 for p<2^446)


Do we need this line? The comment says the overflow is always 0, is that right?

Ahh I think I get it, you are just consuming the carry right? Maybe the comment should be replaced by consume the carry flag so CC is clean

bbarbakadze requested a review from a team as a code owner April 3, 2026 16:58

cla-bot Bot added the cla-signed label Apr 3, 2026

bbarbakadze requested review from andrei-stoian-zama and pdroalves April 3, 2026 16:59

pdroalves requested changes Apr 6, 2026

View reviewed changes

bbarbakadze force-pushed the bb/zk/32_bit_limbs branch from 8cfdd1b to a5fd85a Compare April 8, 2026 12:53

pdroalves reviewed Apr 8, 2026

View reviewed changes

bbarbakadze force-pushed the bb/zk/32_bit_limbs branch from a5fd85a to 38e4101 Compare April 15, 2026 09:55

bbarbakadze requested a review from pdroalves April 15, 2026 09:57

guillermo-oyarzun reviewed Apr 15, 2026

View reviewed changes

bbarbakadze force-pushed the bb/zk/32_bit_limbs branch from 38e4101 to e716051 Compare April 15, 2026 11:25

pdroalves reviewed Apr 22, 2026

View reviewed changes

pdroalves requested changes Apr 22, 2026

View reviewed changes

Conversation

bbarbakadze commented Apr 3, 2026

PR content/description

Check-list:

Uh oh!

github-actions Bot commented Apr 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pdroalves commented Apr 6, 2026

Uh oh!

pdroalves left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pdroalves left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants