This repository contains the source code for the paper “Flexible Parallel Radix‑4 MDC NTT for FHE”. It provides an RTL library to implement radix‑4 multipath delay commutator (MDC) NTTs, as well as a framework to automatically generate twiddle factors, constants, test vectors, and core files for use with FuseSoC.
The hardware design is based on a radix‑4 MDC. It is flexible and supports different primes
Our work aims to provide a design‑time configurable IP block for deployment in FHE streaming accelerators. Streaming NTTs are common building blocks in FHE accelerators (see, e.g., Bertels et al., Van Beirendonck et al., and Kim et al.).
While Boolean schemes typically operate on a single polynomial ring modulus per parameter set, schemes such as CKKS usually employ the residue number system (RNS) and therefore operate on multiple ring moduli. However, choosing an auxiliary prime
for all RNS moduli
For accelerators targeting binary schemes, NTT and INTT are not merged as in the diagram below. For such designs, our core supports a Cooley–Tukey (CT) NTT and a Gentleman–Sande (GS) INTT. This combination eliminates the need for bit‑reverse reordering between NTT and INTT. The following figure shows where our hardware design can be used inside a streaming blind‑rotation accelerator for binary FHE schemes such as CGGI:
In the accompanying paper and demo applications, we focus on the STD128Q CGGI parameter set and implement a DIT‑based NTT and a DIF‑based INTT suitable for deployment in a blind‑rotation core as shown above.
For other schemes such as CKKS, our design implements a merged (I)NTT core, suitable for architectures similar to the one below:
For RNS schemes, our design allows either (i) generating one NTT for each required ring modulus
Using this technique, we focus on a parameter set with auxiliary prime
As our design is configurable at design time, it can also accelerate other cryptographic schemes that rely on the NTT, e.g., selected post‑quantum cryptography (PQC) schemes and zero‑knowledge proof (ZKP) systems.
- data: Data and scripts used for the paper submission, including HJSON configuration files describing design points and the data‑collection script.
- hw: SystemVerilog source code of the hardware IP blocks.
- sw: Software applications that drive the accelerator, verify the implementation on FPGA, and benchmark it against software.
- util: Tools and scripts for automatic generation of files (twiddle factors, constants, FuseSoC cores, etc.) and for evaluation.
Install the Python dependencies of the NTT framework:
python3 -m pip install -r python_requirements.txtThe script to compute primitive roots of unity requires SageMath. We use SageMath 9.5. On Ubuntu, install it via:
sudo apt update
sudo apt install sagemathTo run the FPGA demo applications or the Alveo U55C FPGA tests, XRT must be installed. On our deployment platform, we used XRT 2023.2 build 2.16.204 and Vitis 2023.2.
For the demo and benchmarking applications, OpenFHE must be installed. Our experiments use OpenFHE v1.4.2.
For synthesis we used Vivado/Vitis 2023.2, and for functional verification we used Questa 2024.3.
Compiled simulation libraries for XPM and UNISIM are required, as the hardware design instantiates XPM memories and LUT primitives. For information on how to compile AMD/Xilinx simulation libraries for your simulator, see UG900. In case of Questa, the respective modelsim.ini is required to reside within hw/dv.
The following Python script generates FuseSoC targets for a specific parameter set, both for verification (DV) and synthesis (IP). It also creates randomized test vectors and the required twiddle‑factor ROM images.
python3 util/ntt_r4mdc.py $N $Q $PSI $W4 $LOGQ sparse $BFUSExample:
python3 util/ntt_r4mdc.py 1024 33550337 3037 12759331 25 sparse 16The generated synthesis target is placed at:
hw/ip/ntt_r4mdc/ntt_r4mdc_$Q_$N_$BFUS.core
The generated verification target is placed at:
hw/ip/ntt_r4mdc/dv/ntt_r4mdc_$Q_$N_$BFUS_dv.core
Once generated, you can run synthesis or verification via FuseSoC.
To synthesize a generated IP target, run:
./synth_target.sh -s "aisec:ip:ntt_r4mdc_$Q_$N_$BFUS:0.1" $Q_$N_dit $Q_$N_ditTo verify a generated DV target via simulation, run:
./verify_target.sh -s "aisec:dv:ntt_r4mdc_$Q_$N_$BFUS:0.1" $Q_$N_dit $Q_$N_ditIf you use a simulator other than Questa, adapt line 21 of verify_target.sh accordingly.
| Scheme | N | Q | PSI | W4 |
|---|---|---|---|---|
| CGGI | 1024 | 33550337 | 3037 | 12759331 |
| CKKS | 4096 | 18446744069414584321 | 1532612707718625687 | 281474976710656 |
| Comparison | 1024 | 268369921 | 3260661 | 75074761 |
The data directory contains the data used in the paper, the data‑collection script, and the HJSON configuration files describing the design points.
The file data/data_tcas.hjson lists the design points. Each design point consists of cryptographic parameters (e.g., modulus and polynomial length) and a hardware configuration (parallelism, target frequency, etc.).
To collect all data points specified in data/data_tcas.hjson, run:
python3 data/run_data_collection.py data/data_tcas.hjsonInternally, the following steps (described above) are executed for each design point:
- Generate parallel radix‑4 targets.
- Synthesis.
- Verification.
- Parse and copy generated reports into data.
Running this script can take several hours, since each design point is synthesized and implemented in Vivado.
To demonstrate the applicability of our approach, the repository includes three software demos and proof‑of‑concept FPGA implementations.
-
sw/demo_parameter: Examples adapted from OpenFHE.
These implement demo applications for the target parameter sets of our case study. For CGGI, we adapt the binary FHE example to theSTD128Qparameter set. For CKKS, we adapt the CKKS arithmetic example with Composite Scaling. For more details on Composite Scaling, see OpenFHE’s documentation and Agrawal et al.. -
sw/test: Tests functionality of FPGA bitstreams by comparing FPGA results against automatically generated test vectors (generated via util/ntt_r4mdc.py).
-
sw/benchmark: Compares the runtime of our FPGA implementation on a batch of 32 polynomials against a software implementation based on OpenFHE.
To generate the bitstreams for the AMD Alveo U55C FPGA, run the commands below. If you require additional debug capability, you can append the --debug flag to these commands. This adds ILAs for usage with Chipscope on the AXI interfaces between the RTL kernel and the HBM.
./hw/top_ntt/util/gen_bitstream_top_ntt.sh -s "aisec:fpga:top_ntt_u55c_33550337_1024_16:0.1" 33550337_1024_dit u55c --freq 250./hw/top_ntt/util/gen_bitstream_top_ntt.sh -s "aisec:fpga:top_ntt_u55c_33550337_1024_16:0.1" 33550337_1024_dif u55c --freq 250./hw/top_ntt/util/gen_bitstream_top_ntt.sh -s "aisec:fpga:top_ntt_u55c_18446744069414584321_4096_8:0.1" 18446744069414584321_4096_uni_opt u55c --freq 200For the bitstream generation to succeed, ensure that the corresponding targets and constants were generated beforehand via:
python3 util/ntt_r4mdc.py $N $Q $PSI $W4 $LOGQ sparse $BFUSTo build the FPGA verification application for a dit/dif design run the command below.
g++ -g -std=c++17 -I$XILINX_XRT/include -L$XILINX_XRT/lib -o sw/test/test_top_ntt.exe sw/test/test_top_ntt.cpp -lxrt_coreutil -pthreadTo build the FPGA verification application for a uni design run the command below.
g++ -g -std=c++17 -I$XILINX_XRT/include -L$XILINX_XRT/lib -o sw/test/test_top_ntt_rns.exe sw/test/test_top_ntt_rns.cpp -lxrt_coreutil -pthreadFor the OpenFHE applications, please use the commands below.
cd sw/{benchmark,demo_parameter}/{cggi,ckks}/
mkdir build
cd build
cmake ..
make
cd ../../../../To run applications, e.g. verify the STD128Q CGGI FPGA design and benchmark it, execute the following commands.
./sw/test/test_top_ntt.exe -s "aisec:fpga:top_ntt_u55c_33550337_1024_16:0.1" 33550337_1024_{dit,dif} u55c./sw/benchmark/cggi/build/boolean -s "aisec:fpga:top_ntt_u55c_33550337_1024_16:0.1" 33550337_1024_{dit,dif} u55cTo run application, e.g. verify the CKKS Composite Scaling FPGA design and benchmark it, execute the following commands.
./sw/test/test_top_ntt_rns.exe -s "aisec:fpga:top_ntt_u55c_18446744069414584321_4096_8:0.1" 18446744069414584321_4096_uni u55c./sw/benchmark/ckks/build/test -s "aisec:fpga:top_ntt_u55c_18446744069414584321_4096_8:0.1" 18446744069414584321_4096_uni u55cOur framework supports arbitrary parameter sets. To generate a new configuration, run:
python3 util/ntt_r4mdc.py $N $Q $PSI $W4 $LOGQ sparse $BFUSTo obtain the primitive
sage -python util/primitive_nth_rooth.py $Q $NA prerequisite for using our hardware design with a new parameter set is the existence of the corresponding modular‑reduction unit in the project. To add a new reduction implementation, follow the instructions in the next section.
To provide a plug‑and‑play solution for arbitrary parameter sets without additional implementation effort, we will add a word‑level Montgomery multiplier following Hirner et al..
To add a new optimized sparse‑reduction implementation, follow these steps:
- Copy the RTL implementation of the reduction unit into hw/ip/arith/rtl.
- Copy the RTL implementation of the constant‑multiplication unit into hw/ip/arith/rtl.
- Add a new case in hw/ip/arith/rtl/sparse_reduction.sv.
- Add a new case in hw/ip/arith/rtl/const_multiplier.sv.
- Extend the file list in hw/ip/arith/arith.core.
- Add the corresponding delays in util/ntt_r4mdc.py in the
compute_delaysfunction.
- Replace the verification environment with a UVM environment.
- Integrate Montgomery reduction to support arbitrary NTT parameter sets without additional implementation effort.
- Automate the integration of additional optimized reduction implementations.
- Explore higher radices together with Fermat numbers (
$2^{2^{k}} + 1$ ) as in Kim et al.. - Extend the framework for positive wrapped convolution and ZKPs.
This code base contains AXI read and write master interfaces from Xilinx Vitis-Tutorials. Parts of the AXI slave interface were designed based on ZipCPU and Xilinx Vitis-Tutorials. The files residing within sw/benchmark and sw/demo_paramter contain code examples of OpenFHE licensed under BSD 2-Clause License. The file hw/axi/rtl/skidbuffer.sv is taken from ZipCPU.