NTT Multiplication for NTT-unfriendly Rings
New Speed Records for Saber and NTRU on Cortex-M4 and AVX2

Chi-Ming Marvin Chung¹,², Vincent Hwang¹,², Matthias J. Kannwischer³, Gregor Seiler⁴,⁵, Cheng-Jhih Shih¹,² and Bo-Yin Yang¹

¹ Academia Sinica, Taipei, Taiwan
(marvin852316497,vincentvbh7,cs861324)@gmail.com, by@crypto.tw
² National Taiwan University, Taipei, Taiwan
³ Max Planck Institute for Security and Privacy, Bochum, Germany
matthias@kannwischer.eu
⁴ IBM Research – Zurich, Rüschlikon, Switzerland
⁵ ETH Zurich, Zurich, Switzerland
gseiler@inf.ethz.ch

Abstract. In this paper, we show how multiplication for polynomial rings used in the NIST PQC finalists Saber and NTRU can be efficiently implemented using the Number-theoretic transform (NTT). We obtain superior performance compared to the previous state of the art implementations using Toom–Cook multiplication on both NIST’s primary software optimization targets AVX2 and Cortex-M4. Interestingly, these two platforms require different approaches: On the Cortex-M4, we use 32-bit NTT-based polynomial multiplication, while on Intel we use two 16-bit NTT-based polynomial multiplications and combine the products using the Chinese Remainder Theorem (CRT).

For Saber, the performance gain is particularly pronounced. On Cortex-M4, the Saber NTT-based matrix-vector multiplication is 61% faster than the Toom–Cook multiplication resulting in a 22% speed-up of Saber encapsulation. For NTRU, the speed-up is less impressive, but still NTT-based multiplication performs better than Toom–Cook for all parameter sets on Cortex-M4. The NTT-based polynomial multiplication for NTRU-HRSS is 10% faster than Toom–Cook which results in a 6% speed-up for encapsulation. On AVX2, we obtain speed-ups for three out of four NTRU parameter sets.

As a further illustration, we also include code for AVX2 and Cortex-M4 for the Chinese Association for Cryptologic Research competition award winner LAC (also a NIST round 2 candidate) which outperforms existing code.

Keywords: Polynomial Multiplication, NTT Multiplication, Saber, NTRU, Cortex-M4, AVX2

1 Introduction

Popular PKC primitives like RSA and elliptic curve cryptography (ECC) which are based on the hardness of factoring large integers and the discrete logarithm problem both are vulnerable to quantum computer attacks [Sho94]. Thus we often hear that Quantum Computers (QCs) may arrive soon and break all common Public-Key Cryptography (PKC) today.

∗This work was in part done while MJK was employed by Radboud University, Nijmegen, The Netherlands and visiting Academia Sinica, Taipei, Taiwan.
Hence Post-Quantum Cryptography (PQC), the study of QC-resistant PKCs. There are five major classes of PQCs today, based on multivariate quadratics (MPKCs), lattices, error-correcting codes, (supersingular) isogenies, and hash functions. Most extant PQCs have merit as PKCs beyond being post-quantum. They tend to be faster at the same designed level of security, and as such are reasonable candidates for wide deployment even without considering QC. In particular, PQC based on hard lattice problems combine good overall performance with acceptable transmission bandwidth requirements.

At the moment the U.S. National Institute of Standards and Technology (NIST) has called for candidates for the next generation post-quantum cryptography. 82 cryptosystems were submitted in 2017 and 15 are currently in the 3rd round which started in July 2020 of which 7 are considered finalists and 8 are alternate schemes. NIST plans to start standardization of some of the finalists in about 2 years from the start of the 3rd round. The perceived superiority of lattice-based crypto is reflected in the NIST post-quantum cryptography standardization process, as nearly half of the candidates were (and are) using hard lattice problems as their building blocks. Most of these are “small lattice systems” which use polynomial rings as the basic algebraic structure. The most critical algebraic step is a polynomial multiplication modulo a specified polynomial. Most of the time one multiplicand has random-looking coefficients, and the other has small coefficients.

Of these small lattice-based cryptosystems, several are designed from the ground up to depend on a specific way to multiply polynomials in an integer ring: the Number Theoretic Transform (NTT). The remaining third round candidates with this structure are Kyber, FALCON, and Dilithium [ABD+19, FHK+19, LDK+19], and there were other similar submissions in earlier rounds of the NIST competition (e.g., [PAA+19, DTGW17]).

There seems to be a common conception that schemes that were not specifically designed to benefit from NTT-based multiplication by using a NTT-friendly ring cannot be efficiently implemented using them and, hence, one has to fall back to other multiplication algorithms like Karatsuba multiplication [KO63] or Toom–Cook multiplication [Too63, Coo66]. Among the finalists, this applied to two schemes: Saber [DKRV19] and NTRU [ZCH+19]. Both use a power-of-two modulus which is inherently incompatible with straightforward NTTs. Previous implementations of Saber and NTRU use a combination of Toom-4 and Karatsuba to implement efficient polynomial arithmetic. However, as we show in this work it is still possible to use NTTs to implement their underlying polynomial arithmetic and obtain superior performance compared to the state of the art implementations both on the ARM Cortex-M4 and AVX2.

Leaving the performance aspect aside, it is also interesting to be able to implement all lattice-based schemes with NTT-based polynomial multiplication algorithms from an ease of implementation point of view. Furthermore, this way all schemes can benefit from potential future hardware support for computing NTTs. Because of these reasons we think that even a small decrease in runtime maybe acceptable when using NTT-based multiplication instead of other methods.

The Chinese Association for Cryptologic Research (CACR) also sponsored a competition similar to that of NIST between 2018–19 [CAC19]. All three First Class Award winners were small lattice-based systems. Two of them, styled Aigis-ENC and Aigis-Sign, resemble Kyber and Dilithium in their design (see [ZYF+19], where the authors detail their deviations from Kyber and Dilithium). The other, LAC [LLJ+19], has a very small prime modulus \((q = 251)\) which is not suited to NTTs, and the designers suggest a sparse multiplication technique instead. We also show below that NTTs can be used to obtain performance superior to all previous implementations.

**Contribution.** We show how NTTs can be used to obtain efficient polynomial arithmetic in finite fields modulo a power-of-two. We present new implementations of Saber, LAC, and NTRU targeting the ARM Cortex-M4 and AVX2 which are faster than any implementations
described in the literature for the majority of parameter sets. Only for \texttt{ntruhps2048509} we were unable to obtain a speed-up on AVX2. Interestingly, our two platforms require different multiplication strategies due to limitations of the available multiplication instructions.

**Code.** Our implementations of Saber, LAC, and NTRU are Open Source and are available at https://github.com/ntt-polymul/ntt-polymul.

**Structure of this Paper.** Section 2 describes Saber, LAC, and NTRU and the background of the techniques required to implement polynomial arithmetic using NTTs for each. Section 3 presents the implementation details on the Cortex-M4. Section 4 presents the implementation details for AVX2 on Skylake. In Section 5 we present the performance results for Saber, LAC, and NTRU on our target platforms.

## 2 Preliminaries

This section is organized as follows: First, we introduce the cryptographic schemes we consider in this paper: Saber (Section 2.1), NTRU (Section 2.2), and LAC (Section 2.3). Second, Section 2.4 introduces the NTT techniques that can be used to implement polynomial arithmetic for NTRU and Saber. Last, we present some of the intricacies of Cortex-M4 in Section 2.5.

### 2.1 Saber

Saber [DKRV19] is a lattice-based key encapsulation mechanism based on the Module Learning With Rounding M-LWR problem. The polynomial ring used within Saber is $\mathbb{Z}_q[x]/(x^n + 1)$ with $q = 2^{13}$ and $n = 256$ across all parameter sets. As most other lattice-based schemes, Saber constructs a CCA-secure KEM from a CPA-secure DPKE.

**Algorithm 1** Saber Key Generation

**Output:** $pk = (\text{seed}_A, b)$, $sk = (s)$

1: seed$_A \leftarrow \text{Sample}_U()$
2: $A \in \mathbb{R}^l \times l \leftarrow \text{Expand}($seed$_A$)
3: $s \in \mathbb{R}^l \leftarrow \text{Sample}_B()$
4: $b \leftarrow \text{Round}(A^T \cdot s)$

**Algorithm 2** Saber CPA Encryption

**Input:** $m, r, pk = (\text{seed}_A, b)$

**Output:** $ct = (c, b')$

1: $A \in \mathbb{R}^l \times l \leftarrow \text{Expand}($seed$_A$)
2: $s' \in \mathbb{R}^l \leftarrow \text{Sample}_B(r)$
3: $b' \leftarrow \text{Round}(As')$
4: $v' \leftarrow b'^T (s' \mod p)$
5: $c \leftarrow \text{Round}(v' - 2^{x-1}m)$

**Algorithm 3** Saber CPA Decryption

**Input:** $ct = (c, b'), sk = (s)$

**Output:** $m$

1: $v \leftarrow b'^T (s \mod p)$
2: $m \leftarrow \text{Round}(v - 2^{x-1}c \mod p)$

Algorithm 1, Algorithm 2, and Algorithm 3 depict the CPA-secure key generation, encryption, and decryption respectively. \texttt{Sample}_U refers to sampling from a uniform distribution, \texttt{Sample}_B refers to sampling from a binomial distribution. \texttt{Expand} expands a seed to a uniform matrix of polynomials. We omit the CCA variants for brevity and refer the reader to the specification for the corresponding CCA transformation. Saber’s major operation in key generation and encryption is the matrix-vector multiplication of polynomials $A^T \cdot s$ and $As'$. In decryption the most expensive operation is the inner product of $b'^T \cdot s$. 

Sample
Table 1: NTRU and Saber Parameter Sets

<table>
<thead>
<tr>
<th>name</th>
<th>( l )</th>
<th>( T = 2^r )</th>
<th>( \mu )</th>
<th>( q )</th>
<th>( n )</th>
</tr>
</thead>
<tbody>
<tr>
<td>Lightsaber</td>
<td>2</td>
<td>2^2</td>
<td>10</td>
<td>ntruhps2048509</td>
<td>2048 = 2^{11}</td>
</tr>
<tr>
<td>Saber</td>
<td>3</td>
<td>2^4</td>
<td>8</td>
<td>ntruhps2048677</td>
<td>2048 = 2^{11}</td>
</tr>
<tr>
<td>Firesaber</td>
<td>4</td>
<td>2^6</td>
<td>6</td>
<td>ntruhrss701</td>
<td>8192 = 2^{13}</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>ntruhps4096821</td>
<td>4096 = 2^{12}</td>
</tr>
</tbody>
</table>

**Parameters** The Saber submission specifies the three parameter sets Lightsaber, Saber, and Firesaber targeting the NIST security levels 1, 3, and 5 respectively. While the underlying polynomial ring remains the same for all parameter sets, the module dimension \( l \), the rounding parameter \( T \), and the secret distribution parameter \( \mu \) varies per parameter set. The parameters are summarized in Table 1a.

**CCA Transform** To achieve IND-CCA2 security, Saber is using a variant of the FO transform due to Hofheinz, Hövelmanns, and Kiltz [HHK17]. However, as the randomness \( r \) (and the corresponding \( s' \)) cannot be recovered in decryption, Saber does require re-encryption in the decapsulation algorithm. Hence, improving the encryption also improves decapsulation. For technical details on the FO transform, refer to the specification [DKRV19].

### 2.2 NTRU

The NTRU submission [ZCH+19] is based on the NTRU cryptosystem which was first proposed by Hoffstein, Pipher, and Silverman in 1998 [HPS98]. Two teams submitted an NTRU-like scheme to the NIST competition named NTRU-HRSS and NTRUEncrypt. After the first round, those teams merged their proposals giving it the new name ‘NTRU’. It operates in the three polynomial rings \( \mathbb{Z}_3[x]/\Phi_n \), \( \mathbb{Z}_q[x]/\Phi_n \), and \( \mathbb{Z}_q[x]/(\Phi_1 \cdot \Phi_n) \) with \( \Phi_1 = (x - 1) \) and \( \Phi_n = (x^{n-1} + x^{n-2} + \cdots + 1) \).

The algorithms for key generation, encryption, and decryption are shown in Algorithm 4, Algorithm 5, and Algorithm 6 respectively. For the details of Sample and Lift, see [ZCH+19].

NTRU’s main benefit is the relatively cheap encapsulation which is the fastest of the KEM finalists in the NIST competition. However, it comes with a rather costly key generation procedure as it requires polynomial inversion. In both encryption and decryption, the major arithmetic operation is polynomial multiplication.

**Algorithm 4 NTRU Key Generation**

**Output:** \( pk = (h), sk = (f, f_p, h_q) \)

1: \( f, g \leftarrow \text{Sample}() \)
2: \( f_q \leftarrow f^{-1} \mod (q, \Phi_n) \)
3: \( h \leftarrow (3 \cdot g \cdot f_q) \mod (q, \Phi_1 \cdot \Phi_n) \)
4: \( h_q \leftarrow h^{-1} \mod (q, \Phi_n) \)
5: \( f_p \leftarrow f^{-1} \mod (3, \Phi_n) \)

**Algorithm 5 NTRU CPA Encryption**

**Input:** \( m, r, pk = (h) \)

**Output:** \( c \)

1: \( m' \leftarrow \text{Lift}(m) \)
2: \( c \leftarrow (r \cdot h + m') \mod (q, \Phi_1 \cdot \Phi_n) \)

**Algorithm 6 NTRU CPA Decryption**

**Input:** \( c, sk = (f, f_p, h_q) \)

**Output:** \( r, m \) or \( \text{fail} \)

1: if \( c \neq 0 \mod (q, \Phi_1) \) return \( \text{fail} \)
2: \( a \leftarrow (c \cdot f) \mod (q, \Phi_1 \cdot \Phi_n) \)
3: \( m \leftarrow (a \cdot f_p) \mod (3, \Phi_n) \)
4: \( m' \leftarrow \text{Lift}(m) \)
5: \( r \leftarrow ((c - m') \cdot h_q) \mod (q, \Phi_n) \)
The polynomial ring used in LAC is \( \mathbb{Z}[x]/(X^n + 1) \) with \( q = 251 \) and \( n = 512 \) for LAC-128 and \( n = 1024 \) for LAC-192 and LAC-256. As most other lattice-based schemes, LAC constructs a CCA-secure KEM from a CPA-secure PKE. By implicitly rejecting invalid ciphertexts, NTRU can avoid having to re-encrypt the message in the decapsulation. Due to space limitations, we omit the details here and refer the reader to the specification for the corresponding CCA transformation.

### CCA transformation

NTRU is using a variant of the FO transform \([FO99]\) transformation to obtain a CCA-secure KEM from the CPA-secure PKE. By implicitly rejecting invalid ciphertexts, NTRU can avoid having to re-encrypt the message in the decapsulation. Due to space limitations, we omit the details here and refer the reader to the specification \([ZCH+19]\).

#### 2.3 LAC

LAC \([LLJ+19]\) is a lattice-based key encapsulation mechanism based on the Ring Learning with Errors problem. The polynomial ring used in LAC is \( \mathbb{Z}[x]/(X^n + 1) \) with \( q = 251 \) and \( n = 512 \) for LAC-128 and \( n = 1024 \) for LAC-192 and LAC-256. As most other lattice-based schemes, LAC constructs a CCA-secure KEM from a CPA-secure PKE.

Algorithm 7, Algorithm 8, and Algorithm 9 depict the CPA-secure key generation, encryption, and decryption respectively. \( \text{Sample}_B \) refers to sampling from a uniform distribution, \( \text{Sample}_B \) to sampling from a fixed-weight ternary distribution. \( \text{Expand} \) expands a seed to a uniform matrix of polynomials. \((\cdot)_t \) means to take the first \( t \) coefficients of a polynomial as a vector. We omit the CCA variants for brevity and refer the reader to the specification for the corresponding CCA transformation. LAC’s major operations are multiplications \((as, ar, br, c_1s)\).

### Parameters

The LAC submission specifies the three parameter sets LAC-128, LAC-192, and LAC-256 targeting the NIST security levels 1, 3, and 5 respectively. The parameters are summarized in Table 2.

#### CCA Transform

To achieve IND-CCA2 security, LAC is using a variant of the Fujisaki-Okamoto transform due to Hofheinz, Hövelmanns, and Kiltz \([HHK17]\), similar to Saber.

<table>
<thead>
<tr>
<th>name</th>
<th>( n )</th>
<th>( l_v )</th>
<th>( B )</th>
<th>( B' )</th>
<th>ECC</th>
</tr>
</thead>
<tbody>
<tr>
<td>LAC-128</td>
<td>512</td>
<td>511</td>
<td>( \left( \frac{1}{2}, \frac{1}{2}, \frac{1}{4} \right) )</td>
<td>( \left( \frac{1}{2}, \frac{1}{2}, \frac{1}{4} \right) )</td>
<td>( BCH(511, 256, 33) )</td>
</tr>
<tr>
<td>LAC-192</td>
<td>1024</td>
<td>511</td>
<td>( \left( \frac{1}{2}, \frac{1}{2}, \frac{1}{4} \right) )</td>
<td>( \left( \frac{1}{2}, \frac{1}{2}, \frac{1}{4} \right) )</td>
<td>( BCH(511, 256, 17) )</td>
</tr>
<tr>
<td>LAC-256</td>
<td>1024</td>
<td>1023</td>
<td>( \left( \frac{1}{2}, \frac{1}{2}, \frac{1}{4} \right) )</td>
<td>( \left( \frac{1}{2}, \frac{1}{2}, \frac{1}{4} \right) )</td>
<td>( BCH(511, 256, 33) + D2 )</td>
</tr>
</tbody>
</table>
The FFT usually means to split everything down from $x^{2^k} - 1$ to linear factors, which requires a primitive root $\zeta \in R$ of degree $2^k$ (an element such that $\zeta^{2^k-1} = -1$). The Number Theoretic Transform (NTT) means an FFT in a prime field, which must be modulo an “NTT-friendly” prime of the form $p = 2^k p' + 1$.

One might invert the FFT trick by using Gentleman–Sande butterflies, replacing all the divisions by 2 in favor of a final division by $2^k$. We may also treat this as a forward
The Twisted FFT trick” means to do the FFT trick on $x$ with the powers of 2. The “Negacyclic convolution” means to multiply modulo $2^n$ by the powers of 2. Good’s FFT trick [Goo51], where we set $x = yw$ with $y^{2^{k-1}} = -1 = w^{h-1} + w^{h-2} + \cdots + w$. Instead of the incomplete NTT, if the length of the NTT is $2^k$ and $h$ odd, we apply Good’s FFT trick [Goo51], where we set $x = yw$ with $y^{2^{k-1}} = -1 = w^{h-1} + w^{h-2} + \cdots + w$. A multiplicand $a(x) \in \mathbb{Z}_q[x]/(x^{2^k} - 1)$ with deg $a < h \cdot 2^k$ becomes $b(y, z) \in \mathbb{Z}_q[y, w]$ with $x^i = y^{i \mod 2^k} w^{i \mod h}$, with deg$_y b < 2^k$ and deg$_w b < h$. We may write $b = \cdots$
\[ \sum_{i=0}^{h-1} w^i b_i(y) \] with \( b_i(y) \in \mathbb{Z}_q[y]/(y^{2^k} - 1) \). We term “Good’s permutation” the map from the array \( a[] \) representing \( \sum_{0 \leq i < h \cdot 2^k} a_i x^i \) to \( b[] \) representing \( \sum_{i=0}^{h-1} \sum_{j=0}^{2^k-1} b_{i,j} w^i y^j \).

We follow Good’s permutation with a size-2\( k \) FFT w.r.t. \( y \) on each multiplicand, represented by \( h \) parallel size-2\( k \) NT Ts. Then we do “point” multiplication by convolving together degree-\((h - 1)\) polynomials in \( w \) modulo \( w^h - 1 \), do an inverse size-2\( k \) FFT (represented by \( h \) inverse NT Ts), and then finally undo Good’s permutation.

### 2.4.6 NTTs with Modulus not of the form \( 2^k p' + 1 \)

Suppose we have a convolution modulo \( x^n - 1 \) modulo \( q \), where \( n \ nmid (q - 1) \). We can express the polynomials with coefficients in \( [-\frac{n}{2}; \frac{n}{2}] \) and compute the convolution as a polynomial of integer coefficients. The absolute magnitude of the resulting coefficients would be at most \( nq^2/4 \). Therefore, if we find a prime \( p > nq^2/2 \) such that \( n|(p - 1) \), and compute the multiplication mod \( p \) (which we can using NT Ts of length \( n \) mod \( p \)), then the result must be correct as a polynomial with integer coefficients, and then we can recover our correct result modulo \( q \).

The procedure is quite similar if it is a different kind of convolution or another product. In the case of our applications (Saber and NTRU), one of the multiplicands is usually “small” so that we can use an even smaller prime.

### 2.4.7 Mixed-Radix NTT for Multiplications

In [CT65] Cooley and Tukey explained how to effectively compute a general FFT of a composite size \( N \). In such cases, the FFT operation can be realized by combining the results of \( N/p \) smaller FT Ts on vectors of size \( p \). These elementary FFT operations over vectors of prime size are also referred to by the shape of their diagrams as butterflies. After such subdivisions, the immediate output of the algorithm would appear in an order different from that of the input (however, like the radix-2 case, that need not concern us).

### 2.4.8 Multiple Moduli and the Explicit CRT (Divided Difference Form)

As in Section 2.4.6 suppose we have a convolution modulo \( x^n - 1 \) modulo \( q \), where \( n \ nmid (q - 1) \). A different possibility is to take various N T T-friendly primes \( p_i \) whose product \( P \) is sufficiently large (usually \( > nq^2/2 \) ). Clearly computing the multiplication mod \( P \) must return the correct product as polynomial with integer coefficients. This we can do by computing the product modulo each \( p_j \) using NT Ts. There are at least two methods to put the pieces together modulo \( P \), from which we can compute our correct results. One is via the Explicit Chinese Remainder Theorem [BS07]. The other is the following approach:

**Theorem 1.** Let \( p_i > 0 \) be odd, pairwise co-prime \( (\gcd(p_i, p_j) = 1 \) for \( 1 \leq i < j \leq s) \). An explicit solution \( u \) of \( u \equiv u_i \pmod{p_i} \), \( i = 1 \ldots s \), where \( |u| < P_i/2 \), where \( |u| < P/2 = \prod_{i=1}^{s} p_i \), is given by

\[
\begin{align*}
    y_1 &= u_1 \\
    y_2 &= y_1 + ((u_2 - y_1)m_2 \pm p_2) p_1 \\
    y_3 &= y_2 + ((u_3 - y_2)m_3 \pm p_3) p_1 p_2 \\
    & \vdots \\
    u &= y_s = y_{s-1} + ((u_s - y_{s-1})m_s \pm p_s) p_1 \cdots p_{s-1}
\end{align*}
\]

where each \( m_i := (p_1 \cdots p_{i-1})^{-1} \mod p_i \).

The theorem is also true for noncentered mod and is faster than [BS07] for small \( s \).
2.5 Cortex-M4

As selected by NIST for evaluating PQC candidates on micro-controllers, the ARM Cortex-M4 (or M4F since a floating point unit is assumed) is one of our target platforms for implementing PQC schemes. It is a RISC microcontroller which has fourteen 32-bit general purpose registers. Its instruction set has several unusual features:

**Single Cycle:** Most instructions take 1 cycle each, including \(32 \times 32 + 64 = 64\)-bit MADD. The most conspicuous exception is the first load in a sequence of loads (2 cycles).

**Barrel Shifter:** In almost all instructions one of the operands can be shifted or rotated by an arbitrary number of bits at no extra cost (and even sometimes help set carry).

**Flexible Indexing:** Registers may be shifted before/after being used as the index to load.

**SIMD:** Some arithmetic instructions that operate on 8- and 16-bit chunks of registers.

**Restricted immediates:** Not all 32-bit numbers can be used as an immediate operand.

**Optional Flag-setting:** Instructions don’t set flags by default, though most optionally do.

We mention some of the main tricks which we employ below.

**Reductions.** Since all of our approaches on Cortex-M4 are 32-bit NTTs, we need 32-bit modular reductions. We implement 32-bit signed Barrett reduction and 32-bit signed Montgomery multiplication with Cortex-M4’s powerful 1-cycle long multiplications `smull`, `smlal`, and `smmulr`. Fix an integer \(a\), a modulus \(q\), and let \(R = 2^{32}\). For Barrett reduction, we compute \(a \mod \pm q\) with \(a - \left\lfloor \frac{a}{q} \right\rfloor \cdot q \approx a - \left\lfloor \frac{a}{R} \right\rfloor \cdot q\) in 2 cycles as shown in Algorithm 10.

For Montgomery multiplication, we compute \(abR^{-1} \mod \pm q\) instead. While computing the point multiplication for NTTs using schoolbook multiplication, neither multiplicand is known beforehand, but we may cancel out the \(R^{-1} \mod \pm q\) by Montgomery-multiplying the precomputed \(\text{NTT}_q^{-1}R^2 \mod \pm q\) (i.e. an extra factor of \(R\)) at the end of \(\text{NTT}_q^{-1}\). For convenience, we denote the computations \((a,b) \mapsto abR^{-1} \mod \pm q\) by \(\text{montgomeryM}\) and \(a(64\text{-bit}) \mapsto aR^{-1} \mod q(32\text{-bit})\) by \(\text{montgomeryR}\).

### Algorithm 10: Barrett reduction

**Input:** \(c_0 = a\)

**Output:** \(c_0 = a - \left\lfloor \frac{a}{q} \right\rfloor \cdot q\)

1. `smmulr tmp, c0, \left\lfloor \frac{R}{q} \right\rfloor`
2. `mls c0, tmp, q, c0`

### Algorithm 11: Montgomery multiplication

**Input:** \((c0, c1) = (a, b)\)

**Output:** \(c0 = abR^{-1} \mod \pm q\)

1. `smull tmp0, c0, c0, c1`
2. `mul tmp1, tmp0, (-q^{-1} \mod \pm R)`
3. `smlal tmp0, c0, tmp1, q`

**Floating-point registers.** On the Cortex-M4 as specified by NIST, there are 14 general purpose registers and 32 floating-point registers. Floating-point registers not only enable us to access frequently used twiddle factors but also give us great flexibility on designing our approaches; loading twiddle factors to floating-point registers before loops for butterflies saves a general purpose register (crucial, since all our implementation on Cortex-M4 are 32-bit NTTs), we have a slightly faster implementation for `ntruhs2048509` comparing to...
Toom-4 implementation and a faster variant to use 64-bit (vs. the usual 32-bit) accumulators for Saber’s matrix-to-vector product. Details for these variants would be elaborated in the relevant sections.

3 NTTs on the Cortex-M4

On Cortex-M4, we commonly compute three layers of radix-2 NTTs at a time, Algorithm 19 illustrates the idea and is adapted from [ACC+20]

3.1 Saber

For Saber, we replace polynomial multiplications in the subroutines InnerProd and MatrixVectorMul using the negacyclic NTT trick to eliminate all Toom-4 multiplications in Saber. In the interest brevity, we only detail MatrixVectorMul (which takes most of the time) that multiplies an \( l \times l \) matrix with an \( l \times 1 \) vector, where each component is an element of \( \mathbb{Z}_q[x]/(x^{256} + 1) \). The design of Saber provides additional incentives to use NTTs because the matrix-to-vector product is turned into a matrix-to-vector point-multiplication in NTT domain. More concretely, we do not merely save the difference in cycles between Toom-4 and NTT-based degree-255 polynomial multiplications, because to compute the \( l^2 \) multiplications in MatrixVectorMul, we only need to compute \( l^2 + l \) NTTs and \( l \) NTT inverses instead of \( 2l^2 \) NTTs and \( l^2 \) inverse NTTs as normally might be expected.

Our NTT-based MatrixVectorMul therefore proceeds as follows: compute the size-256 negacyclic NTT for each component in the matrix and the vector, multiply the matrix by the vector with degree-3 schoolbooks, accumulate the result to a vector, and then compute the NTT inverses for each component.

We compute incomplete NTTs and degree-3 schoolbook as it gives the best performance for Saber. To compute \( a_0 \cdot b_0 + a_1 \cdot b_1 + a_2 \cdot b_2 \) (mod \( q' \)) using Montgomery reductions, we only need 1 smull, 2 smlal, and 1 Montgomery reduction instead of computing 3 multiplications, each followed by a Montgomery reduction, adding the results together, and then reducing modulo \( q' \) again. Furthermore, this idea also applies where each \( a_i \cdot b_i \) is a degree-3 schoolbook.

Choosing the best incomplete NTT. When using incomplete NTTs we need to choose at which point we should stop doing NTT butterfly operations and simply multiplying the polynomials using school-book multiplication. One can choose between 8 layers of NTTs, 7 layers of NTTs followed by \( 2 \times 2 \) schoolbook, 6 layers of NTTs followed by \( 4 \times 4 \) schoolbook, and 5 layers of NTTs followed by \( 8 \times 8 \) schoolbook. First we compare the behavior of incomplete NTTs. On a Cortex-M4, among the 14 general purpose registers, we need one register for loading coefficients, one register for loading the twiddle factors \( \zeta \), two registers for constants used in Montgomery multiplication, two registers as temporary storage for Montgomery multiplication in schoolbook. There are only 8 registers remained where computing 3 layers of NTTs at a time could be achieved without overhead. Computing 5 layers of NTTs would not achieve the economical use of registers, since we can often compute the 5-th layer without spilling the registers. Computing 7 layers of NTTs would involve a lot of vmovs because of the lack of registers. For Saber, we achieve the best performance when doing 6 layers of NTTs. This can be explained by comparing \( 4 \times 4 \) school-book multiplication and size-4 NTTs. For simplicity, we will focus on an \( l \)-dimensional matrix-to-vector product in which each component is a degree-3 polynomial. A \( 4 \times 4 \) school-book multiplication requires 7 smulls, 12 smlals, and 7 montgomeryRs as illustrated in Algorithm 17 in the Appendix. For accumulation, each \( l \)-dimensional row-column inner product requires \( 4l - 4 \) adds and \( 4l - 4 \) vmovs for temporary storage. Therefore, \( 4l^2 - 8l \) cycles are required for the \( 4 \times 4 \) school-book approach. To use a
size-4 NTT trick, we calculate the size-4 NTT of each component, multiply components by components with point-multiplication, accumulate to a vector, and finally, compute the size-4 NTT inverse of each component. Each size-4 NTT requires 4 \texttt{montgomeryR}s and 4 \texttt{add-sub} pairs as shown in Algorithm 16. Since only 14 registers are available, we need to \texttt{vmov 4 \cdot (2 \cdot (l - 1) + l) \cdot l} = 12\text{l}^2 - 8\text{l} cycles for storing intermediate values for accumulation. If the NTT trick is adopted, the matrix-to-vector product would require \(20\text{l}^2 + 40\text{l} \cdot \text{cycles for } \text{l}^2 + 1 \text{ NTTs and } \text{l} \text{ NTT inverses, } 12\text{l}^2 - 8\text{l} \text{ \texttt{vmovs}, and } 4\text{l}^2 - 4\text{l} \text{ \texttt{adds}, resulting in } 39\text{l}^2 + 28\text{l} \text{ cycles. We have } 41\text{l}^2 - 8\text{l} < 39\text{l}^2 + 28\text{l} \text{ for } \text{l} < 18.

\textbf{Better Accumulation For Schoolbook Multiplication.} There is an even better approach to matrix-to-vector products utilizing the commutativity of instructions. All \texttt{adds} and some \texttt{montgomeryR} can be removed at the cost of some additional \texttt{vmovs}. The following observation results in an efficient implementation that is difficult to match independent of \texttt{m}. Consider one inner product \(h = \sum_{i=0}^3 p_i \cdot q_i\) where \('\cdot'\) is multiplication \((\text{mod } z^4 - \zeta)\). Now \([z^0]h = \sum_i \left(p_i q_{i,0} + \zeta (p_{i,3} q_{i,3} + p_{i,2} q_{i,2} + p_{i,1} q_{i,1})\right)\) is its constant term\(^1\). So we can compute the 64-bit value of \([z^0]h\) and then reduce it to 32-bit with \texttt{montgomeryR}, wherein all the \texttt{adds} can be absorbed (changing some \texttt{smul} into \texttt{smlal}). Algorithm 18 is an illustration of the idea. To summarize, we save \(4\text{l}^2 - 4\text{l} \text{ \texttt{adds}}\) and \(4\text{l}^2 - 4\text{l} \text{ \texttt{montgomeryR}s}\) (each of which takes 2 cycles) at the cost of \(8\text{l}^2 - 8\text{l} \text{ \texttt{vmovs}}\) and therefore the cycle count becomes \(37\text{l}^2 - 4\text{l}\), smaller than \(39\text{l}^2 + 28\text{l}\) for all \(\text{l}\).

\textbf{Our Optimized Negacyclic NTT Trick.} Since incomplete size-256 negacyclic NTTs are computed, we choose prime \(q' = 25166081 = 196610 \cdot 128 + 1\) for \texttt{Saber} and \texttt{Firesaber}, and prime \(q' = 20972417 = 163847 \cdot 128 + 1\) for \texttt{Lightsaber}. The NTTs are computed with six layers of radix-2 NTTs (CT butterflies), where the first three layers are merged and the following three layers are merged, compute schoolbook-and-accumulate with above strategy, and then compute incomplete size-256 NTT negacyclic inverses using GS butterflies with the same 3-layer-merge.

\subsection{3.2 NTRU}

In this section, we go into implementation details for polynomial multiplication in NTRU on Cortex-M4. We are targeting the first two \texttt{poly}_Rq\_mul\_s and the first \texttt{poly}_Rq\_mul in key generation, the \texttt{poly}_Rq\_mul in encryption, and the first \texttt{poly}_Rq\_mul in decryption. While implementing polynomial multiplication for each parameter set, we optimized the code in various aspects. Some ideas work for all parameter sets, and some are only suitable for a particular one. The core ideas are simple: manipulate registers wisely, compute small convolutions with schoolbook, and change the domain only when needed. We summarize the tricks used for each parameter set in Table 3.

\textbf{Layers of NTT.} As usual, several layers of NTTs are computed at a time to avoid load-stores and use the registers economically. On Cortex-M4, since only 14 general purpose registers are available, we compute three layers of radix-2 NTTs (and two layers of radix-3 NTTs) at a time. For \texttt{ntruhps2048509}, we employ a seemingly strange alternative, computing four layers of radix-2 NTTs at a time, to set up better a foundation for polynomial multiplication. This results in a slightly faster implementation for \texttt{ntruhps2048509} compared to the Toom-4 approach.

\textbf{Tricks for commutative operations.} Recall that for computing an NTT, we must cancel the scaling factor \texttt{NTT}. We can halve the number of Montgomery-multiplications by

\footnote{Combinatorially it is customary to write \([x^i]f\) for the coefficient of \(x^i\) in \(f\).}
### Table 3: Overview of NTTs for NTRU on Cortex-M4

(a) NTT tricks for NTRU parameter sets.

<table>
<thead>
<tr>
<th>Parameter sets</th>
<th>NTT\textsubscript{q}</th>
<th>q'</th>
<th>Strategy</th>
</tr>
</thead>
<tbody>
<tr>
<td>ntruhps4096821</td>
<td>1728 = 9 \cdot 64 \cdot 3</td>
<td>3365569</td>
<td>Mixed-radix(CT+GS)</td>
</tr>
<tr>
<td>ntruhrss701</td>
<td>1536 = 512 \cdot 3</td>
<td>5747201</td>
<td>Good's(CT+CT)</td>
</tr>
<tr>
<td>ntruhps2048677</td>
<td>1536 = 512 \cdot 3</td>
<td>1389569</td>
<td>Good's(CT+CT)</td>
</tr>
<tr>
<td>ntruhps2048509</td>
<td>1024 = 256 \cdot 4</td>
<td>1043969</td>
<td>Radix-2(CT+GS)</td>
</tr>
</tbody>
</table>

(b) Layers of NTTs for each set of parameter.

<table>
<thead>
<tr>
<th>Parameter sets</th>
<th>NTT</th>
<th>baseMul</th>
<th>NTT inverse</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>ntruhps4096821</td>
<td>2-layer-radix-3</td>
<td>3 \times 3</td>
<td>2 \times 3-layer-radix-2</td>
</tr>
<tr>
<td></td>
<td>+2 \times 3-layer-radix-2</td>
<td>3 \times 3</td>
<td>+2-layer-radix-3</td>
</tr>
<tr>
<td>ntruhrss701</td>
<td>3 \times 3-layer-radix-2</td>
<td>3 \times 3</td>
<td>3 \times 3-layer-radix-2</td>
</tr>
<tr>
<td>ntruhps2048677</td>
<td>2 \times 4-layer-radix-2</td>
<td>4 \times 4</td>
<td>2 \times 3-layer-radix-2</td>
</tr>
<tr>
<td>ntruhps2048509</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

\(\text{NTT}_{q}^{-1} \cdot R^2 \mod \pm q\) by first reducing modulo the polynomial modulus and then performing the multiplication. The same idea also applies to the operations of reducing the coefficient from \(\mathbb{Z}_q\) to \(\mathbb{Z}_q\) and packing two coefficients into one register. Because they commute, we pack two coefficients and then and with \((q - 1) | (q - 1)\). Algorithm 20 shows how the ideas are implemented at the final stage.

**ntruhps4096821.** Algorithm 12 depicts the NTT for ntruhps4096821. We compute incomplete mixed-radix size-1728 NTTs for each polynomial by splitting down to \(x^2_{i,j} - \zeta_{i,j}\), multiply degree-2 polynomials with schoolbook, derive incomplete mixed-radix NTT inverses, and then reduce the coefficient ring to \(\mathbb{Z}_q\). For incomplete-size-1728 NTTs, we first compute size-9 NTTs with two radix-3 NTTs for each 9-set distancing apart by 192 units. Next, for each consecutive 192 coefficients, we compute size-64 NTTs with six layers of radix-2 NTTs for each 64-set distancing apart by 3 units, leaving degree-2 polynomials. Among 9 sets of 192-coefficient, standard size-64 NTTs are computed for the first 192-coefficient and twisted size-64 NTTs are computed for the rest. The incomplete size-1728 NTT inverse is computed in the reversed manner. For the final stage, we employ all the ideas mentioned in the previous paragraph — taking quotient before Montgomery-multiplying \(R^2 \cdot \text{NTT}_{q}^{-1}\) mod \(q'\) and pack two coefficients before the and. For merging layers, the two layers of radix-3 NTTs are merged, the first three layers of radix-2 NTTs are merged, the following three layers of radix-2 NTTs are merged, and the NTT inverses are merged in the same manner.

**ntruhrss701 and ntruhps2048677.** Algorithm 13 shows the NTT used for ntruhrss701 and ntruhps2048677. We use Good’s trick for both. Our approach is almost the same as [ACC+20], with a slightly faster final stage. This is because \((\mod 2^k)\) and \((\mod (x^n - 1))\) are cheaper. We employ Good’s permutation of size \(3 \times 2^6\) for the size-1536 NTT. The algorithm goes in the following order: compute three size-512 NTTs (CT butterflies), each for 512 contiguous entries, compute \(3 \times 3\) convolutions, where coefficients are distancing apart by 512 units, invert size-512 NTTs (CT butterflies), and a final stage. This last stage consists of: inverting Good’s permutation, taking the remainder \(\mod (x^n - 1)\), Montgomery-multiplication by \(R^2 \cdot \text{NTT}_{q}^{-1}\) mod \(q'\), packing two coefficients into one register, and reducing to coefficient ring \(\mathbb{Z}_q\). We implement the iNTT using CT butterflies because we need fewer reductions to avoid overflows. As mentioned above, we do \(\mod (x^n - 1)\) first so we save half the Montgomery-multiplications by \(R^2 \cdot \text{NTT}_{q}^{-1}\) mod \(q'\).
Algorithm 12 Incomplete mixed-radix size-1728 NTT for ntruhps4096821

Representing
\[
\begin{align*}
&\text{src1}[i] \text{ with } ntt1[i/192] [(i \mod 192)/3] [(i \mod 3)] \\
&\text{src2}[i] \text{ with } ntt2[i/192] [(i \mod 192)/3] [(i \mod 3)] \\
\end{align*}
\]

1. For each \(j, k\), compute
\[
\begin{align*}
&\text{NTT}_9(\text{ntt1}[0-8] [j] [k]) \\
&\text{NTT}_9(\text{ntt2}[0-8] [j] [k])
\end{align*}
\]
2. For each \(i, k\), compute
\[
\begin{align*}
&\text{NTT}_{64:512}(\text{ntt1}[i] [0-63] [k]) \\
&\text{NTT}_{64:512}(\text{ntt2}[i] [0-63] [k])
\end{align*}
\]
3. For each \(i, j\), compute \(\text{nttout}[i] [j] = \text{ntt1}[i] [j] [0-2] \ast \text{ntt2}[i] [j] [0-2] \mod (x^3 - \zeta_{i,j})\)
4. For each \(i, k\), compute \(\text{NTT}^{-1}_{512:0} (\text{nttout}[0-8] [j] [k])\).
5. For each \(j, k\), compute \(\text{NTT}^{-1}_9 (\text{nttout}[0-8] [j] [k])\).
6. Compute \(\text{des}[0-1727] = \text{final\_stage}(\text{nttout}[0-8] [0-63] [0-2])\).

Algorithm 13 Good’s trick of size-1536 NTT for ntruhrss701 and ntruhps2048677

1. Compute
\[
\begin{align*}
&\text{ntt1}[0-2] [0-511] = (\text{NTT}^{0:3}_{512:0}\ast\text{Good}_{3\times512})(\text{src1}[0-1535]) \\
&\text{ntt2}[0-2] [0-511] = (\text{NTT}^{0:3}_{512:0}\ast\text{Good}_{3\times512})(\text{src2}[0-1535])
\end{align*}
\]
2. For each \(i\), compute
\[
\begin{align*}
&\text{NTT}_{512:3:8}(\text{ntt1}[i] [0-511]) \\
&\text{NTT}_{512:3:8}(\text{ntt2}[i] [0-511])
\end{align*}
\]
3. For each \(j\), compute \(\text{nttout}[0-2] [j] = \text{ntt1}[0-2] [j] \ast \text{ntt2}[0-2] [j] \mod (\omega^3 - 1)\).
4. For each \(i\), compute \(\text{NTT}^{-1}_{512} (\text{nttout}[i] [0-511])\).
5. Compute \(\text{des}[0-1535] = \text{final\_stage}(\text{nttout}[0-2] [0-511])\).

ntruhps2048509. We merge our NTT layers differently for ntruhps2048509 to provide a better framework for polynomial multiplication. See Algorithm 14 for the details. We do 2 sets of four-layer NTTs (CT butterflies) for incomplete size-1024 NTTs, perform each 4-coefficient (modulo a degree-3 polynomial) multiplication with schoolbook, do 2 sets of 3-layer NTT inverses (GS butterflies), and a final stage. Here GS butterflies make for an easier final stage comprising the following operations: 2 layers of NTT inverses, taking \(\mod(x^3 - 1)\), Montgomery-multiplication by \((R)^2\text{NTT}^{-1}_9 \mod q^r\), packing two coefficients into one register, and reducing to coefficient ring \(\mathbb{Z}_q\). This approach saves 1 layer of load-stores.

3.3 LAC on Cortex-M4

For LAC-128, LAC-192, and LAC-256, we focus on big-by-small polynomial multiplications where the ‘small’ polynomials have coefficients in \(\{0, \pm 1\}\).

Algorithm 14 Incomplete size-1024 NTT for ntruhps2048509

Representing
\[
\begin{align*}
&\text{src1}[i] \text{ with } ntt1[i/4] [i \mod 4] \\
&\text{src2}[i] \text{ with } ntt2[i/4] [i \mod 4]
\end{align*}
\]

1. For each \(j\), compute
\[
\begin{align*}
&\text{NTT}_{256}(\text{ntt1}[0-255] [j]) \\
&\text{NTT}_{256}(\text{ntt2}[0-255] [j])
\end{align*}
\]
2. For each \(i\), compute \(\text{nttout}[i] [0-3] = \text{ntt1}[i] [0-3] \ast \text{ntt2}[i] [0-3] \mod (x^4 - \zeta_i)\).
3. For each \(j\), compute \(\text{NTT}^{-1}_{256} (\text{nttout}[0-256] [j])\).
4. Compute \(\text{des}[0-1023] = \text{final\_stage}(\text{nttout}[0-256] [0-3])\).
Table 4: Overview of NTTs for LAC on Cortex-M4

(a) NTT tricks for LAC parameter sets.

<table>
<thead>
<tr>
<th>Parameter sets</th>
<th>$N_{TT}$</th>
<th>$q'$</th>
<th>Strategy</th>
</tr>
</thead>
<tbody>
<tr>
<td>LAC-128</td>
<td>512</td>
<td>133121</td>
<td>Complete NTT(CT+GS)</td>
</tr>
<tr>
<td>LAC-192</td>
<td>1024</td>
<td>270337</td>
<td>Incomplete NTT(CT+GS)</td>
</tr>
</tbody>
</table>

(b) Layers of NTTs for each set of parameter.

<table>
<thead>
<tr>
<th></th>
<th>NTT</th>
<th>baseMul</th>
<th>NTT inverse</th>
</tr>
</thead>
<tbody>
<tr>
<td>LAC-128</td>
<td>3 × 3-layer-radix-2</td>
<td>1 × 1</td>
<td>3 × 3-layer-radix-2</td>
</tr>
<tr>
<td>LAC-192</td>
<td>3 × 3-layer-radix-2</td>
<td>2 × 2</td>
<td>3 × 3-layer-radix-2</td>
</tr>
<tr>
<td>LAC-256</td>
<td>3 × 3-layer-radix-2</td>
<td>1 × 1</td>
<td>3 × 3-layer-radix-2</td>
</tr>
</tbody>
</table>

**NTT trick for LAC**  We employ the negacyclic NTT trick on the rings $\mathbb{Z}_q[x]/(x^{512} + 1)$, $\mathbb{Z}_q[x]/(x^{1024} + 1)$, $\mathbb{Z}_q[x]/(x^{1024} + 1)$ for LAC-128, LAC-192, and LAC-256, respectively. Our approach for LAC-128 proceeds as follows: compute the negacyclic size-512 NTTs of polynomials, do point-by-point multiplications, and finally, compute the size-512 NTT inverse. Our approach for LAC-192 and LAC-256 proceeds as the follows: derive incomplete negacyclic size-1024 NTT by three sets of 3-layer-radix-2 NTTs, compute $2 \times 2$ schoolbooks, and invert the NTT.

**On the “optimized implementation” in the LAC submission**  The original LAC “optimized” code stores small polynomials as arrays of the indices of the non-zero terms (and do secret-dependent table lookups), and they use the $C \%$ operator. These operations are not constant time, posing a security risk. We use a standard form for the array to do NTTs, and replace all $C \%$ operator using Montgomery reductions to obtain a constant-time implementation.

4 Vectorized NTT on AVX2

For fast NTT-based polynomial multiplication on current x86 processors from Intel and AMD, it is necessary to use a vectorized implementation of the NTT. These processors support the AVX2 instruction set, offering a large number of instructions that operate on 16 vector registers, each of length 256 bit. Kyber, NTTRU, and Dilithium.

4.1 Fast mulmods

A first obstacle towards fast vectorization of the NTT is the problem of efficiently multiplying many coefficients modulo a small prime $q$. The standard way to compute modular products is to first compute the double-length products over $\mathbb{Z}$, and then reduce these intermediate results modulo $q$. In a vectorized implementation, in order to achieve the highest possible throughput, one wants to pack as many coefficients as possible in a vector register. But double-length intermediate products mean it is only possible to achieve half the density compared to packing only mod-$q$ reduced integers. This effectively reduces the speed of the implementation by a factor of two. Note that this is not a problem when computing products modulo a two-power as in other polynomial multiplication implementations for Saber or NTRU that directly operate over the respective polynomial rings. There the binary arithmetic in modern CPUs automatically takes care of the modular reduction. To overcome this obstacle we use the modified Montgomery reduction algorithm from [Sci18] together with the improvement from [LS19]. Here the modular multiplications
We considered several different choices of transforms. For Saber with its NTT-friendly polynomial modulus $X^{256} + 1$, we compute the negacyclic length-256 transforms modulo $X^{256} + 1$ as we do on the Cortex-M4. For performing only a single polynomial multiplication it is usually advantageous to use an incomplete NTT but for Saber where in the matrix-vector product the vector of polynomials only needs to be transformed once and the inner products can be computed in the NTT basis, a complete NTT is preferable. In the case of ntruhps2048677 and ntruhrss701 we compute an incomplete NTT modulo $X^{1536} - 1$ where we do 9 radix-2 splittings down to factors of degree 3. Since the input polynomials have degree less than 768, the first splitting is for free. For ntruhps2048509 and ntruhps4096821 the same approach that we use on the Cortex-M4 should also gives good results on Skylake. In particular, a length-1728 NTT with two radix-3 splittings, followed by 6 radix-2 splittings, down to polynomials of degree less than 3. For LAC with its polynomial moduli $X^{512} + 1$ and $X^{1024} + 1$, we compute incomplete negacyclic length-512 and length-1024 NTTs, respectively, each with 8 layers, coming down to factors of degree 2 and 4.

We chose the prime moduli 7681 and 10753 for the NTTs of length 256, 512, 1024 and 1536. Their product is slightly longer than 26 bits, which is enough for all our applications. In the case of Saber, the absolute value of the polynomial coefficients when computing the matrix-vector product over $\mathbb{Z}$ is bounded by $2^{24}$, which is below $2^{25}$. In NTRU, the maximum absolute value is attained in ntruhrss701, where the coefficients are bounded by $2^{24.04}$ in all products of a uniform polynomial with a short polynomial. Next, as 7680

\[ \textbf{Algorithm 15} \text{ Multiplication modulo 16-bit } q \]

\textbf{Require:} $-2^{15} \leq a < 2^{15}$, $2^{-1} \leq b \leq 2^{-1}$, $b' = bq^{-1} \mod 2^{16}$

\textbf{Ensure:} $r \equiv 2^{16}ab \pmod{q}$

1: $t_1 \leftarrow \left\lfloor \frac{ab}{2^{16}} \right\rfloor$ \hspace{1cm} \triangleright \text{signed high product}$
2: $t_0 \leftarrow ab' \mod 2^{16}$ \hspace{1cm} \triangleright \text{signed low product}$
3: $t_0 \leftarrow \left\lfloor \frac{t_0}{2^{16}} \right\rfloor$ \hspace{1cm} \triangleright \text{signed high product}$
4: $r \leftarrow (t_1 - t_0) \mod 2^{16}$

are computed from separate intermediate low and high half-products. When using the AVX2 instruction set, this approach is most efficient for 16-bit primes $q$. The reason is that there is a specific high-only half-product instruction \texttt{vpmulhw} for packed 16-bit integers that does not have an equivalent instruction for packed 32-bit integers. Therefore, unlike on the Cortex-M4, we use NTTs modulo 16-bit primes $q$ on AVX2. Then we need to use a multi- modular approach and compute the polynomial products modulo two such primes so that we are able to correctly lift the results to $\mathbb{Z}$ with the help of the Chinese remainder theorem. The additional polynomial product modulo a second prime involving three NTT computations and a base product computation does not result in reduced speed, because this loss of a factor of two is completely compensated for by twice the throughput from packing 16-bit integers instead of 32-bit integers. Another benefit of 16-bit primes is that it is possible to compute the occasional product modulo 3 in NTRU more efficiently, but we haven’t used this improvement in our experiments.

We state the modular multiplication algorithm in Algorithm 15. As inputs it gets a 16-bit integer $a$, and an mod-$q$ reduced integer $b$ together with the precomputed $b' = bq^{-1} \mod 2^{16}$. The algorithm then outputs a representative modulo $q$ for the scaled product $ab2^{16} \mod q$. The second multiplicand $b$ is always a fixed constant in the NTT and hence $b$ and the corresponding element $b'$ can easily be precomputed. The scaling factor $2^{16}$ is handled as usual by precomputing $b$ and $b'$ with an additional factor of $2^{-16}$.

4.2 Choice of Transforms

We considered several different choices of transforms. For Saber with its NTT-friendly polynomial modulus $X^{256} + 1$, we compute the negacyclic length-256 transforms modulo $X^{256} + 1$ as we do on the Cortex-M4. For performing only a single polynomial multiplication it is usually advantageous to use an incomplete NTT but for Saber where in the matrix-vector product the vector of polynomials only needs to be transformed once and the inner products can be computed in the NTT basis, a complete NTT is preferable. In the case of ntruhps2048677 and ntruhrss701 we compute an incomplete NTT modulo $X^{1536} - 1$ where we do 9 radix-2 splittings down to factors of degree 3. Since the input polynomials have degree less than 768, the first splitting is for free. For ntruhps2048509 and ntruhps4096821 the same approach that we use on the Cortex-M4 should also gives good results on Skylake. In particular, a length-1728 NTT with two radix-3 splittings, followed by 6 radix-2 splittings, down to polynomials of degree less than 3. For LAC with its polynomial moduli $X^{512} + 1$ and $X^{1024} + 1$, we compute incomplete negacyclic length-512 and length-1024 NTTs, respectively, each with 8 layers, coming down to factors of degree 2 and 4.

We chose the prime moduli 7681 and 10753 for the NTTs of length 256, 512, 1024 and 1536. Their product is slightly longer than 26 bits, which is enough for all our applications. In the case of Saber, the absolute value of the polynomial coefficients when computing the matrix-vector product over $\mathbb{Z}$ is bounded by $2^{24}$, which is below $2^{25}$. In NTRU, the maximum absolute value is attained in ntruhrss701, where the coefficients are bounded by $2^{24.04}$ in all products of a uniform polynomial with a short polynomial. Next, as 7680
and 10752 are divisible by 1536 = 3 · 2^9, both of these moduli support complete transforms modulo \( X^{1536} - 1 \), which is all that we need for Saber and the NTRU parameter sets except \texttt{ntruhsps4096821}. For LAC, the coefficients are even smaller so this is no problem.

For implementing the length-1728 NTT that we need in the remaining NTRU parameter set \texttt{ntruhsps4096821}, the two 16-bit primes 3457 and 8641 are used. Their product is sufficiently large, they support complete length-17218 NTTs and they are even slightly smaller than the primes described above, which is good for modular reductions.

So, algebraically, for Saber we compute the map

\[
\mathbb{Z}_q[X]/(X^{256} + 1) \rightarrow \mathbb{Z}_q[X]/(X - \zeta_0) \times \cdots \times \mathbb{Z}_q[X]/(X - \zeta_{255})
\]

where \( \zeta_i \) denote the primitive 512-th roots of unity in \( \mathbb{Z}_q \). For \texttt{ntruhss701} and \texttt{ntruhsps2048677} we compute

\[
\mathbb{Z}_q[X]/(X^{1536} - 1) \rightarrow \mathbb{Z}_q[X]/(X^3 - \zeta_0) \times \cdots \times \mathbb{Z}_q[X]/(X^3 - \zeta_{511})
\]

where \( \zeta_i \) denote all the 512-th roots of unity.

For \texttt{ntruhsps2048609} we compute

\[
\mathbb{Z}_q[X]/(X^{1024} - 1) \rightarrow \mathbb{Z}_q[X]/(X^2 - \zeta_0) \times \cdots \times \mathbb{Z}_q[X]/(X^2 - \zeta_{511}),
\]

with \( \zeta_i \) again ranging over all 512-th roots of unity.

Then, for \texttt{ntruhsps4096821} we compute

\[
\mathbb{Z}_q[X]/(X^{1728} - 1) \rightarrow \mathbb{Z}_q[X]/(X^3 - \zeta_0) \times \cdots \times \mathbb{Z}_q[X]/(X^3 - \zeta_{575}),
\]

where \( \zeta_i \) denote all the 576-th roots of unity. Finally, for LAC, we do

\[
\text{For } \texttt{ntruhss701} \text{ and } \texttt{ntruhsps2048677} \text{ we compute}
\]

\[
\mathbb{Z}_q[X]/(X^{512} + 1) \rightarrow \mathbb{Z}_q[X]/(X^2 - \zeta_0) \times \cdots \times \mathbb{Z}_q[X]/(X^2 - \zeta_{255}), \text{ and}
\]

\[
\mathbb{Z}_q[X]/(X^{1024} + 1) \rightarrow \mathbb{Z}_q[X]/(X^4 - \zeta_0) \times \cdots \times \mathbb{Z}_q[X]/(X^4 - \zeta_{255}),
\]

where \( \zeta_i \) denote all the primitive 512-th.

### 4.3 Register allocation

Intel’s Skylake and later microarchitectures have a throughput of 2 vector multiplications per clock cycle with a latency of 5 cycles [Fog20]. The addition and subtraction instructions have a throughput of 3 instructions per cycle since they can go to a third execution port that is not able to execute multiplications. Their latency is 1 cycle. Hence, the subtraction instruction in Algorithm 15 ideally does not compete with the multiplication instructions for execution resources, and the maximum theoretical throughput is 2/3 vector mulmod operations per cycle, or 32/3 scalar modular multiplications per cycle. On the other hand, the critical path of a vector mulmod consists of two multiplication instructions and a subtraction and thus has a latency of 11 cycles. In order for the code to not be completely latency bound and get near the maximum throughput, it is important that there are always many independent mulmods that can be computed in parallel. In principle, the out of order execution capability allows the CPU to find independent mulmods. But in practice the code will not come from the small uop cache and the instruction fetch from the L1 instruction cache is limited to 16 bytes per cycle, which translates to only less than about three vector instructions per cycle on average. So the code is likely to bottleneck on the front-end of the pipeline and the instruction decoding will not be able to run sufficiently far ahead for the CPU to be able to find independent instructions if they are far apart in the code. Hence it is important to schedule the instructions so that as many mulmods as possible are as close as possible. We achieve this by filling as many vector registers as possible with polynomial coefficients to operate on under the constraint that we also
need auxiliary registers for constants and scratch registers for intermediate results. Then we can compute several NTT layers with loading coefficients only once, and, after only a few layers, arrive at polynomials that we completely load into the registers. We also experimented with more refined approaches to scheduling where we implemented several parallel mulmods in an interleaved fashion so that we could schedule the addition and subtraction instructions in a way that they do not steal executions resources from the multiplication instructions. The downside of this approach is that by interleaving mulmod operations one needs more scratch registers so that one can either only operate on fewer polynomial coefficients at a time or needs to temporarily store away some of the coefficients. In the end we found that not doing this and letting the register renaming capability of the CPU take care of allocating scratch registers from the register file leads to superior results. In the two-power NTTs we always have 8 vector registers with a total of 128 polynomial coefficients loaded whereas in the NTT for NTRU whose length 1536 is divisible by 3 we have always 12 registers with 192 coefficients loaded.

4.4 Range Analysis

For the two primes \( q = 7681 \) and \( q = 10753 \) that we use, it is not possible to compute all the layers of the NTT using straight-forward radix-2 steps without performing additional modular reductions. We assume that the input polynomials we want to transform have coefficients less than \( \frac{q}{4} \) in absolute value. This is true for all our applications without first reducing the polynomials modulo \( q \). Now, by [Sei18, Lemma 2], the output coefficients of Algorithm 15 lie in the interval \([-q, q]\). So, using this approximation, we find for the forward negacyclic NTT with Cooley-Tukey butterflies that the coefficients grow by at most \( q \) in absolute value in each layer of the NTT. It then follows that we can only perform 2 layers without additional reductions. Instead, we use a more refined range analysis where for each layer and a given input range we compute the maximum range of the modular products. This then determines the range of the output coefficients, which form the inputs for the next layer. With this analysis we find that we can compute three layers of radix-2 splittings without additional reductions, both in the cyclic and in the negacyclic NTT. After these three layers we twist all the factors into rings of the form \( \mathbb{Z}_q[X]/(X^n-1) \). The advantage of twisting the factors instead of merely reducing coefficients is that this results in fewer modular multiplications in subsequent layers. Moreover, the mulmods as in Algorithm 15 are even slightly more efficient than for example Barrett reductions as they have the same throughput but shorter dependency chains. Concretely, splitting rings of the form \( \mathbb{Z}_q[X]/(X^n-1) \) does not need any mulmod. But for later factors of this form we do in fact sometimes multiply coefficients by 1 in order to reduce them. We then recursively compute the following maps with \( 16n \) mulmods, where \( \zeta \in \mathbb{Z}_q \) is a primitive 8-th root of unity,

\[
\begin{align*}
Z_q[X]/(X^{8n}-1) &
\rightarrow Z_q[X]/(X^{4n}-1) \times Z_q[X]/(X^{4n}+1) \\
&
\rightarrow Z_q[X]/(X^{2n}-1) \times Z_q[X]/(X^{2n}+1) \times Z_q[X]/(X^{2n}-\zeta^2) \times Z_q[X]/(X^n+\zeta^2) \\
&
\rightarrow Z_q[X]/(X^n-1) \times Z_q[X]/(X^n+1) \times Z_q[X]/(X^n-\zeta^2) \times Z_q[X]/(X^n+\zeta^2) \\
&
\times Z_q[X]/(X^n-\zeta) \times Z_q[X]/(X^n+\zeta) \times Z_q[X]/(X^n-\zeta^2) \times Z_q[X]/(X^n+\zeta^2) \\
&
\rightarrow Z_q[X]/(X^n-1) \times \cdots \times Z_q[X]/(X^n-1)
\end{align*}
\]

5 Results

In this section, we describe the benchmarking results for our Saber, NTRU, and LAC implementations. First, we describe our benchmarking setup for the Cortex-M4 and
Table 5: Saber Performance results in clock cycles for core arithmetic operations on Cortex-M4 and Skylake. The Inner-product computation in our AVX2 implementation for SABER does not contain the cost of computing the NTT of one of the input vectors. In encryption the NTT of the secret vector is already computed for the matrix vector product. For decryption the secret vector can be stored in NTT form in the secret key, which does not need to be compatible with other implementations.

<table>
<thead>
<tr>
<th>MatrixVectorMul</th>
<th>Cortex-M4</th>
<th>Skylake (AVX2)</th>
<th>speed-up</th>
<th>Cortex-M4</th>
<th>Skylake (AVX2)</th>
<th>speed-up</th>
</tr>
</thead>
<tbody>
<tr>
<td>(l = 2)</td>
<td>159k</td>
<td>66k</td>
<td>58%</td>
<td>7002</td>
<td>5215</td>
<td>25%</td>
</tr>
<tr>
<td>(l = 3)</td>
<td>317k</td>
<td>125k</td>
<td>61%</td>
<td>14145</td>
<td>9579</td>
<td>32%</td>
</tr>
<tr>
<td>(l = 4)</td>
<td>528k</td>
<td>205k</td>
<td>61%</td>
<td>24342</td>
<td>14959</td>
<td>39%</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>InnerProduct(^a)</th>
<th>Cortex-M4</th>
<th>Skylake (AVX2)</th>
<th>speed-up</th>
<th>Cortex-M4</th>
<th>Skylake (AVX2)</th>
<th>speed-up</th>
</tr>
</thead>
<tbody>
<tr>
<td>(l = 2)</td>
<td>73k</td>
<td>41k</td>
<td>44%</td>
<td>4016</td>
<td>2125</td>
<td>47%</td>
</tr>
<tr>
<td>(l = 3)</td>
<td>99k</td>
<td>57k</td>
<td>42%</td>
<td>5977</td>
<td>2706</td>
<td>55%</td>
</tr>
<tr>
<td>(l = 4)</td>
<td>126k</td>
<td>73k</td>
<td>42%</td>
<td>8040</td>
<td>3278</td>
<td>60%</td>
</tr>
</tbody>
</table>

\(^a\)[BMKV20] report cycles on a different platform with a slightly newer Kabylake processor. We have re-benchmarked their code on our Skylake platform.

Skylake and then we report our results for Saber, NTRU, and LAC in Sections 5.1, 5.2, and 5.3.

**Benchmarking setup for the Cortex-M4.** Our benchmarking setup is based on the pqm4 [KRSS] benchmarking framework and as such produces comparable cycle counts to previous work [BMKV20, KRS19]. We target the STM32F407-DISCOVERY board which has a STM32F407VG core. We clock it at 24 MHz with no flash wait states to obtain similar cycle counts as the ones reported in pqm4. For obtaining randomness, we use the hardware random number generator. As both NTRU and Saber make use of SHA-3 and SHAKE, we make use of the optimized assembly implementations of Keccak from the XKCP\(^2\) which is also contained in pqm4. LAC relies on AES and SHA-2 which we source from [SS17] and SUPERCOP\(^3\) respectively. All cycle counts in the following were obtained for implementations compiled with gcc and -O3 (arm-none-eabi-gcc, Version 10.2.0). Each cycle count reported is the average of 100 executions.

**Benchmarking setup for Skylake.** The cycle counts for AVX2 were obtained on a Intel Core i7-6600U (Skylake) processor with a base frequency of 2.6 GHz. As usual we disable TurboBoost and hyperthreading. We compile our implementations with gcc version 7.5.0 and use the compiler flags -O3, -fomit-frame-pointer, -march=native, -mtune=native. All cycle counts are the median cycle counts of 10000 executions.

## 5.1 Saber results

Table 5 contains the performance results for the polynomial arithmetic speed-ups in Saber. We report the results for matrix-vector multiplication \(A \cdot s\) as used in key generation and encryption and vector-vector inner multiplication \(b^T \cdot s\) as used in encryption and decryption separately. The dimension of the matrix is \(l \times l\) and the dimension of the

\(^2\)https://github.com/XKCP/XKCP

\(^3\)https://bench.cr.yp.to/supercop.html
Table 6: Performance results in clock cycles for Lightsaber, Saber, and Firesaber

<table>
<thead>
<tr>
<th></th>
<th>Cortex-M4</th>
<th></th>
<th>Skylake (AVX2)</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>[BMKV20]*</td>
<td>Our Work</td>
<td>[BMKV20]*</td>
<td>Our Work</td>
</tr>
<tr>
<td><strong>Lightsaber</strong></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>CPA</td>
<td>K</td>
<td>383k</td>
<td>294k (−23%)</td>
<td>49132</td>
</tr>
<tr>
<td></td>
<td>E</td>
<td>448k</td>
<td>330k (−26%)</td>
<td>46311</td>
</tr>
<tr>
<td></td>
<td>D</td>
<td>93k</td>
<td>58k (−38%)</td>
<td>7842</td>
</tr>
<tr>
<td>CCA</td>
<td>K</td>
<td>466k</td>
<td>360k (−23%)</td>
<td>61325</td>
</tr>
<tr>
<td></td>
<td>E</td>
<td>653k</td>
<td>513k (−21%)</td>
<td>75876</td>
</tr>
<tr>
<td></td>
<td>D</td>
<td>678k</td>
<td>498k (−27%)</td>
<td>70228</td>
</tr>
<tr>
<td><strong>Saber</strong></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>CPA</td>
<td>K</td>
<td>738k</td>
<td>554k (−25%)</td>
<td>86502</td>
</tr>
<tr>
<td></td>
<td>E</td>
<td>830k</td>
<td>606k (−27%)</td>
<td>84852</td>
</tr>
<tr>
<td></td>
<td>D</td>
<td>128k</td>
<td>79k (−38%)</td>
<td>10909</td>
</tr>
<tr>
<td>CCA</td>
<td>K</td>
<td>853k</td>
<td>658k (−23%)</td>
<td>104832</td>
</tr>
<tr>
<td></td>
<td>E</td>
<td>1103k</td>
<td>864k (−22%)</td>
<td>125835</td>
</tr>
<tr>
<td></td>
<td>D</td>
<td>1127k</td>
<td>835k (−26%)</td>
<td>118553</td>
</tr>
<tr>
<td><strong>Firesaber</strong></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>CPA</td>
<td>K</td>
<td>1191k</td>
<td>879k (−26%)</td>
<td>135986</td>
</tr>
<tr>
<td></td>
<td>E</td>
<td>1312k</td>
<td>947k (−28%)</td>
<td>136075</td>
</tr>
<tr>
<td></td>
<td>D</td>
<td>162k</td>
<td>101k (−38%)</td>
<td>14474</td>
</tr>
<tr>
<td>CCA</td>
<td>K</td>
<td>1340k</td>
<td>1008k (−25%)</td>
<td>157915</td>
</tr>
<tr>
<td></td>
<td>E</td>
<td>1642k</td>
<td>1255k (−24%)</td>
<td>184322</td>
</tr>
<tr>
<td></td>
<td>D</td>
<td>1679k</td>
<td>1227k (−27%)</td>
<td>177864</td>
</tr>
</tbody>
</table>

* [BMKV20] only reports cycle counts for the CCA-secure Saber. The CPA-secure cycle counts are our own benchmarks.

vectors is $l \times 1$. The dimension $l = 2, 3, 4$ correspond to the parameter sets Lightsaber, Saber, and Firesaber.

On Cortex-M4, we obtain speedups between 58% and 61% for $A \cdot s$ and between 42% and 44% for $b^T \cdot s$. The speed-ups on Skylake range from 25% to 39% for $A \cdot s$ and from 47% to 60% for $b^T \cdot s$.

Table 6 illustrates the resulting performance of Lightsaber, Saber, and Firesaber on the Cortex-M4 when our fast MatrixVectorMul and InnerProduct are plugged into them. In addition to the full CCA-secure KEM schemes, we also report cycle counts for the underlying CPA-Secure PKE. While those are not explicitly exposed in the Saber specification, all our optimizations were inside of the CPA primitives and, hence, the overhead of the CCA transformation did not change. Moreover, some schemes use considerably more expensive CCA transforms than others. For example, Saber and Kyber include very costly public key and ciphertext hashes in their CCA transforms that could be omitted in a different choice of transform.

On Cortex-M4, we achieve significant speed-ups of consistently more than 20%. For CPA-secure decryption, we get the most notable speed-up of 38%.

### 5.2 NTRU results

Table 7 shows the results for polynomial multiplication for NTRU for the four different polynomial degrees used in ntruhps2048509, ntruhps2048677, ntruhps701, and ntruhps4096821. On the Cortex-M4, for the smallest polynomial size $n = 509$, our implementation using NTTs is performing only slightly better than the Toom4 implementation [KRS19]. For the larger sizes, the speed-up on the Cortex-M4 is more pronounced with 10% or more. On AVX2, $n = 509$ is the only polynomial size for which we were not able to obtain a speed-up using NTTs. All other parameter sets have small speed-ups of 7% to 15%. The reason why we didn’t achieve a speed-up for $n = 509$ is partly because we
Table 7: NTRU Performance results in clock cycles for polynomial multiplication on Cortex-M4 and Skylake

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>509</td>
<td>104k</td>
<td>101k</td>
<td>3%</td>
<td>6643</td>
<td>8540</td>
<td>-29%</td>
</tr>
<tr>
<td>677</td>
<td>175k</td>
<td>156k</td>
<td>11%</td>
<td>11103</td>
<td>10373</td>
<td>7%</td>
</tr>
<tr>
<td>701</td>
<td>173k</td>
<td>156k</td>
<td>10%</td>
<td>11242</td>
<td>10373</td>
<td>8%</td>
</tr>
<tr>
<td>821</td>
<td>230k</td>
<td>199k</td>
<td>13%</td>
<td>15507</td>
<td>13247</td>
<td>15%</td>
</tr>
</tbody>
</table>

*KRS19* only reports cycle counts for *n* = 701, but their code generator has been used to generate Toom-4 polynomial multiplication code to speed-up the other NTRU parameter sets. See https://github.com/mupq/pqm4/pull/86

chose a different vector layout and shuffling strategy in the length-1024 NTT compared to the other NTTs. The advantage of the different vector layout is that it is easier to precompute the constant vectors and they need less space. But they require more loads. In principal the loads don’t compete with the arithmetic because they go to separate execution ports and can be dispatched in parallel. Unfortunately, it turned out that this does incur a penalty, most likely because the code is bottlenecking on the front-end. We leave it as future work to optimize the length-1024 NTTs as much as the other NTTs.

Table 8 reports the results for the full NTRU schemes. As we only optimize polynomial multiplication in this paper and key generation is dominated by polynomial inversion, we do not see a big difference in cycle counts across all parameter sets and platforms. On the Cortex-M4, encapsulation is 1% to 6% faster while decapsulation is 2% to 4% faster. For the underlying CPA-secure PKE, we achieve higher speed-ups of 2% to 13% which comes as no surprise as we did not modify the CCA transformation.

5.3 LAC results

Table 9 summarizes the speed of the (big by small) polynomial multiplication in LAC. We can see that our code is faster than that of [LLZ+18] by a factor of 10× on the Cortex-M4 and a factor of 3× to 7× on Skylake.

Table 10 summarizes the results for the full LAC schemes. We can see that for LAC-128 we see a 3× up speedup on the Cortex-M4 while there is a more modest 20–50% speedup for AVX2. For LAC-192 and LAC-256 there is a roughly 4× speedup for the Cortex-M4 and roughly a 2× speedup for Skylake.

Acknowledgements

This work has been supported by the European Commission through the ERC Starting Grant 805031 (EPOQUE) and by the SNSF ERC starting transfer grant FELICITY. Taiwanese authors were supported by Taiwan Ministry of Science and Technology Grants 109-2923-E-001-001-MY3 and 109-2221-E-001-009-MY3, Sinica Investigator Award ASIA-109-M01, Executive Yuan Data Safety and Talent Cultivation Project (AS-KPQ-109-DSTCP). We thank Daniel J. Bernstein for the idea of trying NTT-based multiplication for schemes that are not specifically designed for NTTs and several important insights.
Table 8: Performance results in clock cycles for NTRU

<table>
<thead>
<tr>
<th></th>
<th>Cortex-M4</th>
<th></th>
<th>Skylake (AVX2)</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>[KRS19]*</td>
<td>Our Work</td>
<td>[ZCH+19]</td>
<td>Our Work</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>speed-up</td>
<td>speed-up</td>
</tr>
<tr>
<td>ntruhs2048509</td>
<td>K</td>
<td>79639k</td>
<td>155306</td>
<td>164952 ( +6% )</td>
</tr>
<tr>
<td></td>
<td>E</td>
<td>160k</td>
<td>10183</td>
<td>12052 ( +18% )</td>
</tr>
<tr>
<td></td>
<td>D</td>
<td>441k</td>
<td>27314</td>
<td>31340 ( +15% )</td>
</tr>
<tr>
<td></td>
<td>CPA</td>
<td>79617k (±0%)</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>E</td>
<td>152k (−5%)</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>D</td>
<td>434k (−2%)</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>CPA</td>
<td>79617k (±0%)</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>E</td>
<td>537k</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>D</td>
<td>538k (−1%)</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>CPA</td>
<td>79682k (±0%)</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>E</td>
<td>79617k (±0%)</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>D</td>
<td>79617k (±0%)</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>CPA</td>
<td>143759k (±0%)</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>E</td>
<td>143725k (±0%)</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>D</td>
<td>143725k (±0%)</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>CPA</td>
<td>153794k (±0%)</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>E</td>
<td>154377k (±0%)</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>D</td>
<td>154377k (±0%)</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>CPA</td>
<td>208892k (±0%)</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>E</td>
<td>208771k (±0%)</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>D</td>
<td>208771k (±0%)</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>CPA</td>
<td>208892k (±0%)</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>E</td>
<td>208771k (±0%)</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>D</td>
<td>208771k (±0%)</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

*a[KRS19] only reports cycle counts for the CCA-secure ntruhrss701 from the first round of the NIST competition. Cycle counts in this table are our own benchmarks of the second round code contained in pqm4 [KRSS].

Table 9: LAC polynomial multiplication clock cycles on Cortex-M4 and Skylake

<table>
<thead>
<tr>
<th></th>
<th>Cortex-M4</th>
<th></th>
<th>Skylake (AVX2)</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>[LLZ+18]</td>
<td>Our Work speed-up</td>
<td>[LLZ+18]</td>
<td>Our Work speed-up</td>
</tr>
<tr>
<td>LAC-128</td>
<td>638k</td>
<td>65k 90%</td>
<td>14691</td>
<td>4552 69%</td>
</tr>
<tr>
<td>LAC-192</td>
<td>1274k</td>
<td>131k 90%</td>
<td>73955</td>
<td>10119 86%</td>
</tr>
<tr>
<td>LAC-256</td>
<td>1701k</td>
<td>132k 92%</td>
<td>73955</td>
<td>10119 86%</td>
</tr>
</tbody>
</table>
Table 10: Performance results in clock cycles for LAC

<table>
<thead>
<tr>
<th></th>
<th>Cortex-M4</th>
<th>Skylake (AVX2)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>[LLZ+18]</td>
<td>Our Work</td>
</tr>
<tr>
<td>CPA</td>
<td></td>
<td></td>
</tr>
<tr>
<td>LAC-128</td>
<td>K 850k</td>
<td>282k (−67%)</td>
</tr>
<tr>
<td></td>
<td>E 1424k</td>
<td>444k (−60%)</td>
</tr>
<tr>
<td></td>
<td>D 528k</td>
<td>113k (−70%)</td>
</tr>
<tr>
<td>CCA</td>
<td>K 850k</td>
<td>282k (−67%)</td>
</tr>
<tr>
<td></td>
<td>E 1430k</td>
<td>450k (−60%)</td>
</tr>
<tr>
<td></td>
<td>D 1 960k</td>
<td>565k (−71%)</td>
</tr>
<tr>
<td>CPA</td>
<td></td>
<td></td>
</tr>
<tr>
<td>LAC-192</td>
<td>K 1 506k</td>
<td>373k (−75%)</td>
</tr>
<tr>
<td></td>
<td>E 2 417k</td>
<td>601k (−75%)</td>
</tr>
<tr>
<td></td>
<td>D 899k</td>
<td>210k (−77%)</td>
</tr>
<tr>
<td>CCA</td>
<td>K 1 507k</td>
<td>373k (−75%)</td>
</tr>
<tr>
<td></td>
<td>E 2 427k</td>
<td>610k (−75%)</td>
</tr>
<tr>
<td></td>
<td>D 3 329k</td>
<td>824k (−75%)</td>
</tr>
<tr>
<td>CPA</td>
<td></td>
<td></td>
</tr>
<tr>
<td>LAC-256</td>
<td>K 2 019k</td>
<td>459k (−77%)</td>
</tr>
<tr>
<td></td>
<td>E 3 623k</td>
<td>739k (−80%)</td>
</tr>
<tr>
<td></td>
<td>D 1 690k</td>
<td>359k (−79%)</td>
</tr>
<tr>
<td>CCA</td>
<td>K 2 020k</td>
<td>459k (−77%)</td>
</tr>
<tr>
<td></td>
<td>E 3 633k</td>
<td>748k (−79%)</td>
</tr>
<tr>
<td></td>
<td>D 5 327k</td>
<td>1 111k (−79%)</td>
</tr>
</tbody>
</table>
References


Algorithm 16 Size-4 NTT

Input: \( \{a_0, a_1, a_2, a_3\} \), \( \{\zeta'_0, \zeta'_1, \zeta'_2\} \) where \( \zeta'_i = (2^{32} \zeta_i \mod q') \).

Output: \( \{a_0'', a_1'', a_2'', a_3''\} \) where

\[
\begin{align*}
a_0'' &= a_0' + \zeta_1 a_1' \\
a_1'' &= a_0' - \zeta_1 a_1' \\
a_2'' &= a_2' + \zeta_2 a_3' \\
a_3'' &= a_2' - \zeta_2 a_3'
\end{align*}
\]

1: \((r_4, r_5, r_6, r_7) = (a_0, a_1, a_2, a_3)\)
2: \text{montgomeryM} r_6, r_6, \zeta'_0 \quad \triangleright r_6 = \zeta_0 a_2
3: \text{montgomeryM} r_7, r_7, \zeta'_0 \quad \triangleright r_7 = \zeta_0 a_3
4: \text{addSub2} r_4, r_6, r_5, r_7
5: \text{montgomeryM} r_5, r_5, \zeta'_1 \quad \triangleright r_4 : a_0', r_5 : a_1', r_6 : a_2', r_7 : a_3'
6: \text{montgomeryM} r_7, r_7, \zeta'_1 \quad \triangleright r_5 = \zeta_1 a_1'
7: \text{montgomeryM} r_7, r_7, \zeta'_2 \quad \triangleright r_7 = \zeta_2 a_3'
8: \text{addSub2} r_4, r_5, r_6, r_7
9: \text{montgomeryM} r_5, r_5, \zeta'_2 \quad \triangleright r_4 : a_0'', r_5 : a_1'', r_6 : a_2'', r_7 : a_3''
Algorithm 17 4 × 4 schoolbook

Input: \( \{k_0, k_1, k_2, k_3\}, \{b_0, b_1, b_2, b_3\}, \zeta' = (2^{32} \zeta \mod q') \)

Output: \( \{c'_0, c'_1, c'_2, c'_3\} \) where

\[
\begin{align*}
c'_0 &= \frac{(k_0b_0 + \zeta(k_1b_3 + k_2b_2 + k_3b_1))}{2^{32}} \mod q' \\
c'_1 &= \frac{(k_0b_1 + k_1b_0 + \zeta(k_2b_3 + k_3b_2))}{2^{32}} \mod q' \\
c'_2 &= \frac{(k_0b_2 + k_1b_1 + k_2b_0 + \zeta(k_3b_3))}{2^{32}} \mod q' \\
c'_3 &= \frac{(k_0b_3 + k_1b_2 + k_2b_1 + k_3b_0)}{2^{32}} \mod q' \\
\end{align*}
\]

```
1: smull r0, r1, k1, b3
2: smlal r0, r1, k2, b2
3: smlal r0, r1, k3, b1
4: montgomeryR r0, r1
5: smull r0, r1, r1, \zeta'
6: smlal r0, r1, k0, b0
7: montgomeryR r0, r1
8: vmov s0, r1
9: smull r0, r1, k2, b3
10: smlal r0, r1, k3, b2
11: montgomeryR r0, r1
12: smull r0, r1, r1, \zeta'
13: smlal r0, r1, k0, b1
14: smlal r0, r1, k1, b0
15: montgomeryR r0, r1
16: vmov s1, r1
17: smull r0, r1, k3, b3
18: montgomeryR r0, r1
19: smull r0, r1, r1, \zeta'
20: smlal r0, r1, k0, b2
21: smlal r0, r1, k1, b1
22: smlal r0, r1, k2, b0
23: montgomeryR r0, r1
24: vmov s2, r1
25: smull r0, r1, k0, b3
26: smlal r0, r1, k1, b2
27: smlal r0, r1, k2, b1
28: smlal r0, r1, k3, b0
29: montgomeryR r0, r1
30: vmov s3, r1
```

\( \triangleright r1 = c'_0 \)

\( \triangleright r1 = c'_1 \)

\( \triangleright r1 = c'_2 \)

\( \triangleright r1 = c'_3 \)

\( \triangleright \text{vmov for accumulation later on} \)
Algorithm 18 Saber $4 \times 4$ schoolbook refined ($l = 3$ as an example)

Input:
\[ \{k_0, k_1, k_2, k_3\} \]
\[ \{b_0, b_1, b_2, b_3\} \]
\[ \{k_n, k_{n+1}, k_{n+2}, k_{n+3}\} \]
\[ \{b_n, b_{n+1}, b_{n+2}, b_{n+3}\} \]
\[ \{k_{2n}, k_{2n+1}, k_{2n+2}, k_{2n+3}\} \]
\[ \{b_{2n}, b_{2n+1}, b_{2n+2}, b_{2n+3}\} \]
\[ (\zeta' = (2^{32} \zeta \mod q')) \]

Output: \( \{c_0'', c_1'', c_2'', c_3''\} \) where \( c_i'\)'s are as in Algorithm 17

\[ c_0'' = (c_0' + c_n' + c_{2n}) \mod q' \]
\[ c_1'' = (c_1' + c_{n+1}' + c_{2n+1}) \mod q' \]
\[ c_2'' = (c_2' + c_{n+2}' + c_{2n+2}) \mod q' \]
\[ c_3'' = (c_3' + c_{n+3}' + c_{2n+3}) \mod q' \]

1: smlal r0, r1, k1, b3
2: smlal r0, r1, k2, b2
3: smlal r0, r1, k3, b1
4: montgomeryR r0, r1
\[ \triangleright r1 = (k_1b_3 + k_2b_2 + k_3b_1)/2^{32} \mod q' \]
5: smlal r0, r1, r1, \( \zeta' \)
6: smlal r0, r1, k0, b0
7: vmov s0, s1, r0, r1
\[ \triangleright 2^{32}r1 + r0 = 2^{32}c_0' \mod q' \]
8: smlal r0, r1, k_{n+1}, b_{n+1}
9: smlal r0, r1, k_{n+2}, b_{n+2}
10: smlal r0, r1, k_{n+3}, b_{n+3}
11: montgomeryR r0, r1
\[ \triangleright r1 = (k_{n+1}b_{n+3} + k_{n+2}b_{n+2} + k_{n+3}b_{n+1})/2^{32} \mod q' \]
12: vmov r0, r4, s0, s1
13: smlal r0, r4, r1, \( \zeta' \)
14: smlal r0, r4, k_n, b_n
15: vmov s0, s1, r0, r1
\[ \triangleright 2^{32}r4 + r0 = 2^{32}(c_0' + c_n') \mod q' \]
16: smlal r0, r1, k_{2n+1}, b_{2n+1}
17: smlal r0, r1, k_{2n+2}, b_{2n+2}
18: smlal r0, r1, k_{2n+3}, b_{2n+3}
19: montgomeryR r0, r1
\[ \triangleright r1 = (k_{2n+1}b_{2n+3} + k_{2n+2}b_{2n+2} + k_{2n+3}b_{2n+1})/2^{32} \mod q' \]
20: vmov r0, r4, s0, s1
21: smlal r0, r4, r1, \( \zeta' \)
22: smlal r0, r4, k_{2n}, b_{2n}
\[ \triangleright 2^{32}r4 + r1 = 2^{32}(c_0' + c_n' + c_{2n}) \mod q' \]
23: montgomeryR r0, r4
\[ \triangleright r4 = c_0'' \]
24: compute \( c_0'' \) with above optimization
25: compute \( c_1'' \) with above optimization
26: compute \( c_2'' \) with above optimization
27: compute \( c_3'' \) with above optimization
28: Code sections for \( c_0'', c_1'', c_2'', \) and \( c_3'' \) are actually interleaved.
Algorithm 19 Cooley-Tukey FFT three layers (adapted from [ACC+ 20])

Input: \{a_0, \ldots, a_7\}, \{ζ'_0, \ldots, ζ'_6\} where ζ' = \left(2^{32}ζ \mod q'\right).

Output: \{a''_0, \ldots, a''_7\} where

\begin{align*}
a''_0 &= a'_0 + ζ_0a'_1 \\
a''_1 &= a'_0 - ζ_0a'_1 \\
a''_2 &= a'_2 + ζ_2a'_3 \\
a''_3 &= a'_2 - ζ_2a'_3 \\
a''_4 &= a'_4 + ζ_4a'_5 \\
a''_5 &= a'_4 - ζ_4a'_5 \\
a''_6 &= a'_6 + ζ_6a'_7 \\
a''_7 &= a'_6 - ζ_6a'_7
\end{align*}

1: \((r_4, \ldots, r_{11}) = (a_0, \ldots, a_7)\)
2: montgomeryM r_8, r_8, ζ'_0
3: montgomeryM r_9, r_9, ζ'_0
4: montgomeryM r_{10}, r_{10}, ζ'_0
5: montgomeryM r_{11}, r_{11}, ζ'_0
6: addSub4 r_4, r_8, r_5, r_9, r_{10}, r_{11}, r_7
7: \quad \triangleright (r_4, \ldots, r_{11}) = (a''_0, \ldots, a''_7)
8: montgomeryM r_6, r_6, ζ'_1
9: montgomeryM r_7, r_7, ζ'_1
10: montgomeryM r_{10}, r_{10}, ζ'_1
11: montgomeryM r_{11}, r_{11}, ζ'_1
12: addSub4 r_4, r_6, r_5, r_7, r_8, r_{10}, r_9, r_{11}
13: \quad \triangleright (r_4, \ldots, r_{11}) = (a''_0, \ldots, a''_7)
14: montgomeryM r_5, r_5, ζ'_2
15: montgomeryM r_7, r_7, ζ'_2
16: montgomeryM r_9, r_9, ζ'_2
17: montgomeryM r_{11}, r_{11}, ζ'_2
18: addSub4 r_4, r_5, r_6, r_7, r_8, r_9, r_{10}, r_{11}
19: \quad \triangleright (r_4, \ldots, r_{11}) = (a''_0, \ldots, a''_7)
20: \quad \triangleright \text{If } ζ = \pm 1 \text{ then no multiplications}
Algorithm 20 NTRU central reduction loop (for ntruhps2048677, ntruhrss701, and ntruhps4096821)

Input: \{a_i, \ldots, a_{i+3}, a_{n+i}, \ldots, a_{n+i+3}\}.
Output: \{a'_{i+1}||a'_i, a'_{i+3}||a'_{i+2}\} where

\[
a'_i = \frac{(a_i + a_{n+i})2^{32}}{\text{NTT}_N \mod q}
\]

\[
a'_{i+1} = \frac{(a_{i+1} + a_{n+i+1})2^{32}}{\text{NTT}_N \mod q}
\]

\[
a'_{i+2} = \frac{(a_{i+2} + a_{n+i+2})2^{32}}{\text{NTT}_N \mod q}
\]

\[
a'_{i+3} = \frac{(a_{i+3} + a_{n+i+3})2^{32}}{\text{NTT}_N \mod q}
\]

1: \((r4, r5, r6, r7, r8, r9, r10, r11) = (a_i, \ldots, a_{i+3}, a_{n+i}, \ldots, a_{n+i+3})
2: \text{add } r4, r8
3: \text{add } r5, r9
4: \text{add } r6, r10
5: \text{add } r7, r11
6: \text{montgomeryM } r4, r4, (2^{32})^2/\text{NTT}_N \mod q'
7: \text{montgomeryM } r5, r5, (2^{32})^2/\text{NTT}_N \mod q'
8: \text{montgomeryM } r6, r6, (2^{32})^2/\text{NTT}_N \mod q'
9: \text{montgomeryM } r7, r7, (2^{32})^2/\text{NTT}_N \mod q'
10: \text{centralR } r4 \quad \triangleright \quad r4 = \frac{(a_i + a_{n+i})}{\text{NTT}_N \mod q'}
11: \text{centralR } r5 \quad \triangleright \quad r5 = \frac{(a_{i+1} + a_{n+i+1})}{\text{NTT}_N \mod q'}
12: \text{centralR } r6 \quad \triangleright \quad r6 = \frac{(a_{i+2} + a_{n+i+2})}{\text{NTT}_N \mod q'}
13: \text{centralR } r7 \quad \triangleright \quad r7 = \frac{(a_{i+3} + a_{n+i+3})}{\text{NTT}_N \mod q'}
14: \text{pkhbt } r4, r4, r5, \text{lsl }#16
15: \text{pkhbt } r6, r6, r7, \text{lsl }#16
16: \text{and } r4, r4, (q-1)||(q-1) \quad \triangleright \quad r4 = a'_{i+1}||a'_i
17: \text{and } r6, r6, (q-1)||(q-1) \quad \triangleright \quad r6 = a'_{i+3}||a'_{i+2}