Multi-Voltage Low Power Convolvers Using the Polynomial Residue Number System

V. Paliouras
Electrical & Computer Engineering Dept.,
University of Patras, Greece
paliuras@ee.upatras.gr

A. Skavantzos
Electrical & Computer Engineering Dept.,
Louisiana State University, USA

T. Stouraitis
Electrical & Computer Engineering Dept.,
University of Patras, Greece

ABSTRACT
A novel approach for the reduction of the power dissipated in a signal processing application is introduced in this paper. By exploiting the properties of the Polynomial Residue Number System (PRNS) and of the arithmetic modulo \((2^n + 1)\), the power dissipation of implementing cyclic convolution is reduced up to 4 times. Furthermore, the corresponding power \(\times\) delay product is reduced up to 2.4 times, while a simultaneous reduction of area cost is achieved. The particular performance improvement becomes possible by introducing a way to minimize the forward and inverse conversion overhead associated with PRNS. The introduced minimization exploits the fact that for the conversions for particular lengths of data sequences and particular moduli, only multiplications with powers of two and additions are required, thus leading to low implementation complexity. In addition multiple supply voltages are utilized to further reduce power dissipation by more than 30\% for particular cases. Formulas that return the applicable supply voltage values per PRNS channel are derived in this paper.

Categories and Subject Descriptors
B.7.1 [Hardware]: Types and Design Styles; B.5.1 [Hardware]: Design; C.3 [Computer System Organization] Special-purpose and application-based systems—signal processing systems

General Terms
Design

Keywords
Computer arithmetic, Polynomial Residue Number System (PRNS), Low power design, signal processing

1. INTRODUCTION

Recently low power consumption has emerged as a major design optimization objective due to the need for portable electronics equipment as well as a means for increasing the reliability in high-performance systems. Several design techniques have been proposed to minimize power dissipation, spanning all levels of the design abstraction [1], from system level down to the VLSI technology level. Among the various optimization techniques, optimal arithmetic selection has been shown to have an impact on overall system power dissipation [2]. In this paper, a new approach is proposed for the design of low-power digital signal processing equipment. By utilizing a number theoretic approach, it is shown that a significant reduction of power dissipation can be achieved, combined with a reduction of the system area cost, at the cost of increased delay. Further optimization is possible, by utilizing multiple supply voltages.

The Residue Number System (RNS) [3] is an integer system capable of supporting parallel, carry-free, high-speed arithmetic. The system also offers some useful properties for error detection, error correction and fault tolerance in digital systems. Important areas of application of the RNS include Digital Signal Processing (DSP) intensive computations, such as digital filtering, convolutions, correlations and DFT and FFT computations. Recent work in RNS arithmetic has resulted in the development of the Polynomial Residue Number System (PRNS) [4] which is capable of multiplying two polynomials using minimum computational complexity.

The PRNS examines the problem of multiplying two \((N - 1)\)-degree polynomials \(\mod (x^N + 1)\) in some modular ring \(\mathbb{Z}_m = \{0, 1, \ldots, m - 1\}\), a ring which is closed with respect to the operations of addition and multiplication \(\mod m\). This system can perform the above polynomial product using the minimal number of multiplications. It should be noted that the polynomial product of two \((N - 1)\)-degree polynomials \(\mod (x^N - 1)\) implements the cyclic convolution of two \(N\)-point sequences, a task which is useful in efficiently computing linear convolutions. It should also be noted that the linear convolution of two sequences is a very useful computation because it mechanizes digital filtering. Consider two \(N\)-point sequences \(A = a_0, a_1, \ldots, a_{N-1}\) and \(B = b_0, b_1, \ldots, b_{N-1}\).
Then their cyclic convolution is an N-point sequence \( C = c_0, c_1, \ldots, c_{N-1} \), with \( c_i, i = 0, 1, \ldots, N-1 \) given by the coefficients of the polynomial \( C(x) \), where \( C(x) = \langle A(x)B(x) \rangle_{\nu^N-1} \) with \( \langle P(x) \rangle_{Q(x)} \) denoting the operation \( P(x) \mod Q(x) \) in polynomials, while \( A(x) = \sum_{i=0}^{N-1} a_i x_i, \quad B(x) = \sum_{i=0}^{N-1} b_i x_i, \quad \text{and} \quad C(x) = \sum_{i=0}^{N-1} c_i x_i. \)

By factorizing the polynomials \( x_N \pm 1 \) in \( N \) distinct factors in \( \mathbb{Z}_m \), as in

\[
x_N = (x - r_0)(x - r_1) \cdots (x - r_{N-1}),
\]

where \( r_i \in \mathbb{Z}_m, \quad i = 0, 1, \ldots, N-1 \), the product of two \((N-1)\)-degree polynomials \( \mod (x_N \pm 1) \) in \( \mathbb{Z}_m \) can be computed with only \( N \) multiplications instead of \( N^2 \). This is based on an isomorphic mapping \( f \) called the PRNS isomorphic mapping which translates polynomials of the form \( A(x) = \sum_{i=0}^{N-1} a_i x_i \) into the PRNS domain.

The PRNS isomorphic mapping is given by

\[
A(x) = a_0 + a_1 x + \cdots + a_{N-1} x^{N-1} \xrightarrow{f} A^*(x) = \sum_{i=0}^{N-1} a_i x_i,
\]

with

\[
a_i^* = \langle A(x) \rangle_{(x-r_i)} = \langle A(r_i) \rangle_m = \langle a_0 + a_1 r_i + \cdots + a_{N-1} r_i^{N-1} \rangle_m,
\]

for \( i = 0, 1, \ldots, N-1 \) and \( r_i \) are the distinct roots of \( x_N \pm 1 = 0 \) in \( \mathbb{Z}_m \).

In (3) \( \langle x \rangle_m \) denotes the operation \( x \mod m \) between integers. The inverse PRNS mapping \( f^{-1} \) is given by

\[
A'(x) = \sum_{i=0}^{N-1} a_i^* x_i \xrightarrow{f^{-1}} A(x) = a_0 + a_1 x + \cdots + a_{N-1} x^{N-1},
\]

where the coefficients of the polynomial \( A(x) = a_i, \quad i = 0, 1, \ldots, N-1 \) are given by

\[
a_i = \langle N^{-1}(a_0 r_i^{-1} + a_1 r_i^{-2} + \cdots + a_{N-1} r_i^{-N+1}) \rangle_m.
\]

Eq. (6) dictates that the product of two \((N-1)\)-degree polynomials \( \mod (x_N \pm 1) \) requires \( N \) multiplications \( \mod m \) if performed in the PRNS domain. The same task requires \( N^2 \) multiplications \( \mod m \) if performed using the traditional technique; (non-PRNS).

Consider two polynomials \( A(x) = \sum_{i=0}^{N-1} a_i x_i \) and \( B(x) = \sum_{j=0}^{N-1} b_j x_j \) and consider performing \( \langle A(x)B(x) \rangle_{\nu^N-1} \) in the ring \( \mathbb{Z}_m \).

Let \( C_{\text{PRNS}} \) and \( C_{\text{non-PRNS}} \) denote the computational requirements for computing \( \langle A(x)B(x) \rangle_{\nu^N-1} \) in \( \mathbb{Z}_m \) using the PRNS and the traditional technique respectively. Then \( C_{\text{PRNS}} \) and \( C_{\text{non-PRNS}} \) are given by

\[
C_{\text{PRNS}} = (3N(N-1)+1)\text{scalings} + 3N(N-1)\text{adds} + N\text{mults}
\]

and

\[
C_{\text{non-PRNS}} = N^2\text{mults} + N(N-1)\text{adds}.
\]

In (7) and (8) the scalings, additions and multiplications are operations \( \mod m \), while scaling means multiplication by constant.

Let \( r_i, i = 0, 1, \ldots, N-1 \) be the \( N \) distinct roots of \( x_N \pm 1 = 0 \) in \( \mathbb{Z}_m \).

If all the roots \( r_i, i = 0, 1, \ldots, N-1 \), their multiplicative inverses in \( \mathbb{Z}_m \), i.e. \( i = 0, 1, \ldots, N-1 \) and \( N-1 \) are all perfect powers of two, then all the scaling operations required by the forward and inverse PRNS mappings of (3) and (5) become multiplications by powers of two which can be implemented with simple shift operations simplifying this way the computational hardware. It can easily be shown that for several moduli of the form \( m = 2^p + 1 \) and several choices of \( N \), all the roots \( r_i, i = 0, 1, \ldots, N-1 \) of \( x_N = 0 \) their multiplicative inverses in \( \mathbb{Z}_{2^p+1} \), \( i = 0, 1, \ldots, N-1 \), (5) \( x_N \pm 1 \) are all perfect powers of two. If the diminished-1 system [5,6] is used for performing arithmetic mod \( 2^p + 1 \), then multiplications by powers of two can be implemented with leftwise rotations and complementation operations and this way very little computational hardware is required; (only the inverters responsible for complementing some bits of the number being rotated). More on diminished-1 arithmetic mod \( 2^p + 1 \) will be offered in section 2 of the paper.

**Lemma 1** \( x_N \pm 1 \) can be factorized into \( N \) distinct factors in \( \mathbb{Z}_m \), as \( x_N = \langle (x - r_0)(x - r_1) \cdots (x - r_{N-1}) \rangle_m \) if and only if \( p_i = 2k_iN+1, \quad k_i = 1, 2, 3, \ldots, i = 1, 2, \ldots, L \) where \( N \) and \( m \) are positive integers with a prime decomposition of \( m \) given in terms of powers \( e_i \) of its prime factors \( p_i \), as \( m = p_1^{e_1} p_2^{e_2} \cdots p_L^{e_L} \) with \( N < p_i \). Similarly for \( x_N \), the necessary and sufficient condition for its factorization becomes \( p_i = k_iN+1, \quad k_i = 1, 2, 3, \ldots \) and in both cases the factorization is not unique: there are \( N! \) different ways to factorize \( x_N \pm 1 \) into \( N \) distinct first-degree factors [4].

**Lemma 2** For both congruences \( x_N \pm 1 = 0 \)\( \mod m \), the multiplicative inverses of their roots are also roots of the congruences, while the additive inverses are roots of the congruences only when \( N \) is even [4].

Due to the fact that if \( N \) is even and \( r \) is a root of \( x_N \pm 1 = 0 \)\( \mod m \) then \( (r)_{m} \) is also a root, the number of scalings required for the forward PRNS mapping can be reduced to almost one half.

The remainder of the paper is organized as follows: In section 2 the basics of diminished-1 arithmetic are reviewed. In section 3 the area, time and power dissipation performance of a PRNS architecture that exploits the scalings by powers of two for the forward and inverse converters is quantified and compared to a non-PRNS architecture. The impact of multiple supply voltages on the PRNS convolver architecture is discussed in Section 4. Finally, conclusions are discussed in Section 5.
2. DIMINISHED-1 ARITHMETIC

Diminished-1 arithmetic has been proposed by Leibowitz [5] as an efficient means for performing arithmetic modulo \(2^n + 1\). In diminished-1 arithmetic, the quantity \((x - 1)/2^n + 1\) is used as an image of \(x \in \mathbb{Z}_{2^n + 1}\). The particular mapping allows non-zero quantities to be represented using \(n\) bits, while zero is mapped onto the quantity \(2^n\), which requires \(n + 1\) bits for its representation.

When performing arithmetic mod \((2^n + 1)\) using the diminished-1 system, all input operands and the corresponding results are expressed in diminished-1 form.

By exploiting the diminished-1 representation, addition mod \((2^n + 1)\) is performed as an end-around carry operation, in two phases: An ordinary \(n\)-bit addition is performed, the carry out of which is negated and added back. Efficient parallel VLSI diminished-1 arithmetic has also been recently proposed [7].

Negation mod \((2^n + 1)\) is performed in diminished-1 system as follows: When \(A \neq 0\) and \(A \in \mathbb{Z}_{2^n + 1}\), \(A \rightarrow A - 1\). By taking the one’s complement of \(A - 1\), the quantity \(-A\) in diminished-1 format is obtained.

In PRNS processing, scaling by powers of two is an important operation and it can be efficiently implemented in the diminished-1 system. In particular, \((2^n A)_{2^n + 1}\) is computed as follows: The number \(A - 1\) is rotated \(k\) bits to the left, where the bits shifted out of the most significant end are complemented and shifted in the least significant end. The obtained result is expressed in diminished-1 form.

3. PERFORMANCE OF PRNS ARCHITECTURES

In the following the area, time and power dissipation performance of \(N\)-point cyclic convolution using the PRNS is quantified. The organization of the PRNS system is depicted in Fig. 1. The PRNS performance is compared to the performance of an architecture that employs conventional modular arithmetic. It is shown that the PRNS can substantially reduce both the area cost and the power dissipation of a system, even when the corresponding forward and inverse conversion overhead are taken into consideration.

The comparisons assume VLSI PRNS architectures that employ the modulo-(\(2^n + 1\)) multiplier by Wang et al. [6], and carry-save (3, 2)-counter Wallace trees for the \(N\)-operand additions mod \((2^n + 1)\), required by the forward and inverse converters. These structures employ diminished-1 arithmetic mod \(2^n + 1\). The non-PRNS architectures employ the same multiplier and multi-operand adder structures for the direct computation of the cyclic convolution.

A novel observation is employed in this paper to simplify the converter design and hence reduce the PRNS implementation complexity. In particular, when certain moduli \(m = 2^n + 1\) are utilized for certain values of \(N\), the roots of the polynomial \(x^N - 1\), and the multiplicative inverses of the roots and of \(N\) in \(\mathbb{Z}_{2^n + 1}\) are powers of two, either of positive or negative sign. This is demonstrated in Table 1 for several values of \(N\) and \(m\). Hence, the scalings in (3) and (5) are reduced to scalings by powers of two, which can be efficiently implemented in diminished-1 arithmetic as rotations and bit-negation operations, with very low hardware complexity. This is demonstrated for \(N = 8\) and \(m = 2^4 + 1 = 17\) in Fig. 2, for \(a_i\). The remainder of the \(a_i\) require scalings of similar implementation complexity.

The area, time and power dissipation performance of the PRNS-enhanced cyclic convolution architectures is summarized in Table 2, for various numbers of points \(N\) and several moduli of the form \(m = 2^n + 1\). The relative performance of a PRNS and a non-PRNS architecture are compared in terms of the ratios of area, time, power dissipation and product complexity of the architecture, i.e., non-PRNS or PRNS. When a ratio assumes a value larger than one, \(r > 1\), then the performance of the PRNS system is \(r\) times better than the corresponding non-PRNS architecture; a value \(r < 1\) denotes worse
The performance of the non-PRNS system. The particular behavior area and power dissipation savings increase, over the corresponding

\[ \{1, 2, 2^2, 2^{3}, 2^{4}, 2^{5}, 2^{6}, 2^{7}, 2^{8}, 2^{9}, 2^{10}, 2^{11}, 2^{12}, 2^{13}\} \]

Table 1: Roots \( r_i \) of polynomials \( x^{N} - 1 \mod 2^{n} + 1 \), and the multiplicative inverses \( r_i^{-1} \) and \( N^{-1} \) of the roots \( r_i \) and of \( N \) in \( \mathbb{Z}_{2^{n}+1} \). It can be seen that all the quantities are powers of two.

<table>
<thead>
<tr>
<th>( N )</th>
<th>( m )</th>
<th>( \text{Invert/PRNS} )</th>
<th>( \text{Invert/PRNS} )</th>
<th>( \text{Invert/PRNS} )</th>
<th>( \text{TPR_{0.995/PRNS}} )</th>
</tr>
</thead>
<tbody>
<tr>
<td>8</td>
<td>17</td>
<td>0.854</td>
<td>0.602</td>
<td>1.084</td>
<td>0.653</td>
</tr>
<tr>
<td>257</td>
<td>1.481</td>
<td>0.567</td>
<td>1.820</td>
<td>1.032</td>
<td></td>
</tr>
<tr>
<td>65537</td>
<td>2.462</td>
<td>0.556</td>
<td>2.912</td>
<td>1.619</td>
<td></td>
</tr>
<tr>
<td>16</td>
<td>257</td>
<td>1.671</td>
<td>0.559</td>
<td>2.136</td>
<td>1.194</td>
</tr>
<tr>
<td>65537</td>
<td>2.991</td>
<td>0.553</td>
<td>3.717</td>
<td>2.055</td>
<td></td>
</tr>
<tr>
<td>32</td>
<td>65537</td>
<td>3.341</td>
<td>0.550</td>
<td>4.284</td>
<td>2.356</td>
</tr>
</tbody>
</table>

The critical path capacitance can be exploited to derive the delay along the maximum delay path of the particular multiplier architecture can be approximated by (cf. [10]):

\[ T_{\text{crit}}(m) = \frac{C_{\text{crit}}(m)V}{k(V - V_{th})^2} \]

where \( V \) is the supply voltage, \( k \) depends on implementation technology parameters, and \( V_{th} \) is the device threshold voltage. Eq. (12) implies that the delay of a particular modulo-m multiplier depends on \( m \) and the supply voltage \( V \). The different delays of the various modulo channels can be balanced by properly selecting the supply voltages of each channel. The utilization of multiple supply voltages in RNS FIR filters has been proposed by Del Re et al. [11]. In this paper, we study the application of multiple supply voltages to PRNS architectures, and derive models and formulas that return the supply voltage value per channel. Therefore, for a PRNS system employing three moduli channels \( 2^4 + 1, 2^8 + 1 \), and \( 2^{16} + 1 \), the supply voltage for each channel can be computed by posing the requirement that the channels that correspond to smaller moduli demonstrate equal delay time to the larger moduli channels, i.e.,

\[ T_{\text{crit}}(2^{16} + 1) = T_{\text{crit}}(2^4 + 1) \]

\[ T_{\text{crit}}(2^{16} + 1) = T_{\text{crit}}(2^8 + 1). \]

Let \( V \) denote the supply voltage of the modulo \( 2^{16} + 1 \), and \( V_{17} = V_{257} = V_{257} = V \) denote the supply voltages for the channels of the employed modulo \( m = 2^n + 1 \) diminished-one multiplier is

\[ C_{\text{crit}}(m) = h \log_2(m - 1) + \log_2(m - 1) + C_{\text{FA}} + C_{\text{max}} \]

where \( C_{\text{FA}} \) is the capacitance of an 1-bit full adder, \( C_{\text{max}} \) is the capacitance of an 1-bit two-input multiplexer, and \( h(L) \) returns the height of a Wallace tree that adds \( L \) operands and it is recursively computed using [9]:

\[ h(L) = 1 + h \left( \left\lfloor \frac{2L}{3} \right\rfloor \right) \]

(10)

\[ h(3) = 1. \]

(11)

4. MULTI-VOLTAGE CONSIDERATIONS

In case of a PRNS system which includes several moduli, further optimization is possible, by using a different supply voltage for each modulo channel. The capacitance along the critical path of the employed modulo \( m = 2^n + 1 \) diminished-one multiplier is

\[ C_{\text{crit}}(m) = h \log_2(m - 1) + \log_2(m - 1) + C_{\text{FA}} + C_{\text{max}}. \]
mod $2^3 + 1$ and mod $2^8 + 1$. By combining (9)–(14), equations can be formed the solution of which allows the computation of the supply voltage reduction factors $\beta_{17}$ and $\beta_{257}$:

$$\frac{V (23C_{FA} + C_{max})}{k(V - V_{th})^2} - \frac{V_{B17} (7C_{FA} + C_{max})}{k(V_{B17} - V_{th})^2} = 0 \quad (15)$$

$$\frac{V (23C_{FA} + C_{max})}{k(V - V_{th})^2} - \frac{V_{B257} (13C_{FA} + C_{max})}{k(V_{B257} - V_{th})^2} = 0 \quad (16)$$

The solution of (15) and (16) returns

$$\beta_{17} = 1 \frac{1}{2V^2(23C_{FA} + C_{max})} \left( C_{max} \left( V^2 + V_{th}^2 \right) + C_{FA} \left( 7V^2 + 32VV_{th} + 7V_{th}^2 \right) + |V - V_{th}| \sqrt{7C_{FA} + C_{max}} \sqrt{C_{max}(V + V_{th})^2 + C_{FA}(7V^2 + 78VV_{th} + 7V_{th}^2)} \right)$$

$$\beta_{257} = 1 \frac{1}{2V^2(23C_{FA} + C_{max})} \left( C_{max} \left( V^2 + V_{th}^2 \right) + C_{FA} \left( 13V^2 + 20VV_{th} + 13V_{th}^2 \right) + |V - V_{th}| \sqrt{13C_{FA} + C_{max}} \sqrt{C_{max}(V + V_{th})^2 + C_{FA}(13V^2 + 66VV_{th} + 13V_{th}^2)} \right).$$

From the two values obtained for each of $\beta_{17}$ and $\beta_{257}$, the one that leads to $V_{th} V_{257} < V_{th}$ is not legitimate [10]. In order to quantify the voltage reduction factor, capacitance values taken from a 0.7-μm CMOS standard-cell library [8] are utilized as follows: Assuming $C_{FA} \approx 0.054\text{pF}$, $C_{max} \approx 0.067\text{pF}$, $V_{th} \approx 0.6\text{V}$ and $V \approx 5\text{V}$, it is obtained that $\beta_{17} = 0.472$ and $\beta_{257} = 0.674$. Therefore, without affecting the overall system delay, the mod 17 and mod 257 residue channels can operate at supply voltages $V_{th} = 2.36\text{V}$ and $V_{257} = 3.37\text{V}$. The particular supply voltage reduction directly reduces the overall power dissipation of the system. Assuming a Wallace-tree based implementation of a binary multiplier, the power dissipated by an 8-point circular convolution by means of a three-modulus PRNS system which offers a dynamic range of 28 bits, is reduced by 30%. It is noted that the particular performance improvement does not affect the latency of the system. Furthermore, when compared to a full-parallel conventional binary (non-RNS) implementation of the the 8-point convolver, a five-times reduction in power is anticipated.

5. CONCLUSIONS

In this paper it has been shown that by properly selecting the modulus of operation, the conversion complexity inherent in a PRNS-based architecture can be reduced to scalings with powers of two and additions. This substantial reduction leads to significant power×delay product and area savings, in comparison to a traditional architecture for the implementation of N-point cyclic convolution. In addition, in a multi-modulus PRNS VLSI architecture, the different delays displayed by the residue channels can be exploited to further reduce power dissipation. This is achieved by reducing the supply voltage of the smaller residue channels, as dictated by (17) and (18).

6. REFERENCES