Power Optimization of System-Level Address Buses based on Software Profiling

W. Fornaciari † † M. Polentaruti † † D. Sciuto † † C. Silvano † †
† Politecnico di Milano
Dip. di Elettronica e Informazione
Milano, ITALY 20133
CEFRIEL
Milano, ITALY 20133

ABSTRACT
The paper aims at defining a methodology for the optimization of the switching power related to the processor-to-memory communication on system-level buses. First, a methodology to profile the switching activity related to system-level buses has been defined, based on the tracing of benchmark programs running on the Sun SPARC V8 architecture. The bus traces have been analyzed to identify temporal correlations between consecutive patterns. Second, a framework has been set up for the design of high-performance encoder/decoder architectures to reduce the transition activity of the system-level buses. Novel bus encoding schemes have been proposed, whose performance has been compared with the most widely adopted power-oriented encodings. The experimental results have shown that the proposed encoding techniques provide an average reduction in transition activity up to 74.11% over binary encoding for instruction address streams. The results indicate the suitability of the proposed techniques for high-capacitance wide buses, for which the power saving due to the transition activity reduction is not offset by the extra power dissipation introduced in the system by the encoding/decoding logic.

1. INTRODUCTION
In microprocessor-based systems, significant power savings can be achieved through the reduction of the transition activity of the system buses. The power consumption due to the transition activity of the I/O pads in a VLSI circuit ranges from 10% to 80% of the overall power with a typical value of 50% for circuits optimized for low-power [1]. Several encoding techniques have been recently proposed in literature to reduce the switching activity of high capacitance bus lines. In [2], the authors have proposed a redundant encoding scheme, the Bus-Invert code, which is suitable to transmit patterns randomly distributed in time, such as for data buses. The approach introduces delay overhead due to the majority voter included in the encoder. Concerning address buses, for which sequential addressing usually dominates, the Gmy code has been proposed in [4] and [5]. In [6], [7], a redundant encoding scheme, the T0 code, has been introduced, that avoids the transfer of consecutive addresses on the bus by using a redundant line, INC, to transfer to the receiving sub-system the information on the sequentiality of the addresses. For infinite streams of consecutive addresses, the T0 code enjoys the property of zero transitions occurring on the bus, while the Gmy addressing requires one bit switching per each pair of consecutive patterns.

Other encoding techniques at the system-level have been reviewed in [3], while a general encoding/decoding framework aiming at reducing the transition activity has been recently proposed in [1]. Although most of the known low-power encoding techniques can be implemented by using this framework, the critical path to transmit the information on the bus can have a significant impact on the system-level performance. Other approaches consist of directly changing the way the information is stored in memory, so that the address streams have already low transition activity [8].

In this scenario, we propose a design framework to simulate the application of low-power techniques to system-level buses in microprocessor-based architectures. The most relevant features of the proposed design methodology are:

- The target system architecture is quite general and models the HW/SW communication on system-level buses in terms of the main parameters that affect the switching power of the system: power supply, frequency, transition activity and capacitive load;
- The proposed encoding/decoding architecture implements different classes of bus encoding techniques and represents a timing optimization of the architecture proposed in [1]: the critical path delay has been minimized to reduce the latency of the bus accesses;
- The methodology enables the profiling of the software execution in terms of transition activity on system-level buses: real bus tracing, derived from the execution of several application programs can be analyzed;
- Correlation metrics have been defined to characterize the streams transmitted on the buses during HW/SW communication with the purpose of identifying the encoding technique which best fits each target application.
- To further improve the transition activity on system-level buses, novel encoding techniques have been defined by extending the capabilities of previous low-power codes or by combining the previous ones. The proposed encoding techniques are suitable for address buses, characterized by high locality of references.
• Experiments have been carried out on real bus streams generated by tracing benchmark programs on the Sun SPARC V8 architecture. The results have shown an average reduction in transition activity up to 74.11% over binary encoding for instruction address streams.
• The implementation of the encoding/decoding architecture demonstrated how, for high-capacitance buses, the power saving due to the transition activity reduction is not offset by the power overhead introduced by the encoding/decoding logic.

The paper is organized as follows. In Section 2, we describe the target system architecture, the proposed high-performance encoding/decoding architecture, and the novel power-oriented bus encoding techniques. The methodology used to profile the software execution is described in Section 3 along with the correlation metrics. The experimental results in terms of both transition activity and power savings have been presented in Section 4. Finally, some concluding remarks and future development have been reported in Section 5.

2. TARGET SYSTEM ARCHITECTURE
A power-oriented methodology operating at the system-level is mandatory for the HW/SW architectural exploration during the first phases of the design flow. The methodology should be tightly related to the characteristics of the system architecture, mainly in terms of the target processors, the memory sub-system, the system-level buses and the co-processors.

Figure 1 shows the block diagram of our target system architecture, which is a shared memory multi-processor system that can be implemented by using either the System-On-a-Chip approach or the multi-chip approach. The system includes one or more processors, the instruction caches (I-caches), the data caches (D-caches), the memory controller, the main memory, the I/O controllers, the peripheral units, and the co-processors to support specific applications (such as MPEG encoding). All these basic blocks are connected through address, data, and control buses implemented by using different topologies. Given the target architecture, our main focus is to investigate the HW/SW communication either on the sub-system-level buses, such as the processor-to-cache buses (which have been coloured in dark grey in Figure 1) or on the system-level buses. At these interfaces, we introduce a Bus Interface (BI) to model and optimize the four parameters which impact the switching power of the system: power supply, frequency, switching activity and capacitive load. In Figure 2, we propose four architectures to implement the BI module. These architectures model respectively voltage scaling (a), frequency multiplier/demultiplier (b), bus encoding/decoding to modify the transition activity of buses (c), and bus buffering to decouple capacitive loads (d).

2.1 Encoder/Decoder Architecture
A general framework for low-power bus encoding schemes has been recently proposed in [1]. The generic architecture can be specialized by using different alternatives for the internal decorrelating functions to derive most of the known low-power encoding techniques. However, the critical path to transmit the information on the bus can have a significant impact on the system-level performance. In fact, the critical path delay of the encoder is through the functions $f_1$, $f_2$, and $xq_r$, where $f_1$ can implement either a $xq$ or a $dha$ logic block, while $f_2$ can implement the identity, $inv$, $vbn$, or $phm$ functions.

Starting from the architecture in [1], we propose an encoder/decoder (Encode) which maintains wide generality while minimizing the critical path delay to reduce bus latency. The general encoder section of the Encode is shown in Figure 3. The encoder receives as input $b(t)$, the information value at time $t$, and it generates $B(t)$, the value on the encoded bus lines at time $t$. It consists of registers for $b(t-1)$ and $B(t-1)$, and three combinational logic blocks:

• a predictor block $P$, that generates a prediction $\hat{b}(t)$ of the current value of $b(t)$ based on the past value $b(t-1)$:

\[
\hat{b}(t) = P(b(t-1))
\]  

(1)

• a decorrelator block $D$, that decorrelates the input $b(t)$, $e(t) = D(b(t), \hat{b}(t))$ (2)

• a selector block $S$, that select among its inputs $b(t)$, $B(t-1)$, and $e(t)$.

The amount of hardware in the encoding functions has been kept as small as possible, and the critical path delay (through the $D$ and $S$ blocks) has been minimized to reduce the latency of bus accesses. A pass-gate dedicated implementation has been devised for both logic blocks on the critical path. As an example, for the $S$ block, the $mul$ function has been implemented by two pass-gates and one inverter, while the
In the T0-Offset code, we extend the capabilities of the T0 code by adopting the T0 scheme for in-sequence bus values, while for the out-of-sequence bus values we use the Offset code, since the encoding of the difference \( b(t) - b(t-1) \) could imply less transitions on the bus lines with respect to the binary encoding. The T0-Xor-Offset code can be derived by combining the T0-Xor scheme for in-sequence bus values, while for out-of-sequence bus values we adopt the Offset code. In the T0 code with variable stride, namely T0-Var code, the Stride between consecutive patterns can be parametric. To represent the most frequent distances occurring between consecutive addresses, we use \( n \) values of the Stride \( S_1, S_2, \ldots, S_n \), so we need \( \log_2(S_n) \) redundant lines. In the reduced Bus Invert code, namely Red-BI code, we exploit the fact that the most significant bits of the system bus present a less significant transition activity with respect to the least significant bits. Thus we reduce the threshold, after which we invert the bus value, to a number less than \( N/2 \).

### 3. SOFTWARE EXECUTION PROFILING

A software tool, MA\textsubscript{AYA}, has been developed to trace the transition activity of system-level buses during the execution of benchmark programs, to analyze the bus traces in terms of correlation metrics, and to implement bus encoding techniques. The tracing and analysis capabilities of the tool have been maintained independent to each other, to allow the acquisition of bus sequences obtained by different tracing programs. In the current version of MA\textsubscript{AYA}, we integrate the Shade-Spix tool [9], which combines an efficient instruction-set simulator with a flexible tracer of the execution of application programs. The current version of Shade runs on Sun SPARC systems and it simulates the SPARC (Version 8 and 9) and MIPS I instructions sets. The results reported in this paper have been carried out on the Sun SPARC V8 architecture.

The structure of the MA\textsubscript{AYA} tool is mainly composed of the internal tracer, the external tracer and the analyzer. The internal and external tracers enable two different data acquisition modes. The internal tracer is based on the Shade tool to profile data streams derived from the execution of real application programs on the target architecture, while the external tracer receives as input a user-defined external sequence with given characteristics in terms of data correlation. Due to the length of the information streams to be processed (in the order of millions of patterns), the internal tracer provides a single pattern at a time to the analyzer. The analyzer evaluates on-the-fly the correlation metrics, implements the encoding techniques, and calculates the bus transition activity.

#### 3.1 Correlation Metrics

The correlation metrics we introduce analyze the effects of the principle of locality on the bus streams derived by the software profiling. Being \( B(t) \) the value on the bus lines at time \( t \), and \( S \) the Stride, the consecutive addresses satisfy the condition:

\[
B(t) = B(t-1) + S
\]

Given a stream composed of \( N + 1 \) values of bus lines: \( \{ B(0), \ldots, B(N) \} \), we define the metric \( P_{eq} \) as the
number of in-sequence addresses over $N$:

$$P_{eq} = \frac{\sum_{n=1}^{N} \delta_{n(t-1)} + n}{N}$$

where $\delta$ is the function delta of Kronecker. Note that the value of $P_{eq}$ (where $0 \leq P_{eq} \leq 1$) multiplied by 100 represents the percentage of in-sequence addresses. For a given value $P_{eq}$, the consecutive addresses in the stream can be distributed in many ways, as shown in Figure 4. The extreme cases are: (i) the in-sequence addresses are grouped in $P_{eq}$ sets of two consecutive values; (ii) the in-sequence addresses are grouped in a single set of $(N P_{eq})$ consecutive addresses. We define the $\zeta$ metric to take into consideration the distribution of consecutive address in the stream:

$$\zeta = \frac{\sum_{n=1}^{N} \delta_{n(t-1)} + n}{N}$$

In practice, the value of the $\zeta$ metric is incremented by 1 each time we exit from a sequence of consecutive addresses, thus when the condition $C$ is satisfied:

$$C = \begin{cases} 0 & (B(t-I) + s) \\ B(t-I) = (B(t-I) + s) 
\end{cases}$$

The values of $\zeta$ can vary from $\frac{1}{N}$ (case i) to $P_{eq}$ (case iii). Given $P_{eq}$, the cases (a), (b) and (c) of Figure 4 provide different values for $\zeta$, while the cases (c), (d) and (e) provides the same values for $\zeta$. The $L$ metric (where $1 \leq L \leq (NP_{eq})$) represents the average length of the in-sequence sets:

$$L = \frac{P_{eq}}{\zeta}$$

4. EXPERIMENTAL RESULTS

Aim of this section is to evaluate and compare the performance of the proposed codes in terms of bus switching activity and power dissipation. The set of benchmark programs used to compare the bus encoding schemes has been summarized in Table 2, as well as the main characteristics of the selected streams. For each benchmark program, we report the stream length, the percentage of in-sequence addresses, $\zeta$, and $L$ for both data and instruction address streams.

4.1 Transition Activity Results

To evaluate the effectiveness of the proposed encodings, a comparison between their transition activity and the transition activity of other encoding schemes in literature, with respect to the transition activity of the binary encoding has been performed. The results have been derived for both the instruction address streams and the data address streams.

Globally, none of the analyzed codes is suitable for both instruction and data address streams due to the peculiar characteristics of the streams. The comparison results show that, the $T0$-Xor and $Red-BI$ codes are the most effective in terms of transitions count for instruction and data address streams, respectively.

Table 3 reports the percentage of transition savings with respect to binary encoding for instruction address streams (in particular, the last row shows the percentage saving of each code averaged over the whole set of benchmarks). Being high the percentage of in-sequence addresses, the savings provided by the $T0$-based codes as well as the Offset-based codes are remarkable with respect to binary. On the contrary, the $BI$-based codes do not provide any advantage over binary. Better results are given by the irredundant $T0$-Xor code (74.11%), which outperforms both the other irredundant schemes (Offset and Offset-Xor) and the redundant schemes ($T0$, $T0-BI$, $T0-Offset$, and $T0-Var$). The irredundant Offset and Offset-Xor codes present significant advantages (56.30% and 41.35%, respectively). The redundant $T0$-based codes offer advantages in the order of 60% versus the binary code. Among them, the results are in favour of the $T0-Var$ code (64.18% saving), however the simple $T0$ code (providing 61.64% saving) represents the most suitable solution, due to the reduced cost in terms of redundancy and encoder/decoder logic.

Among the set of benchmark programs, the instruction address streams corresponding to gzip and quicksort present the highest values of the $P_{eq}$ and $L$ correlation metrics (see Table 2). If we consider each column of Table 3 corresponding to $T0$-based and Offset-based codes (which better exploit address locality), we can observe how the percentage transition savings corresponding to gzip and quicksort are higher than the average value over the benchmark set, thus proving the effectiveness of the proposed metrics.

As expected, the $T0$-based codes, as well as the Offset-based codes, do not show significant transition savings (at most 3.6%) for data address streams (the complete results are not reported here for space reason). The redundant $BI$-based codes are the most suitable for data address buses. Among them, the $Red-BI$ and $T0-BI$ encodings (which respectively provide 15.55% and 10.05% of saving) outperform the simple $BI$ encoding (9.6%).

4.2 Power Dissipation Results

Aim of this paragraph is to evaluate whether the power savings achieved through switching activity reduction is offset by the circuitry required to implement the encoding/decoding functions. We analyzed the power consumption results when the encoding scheme has been used for both on-chip and off-chip buses. The proposed high-performance Zedec architecture has been used to implement the different codes by using the 0.35 $\mu$m and 3.3 V library supplied by ST Microelectronics. Power estimates have been obtained by using Synopsys Design Power at the maximum clock frequency achieved for each code implementation.

In this paper, we report the power consumption results for different values of off-chip capacitances for the following codes: Binary, $T0$, Offset, and $T0$-Xor. Two different implementations of the binary encoder have been considered: the first one includes registers and pads at the interface, while the second one includes pads only. The power consumption results have been reported in Figure 5 for different values of the bus capacitances (a single set of data has been plotted.
### Table 2: Description of the benchmark set

<table>
<thead>
<tr>
<th>Benchmark</th>
<th>B0</th>
<th>Bus-Invert</th>
<th>CM</th>
<th>B0-Xor</th>
<th>Offset</th>
<th>Offset-Xor</th>
<th>TD</th>
<th>Offset</th>
<th>Off-HI</th>
</tr>
</thead>
<tbody>
<tr>
<td>quicksort</td>
<td>57.3</td>
<td>0.0</td>
<td>67.3</td>
<td>61.4</td>
<td>64.8</td>
<td>57.2</td>
<td>72.1</td>
<td>0.0</td>
<td></td>
</tr>
<tr>
<td>date</td>
<td>58.2</td>
<td>0.0</td>
<td>58.9</td>
<td>71.2</td>
<td>61.7</td>
<td>40.0</td>
<td>59.2</td>
<td>0.0</td>
<td></td>
</tr>
<tr>
<td>gcc</td>
<td>54.5</td>
<td>0.1</td>
<td>53.7</td>
<td>69.9</td>
<td>46.2</td>
<td>38.8</td>
<td>55.6</td>
<td>0.4</td>
<td></td>
</tr>
<tr>
<td>gzip</td>
<td>74.5</td>
<td>0.0</td>
<td>74.4</td>
<td>81.9</td>
<td>76.7</td>
<td>47.5</td>
<td>79.0</td>
<td>0.0</td>
<td></td>
</tr>
<tr>
<td>go</td>
<td>58.1</td>
<td>0.0</td>
<td>64.9</td>
<td>78.9</td>
<td>63.9</td>
<td>43.0</td>
<td>67.1</td>
<td>0.0</td>
<td></td>
</tr>
<tr>
<td>hello</td>
<td>59.1</td>
<td>0.0</td>
<td>59.1</td>
<td>72.2</td>
<td>63.8</td>
<td>41.0</td>
<td>69.0</td>
<td>0.0</td>
<td></td>
</tr>
<tr>
<td>latex</td>
<td>60.5</td>
<td>0.3</td>
<td>56.0</td>
<td>73.3</td>
<td>63.7</td>
<td>49.0</td>
<td>60.0</td>
<td>0.7</td>
<td></td>
</tr>
<tr>
<td>vi</td>
<td>52.4</td>
<td>0.2</td>
<td>51.2</td>
<td>66.0</td>
<td>39.6</td>
<td>39.0</td>
<td>51.6</td>
<td>0.0</td>
<td></td>
</tr>
<tr>
<td>ghost</td>
<td>57.8</td>
<td>0.0</td>
<td>56.4</td>
<td>71.0</td>
<td>50.2</td>
<td>39.0</td>
<td>57.4</td>
<td>0.0</td>
<td></td>
</tr>
<tr>
<td>nsfg</td>
<td>59.5</td>
<td>0.0</td>
<td>56.6</td>
<td>70.4</td>
<td>47.5</td>
<td>36.0</td>
<td>58.8</td>
<td>0.2</td>
<td></td>
</tr>
<tr>
<td>uncompress</td>
<td>68.3</td>
<td>0.0</td>
<td>68.3</td>
<td>70.0</td>
<td>64.3</td>
<td>42.0</td>
<td>64.3</td>
<td>0.0</td>
<td></td>
</tr>
<tr>
<td>Average</td>
<td>61.6</td>
<td>0.0</td>
<td>60.6</td>
<td>74.1</td>
<td>56.3</td>
<td>43.8</td>
<td>62.0</td>
<td>0.18</td>
<td></td>
</tr>
</tbody>
</table>

### Table 3: Instruction address streams: Percentage transition savings with respect to binary encoding for the benchmark set.

![Figure 5: Power dissipation vs. off-chip capacitances for several encoding techniques.](image)

5. FUTURE WORK

As future evolution of the work, we are devising a methodology to evaluate the benefits of encoding schemes on the power consumption of system-level buses when the target system architecture includes multi-level cache memories and different bus topologies. The cache model considers any cache configuration in terms of size, associativity, and cache line size. Moreover, we are investigating system architectures based on VLIW ASIP processor cores. The system-level framework can be effectively adopted to appropriately configure the memory sub-system and system bus architecture from the power standpoint.

6. REFERENCES


