Embedded Knowledge-based Speech Detectors for Real-Time Recognition Tasks

Sabato M. Siniscalchi¹,³, Fulvio Gennaro¹, Salvatore Andolina¹, Salvatore Vitabile²,⁴, Antonio Gentile¹, and Filippo Sorbello¹,⁴

¹ Dipartimento di Ingegneria Informatica, Università di Palermo
V.le delle Scienze (Edif. 6), 90128 Palermo, Italy

² Dipartimento di Biotecnologie Mediche e Medicina Legale, Università di Palermo
Via del Vespro, 90127 Palermo, Italy

³ Center for Signal and Image Processing, School of Electrical and Computer Engineering
Georgia Institute of Technology, Atlanta, Georgia 30332, USA

⁴ Istituto di CAlcolo e Reti ad alte prestazioni – Consiglio Nazionale delle Ricerche
V.le delle Scienze (Edif. 11), 90128 Palermo, Italy

marco@ece.gatech.edu, {vitabile, gentile, sorbello}@unipa.it

Abstract
Speech recognition has become common in many application domains, from dictation systems for professional practices to vocal user interfaces for people with disabilities or hands-free system control. However, so far the performance of Automatic Speech Recognition (ASR) systems are comparable to Human Speech Recognition (HSR) only under very strict working conditions, and in general much lower. Incorporating acoustic-phonetic knowledge into ASR design has been proven a viable approach to raise ASR accuracy. Manner of articulation attributes such as vowel, stop, fricative, approximant, nasal, and silence are examples of such knowledge. Neural networks have already been used successfully as detectors for manner of articulation attributes starting from representations of speech signal frames. In this paper the full system implementation is described. The system has a first stage for MFCC extraction followed by a second stage implementing a sinusoidal based multi-layer perceptron for speech event classification. Implementation details over a Celoxica RC203 board are given.

1. Introduction
In [1] the authors proposed a real time implementation of a bank of Multi Layer Perceptron (MLP) with sinusoidal activation function to detect speech attributes, namely fricative, vowel, stop, nasal, approximant, and silence. Inside the speech community, these aforementioned attributes are referred to as manner of articulation events, and they are strongly related to human speech production [2]. Moreover, they show robustness to speech variations [3]. These speech attributes are generated directly from Mel-Frequency Cepstrum Coefficients (MFCCs), and the six detectors actually perform a sort of mapping from the acoustic domain (MFCCs) to the articulatory domain. The term “‘mel’” denotes a measurement of perceived frequency of a tone, which does not vary linearly with the physical frequency of the corresponding tone. A non linear scale is employed since it was found that human auditory system does not perceive pitch in linear manner. The mapping between the real frequency scale (Hz) and the perceived frequency scales (mels) is given in formula (1)

\[ F_{mel} = 2595 \log\left(1 + \frac{F_{Hz}}{700}\right) \] (1)

The mapping is approximately linear below 1KHz, and logarithmic at higher frequency, and such an approximation is usually adopted in speech recognition.

In this paper we propose the chip design for the entire system, aimed at embedded applications. Our interest in generating the manner of articulation system is because it is part of the Automatic Speech Attribute Transcription (ASAT) project [4], in which a software
neural network-based architecture for these manner of articulation attributes was already implemented in [5].

The main idea of the ASAT project is that the performance of conventional knowledge-ignorant modeling approaches can be improved integrating the knowledge sources available in a large body of speech science literature. In [3] it is showed that the idea of a direct incorporation of acoustic-phonetic knowledge into ASR design raises its accuracy. These “knowledge-based” features (also referred to as speech attributes in the same work) are used to augment the front-end module of a conventional ASR system by means of a set of feature detectors able to capture the speech attributes.

The rest of the paper is organized as follows. Section 2 describes the general framework of the event detector module, which we will call knowledge extraction to be consistent with the nomenclature used in [1]. In sections 3 and 4 the MFCCs and its digital implementation are given respectively. An overview of the digital implementation of the six MLP detectors is shown in section 5. Section 6 presents the experimental set-up and results with comparison to the baseline architecture. Concluding remarks are given in the last section of the paper to summarize its main contributions.

2. Knowledge Extraction Module

The Knowledge Extraction (KE) module uses a frame-based approach to provide K manner of articulation attributes \( A_i \), where \( i=1,2, \ldots K \), from an input speech signal \( s(t) \). In this paper the manner classes were chosen as in [6], and are listed in Table 1.

The KE module, depicted in Figure 1, is composed of two fundamentals blocks: the feature extraction module (FE), and the attribute scoring module (SC). The FE module consists of a bank of K feature extraction blocks \( FE_i \), where \( i=1,2, \ldots K \), and it maps a speech waveform into a sequence of speech parameter vectors \( Y_i \), \( i=1,2, \ldots K \). Actually, each of the \( FE_i \) is fed by the same speech waveform \( s(t) \) and for each speechframe it computes a thirteen MFCC feature vector \( X_i \) (12 MFCCs + Energy). The frame length is of 30 msec overlapped by 20 msec. Finally, \( FE_i \) produces as output a 117-feature vector \( Y_i \) combining the actual frame with the eight surrounding frames, 4 frames before and after, so that each speech parameter vector represents nine frames.

The SC module is composed of six feed-forward neural networks, and its goal is to attach a score, referred to as knowledge score \( KS_i \), to each vector \( Y_i \). The input of each network is a 9 frames of 12 MFCCs + energy, so that the input layer is of 117 nodes. The output layer has two nodes, one for the desired class, and one for the anti-class. Actually, the value obtained for the desired class for case \( i \) is defined to be the \( KS_i \).\n
Table 1. Manner of articulation attribute definition

<table>
<thead>
<tr>
<th>Articulation Manner</th>
<th>Class Elements</th>
<th>Anti-Class Elements</th>
</tr>
</thead>
<tbody>
<tr>
<td>Vowel</td>
<td>IY, IH, ER, EH, AE, AA, AH</td>
<td>HH, EL, SIL</td>
</tr>
<tr>
<td>Fricative</td>
<td>TH, V, AX</td>
<td>CH, S, Z, SH</td>
</tr>
<tr>
<td>Stop</td>
<td>T, K, DX, M, N, NG</td>
<td>TH, V, AX, IX, IH, CH, S</td>
</tr>
<tr>
<td>Nasal</td>
<td>EN, L, R, W, Y, HH</td>
<td>EL, SIL</td>
</tr>
<tr>
<td>Silence</td>
<td>SIL, EN, L, R, W, Y, HH</td>
<td>EL, SIL</td>
</tr>
<tr>
<td>Approximant (App.)</td>
<td>L R W Y EL</td>
<td>AH AO OY OW UH UW ER AX</td>
</tr>
</tbody>
</table>

![Fig. 1. Knowledge Extraction Module, adapted from[6]. The detectors are based on a MLP neural network.](image)
waveform of the input utterance is partitioned into sequence of consecutive frames using windowing analysis. For each frame, the vector of mel frequency cepstrum coefficients are extracted from the frame samples. The resulting sequence of feature vectors represents the input utterance.

The general form of this filter bank is illustrated in Figure 2. As can be seen the filters used are triangular and they are not equally spaced along the mel-scale but which is defined by equation (1).

Fig. 2. Triangular weighted functions in frequency domain.

The block diagram of the entire process is depicted in Fig. 3.

Fig. 3. Block diagram of the entire MFCC extraction module.

A description of each individual step is given below.

Step 1: Frame Blocking

In this step the continuous speech signal is blocked into frames of N samples, with adjacent frames being separated by M (M < N). This process continues until all the speech is accounted for within one or more frames. Typical values for N and M are N = 256 (which is equivalent to ~ 30 msec windowing and facilitate the fast radix-2 FFT) and M = 100.

Step 2: Windowing

The next step in the processing is to window each individual frame so as to minimize the signal discontinuities at the beginning and end of each frame. If we define the window as w(n) with 0 ≤ n ≤ N−1, where N is the number of samples in each frame, then the result of windowing is the signal

\[ y_i(n) = x_i(n)w(n), \quad 0 \leq n \leq N - 1 \]  

Step 3: Fast Fourier Transform (FFT)

The next processing step is the Fast Fourier Transform, which converts each frame of N samples from the time domain into the frequency domain. The FFT is a fast algorithm to implement the Discrete Fourier Transform (DFT) which is defined on the set of N samples \( \{x_n\} \), as follow:

\[ X_n = \sum_{k=0}^{N-1} x_k e^{-2\pi jkn/N}, \quad n = 0, 1, 2, ..., N - 1 \]  

Step 4: Mel-frequency Wrapping

An approach to simulate the human being auditory system is to process the spectrum \( S(\omega) \) of \( X_n \) by a filter bank spaced uniformly on the mel scale (see Figure 2). That filter bank has a triangular bandpass frequency response, and the spacing as well as the bandwidth is determined by a constant mel frequency interval. The number of mel spectrum coefficients, \( K \), is typically 20.

Step 5: Cepstrum

In this final step, we convert the log mel spectrum back to time. The result is called the mel frequency cepstrum coefficients (MFCC). Because the mel spectrum coefficients (and so their logarithm) are real numbers, we can convert them to the time domain using the Discrete Cosine Transform (DCT). Therefore if we denote those mel power spectrum coefficients that are the result of the last step \( \tilde{S}_k, k = 1, 2, ..., K \), we can calculate the MFCC’s (\( \tilde{c}_n \)) as

\[ \tilde{c}_n = \sum_{k=1}^{K} (\log \tilde{S}_k) \cos \left( n \left( k - \frac{1}{2} \right) \frac{\pi}{K} \right), \quad n = 1, 2, ..., K \]  

3.1. Implementation on FPGA

The front-end has been implemented and prototyped onto a Celoxica RC203 board equipped with a Virtex II XC2V3000-4 donated by Xilinx. The extractor is
described using a C-like hardware description language, Handel-C, developed by the Oxford Hardware Compilation Group at the University of Oxford (UK).

The actual prototype runs at a rather low operating frequency of 12.5 MHz, operating on 10 ms windows and requiring 2.1 ms per input frame.

4. Feed Forward Neural Network digital design

In [7] an efficient MLP digital implementation for road signs recognition and high energy physics experiments classification has been proposed. This initial design has been adapted and optimized for automatic speech classification and is presented in this section.

A single MLP digital architecture is used to implement each of the detectors described in Figure 1. As depicted in Figure 4, this architectural design aims to satisfy high design modularity, high density of neurons on device, high recognition rate and speed. As a result, (a) data input acts in a serial way; (b) data processing acts in parallel among the neurons and serially within each neuron; (c) second layer processing is pipelined with first layer processing. The Winners Takes All (WTA) circuit selects, among a set of \( m \) numbers, the greatest activation level units.

![Fig. 4. Functional block diagram of the MLP architecture](image)

The basic digital neural network elements, as multipliers and accumulators, are designed following the standard solutions. The output activation function is a linear function, whilst sinusoidal activation function is employed as activation function of the hidden layer. Fixed point arithmetic with two's complement representation is used for the chip implementation of the MLP. Principal constrains of this project are the compromise between the neural network accuracy and the bit depth for input and weight data, and the compromise between the neural network accuracy and the bit depth for the pre-synaptic value and the post-synaptic value of the hidden activation function.

5. Experiments and results

The evaluation of the proposed Manner of Articulation Extraction module was performed on the TIMIT Acoustic-Phonetic Continuous Speech Corpus database [8], which is a well-known speech corpus in the speech recognition field. This database is composed of a total of 6300 sentences; it has a one-channel, 16-bit linear sampling format, and it was sampled at 16000 samples/sec. The MLP detectors were trained on 3504 randomly selected utterances, and to be consistent with [3] and [9] the four phones “cl”, “vcl”, “epi”, and “sil” were treated as a single class, thus reducing the TIMIT phone set to a set of 45 context-independent (CI) phones. The front-end module is in the process of being implemented following the guidelines given in [10]. Instead the max module is a simple comparator circuit. The MLP module is the focus of this work, and a detailed description is given in what follows.

Each of the six detectors is a three-layer network the input of which is a window of nine frames, that is, 117 parameters. The nodes of hidden layers are 100. The output layer contains two units, and a simple linear activation function is used. Finally, the max module applies a max function to the KS outputs in order to compute the overall confusion matrix.

As previously stated, the detectors work in a frame-based paradigm, so that their performance was evaluated in term of frame error rate. Each frame was classified according to the neural network with the largest value.

<table>
<thead>
<tr>
<th>%</th>
<th>Vowel</th>
<th>Fricative</th>
<th>Stop</th>
<th>Nasal</th>
<th>App.</th>
<th>Silence</th>
</tr>
</thead>
<tbody>
<tr>
<td>Vow.</td>
<td>89.85</td>
<td>1.38</td>
<td>1.53</td>
<td>1.26</td>
<td>4.64</td>
<td>0.19</td>
</tr>
<tr>
<td>Fric.</td>
<td>3.16</td>
<td>87.02</td>
<td>5.53</td>
<td>1.02</td>
<td>0.89</td>
<td>1.24</td>
</tr>
<tr>
<td>Stop</td>
<td>6.32</td>
<td>7.41</td>
<td>79.89</td>
<td>1.71</td>
<td>1.57</td>
<td>1.96</td>
</tr>
<tr>
<td>Nas.</td>
<td>9.65</td>
<td>2.44</td>
<td>3.25</td>
<td>81.04</td>
<td>2.20</td>
<td>0.90</td>
</tr>
<tr>
<td>App.</td>
<td>30.82</td>
<td>2.88</td>
<td>3.26</td>
<td>2.74</td>
<td>58.07</td>
<td>1.19</td>
</tr>
<tr>
<td>Sil.</td>
<td>1.10</td>
<td>1.09</td>
<td>1.88</td>
<td>0.61</td>
<td>0.58</td>
<td>94.21</td>
</tr>
</tbody>
</table>

The global confusion matrix for the manner of articulation attributes is given in Table 2. The (p, q)-th
element of the confusion matrix measures the rate of the p-th attribute being classified into the q-th class.

The digital version Knowledge-based Automatic Speech Classifier is implemented on Celoxica RC203 board [11] equipped with a Xilinx VirtexII XC2V3000-4 FPGA. Neural architectures were described using the VHDL language and were synthesized using the Xilinx ISE 6.3 tools.

According with the results reached in [7], the number of hidden virtual neurons for each of the MLPs has been fixed to 10, representing the best trade-off between execution time and allocated resource. The above MLP digital implementation requires 1187 cycles and, consequently, 0.0236ms for its execution. Combined with the 2 ms execution of the front-end, the execution time clearly allows for real-time execution.

Table 3 illustrates the synthesis report for the MFCC Extractor Module, for the entire scoring module and the total allocated resources required by the entire system. It is easy to see that the chosen configuration for each MLP allows the implementation of the 6 detectors in a single FPGA.

### Table 3. Synthesis report for the MFCC Extractor Module, for the entire scoring module and the total allocated resources required by the entire system

<table>
<thead>
<tr>
<th>Available Resources</th>
<th>Slices</th>
<th>FFs</th>
<th>LUTs</th>
<th>RAMs</th>
</tr>
</thead>
<tbody>
<tr>
<td>MFCC Extractor</td>
<td>14336</td>
<td>28672</td>
<td>28672</td>
<td>96</td>
</tr>
<tr>
<td>MLP scoring module</td>
<td>6439</td>
<td>1319</td>
<td>11205</td>
<td>3</td>
</tr>
<tr>
<td></td>
<td>44.9%</td>
<td>4.6%</td>
<td>39.1%</td>
<td>3.1%</td>
</tr>
<tr>
<td></td>
<td>4830</td>
<td>4058</td>
<td>8234</td>
<td>60</td>
</tr>
<tr>
<td></td>
<td>33.7%</td>
<td>14.2%</td>
<td>28.7%</td>
<td>62.5%</td>
</tr>
<tr>
<td>Total Resources</td>
<td>11269</td>
<td>5377</td>
<td>19439</td>
<td>63</td>
</tr>
<tr>
<td></td>
<td>78.6%</td>
<td>18.8%</td>
<td>67.8%</td>
<td>65.6%</td>
</tr>
</tbody>
</table>

Implementation results on FPGA show that use of sinusoidal activation functions decrease hardware resource usage of more than 50% for slices, FFs, LUTs and of more than 35% for FPGA RAM when compared with the standard sigmoid-based neuron implementation. Furthermore, neuron virtualization allows for a significant decrease of concurrent memory access, resulting in improved performance for the entire attribute scoring module [7].

### 6. Summary

The performance of Automatic Speech Recognition (ASR) systems are comparable to Human Speech Recognition (HSR) only under very strict working conditions, and in general far lower. Incorporating acoustic-phonetic knowledge into ASR design has been proven a viable approach to raise ASR accuracy. Manner of articulation attributes such as vowel, stop, fricative, approximant, nasal, and silence are examples of such knowledge. Neural net-works have already been used successfully as detectors for manner of articulation attributes starting from representations of speech signal frames.

The preliminary experimental results offer good evidence of the real-time capability of the system, and they demonstrates its implementation on embedded devices as part of full speech recognition systems.

In this paper an embedded knowledge-based speech detectors for real-time execution is described. The system has a first stage for MFCC extraction followed by a second stage implementing a sinusoidal based multi-layer Perceptron for speech event classification. Implementation details over a Celoxica RC203 board have been given.

Execution time for the entire system is slightly above 2 ms per frame and allows for real-time speech event classification on embedded devices.

Currently research works underway to incorporate the other stages for full large dictionary speech recognition embedded IP engine.

### 7. References


