Influence of Noisy Environment on the Speech Recognition Rate Based on the Altera FPGA

This paper introduce an approach to study the effects of different levels of environment noise on the recognition rate of speech recognition systems, which are not used any type of filters to deal with this issue. This is achieved by implementing an embedded SoPC (System on a Programmable Chip) technique with Altera Nios II processor for real-time speech recognition system. Mel Frequency Cepstral Coefficients (MFCCs) technique was used for speech signal feature extraction (observation vector). Model the observation vector of voice information by using Gaussian Mixture Model (GMM), this model passed to the Hidden Markov Model (HMM) as probabilistic model to process the GMM statistically to make decision on utterance words recognition, whether a single or composite, one or more syllable words. The framework was implemented on Altera Cyclone II EP2C70F896C6N FPGA chip sitting on ALTERA DE2-70 Development Board. Each word model (template) stored as Transition Matrix, Diagonal Covariance Matrices, and Mean Vectors in the system memory. Each word model utilizes only 4.45Kbytes regardless of the spoken word length. Recognition words rate (digit/0 to digit/10) given 100% for the individual speaker. The test was conducted at different sound levels of the surrounding environment (53dB to 73dB) as measured by Sound Level Meter (SLM) instrument.


INTRODUCTION
oice signals convey numerous discriminative features that can be used to identify speakers.Speech contains significant energy ranging rom zero frequency up to around 5 kHz [1].Most of the speech recognition systems objective is to extract, characterize and recognize the information about the utterance words.
One of the most important features extraction method used for the speech applications is Mel Frequency Cepstral Coefficient (MFCC).It's non-parametric method used to modeling the human auditory system.In recent studies of speech recognition system, the MFCC parameters perform better than others in the recognition accuracy [2].MFCC has random multi-dimensional continuous feature vectors, however the most important class of finite mixture densities are Gaussian mixtures, to represent the continuous feature/observation vectors into statistically distributions.
Gaussian Mixture Model (GMM) is one solution to fit the continuous feature/observation vectors (MFCCs) for each utterance word and estimate parameters of the GMM for a set of MFCCs from training speech signals.To obtain the optimum parameters of GMM, iterative Expectation-Maximization (EM) algorithm used to calculate Maximum Likelihood (ML) estimation [3][4].
Hidden Markov Model (HMM) is a probabilistic model used in most speech recognition systems with high recognition rate and good anti-noise performance [5].One topology used for speech recognition is so called left-to-right HMM structure.Continuous HMM was used in this work, so the observations were characterized as continuous signals (vectors).The measured MFCCs are in the form of a multidimensional vector space for the speech signals.The PDF was used as multivariate Gaussian PDF and defined by mean vector(μ) with covariance matrix(σ) [4].

SPEECH PRE-PROCESSING
To accommodate the voice signal to be further analysis to extract features.Figure (1)illustrates the pre-processing functions.According to the sampling theorem: sampling frequency cannot smaller than 2 times of bandwidth of the signal, so to capture more information to give high recognition rate, 16KHz chosen to be sampling frequency since the bandwidth of speech signal is not higher than 4KHz [6].
Analog to digital (A/D) converter is to quantizes (digital representation of samples) of each discrete sample x(n), n=0, 1, ...N-1 into a specific number.In this work, the type of A/D is Sigma/Delta conversion with 24-bits resolution, which is provided by Altera DE2-70 Development Board as dedicated chip.
The microphone with A/D converter may be add a DC offset voltage level to the output signal that should be removed from the digital data by subtract all samples from the mean value of the signal within a limit time period, in this work 1.5 sec.was V chosen as time period.This process performs important thing when determine end and start points for the utterance voice.It is essential to remove any constant DC level voltage.Pre-emphasis it is a digital High-Pass Filter is governed as in Eq. 1 S(n) represents the signal that has been processed with pre-emphasis, while X(n) represents the original signal, and L is the length of each audio frame (samples number).Pre-emphasis filter used to compensate the decayed speech signal [7].A typically values for chosen within 0.95 to 1.0.The value of reflect the degree of pre-emphasis [6].A typical value of is 0.95, which gives rise to a more than 20dB amplification of the high frequency spectrum [8].
End-Point Detection (EPD) is the process of removing unwanted parts of speech sound signal: silence, speech, and background noise segments, however the data necessary to compute will decrease, and the computation time will speed up [7].There are many algorithms adopted to detect the boundary of speech segment: Time Domain, Frequency Domain, and Mixed Parameter EPDs.This work focused on Time Domain EPD, includes two algorithms for determine the EPD of the speech segment based on short time analysis of the speech signal.One is STE (Sort-Time Energy).The amplitude of unvoiced segments is generally much lower than the amplitude of voiced segments.The short-time energy of the speech signal provides a convenient representation that reflects these amplitude variations [9].The other one is Short-Time Zero Crossing Rate (STZCR).Prior calculation of STZCR process, any DC offset voltage level must be removed from the speech signal.STZCR counts how many times the signal crosses the time axis during the short-time frame.Eqs. 2 and 3 used to compute STE and STZCR respectively.

SPEECH SIGNAL FEATURE EXTRACTION
In this work, the MFCC was chosen; it's based on the human peripheral auditory system.The human perception of the frequency contents of sounds does not follow a linear scale.Thus for each tone with an actual frequency t measured in Hz, a subjective pitch is measured on a scale called the "Mel Scale", Mel is an abbreviation of the word melody, is a unit of pitch [4].The Mel frequency scale is linear frequency spacing below 1000 Hz and logarithmic spacing above 1kHz.As a reference point, the pitch of a 1 Hz tone, 40 dB above the perceptual hearing threshold, is defined as 1000 Mels [1].Equation 4is the approximate formula used to compute the Mels for a given frequency f in Hz. [2].
One method to compute the MFCC in this work can be summarized as in Figure (2) First the pre-processed signal blocked into overlapping frames with hamming window which has the form [5]: in which L is the length of the frame.The purpose of the hamming window is to reduce the effect of discontinuity at the both ends of every frame [7].In this work chosen (window size=256 sample) and (frame rate=100 Hz), that means (window size=16msec.),(overlapping=10msec.), where (sample rate=16 KHz).For each hamming window, compute magnitude of 512-FFT (Fast Fourier Transform) |X (k)|.Each magnitude |X (k)| is scaled in both frequency and magnitude.The frequency is scaled logarithmically using Mel-Frequency filter bank ( , ) according to the Eq. 6 [10].
For m=1,2,…..M, where M is the number of filter banks and M<< N. the Mel filter banks is a combined of triangular filters defined by the center frequencies ( ), as in Eq. 7 [10] The center frequencies of the filter banks are computed according to Eq. 4, which is approximately linear for frequencies less than 1 KHz and non-linear for frequencies above 1 KHz as shown in Figure (3).
According to this approximation and MATLAB Auditory toolbox [11], the parameters of filter banks are written as in Table 1.
Finally, the MFCCs are obtained by calculating the Discrete Cosine Transform (DCT) of X (m) using Eq. 8 [12].
The typical values of are 0 ≤ d<9 or 0 ≤ d<12 coefficients [8].Notice that MFCC (0) is equivalent to the log energy of the frame.In this work the value of was chosen as 0 ≤ d<12.However Eq. ( 7) represents frame cepstral coefficients and energy coefficients too.To capture the dynamic variations of the MFCCs and energy of the speech signal frames, the first and second order differences may be used to capture such information [4] [8].First order difference was chosen in this work.Hence the total coefficients become:(1E+12MFCC+1∆E+12∆MFCC)=26, where E and ∆E represent energy and its difference respectively.These coefficients construct the feature vectors as multi-dimensional vectors arranged in a matrix written as MFCC[I,J], I rows represent coefficients=26, J columns represent the total frames and its value depend on the length of the detected speech signal.

GAUSSIAN MIXTURE MODEL (GMM)
Gaussian mixture model (GMM) is a model expresses the probability density function of a random variable in terms of a weighted sum of its components, each of which is described by a Gaussian (normal) density function [4].
A statistical model for each utterance word in the set is developed and denoted by λ.For instance, word s in the set of size S can be written as in Eq. 9 λ = w , μ , σ i = 1,...,M; s = 1,...,S … (9 Where w is weight, μ is mean, σ is a diagonal covariance, and M is the number of GMM components.A diagonal covariance is used rather than a full covariance matrix for the word model in order to simplify the hardware design.However, this means that a greater number of mixture components will need to be used to provide adequate classification performance [13]. To estimate initially these parameters, first by using K-means algorithm to cluster the feature vectors to a specific utterance word model into several categories, choosing 5-clusters (K=5) in this work, and compute diagonal covariance matrix with mean vector for each cluster( = 1,...,5), these parameters defined each cluster as weighted GMM associated with the utterance word model that is used to train HMM later.
The Expectation Maximization (EM) algorithm used to re-compute the means, covariance's, and weights of each component in the GMM iteratively.Iteration of the algorithm provides increased accuracy in the estimates of all three parameters.Equations of the EM algorithm as follow [13].Posterior probability:

… (12)
New estimates of diagonal elements of ith covariance matrix

HIDDEN MARKOV MODEL (HMM)
The HMM is a statistical model which establishes a model for every word through the statistical analysis of large amounts of data.Fig. 4 shows a left-to-right model, therefore can be defined as HMM with λ =(A,B, π), which as in [5]: • A: a = P q q Probability Transition Matrix, describes a probability transition from state to ) the probability of obtaining the symbol V in the state q 1 ≤ j ≤ N ,1 ≤ k ≤ M • π: Initial probability vector π(i), 1 ≤ i ≤ N • Q={q ,q ,……..,q }state sequence through the HMM in the interval where each O t is data representing speech which has been sampled at fixed intervals, and a number of potential models M, each of which is a representation of a particular spoken utterance (e.g.word or sub-word unit), find the model M which best describes the observation sequence, in the sense that the probability P (M|O) is maximized (i.e. the probability that M is the best model given O).Viterbi algorithm, used for recognition, is as follows [14]: Viterbi Algorithm: To find the single best state sequence, Q = {q 1 , q 2 . . .q T }, for the given observation sequence O = (O 1 , O 2 … O T }, we need to define the quantity δ (i)= max P[q ,q ,…q = i,O ,O ..

.O |λ]
Where ( ) is the best score (highest probability) along a single path, at time t, which accounts for the first t observations and ends in state i.By induction we have δ To actually retrieve the state sequence, we need to keep track of the argument that maximize ( ), for each t and j.We do this via the array Ψ (j).The complete procedure for finding the best state sequence can now be stated as follows: i. Initialization: For training the HMM for multiple speakers, the HMM parameters corresponding to each utterance word is averaged.Compared to Rabiners's [15] approach, this has a number of advantages such as lower data requirement, higher detection accuracy and lesser computation complexity.

COMPUTATION OF OBSERVATION PROBABILITIES
Continuous HMMs, however, compute their observation probabilities based on feature vectors extracted from the speech waveform.The computation is typically based on uncorrelated multivariate Gaussian distributions, but can be further complicated by using Gaussian mixtures, where the final probability is the sum of a number of individually weighted Gaussian values.As with Viterbi algorithm, we can perform these calculations in the log domain, resulting in the Eq. 14 [16] Where O is a vector of observation values at time t; μ and σ are mean and variance vectors respectively for state j; O , μ and σ are the elements of the aforementioned vectors, enumerated from 0 to L-1.
Note that the values in square brackets are dependent only on the current state, not the current observation, so can be computed in advance.For each vector element of each state, we now require a subtraction, a square and a multiplication.Because each of these calculations is independent of any other at time t, they can be performed in parallel if sufficient resources are available [16].

SOUND LEVEL MEASUREMENTS (EXTERNAL NOISE LEVEL VARIATION)
The relationship between sound intensity and perceived loudness is shown in Table (2), expressed as sound intensity on a logarithmic scale, called decibel SPL (Sound Power Level).On this scale, 0 dB SPL is a sound wave power of 10 -6 watts/cm 2 , about the weakest sound detectable by the human ear.Normal speech is at about 60 dB SPL, while painful damage to the ear occurs at about 140 dB SPL [17].
In this work the level of external noise was considered as variable level test to study its effect on the speech recognition system without needed for building any types of filter.

IMPLEMENTATION AND RESULTS
Altera provide soft-core processor called Nios II processor.This processor defined by the Hardware Description Language (HDL).The Nios II processor associated with memory and peripheral components are easily instantiated by using Altera SoPC Builder in conjunction with the Quartus ® II software.Generally this system adopted HDL to define the hardware components required with the Nios II processor.Programming the functions of the recognition system is developed by using Nios II IDE/C++ development software.Altera DE2-70 development board with Cyclone II EP2C70F896C6N FPGA chip sitting on it was chosen to implement this work.Figures (5,6) shows the functions interface and the block diagram of the system respectively.
This system is capable of running the training and recognition process in the same design.Acquire the real-time speech signal through microphone connected directly to the microphone-in jack of the Altera DE2-70 board.Sampling frequency is 16KHz, A/D Sigma-Delta type has 24-bit resolution.English digits 0 to 10 used as trained utterance words, each digit gain 3 samples and the duration of each training words was 1.5sec, however for each utterance word the total samples are 24000 samples.Only the effective and useful samples are chosen to feature extraction according to the pre-process operations.
To study the effect of the sound of the surrounding environment on the recognition rate, an instrument known as Sound Level Meter (SLM), model SL-4001 as shown in Figure ( 7) was used in this test to measure the sound level in (dB).
Figure (8) shows reading samples of SLM instrument variation of the sound level of the surrounding, the base level represents the sound level of the room test, at each pulse uttered word occur, and the pulse width shows the duration of the utterance words.In this test, an external random white noise (uniformly distributed with zero mean and 16e 3 variance) was used to obtain multilevel noise while recognition process passed.
Figure ( 9) shows the GMM_HMM recognition system.The recognition results for the training words (digit/0 to digit/10) given accuracy 100% for the individual speaker.Table (3) shows the summary of the test taken into account the variation of the levels of the surrounding sounds.
Figure (10) shows the comparison probability of the recognition probability of utterance words with each other labeled as (ONE, FOUR, SIX, and EIGHT) have been taken as example.Each word uttered as 20 times as test, the less probability, the most guess for the recognition word, as shown in the red line curves.

CONCLUSIONS
MFCC method provides an excellent method to compression the audio signals by extracting the most features of the voice signal by converting it to multi-dimensional vector (i.e. for 4800 samples and by taking 13 coefficients to obtain compression ratio reached to 92.1% independent on the length of the audio samples.Taking an inverse of MFCC computations to get back the original speech data, therefore it's easy to process the small amount of data. The effect of the sound level of the environment on the uttered words has different effects not equally on the words recognition; some words have high sensitivity more than the other.This system is allowed to be used with somewhat loud environment (within certain limits), since the curves have a somewhat parallel property with each others.Those mean all the external sound effects (i.e.noise) have an equal effect on each recognition probability curve, make it capable to recognize an uttered word even with existence of noise.
To use this system with a very loud environment (i.e.into a noisy factory), the system should be retrained under this condition to be able to recognize the uttered word correctly.
Using SoPC to build Altera Nios II processor with C++ as programming language provided a suitable platform for implementation an embedded system and it is possible to modify it easily and quickly to meet our future requirements.

Microphone
[1,t] • N: Number of states • M: Number of symbols The speech recognition problem is given an observation sequence O= O O O …. O Figure (1) Speech pre-processing flow.

Figure ( 10 )
Figure (10) Comparison of the probability of the recognition rate.

Table ( 1) Filter banks parameters. Lowest Frequency(Hz) * Linear Filters
Auditory toolbox used in MATLAB suppresses frequencies below approximately 133Hz. *