Speech Recognition Based Microcontroller for Wheelchair Movement

This paper introduced an approach to design and implement a control system for the movement of wheelchair by means of the human voice for paralyzed patients. In this paper, the Mel-Frequency Cepstral Coefficient (MFCC) technique is used as feature extraction with Dynamic Time Warping (DTW) for features matching. The output of the system is used to control the movement of the wheelchair through an interface between notebook and microcontroller. The experimental results showed that the proposed methods gave a recognition rate 100% of the already trained speakers with environment noise reach to 66dB. The test was conducted at different sound levels of the surrounding environment (53 to 73) dB as measured by Sound Level Meter (SLM).


INTRODUCTION
wheelchair is a device that used for the mobility of a disabled people, which is controlled either manually by pushing the wheels with the hands or via various automatic systems.Wheelchairs are used by people for whom walking is difficult or impossible due to illness, injury or disability.
Human beings usually communicate with each other by voice.The development in electronics makes human beings tend to use the voice command robots, especially wheelchair to facilitate the lives of persons with disabilities who suffer from spasms and paralysis of the extremities and cannot or it is difficult for them to use joystick.
Researches in the area of the wheelchair control system are still going on, beside the development researches of Automatic Speech Recognition (ASR) to provide an easy way for disable people to control the wheelchair.Researchers pursue their studies to the development of that field in both hardware and software implementation.Below their relevant works are briefly described: Z. Abd Ghani, 2007 [1] introduced a system of a wireless wheelchair control system which employs a voice recognition using voice recognition processor (HM2007) for triggering and controlling its movements.The wheelchair is also equipped with two infrared sensors which mounted in front and rear of the wheelchair to detect obstacles for collision avoidance function.It utilizes a PIC controller to control the system operations.It communicates with the voice recognition processor to detect the spoken word and then determines the corresponding output command to drive the left and right motors.H. Nik, 2009 [2] implements speech recognition control wheelchair use two digital signal processors from Microchip™ (dsPIC30F6014/A) mounted on a custom designed printed circuit board to perform smooth humming control and speech recognition.One DSP is dedicated to speech recognition and implements Hidden Markov Models using dsPIC30F speech recognition library developed by Microchip; the other DSP implements Fast Fourier Transforms on humming signals.M. Qadri and S. Ahmed, 2009 [3] implemented voice activated wheelchair through speech processing using Digital Signal Processor (DSP).The Texas Instruments TMS320C6711 DSP Starter Kit (DSK) is connected with the wheelchair for processing of the voice signal.The DSK calculates the energy, zero crossing and the standard deviation of the spoken word.It also generates different desired analog signals according to the spoken words which further amplified and converted into digital.These digital signals are used to operate the stepper motor.Five words are recognized which are forward, reverse, left, right and stop.
S. Jothilakshmi, V. Ramalingam, and S. Palanivel, 2009 [4] proposed a method for improving the speaker segmentation performance by fusing the residual phase and MFCC features.This method is evaluated using television broadcast interviews and NIST 2004 database.The support vector machines are used to detect the speaker change.The system reports a performance of 85.97%.The proposed system can be extended to detect the speaker changes in the speech conversations containing more than two speakers.
A. AL-Thahab, 2011 [5] proposed a technique called Multiregdilet transform was used for isolated word recognition.Finally, using the outputs of the neural network (NNT) to control the wheelchair through computer notebook and special interface hardware.The rate of recognition command "GO" is 90%, and 100% for other commands.

Speaker Recognition
Speech recognition in this work is for Arabic language and any other language according to the training of the user to the system and the identity of user or many users of the wheelchair.Speech recognition contains three steps: preprocessing, feature extraction, and feature matching using DTW.

Preprocessing
Speech signal must be transformed into discrete and prepare it for feature extraction.The popular processes of transforming speech signal and being accommodated to the next stage of feature extraction.Sampling the speech signal which continues signal to get discrete signal.The main purpos of analog to digital (A/D) converter, is to quantize (digital representation of samples) each discrete sample x (n), n=0, 1….N-1 into a specific number [6].Then processes the passing of the signal through a filter which emphasizes higher frequencies.This process will increase the energy of the signal at higher frequency [7].The most commonly used filter for this step is the finite impulse response filter described below [8]: Two methods are used to detect End Points : Short-Term Energy (STE) and Short-Term Zero Crossing (STZC): STE is the most obvious and simple indicator of ''voicedness".Typically, voiced sounds are several orders of magnitude higher in energy than unvoiced signals.For the frame (of length N) ending at instant m, the energy is given by equation ( 2) [9].
The value of N is chosen to meet the frame time length to be (10-40) ms, since in this time, the speech signal is considered unchanged or its statistical properties are relatively constant [10].The rate at which zero-crossings occurs is a simple measure of the frequency content of a narrowband signal [11].The zero crossing rate of the frame ending at time instant m is defined by equation (3) [9]:

Feature Extraction
Mel-Frequency Cepstral coefficients (MFCCs) are based on the known variation of the human ear's critical frequency bandwidths.This is presented in the Mel-frequency scale, which is a linear frequency space below 1000 Hz and a logarithmic space above 1000 Hz [12].A popular relation between f(HZ) and melfrequency scale is as below [13]:

٢٣٤٣
MFCC provides a baseline acoustic feature set for speech and SR applications [13].
MFCCs with single energy and their dynamic derivatives were used for feature extraction.Figure (1) shows the block diagram for the MFCC feature extraction step by step.

Figure (1): Block Diagram for the Feature Extraction[6]
The details of the block diagram are described below: • Framing and Windowing: The next thing to do with speech signal after preprocessing is to divide it into speech frames and apply a window to each frame, Each frame is K samples long, with adjacent frames being separated by P samples.
A commonly used window is the Hamming window [14].It is calculated as: • Fast Fourier Transform (FFT): the Fast Fourier Transform is a fast implementation of the Discrete Fourier Transform (DFT) which converts N-samples of frames into frequency spectrum.
• Mel Scaled Filter banks: The Mel-scale filter bank implementation used in this study includes 40 triangular filters non-uniformly spaced along the frequency axis [10].
•Signal Energy: Furthermore, the signal energy is added to the set of parameters.It can simply be computed from the speech samples s(n) within the time window by [14]: •Discrete Cosine Transform (DCT): The cepstrum is defined as the inverse Fourier transform of the log magnitude of Fourier transform of the signal.Since the log Mel filter bank coefficients are real and symmetric, the inverse Fourier transform operation can be replaced by DCT to generate the cepstral coefficients [15].The cepstral coefficients are the DCT of the M filter outputs obtained from [16]:

٢٣٤٤
Where M is the number of MFCC coefficients and X , k =0,2,...,39, represents the log energy output of the k th filter.
•Dynamic Parameters: The voice signal and the frames changes, such as the slope of a formant at its transitions.Therefore, there is a need to add features related to the change in cepstral features over time.13 delta or velocity features (12 cepstral features plus energy), and 13 features a double delta or acceleration feature are added.Each of the 13 delta features represents the change between frames in the equation ( 8) corresponding cepstral or energy feature, while each of the 39 double delta features represents the change between frames in the corresponding delta features [7].

Features Matching using DTW
Dynamic time warping (DTW) is an algorithm for measuring similarity between two sequences which may vary in time or speed.DTW allows a nonlinear warping alignment of one signal to another by minimizing the distance between the two as shown in Figure (2).

Figure (2): A Warping between two time series [7]
This warping between two signals can be used to determine the similarity between them and thus it is very useful for feature recognition.DTW is a pattern matching algorithm with a non-linear time optimization effect based on Bellman's principle of optimality, which states that given an optimal path from A to B and a point C lying somewhere along this path, the path segments AC and CB are optimal paths from A to C and C to B respectively [15].The DTW objective is to find the warping path W = {w1, w2, w3, . .., wK} of contiguous elements on distMatrix (with max(TX-1, TW-1) < K < ((TX-1) + (TW-1) -1), and wk= distMatrix(i, j)), such that it minimizes the following function: The warping path is subject to several constraints, see Figure (3) .Given wk = (i, j) and wk-1 = (i', j') with i, i' ≤ (TX-1) and j, j' ≤ (TW-1) [17]: 1. Boundary conditions.w1 = (1,1) and wK = (TX-1, TW-1).2. Continuity.ii' ≤ 1 and jj' ≤ 1.
3. Monotonicity.ii' ≥ 0 and jj' ≥ 0. This path can be found by using dynamic programming to evaluate the following Recurrence, which defines the cumulative distance D(i, j) as the distance d(i, j) found in the current cell and the minimum of the cumulative distances of the adjacent elements [18]: The Euclidean distance between two sequences can be seen as a special case of DTW where the kth element of W is constrained such that wk=(i,j) k ,i= j = k.Note that it is only defined in the special case where the two sequences have the same length [18].

Proposed System Design
The proposed wheelchair looks like a conventional mechanical wheelchair, but components were added to it, that are cheap comparative with the cost of electric wheelchair that even does not have the technology of speech recognition.Proposed wheelchair as shown in Figure ( 4) provides both speech recognition of user instructions and manual control.The wheelchair direction movement control system consists of speech recognition part, which is represented in MATLAB and installed on laptop or notebook to programmed speech recognition algorithm with manual control (keyboard work as joystick) on it.for Wheelchair Movement

Figure (4): Proposed System Design
microcontroller board, which consists of microcontroller and interface between laptop and microcontroller by using USB to serial converter on board, and finally interface between microcontroller and driver circuits of two motors, which consist of H-bridge relays.

Microcontroller Interface
The control command that gets from Laptop performs on motors through special interface using Antel (AT89S8253) microcontroller on 8051-ready additional board.The connection between laptop and 8051-ready addition board perform via USB cable, as shown in Figure (5).

Driver circuit
Driver circuit was built to connect the microcontroller board to high power consumption two motors.The driver circuit as shown in Figure (6) consists of (ULN2803) chip and eight relays to construct two H-bridges, to coordinate movement of two motors.

Implementation and Results
The total features are arranged in the form of matrix (MFCC [I, J]) as I=39, which

Figure (7): feature vectors for word ammam ‫"أﻣﺎم"‬
represents the number of elements of each frame, and J represents the total overlapped frame of speech signal.As can seen in Figures (7) four different speakers features vectors for the Arabic words ammam ‫,"أﻣﺎم"‬ that can be noted the difference between each features vectors and the others for other speakers.Feature matching of feature vectors that are extracted before, by using DTW to compute the optimal path of warping between two feature vectors, as shown in Figure (8).The recognition rate was computed to the effect of background noise for four speakers by testing each word ten times for noise level 40dB, 50dB, 55dB, 60dB, 66dB, and 73dB, as shown in tables ( 1) to ( 4).The first row represents the SLM measurements at different values, under each value of the SLM reading, the recognition rate corresponding to each uttered word is found.

Conclusions
The work utilizes control technique termed on speech recognition, which concluded that the highest recognition rate that can be achieved from the experimental results of the system is 100% in background noise reach to 66dB.This control technique possesses several advantages such as cost and steady state response, simple implementation, no parameter sensitivity and social need is the independence of the physically challenged people.