Speech Recognition Algorithm in a Noisy Environment Based on Power Normalized Cepstral Coefficient and Modified Weighted-KNN

Speech recognition is widely used in robot control and automation. Nevertheless, the use of speech recognition in robots is limited due to its susceptibility to background noise. This paper proposes a speech recognition algorithm to control robots in noisy environments. The proposed algorithm is based on Perceptual Linear Predictive Cepstral Coefficients (PNCC), which is a noise-resistant feature extraction technique, and Modified K-Nearest Neighbors (KNN) with Dynamic Time Warping (DTW) as the classifier. A new KNN-DTW classifier is proposed, integrating weighted KNN and DTW. The proposed algorithm results from experiments comparing PNCC and Mel-frequency cepstral coefficients (MFCC) feature extraction techniques with different classifiers, namely KNN-DTW, two types of KNN (weighted KNN and Medium-KNN), and two types of Support Vector Machine SVM (Linear SVM and Quadratic SVM). The database used to investigate the accuracy was the audio-visual data corpus database UOTletters, which includes 30 speakers, 26 English letters, and 1560 utterances. The database is divided into 50% for training and 50% for testing purposes. In a noise-free environment, the accuracy of the proposed algorithm reached 100%. Moreover, the proposed algorithm demonstrates greater noise immunity across all five noise levels, with an average accuracy difference of 13.67% compared to baseline algorithms.


Introduction
One of the key techniques for controlling robotics, home appliances, security door, mobile applications, cars, etc., is speech recognition. However, as every user is fully cognizant, many features of this technology still need to be improved, such as the identification rate, noise resistance, and consistent accuracy [1,2]. PNCC feature extraction, an enhancement of the most popular MFCC feature extraction method, is a way to increase the accuracy of speech recognition systems in noisy environments [3][4][5][6].
Numerous studies on this topic have been conducted in recent years, in our previous work [7]. The proposed algorithm was used to accurately evaluate the recognition rate using large datasets (Speech Commands DataSet v0.01). This database contains Audio files with 64,727 different individuals (30 words, five repetitions each). This was the first study to integrate the Weighted KNN classifier with PNCC for speech recognition. Although KNN and SVM classifiers are frequently used in studies [8][9][10][11][12][13], it has been demonstrated that the two classifiers have a close accuracy. As a result, the best classifiers for this research were looked into for the two feature extraction techniques, PNCC and MFCC. The paper's findings demonstrate that the suggested technique, based on the PNCC-Weighted KNN algorithm, has higher accuracy and immunity to white noise. Ali et al. [14], proposed an MFCC and KNN classifier-based system for isolated word recognition in Pashtun-10 words (Pashton numerals 0 to 9, and 50 individuals make up the data collection). An accuracy of 76.8% was achieved for this investigation.
The researchers, Safi and Abbas [15], presented a microcontroller-driven Automatic Speech Recognition algorithm (ASR) to control the security door system. The ASR algorithm is based on DTW isolation word matching and MFCC feature extraction. The algorithm was tested with twenty-two individuals, and the database contains three passwords and three user authentications. 100% of accuracy was achieved.

2
The paper's researchers, Imtiaz and Raja [16] 2017 suggested MFCC features extraction and DTW approach integrated by KNN classifier for isolated word automatic speech recognition. Two thousand audio tracks comprise the entire data set, comprising ten English words spoken by five individuals. 98.4% of the validation results are accurate according to this system.
AnggraeniIn et al. [17], proposed an ASR system to direct the movements of a 5 Degree of Freedom (DOF) robotic arm for picking and placing objects. The command recognition technique uses KNN classifiers in conjunction with MFCC for feature extraction. The data set consists of 20 commands that are repeated ten times for each word in the Indonesian language. The voice recognition rate for trained sets is 85%, compared to 80% for untrained respondents. Adiwijaya et al. [18], proposed an ASR system that pronounces the Arabic letters (Hijaiyah). Two feature extraction methods, MFCC and Linear Predictive Coding (LPC), are employed in this work as study cases. To estimate the best outcomes, it is advised to integrate KNN as a classifier in each method. Six speakers and 28 letter sounds make up the data set. The findings indicate that LPC has a 78.92F percent accuracy rate compared to MFCC's 59.87 percent.
She et al. [19], created a new feature extraction technique using the supplied blended features. The combination uses the cepstral coefficients as a foundation of the cochlear filter to maximize accuracy in noisy surroundings (CFCC). Three stages have been added to the feature extraction procedure. This was accomplished by getting the energy feature TEOCC and the feature TEOCC's compensatory effect on the auditory characteristics. Three stages have been added to the feature extraction procedure. This was accomplished by getting the energy feature TEOCC and the feature TEOCC's compensatory effect on the auditory characteristics. Last, Principal Component Analysis (PCA) is used for selection and optimization based on the fusion feature's feature redundancy. This technique uses the SVM classifier, and the database consists of ten and twenty-two Korean words from sixteen individuals, each repeated three times. For ten words, the accuracy is 92.79 percent, and for 20, it is 88.43 percent. Korkmaz et al. [20], suggested a system for classifying vowels of the Turkish language developed on a features vector created using the feature approaches Wavelet Decomposition Shannon Entropy, LPC, MFCC, Energy, and Zero Crossing Rate (ZCR). After optimization using a genetic algorithm, a 1-NN Cityblock classifier classified the characteristics vector. The database of this paper is 2762 total observations and 8 Turkish vowels spoken by ten individuals. The recognition rate of this work reached 100%. Alasadi et al. [21], different feature extraction methods have been proposed for the ASR system: the Modified Group Delay Function (ModGDF), the PNCC, and the MFCC. Forty speakers contributed 18 Arabic words to the data set. The findings of this paper have demonstrated that MFCC has a recognition rate of 97.5%, which is higher than ModGDF's recognition rate of 90.3%. Tuncer et al. [22], suggested a dynamic center mirror local binary pattern (DCMLBP), while the DWT was used for feature extraction. The identification of useful features is then accomplished using neighborhood component analysis (NCA). The decision tree (DT), KNN, SVM, bagged tree (BT), and linear discriminant analysis (LDA) are some of the classifiers employed in this work. There are 480 utterances in the database for this study, representing various ambient classes (eight classes by sixty utterances). SVM classifiers produced 99.97% of accuracy.
This paper proposes a speech recognition algorithm based on PNCC as feature extraction and a new KNN-DTW classifier based on integrated weighted KNN and DTW to control robots in a noisy environment. The speech recognition algorithm interfaces with hardware to control the home appliances using MATLAB 2021a and a microcontroller. The database used to investigate the accuracy was the audio-visual data corpus database UOTletters Which has 30 speakers, 26 English letters, and 1560 utterances.

Research Components
In this section, the background of techniques of feature extraction and classifiers that are used and developed in this paper is illustrated.

PNCC Feature Extraction
The need for a trustworthy feature for ASR, which is high immunity in noisy environments in its structure with reasonable cost to compete with other feature extraction techniques, served as the primary motivation for the development of PNCC [6]. As shown in Figure 1, the structure of modified PNCC in [3], based on basic PNCC in [4], is the same as MFCC, PNCC starting with Pre-emphasis, as shown in Equation 1: The Short Time Fourier Transform is also carried out utilizing DFT after framing and windowing using Humming windows for 25.6 ms and a 10 ms cross-section between frames. Gammatone filter banks (40-channel) in place of MFCC's traditional triangle filter banks in the frequency range is the next step, which is the stage that sets all other varieties of PNCC apart from one another (200Hz to 8000Hz). This change was made in response to the finding in [3,4] that gamma-filter banks have higher ASR accuracy than triangle-filter banks. Getting spectral power in the short term, as shown in Equation 2: where: m is the frame number, l is the channel index, k is the DFT size, H l is the response of lth of channels at frequency ωk.
Medium Time Power quantity ( � ) alter information based on the Short Time power in MFCC is the final iteration of PNCC indicated in [3]. According to the following in Equation 3: where: M is the temporal integration factor. According to [23], M=2 is advised (corresponding to five consecutive windows with 65.6ms of the total net). This step is based on studies showing that accuracy is improved in noisy environments with long-term processing e.g. [24][25][26]. This is true because the power related to noise changed more slowly than the power related to speech. Additionally, numerous research [27][28][29] have demonstrated that long-term processing with Gammatone channels yields more informational details beneficial for improving speech recognition.
4 Based on findings that the human auditory system places more emphasis on the rising edge of a power envelope than its onset (e.g. [22,23] Following the development of , then an "excitation segment" is necessary, and if, � [ , ]< 2 � [ , ] then a "non-excitation segment" is necessary.
In the following step ̃[ , ] is utilized to modify the short-time power [ , ], as shown in Equation 10: Then mean power μ[m] is calculated, as shown in Equation 11: The recommended value for the forgetting factor λμ is 0.999, according to [3]. Now it can obtain the normalized power [ , ] from the Equation 12 below: where: k is an arbitrary value.
The power-law nonlinearity is the final step that sets PNCC apart, as shown in Equation 13: where: M is the number of PNCC coefficients and , k =0, 2..., K, represents the power function output for the kth filter.

Weighted-KNN Classifier
A simple-to-use supervised learning classifier is the KNN classifier [31,10]. It is predicated on the idea that neighboring samples of a similar nature exist [32]. The classifier investigates the distances of the k-objects closest to a sample to categorize 5 it. The sample is then classified under the class that occurs the most [33]. The size and kind of the dataset will determine how strictly the nearest neighbors (K) values should be chosen [14]. The Euclidean distance method is frequently used to calculate the distance (d) between the sample and the object [34] , as shown in Equation 15: (15) where: S and y, for the jth of K-folded, are the sample features vectors and the object features vectors, respectively, i is the number of the cluster.
In Weighted KNN, the calculated distances are evaluated as weight [35] , as shown in Equation 16: where: (w) is the weight for the k-objects most nearby to the tested sample. The category to which the most K neighboring samples (̀) belonged is then specified. Each K-nearest to a certain cluster (i) of nearby samples (̀) as determined by the formula [36], as shown in Equation 17:

Modified-KNN by DTW Classifier
The first proposed Modified KNN by DTW classifier was mentioned [15]. Instead of the Euclidean distance (d) only shown in Equation 15. DTW is frequently used to calculate the global distance (D) (similarity) between the sample and the object, as shown in Equation 18:

The Proposed PNCC and Modified Weighted-KNN Algorithm
The proposed algorithm's general procedure is described in Figure 2, starting with the PNCC feature extracted to trained voices to store under its cluster label in the database. Similarly, in the test process, followed by the classification process using the proposed classifier.
The database used in this experiment is the audio of the UOT letters database, which consists of 30 speakers (10 to 62 years old), 26 English letters, two times of repetition, and the overall utterances are 1560. The database was created by researchers in [37]. They recorded movies of the database in their laboratory in real and noisy surroundings without filtering, depending on the camera microphone with a sampling rate of (11025 Hz), as shown in the sample of sound in Figure 3.

Features Extraction
PNCC feature extraction is mentioned in detail in section 2., but in this section, the parameters setup of PNCC of the proposed algorithm is illustrated. The sampling rate of the speech signal used in these experiments was 11025 samples/sec. The feature extraction started with Pre-emphasis using equation 1 to emphasize the high frequencies by using the common first-order FIR highpass filter to compensate for the high-frequency, low power. The Short Time Fourier Transform uses DFT after framing and windowing using Humming windows for 25.6 ms and a 10 ms cross-section between frames and getting spectral power in the short term for 40-channel of Gammatone filter banks as shown in

Classification
The proposed classifier (Weighted-K-NN-DTW) is a modification of the K-NN classifier explained in section 2.3. The proposed classifier procedure steps are shown below: Set the K of the neighbor features.

1)
Find the similarity between the test and trained features using DTW, as shown in Equation18.

2)
Calculate the K nearest neighbors per each class's DTW, as shown in Equation 19: where: D is the similarity based on DTW.
Calculate the weight (w) for each set of K nearest neighbors using Equation 16.

4)
Compute the sum of weights for each class (M i ) using Equation 17.
where: i max. is the maximum index in the (M) matrix.

Results and Discussion
In this paper, the accuracy of the proposed algorithm was calculated with different levels of white noise (noise-free, 20dB, 15dB, 10dB, and 5dB). The accuracy of the proposed algorithm in a noise-free environment reached 100%, as shown in Figure  5. The k-nearest number is set to 5 according to experiments with respect to the accuracy, as shown in Figure 6, which has the higher accuracy at the k-folded point with an accuracy of 100%.
The proposed algorithm was a result of experiments of comparison on PNCC and MFCC features extraction techniques with different classifiers (proposed KNN-DTW classifier, weighted-KNN, Medium KNN, Linear SVM, and Quadratic SVM), as shown in Table 1. The best accuracy results were for the proposed KNN-DTW classifier in both techniques (PNCC and MFCC), 100% and 98.72%, respectively. At the same time, the Weighted KNN was 99.97% and 96.67% for the PNCC and MFCC, respectively. In the case of Medium-KNN, accuracy had a drop in its values of 89.54% for PNCC and 86.55 for MFCC. Nevertheless, a significant drop in accuracy was shown in SVM classifiers: Quadratic SVM has an accuracy of 76.97% and 74.41% for the PNCC and MFCC, respectively. In comparison, the worse accuracy was for Linear SVM of 60.27% and 40.45% for the PNCC and MFCC, respectively. Consequently, modifying the weighted KNN classifier by DTW improved the accuracy due to the accuracy of DTW in show similarity rather than the Euclidian distance. Moreover, integrating the proposed classifier with PNCC feature extraction improves the accuracy in a noise-free environment.
8 Figure 5: Accuracy of the proposed algorithm in a noise-free environment For the noisy environments, the two approaches, PNCC features extraction with the Proposed KNN-DTW classifier (proposed algorithm), and MFCC features extraction with the KNN-DTW classifier, were selected to investigate the most immunity approach to noise. As shown in Table 2, the approach based on MFCC features extraction with the Proposed KNN-DTW classifier has given recognition rates of 98.72, 72. 19, 18.75, 9.15, and 3.16 for noise levels (20, 15,10, and 5) dB, respectively. On the other hand, the approach based on PNCC features extraction with the Proposed KNN-DTW classifier has given recognition rates of 82.05, 50.90, 25.77, and 12.05 for noise levels (20, 15,10, and 5) dB, respectively. As results have shown, the proposed algorithm based on PNCC has more immunity to noise in all five noise levels with an average accuracy difference of 13.67% with respect algorithm based on MFCC.
As a practical comparison between the proposed algorithm and related works, the proposed algorithm has higher accuracy than the algorithms based on (Weighted K-NN and PNCC) in [7] (KNN and MFCC) in [14,17], and (SVM and PNCC) in [21], as shown in Table 1.
In the noise effect experiments shown in Table 2. the proposed classifier is a modification of the KNN, and it can be taken as the optimal alternative classifier in noisy environments instead of the classifiers [14,17] for MFCC and classifiers in [7,21] for PNCC. Under this assumption, the proposed algorithm has higher accuracy than the related works in noisy environments, as shown in Table 2. 9 Moreover, the proposed algorithm is compared with related works based on the accuracy and the size of the database of studies, as shown in Table 3. The results have shown that the proposed algorithm is higher accuracy than the baseline algorithms in [14,7,[16][17][18][19]21,22], and less cost than the algorithm in [20,22].

Conclusion and Future Works
This paper proposed a speech recognition algorithm for controlling robots in a noisy environment. The outcomes demonstrated that the suggested method is more accurate than the baseline and related work. Moreover, it has more immunity to noisy environments by 13.67% with respect baseline algorithm. Nevertheless, the proposed system needs to be improved in the high noise level. One of the future works suggestions is to integrate the proposed audio speech recognition with a visual lip-reading algorithm to improve accuracy in noisy environments, which is the reason for using the audio of the UOT letters database. Another future work is implementing a hardware interface between the proposed algorithm and a mobile robot.

Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Data availability statement
The data that support the findings of this study are available on request from the corresponding author.

Conflicts of interest
The authors declare that there is no conflict of interest.