A systematic approach for segmenting voiced/unvoiced signals using fuzzy-logic system and general fusion of neural network models for phonemes-based speech recognition

Abstract

In this paper, a speech-to-text translation model has been developed for Malaysian speakers based on 41 classes of Phonemes. A simple data acquisition algorithm has been used to develop a MATLAB graphical user interface (GUI) for recording the isolated word speech signals from 35 non-native Malaysian speakers. The collected database consists of 86 words with 41 classes of phoneme based on Affricatives, Diphthongs, Fricatives, Liquid, Nasals, Semivowels and Glides, Stop and Vowels. The speech samples are preprocessed to eliminate the undesirable artifacts and the fuzzy voice classifier has been employed to classify the samples into voiced sequence and unvoiced sequence. The voiced sequences are divided into frame segments and for each frame, the Linear Predictive co-efficients features are obtained from the voiced sequence. Then the feature sets are formed by deriving the LPC features from all the extracted voiced sequences, and used for classification. The isolated words chosen based on the phonemes are associated with the extracted features to establish classification system input-output mapping. The data are then normalized and randomized to rearrange the values into definite range. The Multilayer Neural Network (MLNN) model has been developed with four combinations of input and hidden activation functions. The neural network models are trained with 60%, 70% and 80% of the total data samples. The neural network architecture was aimed at creating a robust model with 60%, 70%, and 80% of the feature set with 25 trials. The trained network model is validated by simulating the network with the remaining 40%, 30%, and 20% of the set. The reliability of trained network models were compared by measuring true-positive, false-negative, and network classification accuracy. The LPC features show better discrimination and the MLNN neural network models trained using the LPC spectral band features gives better recognition.

Keywords

Fuzzy voice classifier Malaysian English pronunciation linear predictive coefficients (LPCC)neural network models (MLNN).

1 Introduction

This SPEECH is an idealized model of communication and efficient user interface for human. The English language used among Malaysians varies wildly from one ethnic community to another and varies emphatically [1]. The study of speech to text translation system is important as it can be used in many applications such as learning aid, keyboard less data entry and also automated speech therapy for hearing impaired person. The literary works and the research approach towards the speech to text translation system based on multilayer neural network models are explained in subsequent sections.

1.1 Malaysian English pronunciation

Malaysian English is generally non-rhotic and it is originally derived from the British English pronunciation as a result of British colonialism in the present day of Malaysia. English in Malaysia has been categorized into three levels as acrolect, mesolect and basilect [2]. The acrolect English speakers are the core and literates though they started speaking English from early schooling and only a small percentage of Malaysians are skilled in it. Malaysian English belongs to mesolect, and it mostly used by the academics, professionals and other English educated Malaysians. Malaysian English pronunciation uses the similar pronunciation system as British English. However, most Malaysians speak with a distinctive dialect as it is due to the influence of their mother tongue [3 –5].

1.2 Voiced/unvoiced classification of speech signal

Speech signals are composed of phonemes which consist of both voiced/unvoiced portions. Automatic voiced / unvoiced separation of articulated signals provides highly adaptive speech processing models, which significantly reduces the information transfer rate in addition. To date scientists have utilized classic methods for the separations of speech signals, such as Zero Crossing Rate (ZCR) and energy [6]. Many research works have used statistical analyses of wavelet-based frequency distribution, average energy, zero crossing rates, and average energy of short-time segments of the speech signal. For various speakers and utterances, these approach has been evaluated with a wide speech database containing a broad range of speech records [7 –9]. Generally, the speech signals can be categorized in two classes practically; buzz or hizz (periodic or aperiodic) sounds based on the frequency component of the speech spectrum [10]. It is also evident from previous studies that speech signals have high frequency components, but speech signals that incorporates aspirates and plosives often have low frequency components. In addition, there has been no research on the segmentation system (voiced / unvoiced speech signals) that differentiates the voiced and unvoiced portions of the various phonemes, and there is also no methodology that offers a generalized method for segmenting signals involving different phonemes. Therefore, a fuzzy-based methodology using energy and change in energy features is proposed in this research to identify voiced / unvoiced portions using recorded speech signals containing 41 classes of phonemes.

1.3 Phoneme based speech to text translation

Automatic Speech Recognition (ASR) focuses on the translation of an audible speech input into a text representation in recognition of a set of characters [7]. ASRs have been established to classify either isolated words or phonemes. An isolated word recognition system is used in applications where fixed vocabulary is required and convenient to execute using Hidden Markov Models (HMM), Artificial Neural Networks (ANN) and Support Vector Machines (SVM) [11 –15].

Phonetic-based speech recognition is very useful because it is free of vocabulary. Further, the efficiency of the Large Vocabulary ASR (LVASR) device depends on the consistency of the phoneme recognition [16 –18]. Phoneme-based speech recognition can be established using two proven methods. The first is focused primarily on manually segmenting speech signal into phonemes, and the second method is focused on segmenting speech signals into equal numbers of frames. [19] has presented a frame-wise classification system for phoneme recognition in which features such as the ZCR and the MSF are being proposed and classifier performance was compared with the use of the LPC features as well.

From the literature review, ASR limitations were found to be strongly correlated with increased rates of complexity and increased number of isolated words in the database, including lack of adaptation to other languages. In these contexts, Phoneme-based speech recognition systems are one of the alternatives to developing ASRs that overcomes several limitations. As we realize that the number of phonemes in every language ranges from 40 to 50 and certain phonemes are almost the same for all languages, ASRs based on phonemes can be configured for any speaker. In this analysis, the extracted voiced portions are analyzed using LPCC features. The extracted LPCC features are used to develop the MLNN models. The MLNN models are developed based on the eight classes of phoneme categories and simple fusion based MLNN architecture has been developed to classify the isolated words. The block diagram of the proposed speech to text translation system is shown in Fig. 1.

Fig. 1

Block diagram of the speech to text translation system based on phoneme classes.

2 Experimental setup and data collection

Phoneme is the smallest unit of sound in which the meaning of a word can be changed with different combination of phonemes. There are many languages exists in this world, when different languages are compared the phonemic discrimination can be observe and the total number of phonemes is finite. To build a speech to text translation system for the Malaysian English language, the isolated word database has been developed with 85 isolated with having 41 class of English phonemes (i.e. PCWD), which are used to train and develop the speech to text network model.

2.1 Selection of non-native Malaysian speakers

In the work with the dataset of phoneme based isolated word speech, different recording personnel are involved. Many number of research works has discussed the techniques for recording speech data from non native speakers in the context of a language tutoring system [20, 21]. An English language speech database built with both newspaper and conversational sentences and imitated speech (imitated the speech of native speakers) has been developed by [22].

The primary concern is also to establish methods and frameworks that are feasible to be used in Malaysia for the speech-to-text translation system. The ethnic heterogeneity of Malaysia is well known that there are three dominant ethnicities are there (Malays, Chinese and Indians). Therefore, the PCWD database was created with 35 ethnic heterogeneity Malaysians. The collected speaker’s population contains almost equal numbers of subjects of races from different parts of Malaysia. The 86 words with 41 phoneme classes and are represented in Appendix Table A.1. The selected 86 English words are very easy to pronounce by the native speakers. The subjects were asked to pronounce the selected words and corresponding speech signals were recorded.

2.2 Experimental setup

Sound can be described as a phenomenon which can often generally be sensed between 20 and 20,000 Hz thus in the listening mechanisms [22]. In the experimental setup, the speech signals were collected via the standard phone headset (> =85 dB level, sensitivity –58 dB±2 dB and Impedance level of 2.2 k) in a cabin room using a standard phone headset with fluorescent lamps and the air conditioner were switched off and the data was recorded during the day time.

In the data acquisition phase, 19 male and 16 female subjects were chosen from the student, staff and faculty community of Universiti Malaysia Perlis, Perlis and Politeknik Tuanku Syed Sirajuddin, Perlis.

The subjects were from the age group of 20–40 years. The data collection was done in a cabin room. The developed wordlist has 86 words and hence the recording was conducted in two different sessions in a day for one person.

2.3 Phoneme Class Word Database (PCWD)

Speech recognition and speech translation of non native Malaysian English speech is getting more and more important for speech to text translation system. Due to the essential characteristics of non native speech signal, speech recognition systems are still showing severe performance loss when faced with it [23, 24]. The Malaysian English speech database is an isolated word speech database of non-native pronunciations of English. Such a database is essential for the ongoing development of multilingual automatic speech to text translation systems. In this study, the wordlist is developed with 86 words with 41 classes of phoneme based on Affricatives, Diphthongs, Fricatives, Liquid, Nasals, Semivowels and Glides, Stop and Vowels. A typical phoneme based isolated word is shown in Fig. 2.

Fig. 2

Typical Isolated word speech signal of the word ‘yard’.

The PCWD database comprised of 86 isolated words in which the subjects were asked to pronounce each word repeatedly for ten times. The PCWD database is built with 30,100 isolated words speech signals. The data collected is stored in the native ‘WAV’ for further processing through MATLAB.

3 Pre-processing of speech signals

Speech pre-processing is a standard method intended to improve the magnitude of higher frequencies of speech signals relative to their magnitude of lower frequencies. To increase the overall signal-to-noise level, the ambient sound noise should be minimized and recording devices saturated. Correspondingly, the simple pre-emphasis procedure incorporating a high-pass digital filter has been designed in this analysis, splitting the speech signal across 3 dB and extracting the frequencies in between 100 Hz and 1 kHz [25]. The noise in the speech signal is removed using a high pass filter as described in Equation (1). $S [n] = S_{a} [n] - α S_{a} [n - 1]$ (1) where S [n] is the pre-emphasized speech signal, S_a [n] is the current input speech sample and S_a [n - 1] is the previous speech sample. The cutoff frequency α; where the constant value is between 0.9 and 1 [25]. For this application the cutoff frequency is chosen as 0.94 of the single zero filter through which S [n] passes [26] also it is suggested that this procedure also improves the energy level of the signal with frequency above 6000 Hz. Figure 3 shows a typical isolated word speech signal and the pre-emphasized isolated word speech signal.

Fig. 3

Typical normal isolated word speech signal and the corresponding pre-emphasized signal.

4 Fuzzy voiced/unvoiced classification

In the past three decades, researchers have worked extensively on voiced / unvoiced separation through statistical analysis and machine learning algorithms [27 –29].

In most of the existing algorithms, the classifier needs intensive training data and threshold for classification [30]. The accuracy of the voiced/unvoiced classification method is limited. The proposed fuzzy voice classifier has been developed with the use of energy and change in energy as the feature set. Fuzzy logic systems are suitable for problems requiring approximate rather than exact solutions. The proposed fuzzy voice classifier consists of frame blocking of the speech signal, feature extraction algorithms and the fuzzy classifier developed for the purpose of voiced/unvoiced classification. Frame blocking

To split the isolated word speech signal as voiced/unvoiced and to study the features effectively, Pre-emphasized speech signals are subdivided into frames with a sample size of N –256 each, with an overlap of 50 percent, i.e. 128 samples (M) (M < N). The discrete time domain of the isolated speech signal is identified as (X) and can be seen in Equation (1). The first frame consists of the first N (256) samples, with the subsequent frame beginning at the first signal with an overlap of (N - M) 128 samples. The frame blocking of a speech signal is depicted in Fig. 4. This procedure is performed until all speech samples for the segmentation process are taken into consideration [31] and is represented in Equation (2).

Fig. 4

An isolated word speech signal blocked with frame size of 256 samples and an overlap of 50 percentage.

$X = [X_{1}, X_{2}, X_{3}, \dots, X_{i}, \dots, X_{N}]$ (2) where X is the speech data

X_i is the i^th frame and it is represented as: $X_{i} = [x_{i 1}, x_{i 2}, x_{i 3}, \dots, x_{ij}, \dots, x_{256}]$ (3) where x_ij is the j^th signal of the i^th frame.

Thus, the segmented frames then used as an input for the proposed voiced / unvoiced classifier algorithm. This section now presents the feature extraction and voiced/unvoiced classification procedure.

4.1 Energy and change in energy features

Energy and ZCR methods are the most common end-point detection methods for a speech signal, but they are very prone to noise and get reasonable results when environmental noise is sensitively minimal. The energy per frame of the isolated word speech signal provides a preliminary significance for the classification of voiced/unvoiced parts. This indicates that, irrespective of their periodic nature, the voiced portion of the speech signal has high energy features, while the unvoiced segment has low energy features.

In our study on the energy components of all phonemes has shown that the increase in energy and the energy distribution across the speech signals is closely associated. Interstingly, it was observed that the speech signal is unvoiced if the change in energy (Δe_i) is high and the energy component (e_i) in the signal is low. though if the change in energy (Δe_i) is low, energy (e_i) and the next frame energy (e_i+1) is very high, then the speech signal is voiced. The (e_i) and (Δe_i) of the frame segments are calculated using the Equations (3) to (5). The raw speech signal recorded at 16 kHz and the frame energy, change in energy distribution is shown in Fig. 5. $Energy E = [e_{1}, e_{2}, e_{3}, \dots, e_{i}, \dots, e_{N}]$ (4) where E is the Total energy and e_i is the frame energy in the i^th frame and it is represented as: $e_{i} = \sum_{j = 1}^{256} x_{ij}^{2}$ (5) where xij is the jth signal of the ith frame. $Δ e_{i} = e_{i + 1} - e_{i}$ (6) where Δe is the change in energy between i^th and the (i + 1)th frame.

Fig. 5

Frame energy, change in energy distribution of the recorded speech signal.

The change in energy c be described as the difference between the voiced signal’s two consecutive frame energy. The Fig. 6 clearly shows the energy mapping of the voiced component of the speech signal. The e_i, Δe_i and (e_i+1) features are extracted and used to develop the fuzzy voice classifier to extract the voiced segments from the 10-second signal recorded with ten isolated words. The outputs of voiced segments from the fuzzy classifier has been programmed to create an isolated word using simple algorithms. In addition, all speech segments reflecting each word in the 10-second signal were extracted. similarly, Isolated words from all speech signals are then extracted and the database has been developed.

Fig. 6

Frame energy mapping on the voiced portion of the recorded isolated word speech signal.

4.2 Fuzzy voiced/unvoiced classifier

Fuzzy set theory includes the expert opinion on the classification patterns, decisions, features and objects with mathematical tools and processes. Fuzzy classification is a type of pattern recognition algorithm that uses fuzzy sets at all during processing. Fuzzy patter recognition is often associated with fuzzy clustering or if-then else systems used as classifiers [32, 33]. Fuzzy classifiers can also be “transparent” or “interpretive,” based on if-then rules, i.e. the end-user can verify the classification model [34]. Figure 7 shows the overall speech recognition architecture.

Fig. 7

Block diagram of the proposed Fuzzy voice classifier.

Fuzzy logic systems are high-level decision-makers and are a very effective method for achieving reliable findings even with relatively large noise concentrations. Fuzzy logic systems have been designed by means of fuzzy sets that are sets of multiple membership levels [35]. Fuzzy logic systems are usually developed using “if-then” rules and membership functions. Hence, in the proposed fuzzy voice classifier, the classifier consists of a fuzzification block, a fuzzy knowledge based- inference block, and a de-fuzzification block.

The fuzzy voice classifier and isolated word segmentation algorithms are designed using MATLAB GUI software platform. The fuzzy process transforms input variables (frame energy, change in energy and the next frame energy) into crisp input through the fuzzification process, using the set of fuzzy mapping rules to approximate the entire feature. Three Membership functions are used in the fuzzication process and two membership functions are used in the defuzzication process and thery are used to quantify the linguistic terms used in the processing.

During the initial development of the fuzzy voice classifier, standard triangular membership functions were used in both the fuzzy input (fuzzification process) and the fuzzy output sets (defuzzification process). The base values of each triangular membership function are chosen equally and are used to split the voiced portion of the speech signal. Later, the input membership functions are designed based on the observations made on the minimum and maximum values of the e_i, Δe_i and (e_i+1) features, due to the exploitation of fuzzy rules are minimized and yields very less accuracy in the classification. The three linguistic terms used in the fuzzification process are low, medium and high. The designed input fuzzy sets are shown in Figs 8 10.

Fig. 8

Membership function for e_i.

Fig. 9

Membership function for Δe_i.

Fig. 10

Membership function for energy in the (e_i+1) frame.

The fuzzy inference engine then uses the knowledge base developed using if-then rules to generate output based on the fuzzy set of chrisp inputs. The Fuzzy Inference Rules of Mamdani type have been used in this research. The fuzzy linguistic terms were used to design 21 if–then–else rules for identifying the voiced and unvoiced segments [27]. The fuzzy rules chosen based on the observation are depicted in Table 1.

Table 1

Rules formulated for the fuzzy voice classifier

No.	Frame Energy		Change in Energy		Energy In The Second Frame		Output Decision
1	if	low	and	low	and	low	Voiced
2	if	low	and	low	and	medium	Unvoiced
3	if	low	and	low	and	high	Unvoiced
4	if	medium	and	low	and	low	Voiced
5	if	medium	and	low	and	medium	Unvoiced
6	if	medium	and	low	and	high	Unvoiced
7	if	high	and	low	and	low	Voiced
8	if	high	and	low	and	medium	Unvoiced
9	if	high	and	low	and	high	Unvoiced
10	if	low	and	medium	and	low	Voiced
11	if	low	and	medium	and	medium	Unvoiced
12	if	low	and	medium	and	high	Unvoiced
13	if	low	and	high	and	high	Unvoiced
14	if	low	and	high	and	medium	Unvoiced
15	if	low	and	high	and	low	Voiced
16	–	–	and	–	and	low	Voiced
17	–	–	and	–	and	medium	Unvoiced
18	–	–	and	–	and	high	Unvoiced
19	–	–	and	low	and	–	Voiced
20	–	–	and	medium	and	–	Voiced
21	–	–	and	high	and	–	Voiced

The frame energy, change in energy and energy in the second frame levels are fuzzified and the weighted membership function enables the voiced/unvoiced output membership function. The designed output fuzzy sets are shown in Fig. 11. The linguistic terms used for the output fuzzy are ‘voiced’ and ‘unvoiced’ using standard triangular membership functions.

Fig. 11

Membership function for fuzzy voiced/unvoiced output.

After the evaluation has been carried out in a fuzzy inference engine, the overall result is a cristp output. This result is defuzzified using centroid defuzzification method [35 –37] to obtain the output. The output of the defuzzificiation is then used to extract the voiced portion from the speech signal using a simple algorithm. The extracted voice portion is shown in Fig. 12. The voiced portions are segmented and further used for isolated word recognition and speech to text translation system.

Fig. 12

A typical extracted isolated voiced portion from the continuous isolated words.

4.3 Fuzzy voice classifier computation results

The fuzzy voice classifier for the classification of voiced and unvoiced portions are developed based on the features extracted from the speech signals and subsequently extracted the voiced portion using suitable algorithms. The FIS editor function available in MATLAB fuzzy toolbox is used to create the membership function, model and simulate the fuzzy voice classifier. The result of the fuzzy voice classifier such as recognition rate for each class of phonemes is represented in Appendix Fig. A.2.

From Fig. A.2, it can be emphasized that the isolated word ‘may’ has the minimum recognition rate of 89.14 % and the isolated word ‘too’ has the highest recognition of 97.14 % using the vowels database. For the diphthongs database, the isolated word ‘tour’ has the minimum recognition of 90 % and the isolated word ‘sky’ has the highest recognition of 98.85 %. The overall mean recognition is 95 %.

For the consonants I database, it can be emphasized that the isolated word ‘think’ has the minimum recognition of 86.85 % and the isolated word ‘pot’ & ‘bit’ has the highest recognition of 99.71 %. The overall mean recognition is 93.3 %. For the consonants II database, It can be inferred that the isolated word ‘leisure’ has the minimum recognition of 79.42% and the isolated word ‘jump’ has the highest recognition of 98.85 %. The overall mean recognition is 92.78 %.

For the Affricatives database, It can be inferred that the isolated word ‘tape’ has the minimum recognition of 87.42 % and the isolated word ‘tap’ has the highest recognition of 99.42 %. The overall mean recognition is 93.42 %. For the Fricatives database, It can be inferred that the isolated word ‘theme’ has the minimum recognition of 86 % and the isolated word ‘thin’ has the highest recognition of 99.42 %. The overall mean recognition is 93.42 %.

For the Semivowels & glides database, it can be inferred that the isolated word ‘once’ has the minimum recognition of 79.71 % and the isolated word ‘warm’ has the highest recognition of 95.71 %. The overall mean recognition is 89.02 %. For the Nasals database, It can be inferred that the isolated word ‘bottom’ has the minimum recognition of 82.57 % and the isolated word ‘mass’ has the highest recognition of 95.14 %. The overall mean recognition is 91.18 %.

5 Speech to text translation system

The speech to text translation system based on multilayer neural network models consists of pre-processing, feature extraction, development of network models and the classification of isolated words.

5.1 Feature extraction using Linear Predictive Co-efficient (LPCC)

Linear Predictive Coding (LPC) is the most effective technique used to describe the compressed form of the spectral envelope of the speech signal, and one of the most efficient methods of encoding low-bit speech with reliable consistency. This gives exceptionally precise estimates and is comparatively quick to measure speaker parameters [38].

The LPCC features are extracted from each frame of the speech signal using the built in function available in MATLAB signal processing toolbox. The first step is to perform the autocorrelation analysis of frame signal after multiplying it with a hamming window. After computing the autocorrelation sequence, the toeplitz auto correlation matrix of size p X p is generated as shown in the Equation (6). The LPC features are computed by direct matrix multiplication method as given in Equation (7). $x (n) = \sum_{i = 1}^{N} a_{i} x (n - i),$ (7) where ai are the coefficients of the predictor. The z-transform of (6) is $x (z) = \sum_{i = 1}^{N} a_{i} z^{- k} X (z) .$ (8)

5.2 MLNN architecture for PCWD database

The MLNN models [39] are developed based on the eight classes of phoneme. The architecture of the proposed multilayer feedforward neural network model fusion is shown in Fig. 13. Each MLNN network consists of an input layer, hidden layers and output laters. The input to the MLNN models are the features extracted using LPC feature extraction algorithm. The hidden layers and output layers are activated using the logistic sigmoidal activation function. The sigmoid activation function was chosen based on the results of the previous research using the speech database [40].

Fig. 13

MLNN architecture for PCWD database (Isolated word Classification).

The logistic sigmoidal transfer function can be written in the form: $f (x) = \frac{1}{[1 + \exp^{- x}]}$ (9)

The predicted pattern is compared to the actual pattern and Mean Square error (MSE) were determined using Equation 8. If the MSE is greater than the tolerance level, the MLNN architecture weights are modified using the backpropagation algorithm. This procedure is repeated until the MSE is below the tolerance value.

sum squared error (MSE)

$MSE (E_{p}) = \frac{1}{2} \sum_{j = 1}^{N^{L}} {(t_{pj} - y_{pj})}^{2}$ (10) where,

t_pjand y_pj are the target out put and the actual output respectively.

Similarly, the mean square error is observed in all the multilayer neural network models until it reaches less than The tolerance value. Outputs from all network models are collected and discriminated using the target output assigned in the data pre-processing stage. The estimated output is indexed in the phonemes database, identifying an isolated word. The discrete word is then sampled to be pronounced through the speaker and displayed in a GUI using a basic algorithm.

The Multilayer neural network models developed for the isolated word classifications are based on the features extracted from each speech signal of the segmented voiced portion. The extracted LPCC features consist of Vowel (3941 X 12), Diphthong (2660 X 12), Consonants (3922 X 12), Consonants (3202 X 12), Affricatives (3897 X 12), Fricatives (3269 X 12), Semi vowels and glides (3270 X 12), Nasals (3116 X 12). feature set. The extracted features are further processed to label and then associated with the eleven vowel classes. The feature set is normalized, randomized and split in to 60%, 70% and 80% and the testing sample has 100%. The processed features contain the input –output association. The network contains three layers namely input, hidden and output. The input layer is provided with the feature vectors which constitutes the input neurons to the network. The output layer is associated with the target vectors corresponding to the input vectors. The hidden neurons in the hidden layer are allocated experimentally. The hidden neurons contribute towards the weighted connections of the neural network. The network models are trained for 25 times per each trial. Five trials are performed per each neural network model.

6 Results and discussion

The consolidated training parameters of the developed MLNN models for the classification of isolated words based on the LPCC features are tabulated in Table II. The consolidated mean recognition of the developed eight MLNN models for the classification of isolated words based on the phoneme classes are represented in Fig. 14. The corresponding confusion matrix for each network model is represented in Appendix Fig. A.3 to Fig. A.10.

Fig. 14

Mean recognition for the MLNN models.

From Table 2, It can be inferred that the neural network model trained using the diphthongs features set has the Average minimum training time of 342 seconds and the neural network model trained using the fricatives features set has the Average maximum training time of 564 seconds.

Table 2

Network training parameters and Training time (LPCC features)

Training Samples	Output Neurons	0.9	Hidden Neurons in the 1st layer		25	Momentum Factor	0.9	Testing Tolerance	0.1
	Learning Rate	0.1	Hidden Neurons in the 2nd layer		25	Training Tolerance	0.01	Testing Samples	7680
Network Models		Vowels MLNN	Diphthong MLNN 1	Consonants MLNN 2	Consonants MLNN	Affricatives MLNN	Fricatives MLNN	Semi vowels glides	Nasals MLNN

Testing Samples		3941	2660	3922	3202	3897	3269	3270	3116
60 % Samples	Mean	378	342	439	450	411	488	361	412
70 % Samples	Training	412	372	479	491	448	531	394	450
80 % Samples	Time	437	395	508	520	475	564	418	477

From Fig. 14, It can be inferred that the Consonant 1 MLNN model has the Average minimum recognition of 82.60 % and the Vowel MLNN model has the Average maximum recognition of 86.66 % using 60% of the data samples. The Consonant 1 MLNN model has the Average minimum recognition of 83.34 % and the Vowel MLNN model has the Average maximum recognition of 87.44 % using 70% of the data samples. The Consonant 1 MLNN model has the Average minimum recognition of 85.85 % and the Vowel MLNN model has the Average maximum recognition of 90.06 % using 80% of the data samples.

From Fig. 15, it is inferred that the Vowels network model has the Average minimum epoch of 822 and the Consonants 1 network model has the Average maximum epoch of 1621 using 60% of the data samples. The Vowels network model has the Average minimum epoch of 863 and the Consonants 1 network model has the Average maximum epoch of 1702 using 70% of the data samples. Further, the Vowels network model has the Average minimum epoch of 871 and the Consonants 1 network model has the Average maximum epoch of 1717 using 80% of the data samples.

Fig. 15

MLNN classification performance for LPCC features.

7 Conclusion

The regards to the objective of the research work, the recorded isolated word speech signals are split into voiced/unvoiced portion using the proposed fuzzy voice classifier. Feature extraction algorithms using LPCC is used to extract the features from the voiced portion of the speech signal. Methods for data processing were developed to formulate vectors for classifier models. Neural network algorithms were implemented to identify the isolated words and phonemes using the features derived from the isolated word speech.

For the PCWD database, the overall minimum and maximum mean recognition of the developed fuzzy voice classifier are 79.42 % (Isolated word ‘leisure’) and 99.71 % (Isolated words ‘pot’ & ‘bit’). The word /zip/ has the minimum average voice extraction time of 9.61 seconds and the word /fan/ and /may/ has the maximum average voice extraction time of 12.89 seconds. Further from the confusion matrices, the overall minimum and maximum recognition of the isolated word developed for the PCWD are 55.32 % (The isolated word ‘Hunt’) using the Affricatives network model and 99.35 % (the isolated word ‘Hay’) using the Semivowels and glides network model.

Following the current research, the following work may well be carried out to improve the speech-to –text translation system.

An intelligent algorithm would be proposed in order to assemble the phonemes and compile the isolated word.

Advanced experimental methods can be used for capturing speech signals from different ethnic groups of students from different regions.

Since the proposed methodology (energy, change in energy for the fuzzy voice classification and LPCC features for the isolated word recognition) provides more than 90 % accuracy, different techniques for extracting features and optimizing features can be developed in order to improve device reliability and information transfer rate.

Neuro-fuzzy classifiers can be developed to further study speech-to- text translation.

Footnotes

Appendix

Table A.1

Isolated wordlist based on phonemic variation

Vowel phonemes	Diphthongs	Consonant phonemes		Affricates	Fricatives	Semivowels &Glides	Nasals
sit may bat pot luck good ago meat car soft girl too	day sky boy beer bear tour go cow	pit bit time door cat get fan van think that send zip	man nice ring leg rat wet hat yet shop leisure chop jump	joke choke taint take tap tape taste cast hunt coach	sea zone thin them clothe shake fish theme both bath	ray way hay once one wall warm yank yarn yard	moon bottom sing made mass mind bring bang knee knife

Fig. A.2

Performance of the fuzzy voice classifier.

Fig. A.3

Recognition of the vowels MLNN confusion matrix.

Fig. A.4

Recognition of the diphthongs MLNN confusion matrix.

Fig. A.5

Recognition of the consonants I MLNN confusion matrix.

Fig. A.6

Recognition of the consonants II MLNN confusion matrix.

Fig. A.7

Recognition of the affricatives MLNN confusion matrix.

Fig. A.8

Recognition of the Fricatives MLNN confusion matrix.

Fig. A9

Recognition of the Semivowels and Glides MLNN confusion matrix.

References

Hasenan

, Ghani

and Mahreez

, Measuring English language anxiety and learning strategies among Malaysian L2 undergraduates, E-Proceeding Soc Sci Res 14(9) (2017), 492–506.

Baskaran

Loga Mahesan

, A MALAYSIAN ENGLISH PRIMER ASPECTS OF MALAYSIAN ENGLISH FEATURES. Kuala Lumpur: University of Malaya Press, 2005.

Baskaran

, The Malaysian English mosaic, English Today 10(1), 1994, 27–32.

Archipelago Press., The encyclopedia of Malaysia. Archipelago Press, 1998.

Thirusanku

and Yunus

M.M.

, The Many Faces of Malaysian English,, ISRN Educ 2012 (2012), 1–14.

Sunitha

and GSSSIETW M., Separation of unvoiced and voiced speech using zero crossing rate and short time energy,, Int J Adv Comput Electron Technol (IJACET) 4(1) (2017), 6–9.

Patange

P.P.

and Alex

J.S.R.

, Alex, Implementation of ANN based speech recognition system on an embedded board, in 2017 International Conference on Nextgen Electronic Technologies: Silicon to Software (ICNETS2), 2017, pp. 408–412.

Kumar

, Phadikar

and Majumder

, Modified segmentation algorithmbased on short termenergy & zero crossing rate for maithili speech signal, in 2016 International Conference on Accessibility to Digital World (ICADW), 2016, pp. 169–172.

Tan

Z.-H.

and Dehak

, rVAD: An unsupervised segment-based robust voice activity detection method,, Comput Speech Lang 59 (2020), 1–21.

10.

Gold

and Rader

C.M.

, Digital processing of signals. McGraw-Hill, 1969.

11.

Bin Rabieah

, Bouganis

C.-S.

, FPGA based nonlinear support vector machine training using an ensemble learning, in 2015 25th International Conference on Field Programmable Logic and Applications (FPL), 2015, pp. 1–4.

12.

Jiang

, Virupakshappa

and Oruklu

, FPGA implementation of a support vector machine classifier for Ultrasonic flaw detection, in 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS), 2017, pp. 180–183.

13.

Amudha

, Venkataramani

and Ravishankar

, SOC Implementation of hmmbased speaker independent isolated digit recognition system, in 20th International Conference on VLSI Design held jointly with 6th International Conference on Embedded Systems (VLSID’07), 2007, pp. 848–853.

14.

Amudha

, Venkataramani

and Manikandan

, FPGA implementation of isolated digit recognition system using modified back propagation algorithm, in 2008 International Conference on Electronic Design, 2008, pp. 1–6.

15.

Aquino

, et al., Novel nonlinear hypothesis for the delta parallel robot modeling,, IEEE Access 8 (2020), 46324–46334.

16.

Salimbajevs

and Pinnis

, Towards Large Vocabulary Automatic Speech Recognition for Latvian., in Baltic HLT, 2014, pp. 236–243.

17.

Abdou

S.M.

and Moussa

A.M.

, Arabic Speech Recognition: Challenges and State of the Art, Comput Linguist Speech Image Process Arab Lang 4 (2018), 1.

18.

Malekzadeh

, Gholizadeh

M.H.

and Razavi

S.N.

, Persian Vowel recognition with MFCC and ANN on PCVC speech dataset, arXiv Prepr. arXiv1812.06953, 2018.

19.

Manikandan

, Venkataramani

, Preeti

, Sananda

and Sadhana

K.V.

, Implementation of a phoneme recognition system using zero-crossing and magnitude sum function, in TENCON 2009–2009 IEEE Region 10 Conference, 2009, pp. 1–5.

20.

Eskenazi

, Hogan

, Allen

and Frederking

, Issues in Database Design: Recording and Processing Speech from New Populations, in In Proceedings of the First International Conference on Language Resources and Evaluation (LREC-98)., 1998.

21.

Komissarchik

and Komissarchik

, System and methods for improving language pronunciation’. Google Patents, 07-Jul-2015.

22.

Neumeyer

, et al., WebGrader: A Multilingual Pronunciation Practice Tool, May 1998.

23.

Baese-Berk

M.M.

and Morrill

T.H.

, Speaking rate consistency in native and non-native speakers of English, J Acoust Soc Am 138(3) (2015), EL223–EL228.

24.

Wang

and Cho

, Larger-context language modelling, arXiv Prepr. arXiv1511.03729, 2015.

25.

Berdibaeva

G.K.

, Bodin

O.N.

, Kozlov

V.V.

, Nefed’ev

D.I.

, Ozhikenov

K.A.

and Pizhonkov

Y.A.

, Pre-processing voice signals for voice recognition systems, in 2017 18th International Conference of Young Specialists on Micro/Nanotechnologies and Electron Devices (EDM), 2017, pp. 242–245.

26.

Vergin

and O’Shaughnessy

, Pre-emphasis and speech recognition, in, Proceedings 1995 Canadian Conference on Electrical and Computer Engineering 2 (1995), 1062–1065.

27.

Paulraj

M.P.

, Sazali

B.Y.

, Ahmad Nasri

and Sathees Kumar

, Fuzzy voice segment classifier for voice pathology classification, in Signal Processing and Its Applications (CSPA), 2010 6th International Colloquium on, 2010, pp. 190–195.

28.

Algabri

, Bencherif

M.A.

, Alsulaiman

, Muhammad

and Amine Mekhtiche

, Soft Computing Techniques for Classification of Voiced/Unvoiced Phonemes, Intell Autom Soft Comput, pp. 1–7, 2017.

29.

Algabri

, Alsulaiman

, Muhammad

, Zakariah

, Bencherif

and Ali

, Voice and unvoiced classification using fuzzy logic, in Proceedings of the International Conference on Image Processing, Computer Vision, and Pattern Recognition (IPCV), 2015, p. 416.

30.

Bangira

, Alfieri

S.M.

, Menenti

and van Niekerk

, Comparing thresholding with machine learning classifiers for mapping complex water,, Remote Sens 11(11) (2019), 1351.

31.

Tyagi

, Mehrotra

, Sharma

and Kumar

, Audio Pattern Recognition and Mood Detection System, in Proceedings of Fifth International Conference on Soft Computing for Problem Solving, 2016, pp. 321–332.

32.

Laleye

F.A.A.

, Ezin

E.C.

and Motamed

, Fuzzy-based algorithm for Fongbe continuous speech segmentation,, Pattern Anal Appl 20(3) (2017), 855–864.

33.

Chakraborty

, Chakraborty

and Mukherjee

, Detection of Parkinson’s disease using fuzzy inference system, in Intelligent Systems Technologies and Applications, Springer, 2016, pp. 79–90.

34.

M.J.

and Mandal

, A survey of adaptive fuzzy controllers: Nonlinearities and classifications,, IEEE Trans Fuzzy Syst 24(5) (2015), 1095–1107.

35.

Dubois

and Prade

, The legacy of 50 years of fuzzy sets: A discussion,, Fuzzy Sets Syst 281 (2015), 21–31.

36.

Chakraverty

, Sahoo

D.M.

and Mahato

N.R.

, Defuzzification, in Concepts of Soft Computing, Springer, 2019, pp. 117–127.

37.

Yusnita

M.A.

, Paulraj

M.P.

, Sazali

, Abu Bakar

and Sathees Kumar

, Fuzzy Logic Inference System for Voiced-Unvoiced-Silence Classification of Malaysia English Isolated Words based on Time-Domain Features,, Journal of Computer Science 6(2) (2012), 93–202.

38.

Bäckström

, Speech Coding: with Code-Excited Linear Prediction. Springer, 2017.

39.

Murugesa Pandiyan

, Yaacob

, Ahmad Jamil

S.H.-F.S.

and Nataraj

S.K.

, EEG Based Multi-Layer Neural Network Model for Mental Stress Level Classification, in International Conference on Engineering, Science and Technology (ICEST), 2011.

40.

Paulraj

M.P.

, Sazali

, Nazri

and Kumar

, A speech recognition system for Malaysian English pronunciation using Neural Network, 2009.