Abstract
Facial emotion recognition system (FERS) recognize the person’s emotions based on various image processing stages including feature extraction as one of the major processing steps. In this study, we presented a hybrid approach for recognizing facial expressions by performing the feature level fusion of a local and a global feature descriptor that is classified by a support vector machine (SVM) classifier. Histogram of oriented gradients (HoG) is selected for the extraction of global facial features and local intensity order pattern (LIOP) to extract the local features. As HoG is a shape-based descriptor, with the help of edge information, it can extract the deformations caused in facial muscles due to changing emotions. On the contrary, LIOP works based on the information of pixels intensity order and is invariant to change in image viewpoint, illumination conditions, JPEG compression, and image blurring as well. Thus both the descriptors proved useful to recognize the emotions effectively in the images captured in both constrained and realistic scenarios. The performance of the proposed model is evaluated based on the lab-constrained datasets including CK+, TFEID, JAFFE as well as on realistic datasets including SFEW, RaF, and FER-2013 dataset. The optimal recognition accuracy of 99.8%, 98.2%, 93.5%, 78.1%, 63.0%, 56.0% achieved respectively for CK+, JAFFE, TFEID, RaF, FER-2013 and SFEW datasets respectively.
Keywords
Introduction
Emotions are associated with the brain’s nervous system and provide useful information about a person’s state of mind. These are the particular reactions taken by human beings in the form of variations in their speech tone, body gestures, written text, or facial expressions as a result of the change in their internal thoughts or external events that take place in the outside world. Emotions can be represented based on the 2D valence-arousal or circumplex model [1] and basic models. Ekman et al. [2, 3] analyzed the emotional variations of the subjects to propose a basic emotion model and concluded that facial expressions of the six basic emotions namely fear (Fe), anger (An), sadness (Sa), disgust (Di), surprise (Su) and happiness (Ha) remain constant and can be observed in cross-cultures. FERS have a wide range of applications in real-life such as in distance learning [4], surveillance [5], affective computing [6], healthcare [7–11], etc.
Facial expressions describe the changes that take place on the face according to the person’s intentions. In interpersonal communication, a person’s facial expressions play an essential role. It is shown by a research study that the speaker’s facial expressions convey a 55% impact of the spoken message while voice intonation contributes 38% and spoken words contribute merely 7% of the overall message effect [12]. Automatic ER from facial expressions has been a challenging and interesting task for over 30 years. The main challenges in the development of the FERS are changes in illumination conditions, face pose, emotion variability among individuals, facial image viewpoint (frontal or profile view), image scaling, and skin color complexion (refer to Fig. 1). Several approaches have been proposed in the literature to overcome these challenges but still, no universal algorithm exists that can perform effectively in complex realistic scenarios along with constrained scenarios.

Variation in face texture.
FERS can be developed based upon the action units [13]; feature-based approaches including geometric features [14–16], texture features [17–22], a combination of both geometric and texture features [23]; model-based approaches i.e. active shape models (ASM) [24], active appearance models (AAM) [25] and deep learning-based approaches [26–28]. Deep transfer learning is an advanced approach proposed in deep learning in which information from one model transfers to another model. It enables the researchers to solve any particular task by utilizing the full or a part of the model pre-trained on a different task. It’s also been used for classification [29, 30] as well as in different classification tasks like detection and classification of COVID-19 patients [31–33]. All the mentioned emotion recognition approaches provide optimal performance in particular scenarios. Deep learning-based approaches provide promising results but require quite a large size dataset for network training. While those based on both geometric features and facial models recognize emotions by measuring the distance variation among facial parts e.g. eyes, mouth, etc. It’s quite challenging to train geometric feature-based approaches for person independent ER as everyone’s facial deformation is different depending upon the size and shape of a face. Geometric features are also more likely to ignore the changes in facial texture while appearance or texture base features encode pixel values and are more sensitive to the changes in facial texture. Whitehill et al. [34] reported that appearance-based approaches provide superior performance than geometric features. These descriptors have proved very successful for the image processing applications like object recognition, image matching, and object categorization [35]. Several researchers reported that these descriptors result in better recognition accuracy for the development of FERS [36–38]. However, all these descriptors have varying advantages and shortcomings. That is why the complementary feature vectors obtained by the combination of more than one descriptor help to overcome their shortcomings.
In the proposed research work, a model for FERS is presented based on facial texture-based features to recognize the emotions in both constrained and realistic environment. Initially, color images are converted to grayscale to minimize the computation cost. The facial area is then extracted from the input image with the help of the viola-jones algorithm [39]. As face size varies from person to person causing unequal feature extraction, so it is further resized through various interpolation techniques and discrete wavelet transformation (DWT). We extracted the image visual information with the help of LIOP being a local and HoG being a global descriptor. We complimented them together to get the global facial structural information as well as the deformation caused in local facial components like eyes, mouth, etc. as a result of emotional changes. The complemented features are invariant to several geometric and photometric transformations. The HoG [40] descriptor is originally proposed for the pedestrians’ detection and is a shape descriptor. It counts the gradient orientations in a localized image region for edge analysis. As the deformations caused in facial muscles due to changing emotions can also be analyzed by edges making HoG a suitable descriptor for emotion classification [41–44]. On the other hand, the LIOP descriptor [45] describes the image based on the pixels’ intensity order. It is invariant to several image transformations as a change in viewpoint, illumination conditions, JPEG compression, and image blurring as well. Inspiring by the distinguishing nature of both these texture descriptors, we complemented them together for classifying emotions. HoG considers the individual pixels for computing the orientation and magnitude of gradient while LIOP based on pixels’ statistical intensity order also considers the relationship among all neighborhoods. So both HoG and LIOP descriptors have proved very useful to recognize the facial expressions based on the images captured in constrained as well as in complex realistic scenarios having varying illumination conditions, viewpoint, blurring effects, etc. Features classified by SVM classifier by varying kernel functions and optimal results achieved by the polynomial kernel.
The major contributions of this study are as follows: A novel model is proposed to recognize facial emotions based on complementary features. Feature level fusion of pixel intensity order based descriptor and histogram of oriented gradients based descriptor is performed for the extraction of distinguishing features being invariant to illumination conditions, image viewpoint, rotation, blurring, and JPEG compression. The performance of the proposed model is evaluated on the publicly available datasets, which contain images of both lab constrained and realistic unconstrained environments. Enhancement in recognition accuracy under a realistic environment having large variations in illumination conditions, viewpoints, etc. is also highlighted. Image scaling is performed by discrete wavelet transformation and various interpolation techniques to minimize the computational resources.
The remaining part of the paper is organized such that section 2 presents an overview of the related work. The proposed FERS model is discussed in section 3, section 4 provides a detailed description of the results and evaluation of the proposed model and section 5 concludes it.
The motivation behind the development of the ER framework is that such systems have a lot of applications in real life as in distance learning, medical, etc. These systems can be developed by analyzing the spoken words, written text, body gestures, or facial expressions. From all modalities, facial expressions are widely used means for social communication and have the highest contribution to convey the message [12, 46].
Facial expressions can be recognized through traditional machine learning (TML) and modern deep learning-based approaches (DLBA). DLBA provides promising results but requires quite a large size dataset for the successful training of the network. TML approaches results promising accuracy rate even for small datasets. Such approaches can be developed based on facial texture features or geometric features. It’s quite challenging to train geometric feature-based approaches for person independent ER as everyone’s facial deformation is quite different depending upon the size and shape of one’s face. Geometric features are also more likely to ignore the changes in facial texture while appearance features encode pixel values and are more sensitive to the changes in facial texture. However, various appearance descriptors have varying advantages and shortcomings. That is why we complemented global and local feature descriptors to overcome their shortcomings. Features are classified by SVM that is a widely used supervised learning classifier.
Related works
In this section, an overview of state-of-the-art ER approaches is presented. Such systems evolved with the time to tackle the challenges faced in a realistic environment. Many years ago, darwin gave the concept of emotional existence in humans and animals [47]. Ekman and frisen extended his idea and developed a facial action coding system (FACS). The visual movement of the facial components e.g. eyes, mouth, etc. known as action units (AU) is monitored in these systems. The contractions or expansions caused in AUs are further encoded by FACS to recognize the varying emotions. Each emotion is represented by a group of particular AUs e.g. happiness is represented by two AUs that are lip corner puller and cheeks raiser. The benefit of FACS is that each expressive image can be categorized to a particular emotion by encoding the corresponding AUs while the major shortcoming is that it requires a lot of time to analyze each image one by one [48].
Recently, texture feature-based approaches are preferred for FERS because of being invariant to various image transformations as image scaling, illumination conditions, etc. [35]. Several attempts are carried out for the development of FERS based on texture features. Luo et al. [37] proposed a FERS framework based on the texture feature descriptors named improved completed local ternary patterns (ICLTP). A facial area having 150×110 dimensions is extracted from the original image by determining the eyes’ position on the frontal face image. Initially, the image gradient is computed with the help of a Scharr operator that provides better image details than a Sobel operator. Secondly, CLTP features are extracted from 7×6 gradient image patches by consideration of two groups of neighborhoods i.e. 8 neighbors with radius one and 16 neighbors with radius 3. Features based on both neighborhoods are then concatenated together to generate the final feature vector to be used as an input to KNN and sparse representation classifier for 10-fold cross-validation. Their proposed framework resulted in optimal recognition accuracy based on frontal face images captured in a constrained environment. A FERS based on local phase quantization (LPQ) feature descriptor is proposed by kherchaoui et al. [20]. LTP coding and thresholding concept proposed for quantization of local phases better than conventional LPQ. Face and eyes’ region was detected by the viola-jones algorithm, core facial area then normalized by 128×128 pixels upon which later on histogram equalization applied. Facial features extracted from frontal face images by texture-based adapted gradient LPQ descriptor that further classified by SVM classifier. Hu et al. [22] proposed a novel feature descriptor based on local binary pattern (LBP) and center symmetric LBP named center-symmetric local octonary pattern (CS-LOP) for feature extraction. Eight neighbors were considered for computing feature descriptors similar to LBP including four center-symmetric pixels for reduced feature vector computation based on CS-LBP. Unlike LBP and CS-LBP, they also consider the contribution of the central pixel. Detailed CS-LOP features extracted by feature map of Gabor and gradient magnitude features that further fused as a result of feature-level fusion. SVM with polynomial kernel trained for classification based on person-independent (N-person) and person-dependent (N-fold) cross-validation for frontal face images captured in a constrained environment. FERS with the help of a 3D facial model proposed by qi et al. [49]. Initially, the facial area was detected based on the convex hull formed by LBP codes of four facial regions i.e. left, right part of the forehead, and zygomatic to the chin area. Secondly, the 3D facial model was established to divide the facial area into six sub-regions including eyes, nose, forehead, mouth, and cheeks area based on cognition. Promising results were achieved by consideration of mouth and eyes region. Dimensionality of extracted LBP features then reduced by multi-dimensional scaling (MDS) technique based upon earth movers distance. Finally, emotions were recognized based on SVM and convolutional neural network (CNN) classifier with softmax activation function by employing two emotion models i.e. basic discrete model and circumplex model. Higher recognition accuracy is achieved by the circumplex model and softmax classifier for frontal face images captured in a constrained environment.
Several researchers explored the HoG descriptor individually as well as by combining it with other descriptors to recognize facial expressions. Donia et al. [42] presented a framework to recognize the posed and spontaneous facial expressions based on HoG features extracted from frontal face images and SVM classifiers with a linear kernel. Face detected and extracted by continuously adaptive mean shift (CAMshist) algorithm. Facial feature points were marked manually for extraction of six facial parts i.e. eyebrows, an area between eyes, left and right eye, mouth, and nose for feature extraction. It requires a lot of time for manual feature region marking and extraction for a dataset comprising a large number of images. Later on, Carcagni et al. [43] presented a detailed review regarding HoG’s suitability for recognizing facial expressions and proposed an algorithm for frontal face images detected by the viola-jones algorithm. Experiments were performed using 10-fold cross-validation with the SVM classifier. Experimental results were evaluated by varying the number of bins and cell size and optimal accuracy achieved based on 7-pixel cell size and 7 orientation bins. Sajjad et al. [44] proposed the FERS framework based on HoG, uniform local ternary pattern (U-LTP) features, and multiclass SVM classifier. Initially, noise and blurring effects were removed from the image by applying 3×3 median filters and histogram equalization. Face detected based on the viola-jones algorithm further resized to 128×128 dimensional image by adaptive interpolation technique. Optimal results were achieved based on 9 bins histogram per cell, 16×16 cell size, and 2×2 block size having a 50% overlapping area for HoG features. While 59 bins based two histograms built for LTP separately for upper and lower binary patterns resulting 118 dimensional feature vector. One-vs-rest multi-class SVM with linear kernel used for classification purpose. The proposed framework resulted promising results for frontal faces while profile faces were not considered for experimentation purposes.
Methodology
In this section, a detailed description of the proposed FERS is presented that is developed based on facial texture features (refer to Fig. 2). It comprises of major three steps i.e. image collection discussed in subsection 4.1, image pre-processing preceded by emotion classification described in the following subsections.

A framework of the proposed model of FER.
Nowadays, pixels per image are increasing day by day for the incorporation of more detail that becomes possible with the advancement in image acquisition techniques and technology. Various devices capture color images having a distinct size and contrast. As color images having three components require large computation power so initially in this step, images are converted into grayscale having one color component and also normalized. Originally captured images comprised of a lot of complex background areas along with the facial part that contributes to emotions. So facial area detected based on the viola-jones algorithm [39] and extracted from an input image for further feature extraction. As detected faces are of varying sizes resulting in an unequal feature vector length, also the LIOP descriptor requires a square image having an odd side length. So DWT and various interpolation techniques are employed to resize all images to an average size as we show in Fig. 3. Interpolation can both scale down and scale up the image’s dimension. While DWT low-low (LL) band image part is comprised of a lot of image detail and reduce the image size to one-fourth. Nearest interpolation simply copies the value of the nearest pixel to assign to the new pixel while bilinear and bicubic interpolation consider the average value of 2 and 3 pixels respectively. Experiments are performed by scaling images directly to an average size as well as applying DWT first and then resizing to an average size based on various interpolation techniques. Experimental results show that images directly resized based on bicubic interpolation resulted in comparatively high recognition accuracy than those resized after DWT (refer to Tables 3–6).

Face detected based on the viola-jones algorithm and resized based on bicubic interpolation.
Statistical measures for performance analysis of the proposed model
Optimal values of the proposed model hyper-parameters
Classification results for CK+ dataset based on LIOP descriptors and SVM polynomial kernel
Recognition accuracy for the JAFFE dataset based on LIOP descriptors and SVM polynomial kernel
Recognition accuracy for the TFEID dataset based on LIOP descriptors and SVM polynomial kernel
Recognition accuracy for SFEW dataset based on LIOP descriptors and SVM polynomial kernel
In this subsection, we discuss the feature extraction algorithms. Texture based features are extracted from the preprocessed facial part based on the LIOP and HoG feature descriptors that further complemented together by feature level fusion. We discuss the feature extraction and fusion algorithms in subsections 3.2.1–3.2.3. We fine-tune the descriptors’ parameters by repetitive experiments to enhance recognition accuracy (refer to Tables 3–6).
LIOP based facial feature extraction
For feature extraction, we initially resized the image to a square size image having an odd side length i.e. 129. The facial region is then divided into sub-regions based on the pixels’ intensity order. The parameters that we defined for LIOP are R and N, where R shows the number of sub-regions (r1, r2, r3..., rR) to which the facial part will be divided (refer to Fig. 4) and N represents the number of neighbors for each pixel Pi in r j region (refer to Fig. 5).

Facial region division into sub-regions based on pixels intensity value.

LIOP feature vector computation.
Consider a set PN of N-dimensional vectors like (p1, p2,... p
N
) defined as PN = (p1, p2,... p
N
): piεR where each element Fig_1 εPN is a vector representing the selected N neighbors of a pixel ‘P’ and is arranged in non-descending order such that:
In Equation (1), a particular pixel p
i
1
≤p
i
2
if and only if
Let suppose an index table δN for N = 3 having N! rows and two columns pointing to the unique ordered arrangement of Fig_1 elements and corresponding index respectively. A mapping function Ω: PN → δN is defined as Ω (Fig_1) =λ; Fig_1 ε PN ; λεδN to map each Fig_1 ε PN to a particular permutation λεδN according to the intensity-based order. λ= (i1, i2, ... ,iN) in the index table shows the permutation of N neighbors.
The procedure for the LIOP feature vector computation of a particular pixel is represented in Fig. 5. LIOP (P) represents the feature vector of a particular pixel P, similarly feature vectors of all the pixels in ri sub-region computed that resulted in des (ri). LIOP of the entire facial area LIOPI is computed by concatenation of all des(ri) as given in the following equation.
HOG descriptor is primarily used for the extraction of shape-based features, originally proposed for pedestrian detection [40]. It describes the image based on texture information. Major steps that we performed for the computation of the HoG descriptor are gradient computation, histogram binning, block normalization and the last one is feature vector computation as briefly discussed below.
3.2.2.1. Gradient computation
The HoG feature vector is computed based on the orientation and magnitude of a gradient in the selected image patch. Gradient magnitude represents the rate of change in the pixels’ values both in a horizontal and vertical direction, it’s computed by convolving the image with a kernel function horizontally to compute Fig_1x and vertically to compute Fig_1y:
Overall image gradient Fig_1 then computed based on Fig_1x and Fig_1y as shown in (3).
Gradient orientations represent the change in edge direction caused because of facial components’ deformation during emotional change. It’s computed with the help of Fig_1x and Fig_1y as defined in Equation (4):
Gradient orientations can be 0–3600 known as signed gradient and 0–1800 called unsigned gradients. We achieved optimal results based on an unsigned gradient. Gradient magnitude varies depending upon the expressed emotion e.g. it’s quite large for a surprise while quite small for sad emotion.
3.2.2.2. Histogram binning
The number of histogram bins à needs to be defined based on gradient orientations in an image cell. Cell size is inversely proportional to the feature vector length. Too small cell size extract irrelevant features and increase the dimensionality requiring high computation power. While too large cell size lack in image detail (refer to Figs. 6-7). Experimental results evaluated by varying value of à and optimal results achieved by 7 bins histogram for CK+ and JAFFE while 9 bins histogram for TFEID, SFEW, RaF, and FER-2013 datasets. The value of a particular bin ài is computed by gradient magnitude value against orientation angle θ lying in its range.

HoG feature descriptors computation based on varying cell size (CS).

Bar diagram of the HoG feature vector for varying cell sizes and histogram bins.
3.2.2.3. Block normalization
1-D histograms computed based on edge orientations in multiple cells forming a block are further concatenated together resulting in 1×D vector ƛ. The dimension of ƛ is 1×(à * ß) where à represents the number of bins as discussed in subsection 3.2.2.2 and ß denotes the number of cells in a block. Vector ƛ is then normalized by L2 (euclidean) norm to make it invariant to illumination conditions and image rotation (refer to (5)).
Block and cell size affect recognition accuracy. We perform experiments by varying the discussed parameters. It is shown that best recognition results achieve by 24×24 cell size, 2×2 block size with 50% block overlapping area resulting in 775-dimensional feature vector for 129×129 dimensional TFEID, SFEW, and RaF dataset images; cell size reduced to 16×16 for 48×48 dimensional FER-2013 dataset images which resulted in a 279-dimensional feature vector. While 32×32 cell size, 2×2 block size with 50% overlapping area resulting in 400-dimensional feature vector generated optimal results for lab-constrained CK+ and JAFFE dataset.
3.2.2.4. Image feature vector
The HoG feature vector of an entire image HoGI is computed by concatenation of 1-D ƛ vector of all blocks having 1×(ƛ*x*y) dimension where x denotes the number of horizontal and y denotes the number of vertical blocks in an image.
LIOPI and HoGI features are extracted separately as discussed in section 3.2.1 and 3.2.2 respectively from the resized facial area (refer to Fig. 3). Initially, we perform emotion classification based on LIOP feature descriptors by varying descriptor parameters and multiclass SVM kernel functions. Optimal results are achieved for CK+, JAFFE, and TFEID dataset by quite a large 4320-dimensional feature vector as shown in Tables 3–5. While for the SFEW dataset 144-dimensional feature vectors resulted in better recognition accuracy as presented in Table 6. From the literature study, we came to know that complementary feature descriptors can result in better recognition accuracy. Carcagni et al. [41] showed that the HoG descriptor is quite useful for FER. As more detail is extracted from the image is sometimes able to increase the recognition accuracy. So inspiring from [41], we then extracted HoG features as well by varying parameters and concluded that a 400-dimensional feature vector resulted in optimal results for CK+ and JAFFE datasets. Comparatively high dimensional HoG feature vector i.e. 775 require for realistic datasets as RaF, TFEID, and SFEW while the 279-dimensional feature vector requires for comparatively smaller size FER-2013 images. We then performed feature level fusion of LIOPI and HoGI feature vectors extracted based on fine-tuned parameters. Experimental results show that a low dimensional hybrid descriptor ᵮ(LIOPi +HoGi) resulted in recognition accuracy comparable to quite larger 4320-dimensional LIOP feature vectors that are computationally expensive. Finally, the classification performed based on a training matrix having (Ñ x n) dimensions where Ñ represents the number of images in a dataset while n is 544 for CK+, JAFFE, 423 for FER-2013 dataset and its value is 919 for more challenging datasets including RaF, TFEID and SFEW dataset.
Classification
In this section, we discussed the detail of the classification mechanism based on the SVM classifier. Experiments are performed based on supervised learning. We used training matrices with corresponding emotion labels as an input to the classifier. We repeated the experiments based on various SVM kernel functions. It is shown by experimental results that optimal recognition accuracy achieved for polynomial kernel having degree 3 as represented in Table 3.
Support vector machine
SVM is a classification algorithm originally proposed for binary classification where data lies in a 2-D plane [50]. It separates the linearly separable data belonging to distinct classes by finding a hyperplane based on w . x - b = 0, where
In (6) C1, C2 represents the classes and xi shows instances. Two possible extensions of the SVM classifier for multiclass classification problems are one-vs-rest and one-vs-one [51]. In one-vs-rest, ‘N’ binary classifiers (N shows the number of classes) are trained. While in the case of one-vs-one,
Statistical analysis for CK+ dataset based on hybrid descriptors and SVM polynomial kernel
Statistical analysis for JAFFE dataset based on hybrid feature descriptor and SVM polynomial kernel
Statistical analysis for the TFEID dataset based on hybrid descriptors and SVM polynomial kernel
Statistical analysis for SFEW dataset based on hybrid descriptors and SVM polynomial kernel
Statistical analysis for FER-2013 dataset based on hybrid descriptors and SVM polynomial kernel
Statistical analysis for RaF dataset based on hybrid feature descriptors and SVM polynomial kernel
In (7), y and z are feature vectors extracted from dataset images, c is a constant to trade-off the effect of the lower order and higher-order terms; d shows kernel degree which is 3 in our case.
The hyper-parameters of SVM are gamma (kernel scale) and C (box-constraint). SVM classifier draws a decision boundary to separate the instances belonging to different classes. Trade-off analysis is required to be performed while adjusting the position of this boundary line. The parameter C is used to control this tradeoff. A too-small value (close to zero) of C causes under-fitting while too large cause over-fitting that decreases the generalization power of the classifier. Gamma parameter represents the distance to which the influence of individual training instance reaches. Its value is inversely proportional to the radius of the influence of those training samples which are selected as support vectors. Too small gamma can’t capture the shape or complexity of the model while too large gamma value only includes the support vector itself and unable to prevent the overfitting even by decreasing the C value. In the proposed model, optimal results are achieved for C = 1 and gamma = 1.
In this section, we presented the detail of the datasets selected for the experimentation purpose, performance evaluation measures, model’s hyperparameters, and experimental results.
Detail of experimental datasets
We evaluated the performance of the proposed FERS model based on realistic datasets i.e. SFEW [52], FER-2013 [53], RaF-DB [54] and also based on those collected in lab constrained environment i.e. TFEID [55], JAFFE [56], and CK+ [57]. All the selected datasets are briefly reviewed below in the following subsections.
Static facial expressions in the wild (SFEW) dataset
This dataset is comprised of static facial expression images extracted from acted facial expressions in the wild (AFEW) dataset [58]. AFEW is a video-based 2D expression dataset collected from 37 movie clips for the transition from the constrained lab environment to the real world. Subjects belonging to various races, age groups, and genders are considered for the collection of more realistic images having varied head poses, occlusions, etc. SFEW dataset consists of 700 images belonging to six basic emotions. The testing set is not labeled; so for results comparison, we used a validation set consisting of 321 images of six basic emotions.
FER-2013 dataset
The most famous dataset for recognizing emotions is based on 48×48 dimensional images belonging to six basic and one neutral emotion. It was published in ICML-2013 representation learning challenges. We have performed experiments based on the training-testing division of the dataset as well as by 10-fold cross-validation. The recognition accuracy of 56% achieved by training-testing division and 63.0% achieved by cross-validation.
Real-world affective faces database (RAF-DB) dataset
It is one of the larger datasets comprising 15,339 aligned images belonging to six basic and one neutral expression. It is quite challenging having images belonging to different ethnicity, gender, and ages range. It includes images with different head poses and captured in a challenging environment having diverse illumination conditions and occlusions. We selected images belonging to six basic emotions and performed experimentation based on the 80-20 rule.
Taiwanese facial expression database (TFEID) dataset
It comprised 7200 expression images of 20 males showing eight emotions including six basic, neutral, and contempt emotions. Two CCD cameras were used for capturing images at 00 and 450 viewing angles having slight and high-intensity values. For consistency with previous datasets, a total of 229 high-intensity grayscale images of six basic emotions are selected for experimentation purposes.
Extended cohn kanade (CK+) dataset
It is an extended form of the Cohn Kanade database. It comprises 593 image sequences collected by 123 subjects for the representation of seven emotions including six basic and neutral emotions. We selected images belonging to six basic emotions for classifier training and testing purposes.
Japanese female facial expression (JAFFE) dataset
It consists of a total of 213 images captured from 10 females’ facial expressions including six basic and one neutral expression. 60 Japanese subjects rated every single image belonging to a particular emotion; at the end, a highly voted emotion label was assigned to the image under consideration.
We divided the selected images of RaF, TFEID, CK+, and JAFFE datasets into training-testing sets according to the 80–20 rule as 80% used for classifier training and 20% apart from the training set used for the testing purposes. We used a validation set of SFEW for testing purposes which are 31% of the overall dataset size and 69% selected for classifier training purposes. The recognition accuracy of 39.9% and 56.1% achieved for SFEW and FER-2013 respectively by training-testing division and it’s improved by performing 10-fold cross-validation for both datasets to 56.0% and 63.0% respectively.
As in videos, facial expressions change from a neutral state occupying three expressive phases including onset, apex, and offset phase. Among the three phases, the apex phase shows the peak expression while the onset and offset phases are closer to the neutral state. So we selected apex phase images from the above-mentioned datasets for experimentation purposes.
Performance evaluation measures based on statistical analysis and model’s experimental parameters
We briefly discussed the statistical analysis measures and the model’s hyper-parameters in the following subsections.
The confusion matrix describes the overall performance of the model. It shows TP, TN, FP, and FN cases. The statistical analysis of the proposed model is performed based on this matrix. The statistical terminologies that we considered to assess the performance of the proposed model are: True-positive (TP): Number of instances belonging to the positive class which correctly predicted as positive. True-negative (TN): Number of instances belonging to the negative class which correctly predicted as negative. False-positive (FP): Number of negative instances that are incorrectly classified as positive. False-negative (FN): Number of positive instances that are incorrectly classified as negative.
The measures that we selected for statistical analysis are presented in Table 1. Recall shows the portion of the relevant images that are retrieved by the classifier. Precision shows the portion of retrieved images that are also relevant. Accuracy value represents the correct predictions either positive or negative of the classifier. Specificity shows the portion of the negative images that are correctly predicted. While FPR and FNR both show a portion of incorrect predictions. The values of all these measures vary from 0-1 where zero shows the smallest and one shows the highest possible value. The higher values of recall, precision, TNR, and accuracy are desirable while small values equal to zero or closer to zero for FPR and FNR are desirable.
Model’s hyper-parameters
The hyper-parameters of the proposed model are C and gamma for SVM as discussed in section 3.3.1; several neighbors and number of bins for LIOP descriptor as discussed in section 3.2.1; cell size, block size, and several histogram bins for HoG descriptor as explained in section 3.2.2. The optimal values of the hyper-parameters for the selected datasets are shown in Table 2.
Results and discussions
We examined the performance of the proposed model based on the statistical analysis measures presented in Table 1. The statistical analysis results of the selected datasets are presented in Tables 7–12. The experiments are repeated several times by varying the following parameters. The scaling of the extracted facial area is based on the various interpolation methods and DWT. The feature vector’s dimension is based on HoG and LIOP feature descriptors. Kernel functions for a one-vs-one multiclass SVM classifier.
We carefully selected these parameters after extensive experiments based on the pre-processed images. Image resizing is a major pre-processing step as the HoG descriptor’s size is dependent on the image’s dimension which varies from dataset to dataset and also from person to person for the same dataset. SFEW aligned faces are of 143×181 dimension, TFEID rimmed images are of 480×600, RAF-DB-aligned faces are of 100×100, FER-2013 images are 48×48, JAFFE faces have around 150×150, and CK+ images minimum face size is around 200×200 dimension. Thus all images can either be scaled up or scaled down to an average size to result in an equal size feature vector. Two commonly used image resizing techniques are interpolation and DWT. We repeated experiments based on nearest, bilinear, bicubic interpolation, and DWT as shown in Tables 3–6. We show with the help of experimental results that optimal recognition accuracy is achieved by resizing images based on bicubic interpolation for all the selected datasets.
LIOP descriptor’s size is independent of the image dimension. It needs a square image having an odd side length that can be achieved by scaling all selected dataset images:
To an average size i.e. 129×129 dimension with the help of DWT and interpolation techniques.
Equivalent to either shorter or longer image-side based on the interpolation techniques.
In JAFFE and CK+ dataset, the facial part detected based on the viola-jones algorithm is a square image having either odd or even side length and we scaled them based on option (a). RAF-DB and FER-2013 comprised of square aligned faces of 100×100 and 48×48 dimensions. While SFEW and TFEID datasets are comprised of rectangular aligned faces that can be resized based on (a-b) options. We performed experiments for all mentioned resizing options and our results show that optimal accuracy achieved for all datasets based on images directly resized to 129×129 through bicubic interpolation. FER-2013 is dataset accuracy decreased by scaling up its 48×48 dimensional images to 129×129 which is more than double the original size. We converted FER-2013 images to 49×49 dimensional square images and an increase in recognition accuracy is noted. We also performed classification for complemented feature vectors ᵮ and achieved promising results for all datasets as shown in Tables 7–12.
Experimental results show that emotions like surprise cause more facial deformation and can be easily recognized while emotions like sadness, fear results in a high false-positive rate because of similar deformations (refer to Fig. 8).

Sample images of six emotions from left to right 1st column represent angry, 2nd disgust, 3rd fear, 4th happy, 5th sad and 6th shows surprise emotion from selected datasets.
The comparative analysis of the recognition accuracy, precision, and recall for the datasets collected in a lab-constrained and lab-unconstrained environment is shown in Figs. 10-11. The accuracy rate for lab-constrained datasets is much higher than lab-unconstrained datasets because these are captured in constant illumination conditions, head pose, etc.

Bar diagram of the SFEW dataset performance based on the variable length of a hybrid feature vector.

Bar diagram of average recognition accuracy, recall, and precision for the lab-constrained datasets.

Bar diagram of average recognition accuracy, recall, and precision for the realistic datasets.
It is publically available and the most commonly used benchmark dataset. So initially, we selected it for experimentation purposes by varying kernel functions for image resizing option (a) to fine-tune the resizing techniques and SVM kernel functions. For option (a), we resized images directly to 129×129 as well as by first applying DWT to extract the LL band image part and then resized them to 129×129 with the help of interpolation techniques. Table 3 shows that bilinear and bicubic interpolation performs comparatively better than the nearest one, so later on, we performed further experiments for all dataset images based on bicubic and bilinear interpolation. We show that the SVM polynomial kernel outperforms linear and RBF kernel functions resulting in 99.7% for images directly resized to 129×129 and 4320-dimensional LIOP features as shown in Table 3. Table 7 shows that 99.8% accuracy is achieved based on a 544-dimensional hybrid feature descriptor.
Performance evaluation for JAFFE dataset
Initially, we performed experiments for LIOP features extracted by considering 4 and 6 neighbors resulting in 144 and 4320-dimensional feature vectors respectively. The recognition accuracy of 87.7% achieved for 4320-dimensional LIOP features for images directly resized to 129×129 dimension through bicubic interpolation (refer to Table 4). Experiments are then performed for hybrid features ᵮ(LIOPi +HoGi). The 544-dimensional hybrid features with SVM polynomial kernel resulted in optimal recognition accuracy of 98.2% as represented in Table 8.
Performance evaluation for TFEID dataset
Rimmed images having 480×600 dimensions (refer to Fig. 8) are considered for experimentation purposes. LIOP requires square images having odd side length, so experiments performed by resizing images equivalent to shorter side length, longer side length, and to an average size i.e. 129×129 (refer to Table 5). Images directly resized to 129×129 resulted in high recognition accuracy of 84.8% based on 4320-dimensional LIOP descriptor and SVM polynomial kernel. While recognition accuracy of 93.5% achieved based on 919-dimensional hybrid features (144-LIOP and 775-HoG feature vector dimension) as shown in Table 9.
Performance evaluation for SFEW dataset
We perform experiments base on 143×181 dimensional aligned facial images as shown in Fig. 8. Maximum accuracy achieved for images directly resized to 129×129 based on bicubic interpolation. The accuracy rate of 37.1% achieved for 144-dimensional LIOP descriptors extracted base on four neighbors as shown in Table 6. Then we repeat experiments by varying the size of the hybrid feature vector. We concatenated 144-dimensional LIOP descriptor with variable size HoG descriptor achieved by varying cell size and number of bins as shown in Fig. 7. The performance-based on varied feature vector length is shown in Fig. 9. The highest accuracy rate of 39.9% is achieved for 769 and 919-dimensional features with the same recall and different precision values i.e. 31.8, 34.4 respectively. Accuracy started to decrease after that because of overfitting caused by a high dimensional feature vector. We then performed 10-fold cross-validation to enhance the classifier’s generalization power which resulted in an accuracy rate of 56.0%. Statistical analysis based on 919-dimensional hybrid feature descriptors and SVM 10-fold cross-validation is presented in Table 10.
Performance evaluation for FER-2013 dataset
This dataset is comprised of 48×48 dimensional grayscale images (refer to Fig. 8). As the LIOP descriptor requires a square image with odd side length so we resized all images to 49×49 dimensions based on bicubic interpolation. Here images are quite smaller than other selected datasets because of which optimal results are also achieved for smaller descriptor size that is 423 (144-LIOP and 279-dimensional HoG descriptor). Initially, we performed experiments by dividing the dataset into a training-testing part and an optimal accuracy rate of 56.1% achieved with SVM polynomial kernel and penalty value of 0.4. As it’s clear from Fig. 8 that in most of the images, the major facial region is hidden that affects badly the classifier’s generalization ability. To optimize the classifier generalization ability we repeated the experiments for cross-validation. The highest accuracy rate of 63.0% is achieved for 10-fold cross-validation. The statistical analysis is shown in Table 11.
Performance evaluation for RAF-DB
We selected aligned facial images having 100×100 dimensions (refer to Fig. 8). All images are resized to 129×129 with the help of bicubic interpolation for experimentation purposes. We evaluated the performance based on both training-testing and cross-validation. The recognition accuracy of 78.1% achieved by separately training the classifier on the training set and testing on the dataset images apart from training. Statistical analysis is performed to measure the performance of the proposed (refer to Table 12).
Performance comparison
We compared the performance of the proposed model with state-of-the-art approaches that are discussed in the literature to recognize the emotions based on facial expressions. The comparison based on each dataset is shown in Tables 13–18. In [42], the author extracted HoG features from the facial area detected by the CAMshift algorithm that was further classified by the SVM classifier. A 95% accuracy rate was reported for the CK+ dataset. While in [41], an improvement in accuracy rate (i.e. 98.8% for CK+) is reported because of HoG features extraction from the facial area detected by the viola-jones algorithm. In [44], the author combined HoG and U-LTP features extracted from frontal face images and classified them by the SVM classifier. They extracted HoG features based on 16×16 cell size resulting in a comparatively large size feature vector. While in our case, more compact features extracted from processed facial area detected by the viola-jones algorithm. We extracted HoG features by selecting cell size 24 (for TFEID, SFEW) and 32 (for CK+, JAFFE), and an improvement in recognition accuracy is achieved by concatenating them with LIOP features. Sadeghi et al. [59] convolved images with Gabor filters which are then coded based on maximum and minimum responses. Final feature vectors computed based on the histogram for 16 and 8 Gabor filters having a length of 240 and 56 respectively. Ten-fold cross-validation was performed for classification by the SVM classifier with a linear kernel. An approach [36] proposed for ER based on facial patches indicating salient features. LBP-based concatenated histogram features extracted from the selected salient patches that were further used as an input for the training and classification by SVM classifier. An approach [60] using meta probability codes based framework is proposed to recognize the facial expressions. Four feature extractors i.e. LBP, Gabor-wavelet, facial fiducial point, and Zernike moment are selected for images’ feature extraction while SVM, KNN, sparse representation-based classifier, and RBF neural network selected for classification purpose. Zeng et al. [38] initially detected 51 facial landmark points by AAM. A single feature vector generated by concatenation of HoG, LBP, and gray value-based features extracted from the selected local patches centered at each landmark. The dimensionality of the high dimensional single feature vector generated by concatenation of all three features is then reduced by PCA. Final feature vector is then used as an input to deep sparse auto-encoders having four hidden layers each with 100 hidden nodes. Local mean binary pattern (LMBP) that is an appearance based local descriptor is proposed in [61]. Extracted features further classified by template matching as well as least-square SVM.
We compared the proposed model with the deep learning-based approaches as well. Jain et al. [28] initially employed contrastive equalization for contrast enhancement. Pre-processed images are further classified by DCNN as having convolutional and deep residual layers. Sun et al. [62] proposed a method to identify the optimized active region from three regions including regions of the left eye, right eye, and mouth. Histogram equalization, spatial normalization, and rotation correction were performed on the expression images to get more suitable features. The final emotion label was then obtained by majority voting of three CNNs’ classification results. In [63], the author combined two smaller networks to minimize the overfitting caused by big networks for small size datasets. A CNN model developed by a combination of two models, a temporal appearance features based network and temporal geometric features based network. Both types of extracted features are then jointly fine-tuned for the generation of small dimensional feature space to be classified. An algorithm for recognizing facial expressions with the help of biologically plausible features is presented in [64]. Initially, homogeneity of the facial encoded contours checked by a self-organizing network. In the end, several classifiers including SON, KNN, and multi-layer perceptron applied for classification.
A multithreading cascade of SURF (McSURF) was implemented by configuring the area under the curve (AUC) for ER in [19]. Three components of the proposed work were SURF feature descriptors’ extraction from local patches, weak classifiers based on logistic regression combined with ROC curve for testing cascade convergence, and last one multi-threading cascade that was able to process multiple categories to boost the training process. Weak classifiers build for each local patch based on SURF descriptors and logistic regression and optimum patches selected during boosting iterations, which repeated until AUC converges, or a specified number of iterations reached.
It is evidenced in comparison Tables 13–18 that our proposed model based on texture features is highly suitable to recognize the emotions with promising accuracy in both constrained as well as in unconstrained environments. We compared the results with FER frameworks developed with the help of both texture features as well as with deep learning-based approaches. It is shown that the proposed model outperforms several state-of-the-art FER approaches that demonstrate its superiority.
Comparison of the average recognition accuracy of the CK+ dataset
Comparison of the average recognition accuracy of the CK+ dataset
Comparison of the average recognition accuracy of the RaF-DB dataset
Comparison of the average recognition accuracy of the JAFFE dataset
Comparison of the average recognition accuracy of the TFEID dataset
Comparison of the average recognition accuracy of the SFEW dataset
Comparison of the average recognition accuracy of the FER-13 dataset
The proposed model is developed on a DELL core i3 system with 4 GB RAM, operating system windows 8.1 (64-bit), and MATLAB R2018a (64-bit). Either empirical or theoretical analysis can be performed to analyze the computational performance of an algorithm. The empirical analysis is based on the specifications of computer systems used for experimentation purposes but is varies from time to time depending upon the computational load. Whereas the input size determines the required computational resources in the theoretical analysis [68]. We have performed a theoretical analysis to analyze the computational time required for recognizing the emotions based on 129×129 processed facial images having 16,641 pixels. Initially, we perform experiments for the LIOP descriptor and optimal recognition rate achieved based on quite a high dimensional LIOP descriptor having 4320 lengths. We then performed feature level fusion of LIOP and HoG descriptor that resulted in 423 (from FER-2013), 544 (from JAFFE, CK+ dataset), and 919 (from TFEID, SFEW dataset) dimensional compact and distinguishing feature vectors. As hybrid descriptors’ length is much smaller than 4320-dimensional LIOP descriptors with comparable recognition accuracy, thus computational resources are minimized to a great extent. We performed this analysis independently of the system used to execute the program code. The computation time in seconds required for feature extraction from an image of the CK+ dataset based on the mentioned system specifications is shown in Table 19.
Comparison of computational cost (in seconds) required for the per image facial features extraction from CK+ dataset
The comparison of the computation time for classifier training based on whole datasets and per image testing time is shown in Table 20 for Ck+ and JAFFE datasets. The computation time is reduced to a great extent because of the robust selected descriptors. Conclusion and future work
Running time in seconds for CK+ and JAFFE datasets
In this research study, we proposed a model to recognize facial expressions based on facial texture information. The core idea behind the developed framework is to enhance the discrimination power of the texture feature descriptors. The major challenge in the development of ER systems nowadays is the change in the geometrical structure of facial area from person to person and also environmental conditions like illumination conditions etc. We extracted facial texture information being invariant to face geometrical structure. The information extracted with the help of descriptors that are also invariant to several image transformations like illumination conditions, face viewpoint, and image blurring. It is evidenced from the experimental results that our selected complementary features proved useful to recognize the emotions in both lab-constrained as well as in realistic datasets. Local intensity order and histogram of oriented gradients based complemented descriptors increased the inter-class distance simultaneously decreasing the intra-class distance to enhance the recognition accuracy that is reflected in our results. We have performed a supervised classification based on a one-vs-one SVM classifier with varying kernel functions. The experimental results show that polynomial kernel results in an optimal accuracy rate of 99.8%, 98.2%, 93.5%, 56.0%, 78.1%, 63.0% for Ck+, JAFFE, TFEID, SFEW, RAF-DB and FER-2013 dataset respectively. We have not hardcoded the constraints related to intensity values so the proposed model can be fine-tuned for distinct datasets captured in a varying environment. In the future, we aim to extend the proposed framework by combining it with a deep learning model. The idea can be implemented by concatenating the selected complementary features with deep learning features. The resulted feature vectors can then be classified either by a local learning or a deep learning model. Another possible extension is to apply any deep transfer learning approach like VGG-19 to fine-tune the pre-trained model on the selected realistic datasets to optimize its classification accuracy.
Contributions
All the authors contributed equally to this research work.
Footnotes
Acknowledgments
We would like to thank Patrick Lucey, Li-Fen Chen, et al., Abhinav Dhall et al., and Michael J. Lyons et al. for providing facial emotion datasets for experimental purposes.
