FS-DSN: A feature segmentation-based dual-stream network for robust facial expression recognition in uncontrolled environments

Abstract

In uncontrolled environments, facial expression recognition encounters challenges such as poor image quality, uneven lighting, facial occlusions, and head pose variations. To address the challenges of facial occlusions and head pose variations, this paper introduces the Feature Segmentation-Based Dual-Stream Network (FS-DSN). The network consists of four components: a feature pre-extraction module, a feature segmentation module, a global feature extraction module, and a local feature extraction module. The pre-extraction module extracts mid-level features from facial expression images, which are then segmented into three areas: left eye, right eye, and mouth. The global feature extraction module uses the full set of features to extract global expression features, while the local feature extraction module focuses on the segmented regions. This dual-stream approach captures both broad and subtle expression changes, enhancing the semantic interpretation of facial expressions. Empirical tests show FS-DSN's robust performance, achieving accuracies of 88.82%, 60.09%, 78.33%, and 74.17% on the RAF-DB, SFEW 2.0, FED-RO, and FER-2013 datasets, respectively.

Keywords

uncontrolled environments facial expression recognition dual-stream network feature segmentation

1. Introduction

In interpersonal communication, facial expressions convey up to 55% of emotional information (Mehrabian and Ferris, 1967), making them the most potent medium for expressing emotional states. The study of facial expressions as a natural human physiological phenomenon dates back to Darwin's research on the expressive features of humans and animals (Darwin, 2009). Darwin used facial expressions as a starting point to elucidate the connections and evolutionary aspects between humans and animals, marking profound significance in research. With the significant advancements in deep learning technology, Facial Expression Recognition (FER) has emerged as a prominent research direction within the field of computer vision. Furthermore, FER holds extensive application value in areas such as fatigue driving detection (Jeong and Ko, 2018), intelligent learning (Li et al., 2019a; Singh and Ramanujam, 2025), medical assistance (Irani et al., 2015; Li et al., 2019b; Sripian et al., 2021), and human–computer interaction (Starostenko et al., 2015).

In recent years, the advancement of deep learning has led to an abundance of solutions for FER tasks, achieving notable success (Guo et al., 2025; Li et al., 2020; Qian et al., 2023). However, these achievements often rely on datasets from controlled laboratory environments, such as CK+ and Jaffe (Lucey et al., 2010; Lyons et al., 1998). Real-world FER applications, conversely, operate under uncontrolled conditions where factors like uneven lighting, facial occlusions, and head pose variations (Liao et al., 2022) negatively affect feature detection and extraction, posing significant challenges. Farzaneh and Qi (2021) introduced a method that enhances recognition by adaptively selecting subsets of crucial feature elements using a deep attention-based center loss to enhance class compactness and separation. Xie et al. (2022) proposed improving model compactness and class separation by optimizing the loss function. These approaches, however, do not account for partial facial occlusions and head pose variations, thus limiting their effectiveness.

Cognitive science and psychology research suggest that human facial perception processes details from a holistic to a more granular, part-based approach (Roberson et al., 2012), with global and local facial features possessing distinct granularities. Therefore, extracting features globally and locally is an effective method. Liao et al. (2022) selected 16 key landmarks out of 68 facial landmarks, integrating local feature information through a classification module. Li et al. (2019c) focused on 24 landmarks, including the eyes, nose, and cheeks, applying an attention mechanism for differential weighting of these patches. Liu et al. (2021) retained 38 of 68 facial landmarks, connecting feature vectors with Gabor (Luan et al., 2018) and HOG (Dalal and Triggs, 2005) filters to form a comprehensive set of local texture features. This approach, while effective in unobstructed environments, faces inaccuracies in landmark positioning under occlusions. Li et al. (2023) proposed generating patches by sliding a fixed number of pixels along the image's x-axis, efficiently capturing local features. However, due to the excessive number of patches and fixed pixel shifts, this method lacks specificity for key areas and is inefficient in feature extraction.

Experiments by Widen et al. (2011) demonstrated that the human visual system tends to focus more on the eye and mouth regions when performing FER. To more accurately extract local features of facial expressions, Zhao et al. (2021) divided facial expression images into four equal regions using spatial axes, then extracted local features with a CBAM (Woo et al., 2018)-enhanced module. Wang et al. (2020) devised three methods for dividing local regions: fixed-position cropping, random cropping, and landmark-based cropping. Ding et al. (2020) segmented global features into m × n irregular, non-overlapping blocks, merging predictions from each to form a final outcome. Weng et al. (2021), noting facial symmetry, focused solely on one eye after defining the eyes, nose, and mouth. Liu et al. (2022a) divided faces into four sub-regions (upper left, upper right, lower left, and lower right) using a pose segmentation and filling method, ensuring effective feature segmentation under occlusions.

Wang et al. (2019) showed that occluding the eyes and mouth significantly reduces the ability to decode facial expressions, thus, recognizing facial expressions hinges on these crucial areas. Based on this, we propose a Feature Segmentation-Based Dual-Stream Network (FS-DSN), separately extracting global and local features. FS-DSN comprises feature pre-extraction, segmentation, global feature extraction, and local feature extraction modules. In the segmentation module, intermediate feature maps are divided into non-overlapping regions for the left eye, right eye, and mouth. The global feature extraction module adapts to feature map reductions, and the local feature extraction module targets the segmented features for more precise extraction. Finally, a fusion decision combines global and local features for the final recognition result.

The main contributions include:

(1)
A dual-stream network method based on feature segmentation is proposed to extract overall and partial expression features comprehensively to take into account both global and detailed features of facial expressions, so as to improve the robustness of the network.
(2)
A facial feature segmentation method inspired by human perception mechanisms and psychology is designed to efficiently delineate feature regions, extracting potential information from each.
(3)
A two-stream network is designed to more effectively target global and local feature extraction.
(4)
Experimental results on uncontrolled datasets demonstrate that FS-DSN outperforms previous methods, achieving higher recognition accuracy and stronger robustness.

2. Related work

2.1 Dual-stream network for FER

FER methods based on dual-stream networks typically divide the face into global and local regions, with separate branches for extracting features. These features are then fused to generate a comprehensive expression feature. Previous approaches, such as Zhao et al. (2021), employed a dual-stream network where the global branch used an attention module like CBAM (Woo et al., 2018), and the local branch applied multi-scale extraction to capture detailed local features. The global and local features are then fused to produce a comprehensive recognition result. Building on this architecture, Liu et al. (2022b) proposed a dual-stream network with a specialized Gate-OSA module in the global branch, enhancing global facial feature extraction. Weng et al. (2021) utilized three sub-networks to extract features, focusing on aligned facial images. The local branch relied on several key facial landmarks, including the eyes, nose, and mouth, to extract regional features, which were then fused with the global features for final classification. Li et al. (2023) presented a dual-stream approach that segments the feature map into patches along horizontal and vertical axes, with each patch processed by the local branch, while the entire feature map flows through the global branch. Their approach emphasizes the importance of patch-level feature extraction and fusion. Similarly, Li et al. (2019c) and Ding et al. (2020) proposed methods where local regions, based on facial landmarks, were either segmented into fixed or dynamic blocks for local feature extraction, which were later fused with global features at different levels. Other works incorporate dual-stream CNNs combining RGB data and texture descriptors with multi-feature fusion layers (Tiong et al., 2020), fine-grained bilinear CNNs modeling second-order interactions between local and global features (Shabbir and Rout, 2023), and dual attention networks to capture local and global facial micro-expression cues (Takalkar et al., 2021). Additionally, some methods employ 3D landmark-based attention masks to enhance focus on salient regions (Shahid and Yan, 2023) or utilize dual-stream super-resolution combined with recurrent neural classifiers to improve recognition on low-resolution images (Ullah et al., 2022).

While these two-stream architectures have made notable progress in FER, many global branches focus on limited scales. They often rely on coarse segmentation strategies, which can result in incomplete feature extraction. To address these limitations, we propose a more flexible and precise Feature Segmentation-Based Dual-Stream Network (FS-DSN). FS-DSN uses 9 × 9 and 7 × 7 convolution kernels in the backbone's first two layers to capture detailed intermediate features across multiple scales. The global branch refines this with 5 × 5 and 3 × 3 kernels to extract finer semantic details. We also incorporate CBAM in the global branch to allow the network to focus on the most informative facial regions for expression recognition. The local branch divides the facial feature map into three regions—left eye, right eye, and mouth—each processed by a sub-branch with 3 × 3 kernels. We also integrate Coordinate Attention into the local branch to enhance spatial relationships within each region. By combining multi-scale global extraction with targeted local features, FS-DSN improves upon traditional networks, enabling more accurate and reliable FER, especially in uncontrolled conditions.

2.2 Feature segmentation for FER

FER in uncontrolled environments often requires dividing the face into regions and employing specialized extraction methods for these areas. Many existing methods, such as Liao et al. (2022), rely on facial landmark segmentation, where key points are selected from a set of 68 facial landmarks, and a classification module integrates the local feature information. Similarly, Ding et al. (2020) segmented the face into 16 regions based on facial landmarks, dividing the global feature map into irregular, non-overlapping blocks to learn useful contextual information. Li et al. (2019a) introduced a regional decomposition and occlusion-aware method, recalculating 24 facial landmarks and dividing the face into 24 patches, using patch gating units to account for occlusions and focus on unobscured patches. Other methods, like Li et al. (2023), employed a sliding patch technique to extract local features by sliding a fixed-size block across the facial feature map, while Zhao et al. (2021) used a spatial axis-based segmentation method to divide the map into non-overlapping blocks. Similarly, Liu et al. (2022a) introduced a facial attribute-based LP module, dividing the face into four sub-regions and assigning facial organs to each region for effective feature extraction, especially under occlusion. Wang et al. (2020) explored cropping methods such as fixed-position, random, and landmark-based cropping, focusing on key facial landmarks like the eyes, nose, and mouth corners. Islam et al. (2018) proposed segmenting the face into four expression regions and extracting fused HOG and LBP features, followed by multiclass SVM classification. Mohseni et al. (2014) developed an anatomy-based facial graph modeling facial muscles via geometric facial points and graph edges, extracting robust features under pose variation. Dillague et al. (2024) demonstrated that incorporating eyebrow features alongside eyes and mouth improves accuracy using neural network classifiers. Rathod et al. (2023) applied super-resolution and multi-scale deep feature extraction techniques on local facial regions to enhance group emotion recognition. Aquib et al. (2024) proposed to capture the global dependencies of each local region of the face through a modified nonlocal convolutional neural network combined with an attention mechanism, which effectively enhances the local feature interaction.

While these methods achieve success in segmenting facial regions, they generally rely on pre-calibrated key points or equally divide the facial region without targeting specific features, which may fail to capture subtle changes in facial expressions due to occlusions or pose variations. FS-DSN addresses these limitations by segmenting the face into three key regions: left eye, right eye, and mouth. This targeted segmentation method enables more accurate local feature extraction. By combining fine-grained segmentation with multi-scale global feature extraction, FS-DSN overcomes the shortcomings of previous segmentation-based methods. It not only focuses on critical regions but also performs better in handling occlusions and pose variations. This enhanced segmentation and attention mechanism makes FS-DSN an effective solution for FER, especially in uncontrolled environments with partial occlusion of faces.

3. Proposed method

3.1 Overview

We introduce a Feature Segmentation-Based Dual-Stream Network for FER tasks in uncontrolled environments, as depicted in Figure 1. This network comprises two branches and four main components: the feature Pre-Extraction module (PE Module), Feature Segmentation module (FS Module), Global Extraction module (GE Module), and Local Extraction module (LE Module). Initially, layers Conv1 to Conv3 from ResNet-18 (He et al., 2016) serve as pre-extractors to derive shallow image features, resulting in feature maps of dimensions 128 × 3 × 3. Subsequently, these pre-extracted shallow features are divided into two branches for the extraction of global and local region features, respectively. Finally, feature-level fusion is utilized on the extracted global and local features to produce the final recognition result.

Figure 1.

Structure of the proposed method. PE = the pre-extraction feature module; GE = the global feature extraction module; FS = the feature segmentation module, and LE is the local feature extraction module; GAP = global average pooling; FC = a fully connected layer.

3.2 Global extraction module

The primary function of the global feature extraction module is to extract complete facial features, focusing on the positional relationships among various features. Specifically, we designed a ResNet network with different convolution kernel sizes to accommodate changes in the size of feature maps, as illustrated in Figure 2(a). Within the four layers of ResNet, we have modified the inherent convolutional kernels of size 3 × 3 to sizes of 9 × 9, 7 × 7, 5 × 5, and 3 × 3, respectively. The original image, after being processed by the feature pre-extraction module, is resized to 128 × 28 × 28, and then input into the ResNet with varying sizes of convolutional kernels for global feature extraction. The features extracted by the global feature extraction module are comprehensive, and as the input progresses deeper into the layers of ResNet, the feature maps gradually decrease in size. Larger convolutional kernels are generally more adept at handling higher-order information, making the extraction of global features more comprehensive, whereas smaller kernels can extract texture information from the images more effectively. Therefore, modifying the size of the convolutional kernels at each layer of ResNet is essential. Furthermore, within each residual block, the feature map undergoes two convolution operations before being fed into the CBAM module, which considers the image from both spatial and channel dimensions. This attention mechanism involves sequentially multiplying the attention maps by the spatial and channel maps. The spatial and channel feature maps can be defined as follows:

\begin{aligned} M_{s} & = σ (C o n v^{7 \times 7} (A v g P o o l (X); M a x P o o l (X))) \end{aligned}

(1)

\begin{aligned} M_{c} & = σ (M L P (A v g P o o l (X)) + M L P (M a x P o o l (X))) \end{aligned}

(2)

where X is the input feature map,

σ

is the sigmoid activation function, MLP is Multi Layer Perceptron. The output of CBAM can be expressed as:

\begin{aligned} Y = (X ⊙ M_{c}) ⊙ M_{s} \end{aligned}

(3)

Figure 2.

FS-DSN uses a global feature extraction module and a local feature extraction module: (a) Global ResNet and (b) Local ResNet.

After processing through the ResNet module with varying convolutional kernel sizes, the output is passed into a Global Average Pooling (GAP) layer, resulting in a feature vector of size 512. This step is crucial for condensing the spatial dimensions of the feature maps into a compact representation, while preserving essential global information conducive to the task at hand. The employment of different kernel sizes throughout the ResNet layers not only facilitates the extraction of features at various scales but also enhances the network's adaptability to the dynamic nature of facial expressions in uncontrolled environments. Subsequently, the GAP layer's role in aggregating these features ensures that the final feature vector is both comprehensive and succinct, making it well-suited for the subsequent recognition processes.

3.3 Local extraction module

In the context of FER in uncontrolled environments, challenges such as poor image quality, lighting effects, occlusions of facial regions, and changes in head posture can significantly affect the effectiveness of conventional feature extraction methods, often leading to loss of features or incomplete feature extraction. To address these issues, we adopt a simultaneous global and local feature extraction approach, which ensures comprehensive capture of the facial expression features while also enabling targeted extraction of local features that significantly influence facial expressions. Specifically, traditional methods of local feature segmentation, which divide the facial expression image into several regions of equal or unequal sizes, typically do so without specific focus, leading to less effective feature extraction. Therefore, prior to local feature extraction, we devised a more efficacious feature segmentation method. This involves defining specific local areas for targeted extraction, followed by conducting further in-depth feature extraction on the local feature maps obtained from these areas. This approach not only enhances the accuracy of FER by ensuring that critical local features are not overlooked but also improves the robustness of the recognition system against the various challenges present in uncontrolled environments.

Before initiating the local extraction process, the Feature Segmentation module divides the intermediate layer feature maps. Specifically, after the input image is processed through the feature PE module, it segments the intermediate layer feature maps along the horizontal and vertical spatial axes into three local feature maps $F_{i}$ , where $i \in {1, 2, 3}$ . These segmented maps correspond to the facial regions of the left eye, right eye, and mouth. This segmentation approach targets the extraction phase more precisely, thereby enhancing the network's ability to discern features from these critical areas. Chapter 4's ablation studies further validate that segmenting the eye regions into two distinct areas yields superior results compared to a singular region approach. The feature map obtained post the feature pre-extraction module is represented as $S \in R^{128 \times 28 \times 28}$ , where 128 signifies the number of channels and 28 the dimensional size of the image. This map is then divided into three local feature region maps $S_{i} \in R^{128 \times 14 \times 14}$ , $i \in {1, 2}$ and $S_{i} \in R^{128 \times 14 \times 28}$ , $i$ = 3. With each corresponding to one of the aforementioned facial areas, and =3 indicating the tripartite segmentation. This focused method of feature segmentation is instrumental in augmenting the accuracy and efficiency of FER by emphasizing the extraction from facial regions most indicative of emotional states.

The ResNet with Attention, as illustrated in Figure 2(b), integrates Coordinate Attention (CA) (Hou et al., 2021) into each residual block of ResNet-18. CA meticulously evaluates each pixel across both horizontal and vertical dimensions of the image, amplifying the focus on key regions. The purpose of incorporating local feature extraction is to capture finer details within each localized facial region (such as the twitching of the eye corners, or the upturning or downturn of the mouth corners). CA can seamlessly integrate into the ResNet architecture, enhancing its ability to focus on and extract detailed information from specific areas of interest. This incorporation of CA ensures that the network not only recognizes broader facial expressions but also pays close attention to subtle changes in facial features, which are critical for accurate expression recognition.

In the local extract module, the output can be written as:

\begin{aligned} Y_{C} (i, j) = X_{c} (i, j) \times g_{c}^{h} (i) \times g_{c}^{w} (j) \end{aligned}

(4)

where,

g^{h}

and

g^{w}

are the weights of the vertical and horizontal axes of the obtained image respectively, which are finally multiplied with the input image to give the final output.

In the local extraction module, the input feature maps for the left and right eye regions are sized at 128 × 14 × 14, while the input for the mouth region is 128 × 14 × 28. After processing through the local extraction module, the outputs are two feature maps of size 512 × 7 × 7 for the eyes and one feature map of size 512 × 7 × 14 for the mouth. These feature maps are then concatenated along the spatial axis—first, the feature maps of the left and right eyes are concatenated, and this concatenated map is subsequently concatenated with the feature map of the mouth region. This concatenation results in a composite feature map of dimensions 14 × 14 × 512. When this composite feature map is input into a Global Average Pooling (GAP) layer, it produces a feature vector of size 512.

Figure 3 displays the outcomes of the facial feature segmentation method under three distinct conditions: standard, facial occlusion, and head pose variations. Despite these uncontrolled environments, the Feature Segmentation (FS) module successfully divides the facial regions into corresponding local areas for the left eye, right eye, and mouth, ensuring the accuracy of feature segmentation. This capability highlights the robustness of the FS module, demonstrating its effectiveness in accurately identifying and segmenting key facial regions even in challenging scenarios. Such precision in feature segmentation is crucial for maintaining the performance of FER systems, as it allows for consistent extraction of relevant features across varying conditions.

Figure 3.

Feature segmentation module, the three purple boxes represent the left eye, right eye, and mouth regions, respectively. The first, second, and third rows are facial expression images representing normal, facial occlusion, and head posture changes acquired on the RAF-DB, FED-RO, and SFEW 2.0 datasets, respectively.

3.4 Integration mechanisms and loss functions

Decision-level fusion refers to the fusion of recognition results from multiple sensors that have independently accomplished decision classification for the purpose of making optimal decisions. This method is efficiently compatible with the feature information of multiple sensors. In the FS-DSN network, the global feature extraction module and the local region feature extraction module focus on deep extraction of global features and features of local organs of the face, respectively, and each of them has unique advantages. Therefore, we adopt a decision-level fusion approach to fuse the global features and local minutiae features with equal weight allocation (50% each). In the global and local extraction module, the feature vectors of output 512 are obtained separately, using the $v^{(A)}$ denotation, where $A \in (G E, L E)$ , the common cross-entropy loss function is used:

\begin{aligned} L_{A} = - \frac{1}{N} \sum_{i = 0}^{N - 1} \log \frac{e^{w_{Y_{i}}^{(A) T} v_{i}^{(A)} + b_{y_{i}}^{(k)}}}{\sum_{j = 0}^{C - 1} e^{w_{j}^{(A) T} v_{i}^{(A)} + b_{j}^{(k)}}} \end{aligned}

(5)

In Equation (5), N denotes the number of samples, C denotes the number of discriminated expression categories, $W^{(A)}$ is the weight, $b^{(A)}$ is the bias, $v_{i}^{(A)}$ denotes the fully connected input of the $i th$ sample, and $y_{i}$ is the label of the $i th$ sample.

Finally, the final loss function is:

\begin{aligned} L = 0.5 * L_{G E} + 0.5 * L_{L E} \end{aligned}

(6)

4. Experiments

4.1 Datasets

We conducted performance evaluations of the Feature Segmentation-based Dual-Stream Network (FS-DSN) on four widely utilized facial expression datasets in uncontrolled environments: RAF-DB, FED-RO, SFEW 2.0, and FER-2013. These datasets are known for their diversity and complexity, encompassing a wide range of facial expressions captured under various uncontrolled conditions, such as different lighting, occlusions, and head poses.

RAF-DB (Li et al., 2017): This dataset is a facial dataset captured in natural settings, comprising 29,672 facial images. These images are annotated with both basic and compound expressions by 40 markers. In our experiments, we utilized images of the seven basic expressions, totaling 15,339 images. This subset is divided into 12,271 images for training and 3068 images for testing.

FED-RO (Li et al., 2019b): This dataset is the first of its kind, exclusively featuring facial expressions with occlusions in real-world environments. It consists of 400 images, each depicting one of the seven basic expressions. This dataset was compiled by mining images of occluded faces from Bing and Google search engines. Each image was annotated by three individuals to ensure the reliability of the expression classifications. Additionally, to maintain the dataset's uniqueness, any images that were also found in the RAF-DB and AffectNet (Mollahosseini et al., 2019) datasets were removed.

SFEW 2.0 (Dhall et al., 2012): The dataset comprises static facial expression images from natural environments, totaling 1766 pictures, obtained through keyframe extraction via facial point clustering from the AFEW dataset. SFEW 2.0 is a widely recognized benchmark dataset in the field, divided into training, validation, and test sets, each featuring seven basic expressions. For our research, we focus on the training and validation sets due to the unavailability of labels for the test set. This selection facilitates a clear and measurable assessment of our model, leveraging the accessible annotations for a precise comparison with other methodologies.

FER-2013 (Goodfellow et al., 2013): FER-2013 is a dataset previously utilized in a FER challenge hosted on Kaggle. It consists of grayscale images, totaling 35,887, featuring the seven basic expressions. This dataset is divided into 28,709 training images, 3589 validation images, and 3589 testing images. Given that these images are sourced from the internet, they are characterized by significant noise and lower image quality. For our study, we selected only the training and testing images for evaluation.

4.2 Implementation details

When preprocessing the datasets, we first employ RetinaFace (Deng et al., 2020) for face detection and alignment, subsequently cropping the images to a size of 224 × 224 pixels. We adopted distinct strategies for the training process across different datasets: For RAF-DB, FED-RO, and SFEW 2.0, we initially pretrained on the AffectNet-7 dataset, followed by fine-tuning. For the FER-2013 dataset, we pre-trained on the RAF-DB dataset before proceeding to fine-tuning. Experiments were conducted using PyTorch 1.11.0 on a GeForce RTX 2060 Super platform. For RAF-DB, FED-RO, and SFEW 2.0 datasets, we set the batch size to 256. For the FER-2013 dataset, the batch size was set to 128. We selected SGD as the optimizer, with a momentum decay set to 0.9 and an initial learning rate of 0.01. The learning rate was decreased by a factor of 10 every 20 epochs, with the training spanning a total of 100 epochs. This approach ensures a comprehensive and adaptable preprocessing and training methodology tailored to the specifics of each dataset, optimizing the performance of our FER model.

4.3 Ablation experiments

In this section, we conducted a series of ablation experiments on the proposed model to validate the effectiveness of the feature segmentation module and the dual-stream network.

(1)
Feature segmentation: for the feature segmentation part, we conducted experiments on two segmentation methods, namely, dividing the left and right eyes into the same region (as shown in Figure 4), and dividing the image into two parts according to the middle of the longitudinal axis direction as a one-time experiment; and dividing it into the left and right eyes and mouth regions (as shown in Figure. 5) according to the method proposed in this paper, which is a total of three regions as a one-time experiment. They were performed on four datasets, respectively, and the experimental results are shown in Table 1.

Figure 4.
Divided into two areas.

Figure 5.
Divided into three areas.

Table 1.
Comparison of the accuracy of the two segmentation methods on four datasets.

Method of segmentation Datasets Accuracy (%)

Two parts (eyes and mouth) RAF-DB 86.52

FED-RO 71.02

SFEW 2.0 58.69

FER-2013 69.23

Three parts (left eye right eye and mouth) RAF-DB 88 . 82

FED-RO 78.33

SFEW 2.0 60.09

FER-2013 74.17

The bold values indicate the parts corresponding to our proposed method, demonstrating its superior effectiveness.

As demonstrated in Table 1, the three-part segmentation strategy exhibits superior performance over the two-part approach across all datasets, with accuracy improvements of 2.30% (from 86.52% to 88.82%) on RAF-DB, 7.31% (from 71.02% to 78.33%) on FED-RO, and modest yet consistent gains of 1.40% and 4.94% on SFEW 2.0 and FER-2013, respectively. These quantitative enhancements validate the effectiveness of our design, where independent modeling of left and right eye regions enables the capture of asymmetric muscle dynamics under occlusions (e.g., in FED-RO), while holistic eye modeling (two-part segmentation) risks losing fine-grained local textures. The observed improvements across diverse uncontrolled environments conclusively demonstrate the robustness and generalizability of our facial feature segmentation framework. (2)
The ablation study on the dual-stream network rigorously evaluates the impact of multi-scale global feature extraction by varying convolutional kernel sizes in the global branch while maintaining a fixed local branch configuration (kernel sizes: [9,7,3,3]). As shown in Table 2, our adaptive kernel adjustment strategy (9→7→5→3) achieves superior performance across datasets, with notable improvements of 7.51% (from 81.31% to 88.82%) over the uniform 9 × 9 kernel configuration and 4.43% over the baseline ResNet-18 (3 × 3 kernels) on RAF-DB. Similar trends are observed on FED-RO, where adaptive kernels yield a 9.44% accuracy gain compared to the 7 × 7 kernel setup. This enhancement stems from a hierarchical feature capture mechanism: larger kernels in shallow layers extract broad contextual patterns (e.g., facial contours), while progressively smaller kernels in deeper layers refine fine-grained textures (e.g., wrinkles and muscle tensions). The synergy between dynamically scaled global features and fixed-scale local features validates the dual-stream architecture's capability to balance holistic and localized semantic representation, particularly under uncontrolled environments with occlusions or pose variations.

Table 2.
Accuracy of ResNet with different convolutional kernel sizes compared on four datasets.

Datasets Kernel size Accuracy (%)

RAF-DB [9,9,9,9] 81.31

[7,7,7,7] 84.09

[5,5,5,5] 83.61

[3,3,3,3] 84.39

[9,7,5,3] 88.82

FED-RO [9,9,9,9] 64.74

[7,7,7,7] 68.89

[5,5,5,5] 67.62

[3,3,3,3] 69.02

[9,7,5,3] 78.33

SFEW 2.0 [9,9,9,9] 50.04

[7,7,7,7] 55.65

[5,5,5,5] 56.68

[3,3,3,3] 56.94

[9,7,5,3] 60.09

FER-2013 [9,9,9,9] 66.59

[7,7,7,7] 66.78

[5,5,5,5] 69.51.

[3,3,3,3] 72.13

[9,7,5,3] 74.17

The bold values indicate the parts corresponding to our proposed method, demonstrating its superior effectiveness.

Figure 6 presents the Class Activation Mapping (CAM) (Zhou et al., 2016), where areas of high attention are indicated in red, with deeper colors signifying greater attention weight. From Figure 6(a), the Feature Segmentation (FS) approach demonstrates superior focus on critical areas compared to the Dual-Stream Network (DSN). However, the FS-DSN exhibits more accurate overall attention to the facial image, particularly concentrating on unobstructed facial regions, thus achieving higher recognition accuracy. Figure 6(b) shows the effect on the RAF-DB dataset. It is evident that FS-DSN prioritizes the eye and mouth regions, making the feature extraction more targeted and disregarding irrelevant and non-critical areas, which not only enhances recognition accuracy but also aligns with our anticipated goals.

Figure 6.
Comparison of class activation mapping (CAM) of baseline, feature segmentation, dual-stream network, and FS-DSN on FED-RO, SFEW 2.0 dataset: (a) On FED-RO and (b) On RAF-DB.

Figure 7 showcases the accuracy curves from the ablation experiments. From Figure 7(a), it is evident that on the RAF-DB dataset, the accuracy curves of FS, DSN, and FS-DSN display a clear separation effect, indicating distinctive performance characteristics among the different configurations. Figure 7(d) reveals that on the FER-2013 dataset, FS-DSN initially exhibits lower accuracy compared to FS and DSN during the first 11 epochs, which can be attributed to the high noise levels inherent in the FER-2013 dataset. Subsequently, the recognition accuracy of FS-DSN rapidly increases, demonstrating its superior learning capability and robustness. This pattern underscores FS-DSN's ability to effectively adapt and improve over time, even in challenging conditions characterized by significant noise.

Figure 7.
Accuracy curves for four datasets in uncontrolled environments: (a) On RAF-DB, (b) On FED-RO, (c) On SFEW 2.0, and (d) On FER-2013.
4.4 Comparison with the state-of-the-art methods

Method of segmentation	Datasets	Accuracy (%)
Two parts (eyes and mouth)	RAF-DB	86.52
FED-RO	71.02
SFEW 2.0	58.69
FER-2013	69.23
Three parts (left eye right eye and mouth)	RAF-DB	88 . 82
FED-RO	78.33
SFEW 2.0	60.09
FER-2013	74.17

Datasets	Kernel size	Accuracy (%)
RAF-DB	[9,9,9,9]	81.31
[7,7,7,7]	84.09
[5,5,5,5]	83.61
[3,3,3,3]	84.39
[9,7,5,3]	88.82
FED-RO	[9,9,9,9]	64.74
[7,7,7,7]	68.89
[5,5,5,5]	67.62
[3,3,3,3]	69.02
[9,7,5,3]	78.33
SFEW 2.0	[9,9,9,9]	50.04
[7,7,7,7]	55.65
[5,5,5,5]	56.68
[3,3,3,3]	56.94
[9,7,5,3]	60.09
FER-2013	[9,9,9,9]	66.59
[7,7,7,7]	66.78
[5,5,5,5]	69.51.
[3,3,3,3]	72.13
[9,7,5,3]	74.17

In this section, we compare the optimal experimental results obtained. The best experimental outcomes for FS-DSN are juxtaposed with the state-of-the-art (SOTA) results across the four datasets: RAF-DB, FED-RO, SFEW 2.0, and FER-2013. This comparison aims to highlight the performance of FS-DSN in relation to the current leading methodologies within the domain of FER, providing a comprehensive evaluation of its efficacy and advancements over existing approaches.

(1)
In comparison with the RAF-DB dataset, Table 3 presents the performance results of our network alongside current leading methods. As indicated in Table 3, FS-DSN achieved the highest recognition accuracy of 88.82%. Compared to the MA-Net approach, which divides the image into four equal regions, our proposed facial feature segmentation method resulted in a 0.42% improvement in accuracy on the RAF-DB dataset. Furthermore, the confusion matrix for the RAF-DB dataset, presented in Figure 8(a), shows that 14% of the “sad” samples were misclassified as “angry.” This misclassification can be attributed to the similarity between the facial expressions of sadness and disgust, both of which involve slight squinting of the eyes and contraction of the eyebrow area, potentially leading to confusion between the two emotions.
(2)
Comparison with FED-RO: Table 4 showcases the performance comparison results of our network on the FED-RO dataset against current leading methods, achieving the highest recognition accuracy (78.33%). Our proposed FS-DSN network made significant improvements over other networks on this dataset. Given that the FED-RO dataset consists purely of occluded facial expressions, changes in head posture have minimal impact on recognition. This underscores our network's enhanced performance in classifying expressions under facial occlusion conditions. The confusion matrix from the FED-RO dataset, as seen in Figure 8(b), indicates substantial errors in distinguishing between anger and disgust, with each category being confused for the other more than 20% of the time. This confusion likely arises from the similarity in facial features between the two expressions, as both anger and disgust often involve open mouths and furrowed brows, making them susceptible to misclassification.
(3)
In comparison with the SFEW 2.0 dataset, Table 5 presents the performance results of our network alongside current leading methods, with FS-DSN achieving the highest recognition accuracy of 60.09%. The image quality of the SFEW 2.0 dataset is generally low, and within the validation set, several samples from the “happy” and “surprised” categories suffer from poor image quality, including pure black images and zoomed-in sections of certain facial regions (e.g., head, eyes), leading to substantial noise interference. As shown in the confusion matrix in Figure 8(c), the “happy” and “surprised” expressions exhibit the lowest recognition accuracy, while “disgust” demonstrates the highest classification accuracy. This discrepancy may be attributed to the fact that the “disgust” images predominantly feature characters with tightly closed mouths and a skimming motion, a distinguishing feature that contributes to its relatively higher recognition accuracy.
(4)
Table 6 presents a performance comparison of our network with current leading methods on the FER-2013 dataset. As shown in Table 6, FS-DSN achieved the highest recognition accuracy of 74.17%. The FER-2013 dataset consists of grayscale images, which are of lower quality and contain some inherent noise. The confusion matrix in Figure 8(d) indicates that the “fear” and “surprise” categories exhibit the lowest classification accuracy, with a notable proportion of misclassifications between the two. This misclassification may stem from the similarity in facial features associated with both emotions, as images of fear and surprise in the FER-2013 dataset typically depict characters with their mouths wide open, eye muscles contracted, and pupils dilated, thereby increasing the likelihood of confusion. Despite the challenges posed by the grayscale nature of the dataset, our model's superior performance suggests its ability to effectively capture discriminative facial expression features and demonstrates strong robustness.

Figure 8.
Confusion matrix of FS-DSN under four uncontrolled environment datasets: (a) On RAF-DB, (b) On FED-RO, (c) On SFEW 2.0, and (d) On FER-2013.

Table 3.
Comparison with SOTA on RAF-DB.

Method Backbone Year Accuracy (%)

ACNN (Li et al., 2019c) VGG16 2020 85.07

LBAN-IL (Li et al., 2021) ResNet-18 2021 85.89

SPWFA-SE (Li et al., 2023) ResNet-50 2020 86.31

RAN (Wang et al., 2020) ResNet-18 2020 86.90

DACL (Farzaneh and Qi, 2021) ResNet-18 2021 87.78

FERGCN (Liao et al., 2022) ResNet-18 2022 88.23

MA-Net (Zhao et al., 2021) ResNet-18 2021 88.40

FS-DSN (Ours) ResNet-18 2023 88.82

The bold values represent the results of our method, highlighting that it outperforms the other methods compared.

Table 4.
Comparison with SOTA on FED-RO.

Method Backbone Year Accuracy (%)

ACNN (Li et al., 2019c) VGG-16 2019 66.50

IDFL (Li et al., 2022) ResNet-50 2021 67.00

SPWFA-SE (Li et al., 2023) ResNet-50 2020 67.25

OADN (Ding et al., 2020) Manually 2020 68.11

LAENet-SA (Wang et al., 2022) ResNet-18 2022 68.25

MA-Net (Zhao et al., 2021) ResNet-18 2021 70.00

AMP-Net (Liu et al., 2022b) ResNet-34 2022 71.75

FS-DSN (Ours) ResNet-18 2023 78.33

The bold values represent the results of our method, highlighting that it outperforms the other methods compared.

Table 5.
Comparison with SOTA on SFEW 2.0.

Method Backbone Year Accuracy (%)

RAN (Wang et al., 2020) ResNet-18 2020 54.19

LBAN-IL (Li et al., 2021) ResNet-18 2021 55.28

FERGCN (Liao et al., 2022) ResNet-18 2022 56.15

RAN (Wang et al., 2020) VGG16 + ResNet-18 2020 56.40

DUME (She et al., 2021) ResNet-50 2021 58.30

AHA (Weng et al., 2021) ResNet-18 2021 58.89

MA-Net (Zhao et al., 2021) ResNet-18 2021 59.40

FS-DSN (Ours) ResNet-18 2023 60.09

The bold values represent the results of our method, highlighting that it outperforms the other methods compared.

Table 6.
Comparison with SOTA on FER-2013.

Method Backbone Year Accuracy (%)

E-FCNN (Shao and Cheng, 2021) Manually 2021 66.17

BLOCK-FRENET (Tang et al., 2021) Manually 2021 64.41

Nadir Kamel Benamara et al. (2021) Manually 2020 72.72

PASM (Liu et al., 2022a) VGG-16 2021 72.73

LBAN-IL (Li et al., 2021) Manually 2021 73.11

PASM (Liu et al., 2022b) ResNet-34 2021 73.59

AHA (Weng et al., 2021) ResNet-18 2021 73.84

FS-DSN (Ours) ResNet-18 2023 74.17

The bold values represent the results of our method, highlighting that it outperforms the other methods compared.

Table 7.
Precision, recall, and F1 on RAF-DB dataset.

Expression Category Precision Recall F1

Angry 0.80 0.91 0.85

Disgust 0.70 0.60 0.65

Fear 0.73 0.84 0.78

Happiness 0.75 0.89 0.82

Neutral 0.84 0.83 0.83

Sadness 0.75 0.57 0.65

Surprise 0.91 0.85 0.88

Table 8.
Precision, recall, and F1 on FED-RO dataset.

Expression category Precision Recall F1

Angry 0.56 0.55 0.56

Disgust 0.73 0.59 0.66

Fear 0.77 0.82 0.80

Happiness 0.81 0.83 0.82

Neutral 0.72 0.79 0.76

Sadness 0.78 0.90 0.84

Surprise 0.88 0.77 0.82

Table 9
Precision, recall, and F1 on SFEW 2.0 dataset.

Expression category Precision Recall F1

Angry 0.52 0.69 0.59

Disgust 0.57 0.70 0.63

Fear 0.71 0.53 0.60

Happiness 0.23 0.21 0.22

Neutral 0.56 0.60 0.58

Sadness 0.53 0.61 0.57

Surprise 0.25 0.14 0.18

Table 10.
Precision, recall, and F1 on FER-2013 dataset.

Expression category Precision Recall F1

Angry 0.82 0.60 0.69

Disgust 0.71 0.67 0.69

Fear 0.54 0.57 0.56

Happiness 0.65 0.85 0.74

Neutral 0.76 0.77 0.77

Sadness 0.68 0.82 0.74

Surprise 0.61 0.46 0.51

4.5 Category-wise performance analysis

Method	Backbone	Year	Accuracy (%)
ACNN (Li et al., 2019c)	VGG16	2020	85.07
LBAN-IL (Li et al., 2021)	ResNet-18	2021	85.89
SPWFA-SE (Li et al., 2023)	ResNet-50	2020	86.31
RAN (Wang et al., 2020)	ResNet-18	2020	86.90
DACL (Farzaneh and Qi, 2021)	ResNet-18	2021	87.78
FERGCN (Liao et al., 2022)	ResNet-18	2022	88.23
MA-Net (Zhao et al., 2021)	ResNet-18	2021	88.40
FS-DSN (Ours)	ResNet-18	2023	88.82

Method	Backbone	Year	Accuracy (%)
ACNN (Li et al., 2019c)	VGG-16	2019	66.50
IDFL (Li et al., 2022)	ResNet-50	2021	67.00
SPWFA-SE (Li et al., 2023)	ResNet-50	2020	67.25
OADN (Ding et al., 2020)	Manually	2020	68.11
LAENet-SA (Wang et al., 2022)	ResNet-18	2022	68.25
MA-Net (Zhao et al., 2021)	ResNet-18	2021	70.00
AMP-Net (Liu et al., 2022b)	ResNet-34	2022	71.75
FS-DSN (Ours)	ResNet-18	2023	78.33

Method	Backbone	Year	Accuracy (%)
RAN (Wang et al., 2020)	ResNet-18	2020	54.19
LBAN-IL (Li et al., 2021)	ResNet-18	2021	55.28
FERGCN (Liao et al., 2022)	ResNet-18	2022	56.15
RAN (Wang et al., 2020)	VGG16 + ResNet-18	2020	56.40
DUME (She et al., 2021)	ResNet-50	2021	58.30
AHA (Weng et al., 2021)	ResNet-18	2021	58.89
MA-Net (Zhao et al., 2021)	ResNet-18	2021	59.40
FS-DSN (Ours)	ResNet-18	2023	60.09

Method	Backbone	Year	Accuracy (%)
E-FCNN (Shao and Cheng, 2021)	Manually	2021	66.17
BLOCK-FRENET (Tang et al., 2021)	Manually	2021	64.41
Nadir Kamel Benamara et al. (2021)	Manually	2020	72.72
PASM (Liu et al., 2022a)	VGG-16	2021	72.73
LBAN-IL (Li et al., 2021)	Manually	2021	73.11
PASM (Liu et al., 2022b)	ResNet-34	2021	73.59
AHA (Weng et al., 2021)	ResNet-18	2021	73.84
FS-DSN (Ours)	ResNet-18	2023	74.17

Expression Category	Precision	Recall	F1
Angry	0.80	0.91	0.85
Disgust	0.70	0.60	0.65
Fear	0.73	0.84	0.78
Happiness	0.75	0.89	0.82
Neutral	0.84	0.83	0.83
Sadness	0.75	0.57	0.65
Surprise	0.91	0.85	0.88

Expression category	Precision	Recall	F1
Angry	0.56	0.55	0.56
Disgust	0.73	0.59	0.66
Fear	0.77	0.82	0.80
Happiness	0.81	0.83	0.82
Neutral	0.72	0.79	0.76
Sadness	0.78	0.90	0.84
Surprise	0.88	0.77	0.82

Expression category	Precision	Recall	F1
Angry	0.52	0.69	0.59
Disgust	0.57	0.70	0.63
Fear	0.71	0.53	0.60
Happiness	0.23	0.21	0.22
Neutral	0.56	0.60	0.58
Sadness	0.53	0.61	0.57
Surprise	0.25	0.14	0.18

Expression category	Precision	Recall	F1
Angry	0.82	0.60	0.69
Disgust	0.71	0.67	0.69
Fear	0.54	0.57	0.56
Happiness	0.65	0.85	0.74
Neutral	0.76	0.77	0.77
Sadness	0.68	0.82	0.74
Surprise	0.61	0.46	0.51

In this section, we evaluate the performance of the model using precision, recall, and F1 scores across four different datasets: RAF-DB (see Table 7), FED-RO (see Table 8), SFEW 2.0 (see Table 9), and FER-2013 (see Table 10). These metrics provide a comprehensive view of how well the model identifies each expression category in the datasets, particularly in the presence of class imbalance.

In the RAF-DB dataset, the model performs well overall, particularly for the “Surprise” category, which achieves high precision (91%) and F1 score (88%). However, the “Sadness” category shows lower recall (57%), indicating that the model struggles to identify sadness effectively. Despite this, the model demonstrates stable performance across most categories, especially “Anger” (Precision: 80%) and “Happiness” (Recall: 89%), which both show strong F1 scores (85% and 82%, respectively).

In the FED-RO dataset, the model exhibits performance variation across categories. The “Sadness” category performs particularly well with an F1 score of 0.84, suggesting the model effectively recognizes sadness. On the other hand, the “Anger” category shows lower performance with both precision and recall below 60%, indicating that the model struggles to detect anger. Overall, “Happiness” shows a good balance, with an F1 score of 0.82.

The SFEW 2.0 dataset performs relatively poorly overall, especially in the “happy” (precision: 23%, recall: 21%) and “surprised” (F1: 0.18) categories. These poor results can be attributed to the poor quality of sample images in these categories, including pure black images. However, the “fear” category performs relatively well, with a precision of 71% and an F1 score of 0.60.

In the FER-2013 dataset, the model shows balanced performance, particularly in “Happiness” (F1: 0.74) and “Sadness” (F1: 0.74). However, the “Surprise” category has a low precision and recall, resulting in a low F1 score of 0.51. The “Angry” and “Disgust” categories perform well with precision scores of 0.82 and 0.71, respectively, but there is room for improvement in recall.

5. Conclusion

This paper introduces a dual-stream network based on facial feature segmentation. We designed branches for global and local feature extraction, fully leveraging the differences in the sizes and focal points of global and local feature maps. This approach enables the network to more specifically extract features from different perceptual domains, thereby ensuring the maximal capture of useful features while suppressing irrelevant noise interference, enhancing the network's robustness and adaptability. A series of ablation experiments and comparative tests have demonstrated that our proposed method exhibits exceptional performance on facial expression datasets in uncontrolled environments, particularly in cases of facial occlusion. Nevertheless, the method still faces the problem of performance degradation under extreme lighting conditions or when the face is heavily occluded. In addition, the current model is relatively heavy for deployment in real-time or mobile scenarios and relies only on visual information, which may be insufficient in the case of high blur or occlusion. In future work, we will explore the integration of illumination-invariant features, study lightweight network architectures for efficient deployment, and incorporate multimodal signals such as audio or physiological cues to further enhance its robustness in complex real-world environments.

Footnotes

Acknowledgements

This study was supported by the National Key Laboratory Fund SKLK22-11 and the Key Research and Development Project of Shaanxi Provincial Department of Science and Technology Project K20220022.

ORCID iD

Daipeng Guo

Author contributions

Daipeng Guo contributed to resources, methodology, and writing-original draft. Fei Xu contributed to writing-review and editing, investigation, funding acquisition, and conceptualization. Jing Mu contributed to visualization and formal analysis.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

Data will be made available on request.

References

Aquib

Verma

Akhtar

(2024) Enhancing facial expression recognition by integrating global dependencies with modified non-local convolutional neural networks. In: 2024 IEEE International Conference on Computer Vision and Machine Intelligence (CVMI), pp.1–6.

Benamara

Val-Calvo

Álvarez-Sánchez

, et al. (2021) Real-time facial expression recognition using smoothed deep neural network ensemble. Integrated Computer-Aided Engineering 28: 97–111.

Dalal

Triggs

(2005) Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), vol. 881, pp. 886–893.

Darwin

(2009) The Expression of the Emotions in Man and Animals. Cambridge: Cambridge University Press.

Deng

Guo

Ververas

, et al. (2020) Retinaface: Single-shot multi-level face localisation in the wild. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp .5202–5211.

Dhall

Goecke

Lucey

, et al. (2012) Collecting large, richly annotated facial-expression databases from movies. IEEE MultiMedia 19: 34–41.

Dillague

JDO

Juico

JHA

NSL

, et al. (2024) Detection of facial expressions based on three feature points using image processing with artificial neural networks. In: 2024 5th International Conference on Industrial Engineering and Artificial Intelligence (IEAI), pp.29–33.

Ding

Zhou

Chellappa

(2020) Occlusion-Adaptive deep network for robust facial expression recognition. In: 2020 IEEE International Joint Conference on Biometrics (IJCB), pp.1–9.

Farzaneh

(2021) Facial expression recognition in the wild via deep attentive center loss. In: 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pp.2401–2410.

10.

Goodfellow

Erhan

Carrier

, et al. (2013) Challenges in representation learning: A report on three machine learning contests. In: Neural Information Processing. Berlin, Heidelberg: Springer Berlin, pp. 117–124.

11.

Guo

, et al. (2025) Research on facial expression recognition based on wide attention and multi-scale fusion mechanism. Journal of Ambient Intelligence and Smart Environments: 18761364241296439.

12.

Zhang

Ren

, et al. (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.770–778.

13.

Hou

Zhou

Feng

(2021) Coordinate attention for efficient Mobile network design. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.13708–13717.

14.

Irani

Nasrollahi

Simon

, et al. (2015) Spatiotemporal analysis of RGB-D-T facial images for multimodal pain level recognition. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp.88–95.

15.

Islam

Mahmud

Hossain

(2018) High performance facial expression recognition system using facial region segmentation, fusion of HOG & LBP features and multiclass SVM. In: 2018 10th International Conference on Electrical and Computer Engineering (ICECE), pp.42–45.

16.

Jeong

(2018) Driver’s facial expression recognition in real-time for safe driving. Sensors 18(12): 4270. doi: https://doi.org/10.3390/s18124270.

17.

Mehta

Aneja

, et al. (2019a) A facial affect analysis system for autism Spectrum disorder. In: 2019 IEEE International Conference on Image Processing (ICIP), pp.4549–4553.

18.

Wang

, et al. (2021) LBAN-IL: A novel method of high discriminative representation for facial expression recognition. Neurocomputing 432: 159–169.

19.

Jin

Akram

, et al. (2020) Facial expression recognition with convolutional neural networks via a new face cropping and rotation strategy. The Visual Computer 36: 391–404.

20.

Liu

Gong

, et al. (2019b) INDReview on facial expression analysis and its application in education. In: 2019 Chinese Automation Congress (CAC), pp.4526–4530.

21.

Deng

(2017) Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.2584–2593.

22.

, et al. (2023) Facial expression recognition in the wild using multi-level features and attention mechanisms. IEEE Transactions on Affective Computing 14: 451–462.

23.

Chen

, et al. (2022) Learning informative and discriminative features for facial expression recognition in the wild. IEEE Transactions on Circuits and Systems for Video Technology 32: 3178–3189.

24.

Zeng

Shan

, et al. (2019c) Occlusion aware facial expression recognition using CNN with attention mechanism. IEEE Transactions on Image Processing 28: 2439–2450.

25.

Liao

Zhu

Zheng

, et al. (2022) FERGCN: Facial expression recognition based on graph convolution network. Machine Vision and Applications 33: 40.

26.

Liu

Cai

Lin

, et al. (2022a) Adaptive multilayer perceptual attention network for facial expression recognition. IEEE Transactions on Circuits and Systems for Video Technology 32: 6253–6266.

27.

Liu

Lin

Meng

, et al. (2022b) Point adversarial self-mining: A simple method for facial expression recognition. IEEE Transactions on Cybernetics 52: 12649–12660.

28.

Liu

Zhang

Zhou

, et al. (2021) SG-DSN: A semantic graph-based dual-stream network for facial expression recognition. Neurocomputing 462: 320–330.

29.

Luan

Chen

Zhang

, et al. (2018) Gabor convolutional networks. IEEE Transactions on Image Processing 27: 4357–4366.

30.

Lucey

Cohn

Kanade

, et al. (2010) The extended Cohn-Kanade dataset (CK+): A complete dataset for action unit and emotion-specified expression. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, pp.94–101.

31.

Lyons

Akamatsu

Kamachi

, et al. (1998) Coding facial expressions with Gabor wavelets. In: Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, pp.200–205.

32.

Mehrabian

Ferris

(1967) Inference of attitudes from nonverbal communication in two channels. Journal of Consulting Psychology 31: 248–252.

33.

Mohseni

Zarei

Ramazani

(2014) Facial expression recognition using anatomy based facial graph. In: 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp.3715–3719.

34.

Mollahosseini

Hasani

Mahoor

(2019) Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing 10: 18–31.

35.

Qian

Tian

, et al. (2023) Facial expression recognition based on strong attention mechanism and residual network. Multimedia Tools and Applications 82: 14287–14306.

36.

Rathod

Vanazara

Pandya

(2023) Improved group facial expression recognition using super-resolved local facial multi scale features. In: 2023 11th International Conference on Intelligent Systems and Embedded Design (ISED), pp.1–6.

37.

Roberson

Kikutani

Döge

, et al. (2012) Shades of emotion: What the addition of sunglasses or masks to faces reveals about the development of facial expression processing. Cognition 125: 195–206.

38.

Shabbir

Rout

(2023) FgbCNN: A unified bilinear architecture for learning a fine-grained feature representation in facial expression recognition. Image and Vision Computing 137: 104770.

39.

Shahid

Yan

(2023) Squeezexpnet: Dual-stage convolutional neural network for accurate facial expression recognition with attention mechanism. Knowledge-Based Systems 269: 110451.

40.

Shao

Cheng

(2021) E-FCNN for tiny facial expression recognition. Applied Intelligence 51: 549–559.

41.

She

Shi

, et al. (2021) Dive into ambiguity: Latent distribution mining and pairwise uncertainty estimation for facial expression recognition. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.6244–6253.

42.

Singh

Ramanujam

(2025) M, MMSAD—A multi-modal student attentiveness detection in smart education using facial features and landmarks. Journal of Ambient Intelligence and Smart Environments: 18761364251315239.

43.

Sripian

Anuardi

MNAM

Ito

, et al. (2021) Emotion-sensitive voice-casting care robot in rehabilitation using real-time sensing and analysis of biometric information. Journal of Ambient Intelligence and Smart Environments 13: 413–431.

44.

Starostenko

Cortés

Sánchez

, et al. (2015) Unobtrusive emotion sensing and interpretation in smart environment. Journal of Ambient Intelligence and Smart Environments 7: 59–83.

45.

Takalkar

Thuseethan

Rajasegarar

, et al. (2021) LGAttNet: Automatic micro-expression detection using dual-stream local and global attentions. Knowledge-Based Systems 212: 106566.

46.

Tang

Zhang

, et al. (2021) Facial expression recognition using frequency neural network. IEEE Transactions on Image Processing 30: 444–457.

47.

Tiong

LCO

Kim

(2020) Multimodal facial biometrics recognition: Dual-stream convolutional neural networks with multi-feature fusion layers. Image and Vision Computing 102: 103977.

48.

Ullah

Hasan

, et al. (2022) Improved deep CNN-based two stream super resolution and hybrid deep model-based facial emotion recognition. Engineering Applications of Artificial Intelligence 116: 105486.

49.

Wang

Xue

, et al. (2022) Light attention embedding for facial expression recognition. IEEE Transactions on Circuits and Systems for Video Technology 32: 1834–1847.

50.

Wang

Peng

Yang

, et al. (2020) Region attention networks for pose and occlusion robust facial expression recognition. IEEE Transactions on Image Processing 29: 4057–4069.

51.

Wang

Zhu

Chen

, et al. (2019) Perceptual learning and recognition confusion reveal the underlying relationships among the six basic emotions. Cognition and Emotion 33: 754–767.

52.

Weng

Yang

Tan

, et al. (2021) Attentive hybrid feature with two-step fusion for facial expression recognition. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp.6410–6416.

53.

Widen

Christy

Hewett

, et al. (2011) Do proposed facial expressions of contempt, shame, embarrassment, and compassion communicate the predicted emotion? Cognition and Emotion 25: 898–906.

54.

Woo

Park

Lee

J-Y

, et al. (2018) CBAM: Convolutional block attention module. In: Ferrari

Hebert

Sminchisescu

Weiss

(eds) Computer vision – ECCV 2018. Cham: Springer International Publishing, pp. 3–19.

55.

Xie

Tian

, et al. (2022) Triplet loss with multistage outlier suppression and class-pair margins for facial expression recognition. IEEE Transactions on Circuits and Systems for Video Technology 32: 690–703.

56.

Zhao

Liu

Wang

(2021) Learning deep global multi-scale and local attention features for facial expression recognition in the wild. IEEE Transactions on Image Processing 30: 6544–6556.

57.

Zhou

Khosla

Lapedriza

, et al. (2016) Learning deep features for discriminative localization. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.2921–2929.