Cross feature enhanced graph convolutional network for aspect-based sentiment analysis

Abstract

Traditional graph convolutional neural networks (GCN) utilizing linear feature combination methods have limited capacity to capture the interaction between complex features. While current research has extensively investigated various syntactic dependency tree structures, the optimization of GCN algorithms has often been overlooked, leading to suboptimal efficiency in practical applications. To address this issue, this paper proposes a cross-feature method that utilizes feature vector multiplication to construct non-linear combinations of GCN features and enhance the model’s capability to extract complex feature correlations. Experimental results demonstrate the superiority of the proposed method, with our models outperforming state-of-the-art methods and achieving significant improvements on three standard benchmark datasets. These results suggest that the cross-feature method can effectively extract potential connections between features, highlighting its potential for improving the performance of GCN-based models in real-world applications.

Keywords

Aspect-based sentiment analysis syntactic dependency tree graph convolutional neural networks cross-feature

1 Introduction

Aspect-based Sentiment Analysis (ABSA) is a fine-grained Sentiment classification task that usually consists of two sub-tasks: Aspect word extraction (AE) and Aspect-level Sentiment classification (ALSC). This paper only focuses on ATSC, and its research object corresponds to a specific word or word group in the sentence, also known as the aspect term. For instance, given the sentence "The food was well prepared, but the service is abysmal", the aspect terms are "food" and "service", respectively. Our task is to classify the sentiment polarity of the aspect term as "negative", "neutral", or "positive". This method has been widely used in e-commerce word-of-mouth analysis, public opinion monitoring, content marketing, and other scenarios, providing particular references and creating high commercial value.

ALSC focuses on analyzing the context semantic environment of the aspect terms. So, effectively capturing the sentiment information related to the aspect terms is key to the task. Previously, it used Long-Short Term Memory networks (LSTM) and Gate Recurrent Units (GRU) to extract context sequence information from the sentence [1 –5]. They utilize gating units to control the flow of information and achieve remarkable results with the assistance of attention mechanisms [6 –8]. However, sequence models can only model context details but cannot extract syntactic relations between aspect terms and context words. Therefore, the task introduces Graph Convolutional Networks(GCN)[9 –11] of joint syntactic dependency trees to model the syntactic information between words. For instance, Sun et al. [12] combine the LSTM model with the GCN model to jointly learn a sentence’s context and syntactic information. Zhang et al. [13] use syntactic trees to generate aspect term representations and design gate mechanisms to constrain context representation. Lu et al. [14] devise attention mechanisms to obtain the correlation between the syntactic information of a sentence and the aspect terms.

Further, a range of models based on GCN variants primarily endeavor to exploit the rich structural feature of the syntactic tree. Velickovic et al. [15] proposed graph attention networks (GAT) to compute the coupling factors between nodes. Wang et al. [16] consider connected edges in syntactic trees to enrich node features. Phan et al. [17] consider the syntactic distance between nodes and distinguish the degree of contribution between other nodes in the tree to the current node. Chen et al. [18] combine brain cognitive principles with GCN networks to extract semantic information based on brain understanding. Alternatively, some researchers argue that the syntactic trees generated by syntactic tools such as Stanford [19] or Spacy [20] do not cover all the nodes. They alter the original syntactic trees and construct new ways of connecting the nodes. For instance, Zhang et al. [21] used word co-occurrence probabilities to create word co-occurrence matrices to compensate for the incomplete structure of the syntactic tree. Chen et al. [22] use various attention mechanisms to construct potential graph structures that enrich the information representation of the syntactic tree.

In summary, almost all existing research on syntactic tree-based GCN models either focuses on utilizing the structural properties of syntactic trees, such as node information, edge information, and modifier attributes, or addressing the structural deficiencies of syntactic trees, such as constructing a global node information matrix. Although these methods have achieved promising experimental results, they all overlook a fundamental problem, which is the limitation of the GCN model algorithm itself, i.e., the linear feature combination cannot fully extract the correlations between features, thereby limiting the performance of GCN models in practical applications.

Because the multi-layer convolutional structure of the GCN is similar to a Multi-layer perceptron(MLP) [23], using hidden layers to transform the inputs into a high-level dimensional space. Nevertheless, this linear feature combination approach of weighted summation between elements is a weak representation of feature interaction and cannot express the underlying connections between features. Given a sentiment ambiguous sentence: "How can hope to stay in business with service like this", the aspect term is "service", and the corresponding sentiment polarity is "negative". The attention mechanism and GCN model will give too much attention to ’stay’ or ’hope’, which often have positive sentiment connotations and lead to incorrect classification results. This paper believes that aspect-level sentiment classification should rely on co-occurrence relationships between words. In the absence of indicative sentiment words, multiple words rather than individual words jointly determine the sentiment tendency of the aspect term. Notably, the GCN’s algorithmic structure is insufficient to efficiently string together multiple features, limiting the model’s generalization ability.

This paper designs a novel graph convolutional neural network based on cross-feature [24 –27] to solve this problem. Firstly, the model uses LSTM to learn the context information of sentences. Secondly, to ensure that the GCN model can effectively identify the co-occurrence environment of words, this paper modifies the feature transformation algorithm of the GCN. It employs a non-linear feature combination strategy based on feature product instead of a linear feature combination method based on feature-weighted summation, which can improve the ability of the GCN to model feature interactions. The contribution of this paper is as follows:

•This paper considers the feature transformation structure of the GCN as equivalent to a multilayer perceptron and shows the detailed reasoning process in the second section of the article. This method of learning feature interactions through weighted summation is less effective than the product of features in extracting correlations between aspect terms and information about their surroundings.

•This paper introduces the idea of cross-feature modeling to the aspect-level sentiment analysis task and designs a novel graph convolutional neural network model based on cross-features. The model takes advantage of GCN iteratively updating node information and using a non-linear feature combination based on the product of features to replace the original linear feature combination, enhancing the GCN model’s ability to extract potential connections between components.

•Experimental results on multiple publicly available datasets show that the ACC and F1 indicators of the model have improved in different datasets, which further proves the effectiveness of the cross feature.

2 Graph convolutional neural networks with multi-layer perceptron

2.1 Graph convolutional neural networks

The traditional GCN uses the syntactic dependency tree to capture the adjacent node information of the target node and obtains the embedded representation of the target node through the summation operation. We take the initial layer as an example to introduce the convolution process of GCN in detail. The formulation is: $X_{i}^{'} = c_{i} (\sum_{j = 1}^{n} A_{ij} X_{j})$ (1) where c_i is the degree of the target node, $X_{i}'$ and $X_{j}$ represent the updated embedding representation of the target node and the original embedding representation of other nodes, respectively. $A \in R^{n \times n}$ stands for adjacency matrix,A_ij = 1 when there is a connection between nodes. Otherwise, A_ij = 0.

Then, GCN utilizes a feedforward neural network to transform the updated target node embedding representation into a higher-level nonlinear feature representation, the formulation is: $h_{i} = σ (W_{i} X_{i}^{'} + b_{i})$ (2) where $W_{i}$ is s a parameter matrix, b_i is a bias term, σ is a non-linear activation function, and h_i is hidden layer output.

Equation (1) represents the process of GCN aggregation of adjacent node information by the GCN, which enriches the embedding representation of the target node by collecting adjacent node features. Equation (2) represents the process of GCN feature transformation, which uses a weighted summation method to map the target node to a higher-level feature space. The combination of Equation (1) and Equation (2) represents the complete single-layer convolution operation of the GCN.

In order to mine the deeper adequate node information, GCN often use multiple convolutional layers, which makes the feature transformation process of GCN evolve from a single-layer neural network to a Multi-layer Perceptron. The formulation is: $Y = F_{l} (F_{l - 1} (\dots F_{1} (X)))$ (3) $F_{l} (X) = σ (W_{l} X + b_{l})$ (4) where $W_{l}$ is s a parameter matrix, b_l is a bias term, $Y$ represents the final hidden state vector after multiple iterations.

Fig. 1

Multi-layer perceptron construction.

2.2 Multi-Layer perceptron

As shown in Figure 1, the Multi-layer perceptron consists of an input, a dense layer, an activation layer, and an output. A dense layer, also known as a fully connected or linear layer, mainly uses linear functions to fit the relationship between features. However, linear functions cannot solve the XOR problem between features, limiting linear models’ expressiveness. Therefore, the Activation layer uses a nonlinear activation function to convert the output of the dense layer into a nonlinear feature representation, which increases the expressiveness of the model. With enough neurons, a single-layer neural network can approximate any function, but it also increases the risk of dimensionality curse and overfitting. MLP increases the complexity of the model by deepening the number of neural network layers. Compared with expanding the some neurons in a single-layer network, MLP has fewer parameters and more robust expression capabilities.

However, MLP is a neural network based on additive operations, given an input $X \in R^{n}$ , a parameter matrix $W \in R^{m \times n}$ ,and a bias term $b \in R^{m}$ . The formulation is: $\begin{matrix} F (X) & = σ (W X + b) \\ = [\begin{matrix} ω_{11} & \dots & ω_{1 n} \\ ω_{21} & \dots & ω_{2 n} \\ \dots & \dots & \dots \\ ω_{m 1} & \dots & ω_{mn} \end{matrix}] [\begin{matrix} x_{1} \\ x_{2} \\ \dots \\ x_{n} \end{matrix}] + [\begin{matrix} b_{1} \\ b_{2} \\ \dots \\ b_{n} \end{matrix}] \\ = [\begin{matrix} ω_{11} x_{1} + \dots + ω_{1 n} x_{n} + b_{1} \\ ω_{21} x_{1} + \dots + ω_{2 n} x_{n} + b_{2} \\ \dots \\ ω_{m 1} x_{1} + \dots + ω_{mn} x_{n} + b_{n} \end{matrix}] \end{matrix}$ (5)

For example, the output of the m-th neuron is ω_m1x₁ + ⋯ + ω_mnx_n + b_n, the weighted summation of each feature in the input obtains a linear feature combination and uses the activation function to obtain the hidden layer output, which is the input of the upper network. Finally, the interaction between the features is achieved by iterating layer by layer. However, the weighted summation is not the most efficient way to interact with features. Moreover, Qu et al. [24] mentioned that the relationship between features is more of an “and” relationship, not an “add” relationship. For instance, given two sentences:” Food videos over 10 minutes in length” and “Videos longer than 10 minutes and food videos”, the former feature combination method can better reflect the close relationship between features.

3 Cross feature

Mathematically, cross-feature is the product of two or more features. The multiplication relationship can be a logical " ∩ " operation, the product of a series of conditions that take effect together. For example, the Youth Internet Health Management Mechanism predicts whether the user type is a teenager. The premise is that teenagers have more behaviors in the game APP during weekends and holidays. Assuming that the time feature "Saturday", the app category "Game", and the behavior feature " Purchase Skin" occur together in one user, it is highly likely that the user category is teenagers. By analogy, a k-order cross-feature calculates as follows:

$X_{i_{1}, i_{2}, \dots, i_{k}}^{k} = \prod_{i \in i_{1}, i_{2}, \dots, i_{k}} x_{i}$ (6)

where, x_i is the i-th dimensional feature in the input. However, the dimension of features in natural scenes is significant, and it is expensive to construct cross features manually. Therefore, the automatic construction of cross features employing vector products is the key to freeing the workforce. Given a feature vector $X \in R^{n}$ of dimension n, the formula for calculating the k-order cross of n features is as follows: $X^{k} = \underset{k copes}{\underset{︸}{X \otimes X \otimes \dots \otimes X}} \in R^{n \times n \times \dots \times n}$ (7) where, $X^{k} \in R^{n^{k}}$ . ⊗ represents the outer product of tensors. The obtained k-order cross features are substituted into the GCN feature transformation algorithm to obtain the hidden layer output $H^{k} \in R^{E}$ , and E represents the dimensionality of the hidden layer output. The formulation is:

$H^{k} = σ (W^{k} X^{k} + b^{k})$ (8) where $W^{k} \in R^{E \times n^{k}}$ is a parameter matrix, b^k is a bias term.

By subdividing the formula (8), the e-th dimension feature in the output of the hidden layer is $H_{e} k \in R$ , and the formula is: $H_{e}^{k} = σ (sum (W_{e}^{k} \circ X^{k} + b_{e}^{k}))$ (9) where $W_{e} k \in R^{n^{k}}$ is a parameter matrix, $b_{e}^{k} \in R$ is a bias term, ∘ represents hadamard product.

However, cross-feature will generate new high-dimensional features, such as formula (9), the time complexity of the output vector of the e-th dimension hidden layer is O (n^k), the parameter complexity is O (En^k), and the amount of parameters is the key to improving the performance of the model.

Therefore, our paper adopts the method of Feng et al. [27] to reduce the time complexity and the number of parameters. For ease of understanding, we use the second-order cross-feature as an example for the low-order near-rank derivation of Eq. (9). Given an input $X^{2} \in R^{n \times n}$ , $H_{e} 2$ is the e-th dimensional feature of the hidden state vector. The specific derivation is shown in equation (10): $\begin{matrix} H_{e}^{2} & = sum (W_{e}^{2} \circ X^{2}) \\ = sum ((ω ϖ^{T}) (X X^{T})) \\ = sum ((ω \circ X) (ϖ \circ X)^{T}) \\ = \sum_{i = 1}^{n} \sum_{j = 1}^{n} (ω_{i} x_{i} * ϖ_{j} x_{j}) \\ = \sum_{i = 1}^{n} (ω_{i} x_{i}) * \sum_{j = 1}^{n} (ϖ_{j} x_{j}) \\ = (ω X) (ϖ X) \end{matrix}$ (10) where $W_{e} 2 \in R^{n \times n}$ is a parameter matrix, according to the tensor outer product algorithm, we can get $X^{2} = X X^{T}$ . Meanwhile, according to the matrix decomposition principle, $W_{e} 2$ is equivalent to the product of two hidden vectors of dimension n, like $W_{e} 2 = ω ϖ^{T}$ , ω and $ϖ \in R^{n}$ .

Since $H_{e}^{2}$ is only the e-th dimensional feature of $H^{2}$ , calculating $H^{2}$ requires E pairs of hidden vectors, i.e. $W \bar{W} \in R^{E \times n}$ . Thus, $H^{2}$ is calculated as follows: $H^{2} = (W X) \circ (\bar{W} X) = (W^{1} X) \circ (W^{2} X)$ (11)

So, the conversion formula for the k-order cross-feature can be expressed as follows: $H^{k} = (W^{1} X) \circ (W^{2} X) \circ \dots \circ (W^{k} X)$ (12)

After low-order near-rank derivation, the original parameter quantity is reduced from E × n^k to E × k × n, the time complexity is reduced from O (En^k) to O (Ekn), and the parameter quantity and computational complexity increase linearly with the cross order. As a result, the dimensional explosion problem caused by cross-feature will be reduced to a linear level within the tolerance of the model.

4 CF-GCN

This paper builds a graph convolutional neural network model based on cross features. As shown in Figure 2, CF-GCN is mainly composed of Input, BI-LSTM, GCN, Mask, and Output. Each part of the model will be introduced separately in this section.

4.1 Input and Bi-LSTM

Given a sentence S = {w₁, w₂, ⋯ , w_n}, where Aspect = {a₁, a₂, ⋯ , a_m} is the sequence of aspect terms in the sentence. In addition to word vectors, the input layer also adds position embedding information and part-of-speech embedding information. This paper uses the embedding matrix $G \in R^{| v | \times d_{e}}$ provided by the Glove [28] preprocessing word vector, |v| denotes the size of the corpus word list and d_e denotes the dimensionality of the word embedding. Finally, concatenate the word vector, position embedding and part-of-speech embedding to generate a new input vector V = {v₁, v₂, ⋯ , v_n}. The input vector is the input of Bi-LSTM, and the hidden state vector is obtained through two layers of neural networks in opposite directions. Where $H_{Bi - LSTM} \in R^{n \times 2 d_{h}}$ , d_h is the output dimension of a single-layer LSTM.

Fig. 2

Structure of CF-GCN model based on cross-feature.

4.2 GCN

Firstly, the hidden state vector of Bi-LSTM is the input of the GCN layer. Then, the model obtains the embedded representation of the target node by aggregating the adjacent node information. The formula is: $h_{i}^{l^{'}} = c_{i} (\sum_{j = 1}^{n} A_{i j} h_{j}^{l})$ (13) where and $h_{j}^{l} \in R^{D} (D = 2 d_{h})$ represent the target node and other nodes in the layer L embedding feature representation, respectively.

Secondly, perform the outer product operation on the target node to generate the order cross-feature, the formula is:

$h_{i}^{l^{k}} = h_{i}^{l^{'}} \otimes h_{i}^{l^{'}} \otimes \dots \otimes h_{i}^{l^{'}}$ (14)

where $h_{i}^{l^{k}} \in R^{D^{k}}$ . According to the low-order near-rank derivation of formula (12), the feature representation of the target node calculates as follows:

$h_{i}^{(l + 1) k} = (W_{k}^{l} h_{i}^{l^{'}}) \circ (W_{k - 1}^{l} h_{i}^{l^{'}}) \circ \dots \circ (W_{1}^{l} h_{i}^{l^{'}})$ (15)

where $h_{i}^{(l + 1)^{k}}$ is the output feature representation of the layer. Considering that the output of each layer will have different contributions to the final result, to enrich the information of the final output, we perform an aggregation operation on the output of each layer. The formula is as follows:

$h = σ (\sum_{l = 1}^{L} h^{l})$ (16) where $h^{l} \in R^{E}$ is the output vector of each layer of GCN convolution, $h \in R^{E}$ is the final aggregation vector.

4.3 Mask

In this paper, the mask filtering mechanism weakens the interference of irrelevant features on the final output and retain the feature representation of the aspect word itself. The formula is: $f = mask {h_{1}, h_{2}, h_{a 1}, \dots, h_{am}, \dots h_{n}}$ (17) $mask = {\begin{matrix} 1 & , & if h_{i} \in Aspect \\ 0 & , & otherwise \end{matrix}$ (18)

where f is the final output. The mask mechanism depends on whether the current word belongs to the aspect term, and the hidden state vector retains if it belongs to the aspect term. Otherwise, it sets to 0.

4.4 Output

In this paper, the final output obtained is fed to the fully connected layer and finally classified by softmax as follows: $p = softmax (W_{p} f + b_{f})$ (19) where $p \in R^{d_{p}}$ is the sentiment polarity decision space. d_p is the number of sentiment labels.

The model trains with a cross-entropy loss function: $loss = - \sum_{i = 1}^{d_{p}} y_{i} {logp}_{i}$ (20) where, $y_{i} \in R_{p}^{d}$ is true probability distribution.

5 Experiment

5.1 Datasets

ALSC is to determine whether the polarity of an aspect term is positive, negative, or neutral for a given aspect term in a sentence. The experiments are conducted on three widely used benchmarking datasets for ABSA, whose statistics are summarized in Table 1:

•Twitter is a dataset gathered by Dong et al. [29];

•Restaurant and Laptop are downloaded from SemEval 2014 task 4 [30], which contains sentiment reviews for restaurant and laptop domains.

Table 1
Statistics of datasets

Dataset Positive Neural Negative

train test train test train test

Twitter 1507 172 3016 336 1528 169

Laptop 976 337 455 167 851 128

Restaurant 2164 727 637 196 807 196

Dataset	Positive	Neural	Negative
Twitter	1507	172	3016	336	1528	169
Laptop	976	337	455	167	851	128
Restaurant	2164	727	637	196	807	196

5.2 Hyper-parameter

In this paper, a 300-dimensional Glove word vector is used to initialize the word embedding, and all the weight parameters of the model are initialized with a uniform dis-tribution. When the number of GCN convolutional layers is set to 2 and the cross order set to 2, the model performance is optimal. The hyper-parameter setting of the model is shown in Table 2.

Table 2
Hyper-parameter setting

Hyper-parameter Setting

Word embedding 300

Position embedding 30

Part-of-speech embedding 30

Bi-LSTM embedding 100

Batch size 32

Optimizer Adam

Learning rate 0.01

Epoch 100

GCN dropout 0.5

Hyper-parameter	Setting
Word embedding	300
Position embedding	30
Part-of-speech embedding	30
Bi-LSTM embedding	100
Batch size	32
Optimizer	Adam
Learning rate	0.01
Epoch	100
GCN dropout	0.5

5.3 Baseline

The model performance is compared with a range of benchmark models, briefly described below.

•TD-LSTM [1]: using LSTM to model the correlation between target words and context.

•ATAE-LSTM [2]: based on LSTM sequence modelling, an attention mechanism is designed to calculate the weights of aspect words and different contexts.

•IAN [3]: model the aspect words and contextual features, respectively, and design an interactive attention mechanism to learn feature representations of both.

•CDT [12]: an integrated model based on LSTM and GCN, which jointly learns the context information and syntactic dependency information of sentences.

•ASGCN [13]: an integrated model based on LSTM and GCN, using the aspect term generated by GCN to distinguish the context weight of LSTM output.

•Bi-GCN [21]: taking the word co-occurrence rate into account in syntactic modelling to compensate for the flaws of text processing tools’ error analysis.

•Kuma-GCN [22]: modifying the syntactic dependency tree to improve the sensitivity of aspect terms to sentiment words.

•RGAT [16]: Pruning and reconstructing of the dependency tree to weaken the interference of invalid features in the syntactic dependency tree.

•BSSCN [18]: Semantic modeling of GCN based on cognitive guidance of the brain.

•AGGCN [14]: using a special aspect gate designed to guide the encoding of aspect-specific information and construct a graph convolution network on the sentence dependency tree.

•Rep-Walk [31]: The RepWalk model leverages the syntactic structure of the sentence to find crucial contextual information and enriches the representation for the classification.

•DGEDT [32]: dual-transformer network model, which jointly considers the flat representations learned from Transformer and graph-based representations learned from the corresponding dependency graph in an iterative interaction manner.

•ABSACap [33]: The ABASCap model improves the multi-head self-attention and proposes a context mask mechanism based on an adjustable context window to effectively obtain the internal association between aspects and context.

5.4 Experimental result

Table 3
Model performance comparison

Model Twitter Laptop Restaurant

Acc F1 Acc F1 Acc F1

TD-LSTM 71.53 68.21 68.97 63.21 77.86 65.93

ATAE-LSTM 68.50 66.27 68.18 62.77 78.39 68.06

IAN 72.50 70.81 72.05 67.38 79.26 70.09

Rep-Walk 72.41 70.40 76.20 71.90 81.80 73.20

DGEDT 74.80 73.40 76.80 72.30 83.90 75.10

ABSACap 72.92 70.23 76.16 71.78 81.74 72.66

CDT 74.66 73.66 77.19 72.99 82.30 74.02

ASGCN 72.15 70.40 75.55 71.05 80.77 72.02

Bi-GCN 74.16 73.35 74.59 71.84 81.79 73.01

Kuma-GCN 72.45 70.77 76.12 72.42 81.43 73.64

RGAT 75.57 73.82 77.42 73.76 83.30 76.08

BSSCN 75.14 73.27 77.74 74.21 82.59 74.22

AGGCN 73.64 72.20 73.53 68.99 84.37 73.82

CF-GCN 75.23 73.96 77.81 74.56 83.36 76.18

Model	Twitter	Laptop	Restaurant
TD-LSTM	71.53	68.21	68.97	63.21	77.86	65.93
ATAE-LSTM	68.50	66.27	68.18	62.77	78.39	68.06
IAN	72.50	70.81	72.05	67.38	79.26	70.09
Rep-Walk	72.41	70.40	76.20	71.90	81.80	73.20
DGEDT	74.80	73.40	76.80	72.30	83.90	75.10
ABSACap	72.92	70.23	76.16	71.78	81.74	72.66
CDT	74.66	73.66	77.19	72.99	82.30	74.02
ASGCN	72.15	70.40	75.55	71.05	80.77	72.02
Bi-GCN	74.16	73.35	74.59	71.84	81.79	73.01
Kuma-GCN	72.45	70.77	76.12	72.42	81.43	73.64
RGAT	75.57	73.82	77.42	73.76	83.30	76.08
BSSCN	75.14	73.27	77.74	74.21	82.59	74.22
AGGCN	73.64	72.20	73.53	68.99	84.37	73.82
CF-GCN	75.23	73.96	77.81	74.56	83.36	76.18

Table 3 presents the performance comparison of each model on the three datasets. Among a series of LSTM-based models, the TD-LSTM model slightly outperformed the ATAE-LSTM model on the Twitter and Laptop datasets, suggesting that the aspect term distinguished the importance of context at different locations. The IAN model argues that aspect terms should be modeled independently and designs interactional attention mechanisms to couple aspect terms with context features, outperforming the TD-LSTM model and ATAE-LSTM model on all three datasets.

The CDT model, which jointly learns contextual and syntactic dependency information, outperforms the IAN model on all three datasets, especially the Laptop dataset, with a 5.14% improvement in Acc and a 5.61% improvement in F1 values. The experimental results demonstrate that syntactic information enhances aspect-level sentiment classification accuracy. The ASGCN model uses a gate mechanism to distinguish the contribution of neighboring nodes to the target node during the convolution process. Compared to the CDT model, its experimental results on the three datasets were reduced by 2% on average. The AGGCN model added aspect terms to the LSTM model to avoid the model missing aspect-related sentiment information. Its experimental results on the Twitter dataset and laptop dataset were reduced by 1% on average compared to the CDT model. The experimental results of ASGCN and AGGCN were slightly lower than those of the CDT model because the CDT model adds position embedding vectors and part of speech embedding vectors to the inputs.

The Bi-GCN model constructs word co-occurrence matrices, and the Kuma-GCN model makes multiple graph structures. Both supplement the syntactic tree structure but are less effective than the CDT model on all three datasets. Significantly the Kuma-GCN model reduces ACC by 2.21% on the Twitter dataset. The experimental results illustrate that the integrated model based on LSTM and GCN is simpler and more efficient. Similarly, the RGAT model reconstructs the syntactic tree and considers both aspectual items and edge information of the syntactic tree. The model achieves outstanding results on the Twitter dataset with an Acc of 75.57%. In contrast, the BSSCN model is based on human cognitive principles and combines GCN with convolutional neural networks to learn semantic information. Compared to the RGAT model, this model improved the accuracy of the Laptop dataset by 0.32%.

CF-GCN model introduces nonlinear factors in the GCN feature conversion process. The accuracy of the Twitter dataset is 75.23%, which is lower than 0.34% of RGAT, and the F1 value is higher than 0.14% of RGAT. On the Laptops dataset, the accuracy is 77.81%, and the F1 value is 74.56%, which are 0.07% and 0.35% higher than BSSCN, respectively. On the Restaurants dataset, the accuracy is 83.36%, and the F1 is 76.18%, compared to RGAT increased by 0.10%, respectively. Experimental results demonstrate that the CF-GCN model is more straightforward and effective than the RGAT and BSSCN models, the introduction of cross-feature in the GCN convolution process optimizes the ability of the GCN model to extract sentiment and that cross-features strong feature interaction significantly outperforms the weak interaction of linear combinations. In addition, this paper compares the CF-GCN model with other models that use different methods, and the CF-GCN model performs well on three datasets. On the Twitter dataset, the accuracy of the CF-GCN model is 2.83% higher than the Rep-Walk model and 2.31% higher than the ABSACap model. On the Laptop dataset, the accuracy of the CF-GCN model is 1.01% higher than the DGEDT model. Experimental results show that the feature combination method of cross-feature has obvious advantages over other feature extraction methods, and it can better extract the correlations between features.

5.5 Ablation

We first performed an ablation analysis for the input layer to verify the rationality of position embedding and part-of-speech embedding merging based on the BI-LSTM model, as shown in Table 4.

Table 4
Input layer ablation comparison

Model Twitter Laptop Restaurant

Acc F1 Acc F1 Acc F1

BI-LSTM 70.34 69.21 75.20 70.23 77.99 63.67

w post 71.39 69.49 76.04 72.10 81.30 71.51

w pos 72.16 69.77 72.96 68.66 78.26 64.82

w post& pos 72.53 73.17 77.08 73.17 80.41 69.55

Model	Twitter	Laptop	Restaurant
BI-LSTM	70.34	69.21	75.20	70.23	77.99	63.67
w post	71.39	69.49	76.04	72.10	81.30	71.51
w pos	72.16	69.77	72.96	68.66	78.26	64.82
w post& pos	72.53	73.17	77.08	73.17	80.41	69.55

From the analysis in Table 4, we can see that with the inclusion of location information in the BI-LSTM, words close to the location of the aspect term enjoy higher weights. In comparison, words farther away from the location of the aspect term have relatively lower weights. The performance of the w Post model is improved across the three datasets, indicating that the aspect-based location embedding serves to differentiate the importance of context. By introducing part-of-speech information into BI-LSTM, the model performance of the w Pos model was improved on both the Twitters dataset and the Restaurants dataset, demonstrating the positive impact of the inclusion of part-of-speech on the sentiment information of the analyzed aspect term. Finally, both po-sition and part-of-speech information were added to the BI-LSTM model for testing, which performed best on both the Twitter dataset and the Laptops dataset, demonstrating that both contributed positively to the model’s performance.

Then, we performed an ablation analysis of the model architecture to verify the validity of the incorporation of each part of the model, performing an ablation test based on the CF-GCN model, as shown in Table 5.

Table 5

Model architecture ablation comparison

Model	Twitter		Laptop		Restaurant
	Acc	F1	Acc	F1	Acc	F1
CF-GCN	75.23	73.96	77.81	74.56	83.36	76.18
w/o mask	73.52	71.38	75.89	71.40	82.56	74.29
w/o cross	72.52	70.51	77.29	73.64	82.02	72.75

From the results in the above table, we see that the effect of CF-GCN is significantly reduced after removing the Mask mechanism, and the accuracy rates in the Twitter and Laptop datasets are reduced by 1.71% and 1.83%, respectively, indicating that the Mask mechanism plays a role in filtering invalid information. Secondly, removing the cross features, the accuracy of the model in the Twitter and Restaurant datasets decreased by 1.0% and 0.77%, respectively, indicating that the linear combination has a weak ability to learn feature interaction and cannot give full play to the advantages of GCN.

5.6 Visualization

Since the depth of information captured by GCN is directly related to the number of GCN convolutional layers, we conducted visualization experiments on the original GCN model. The integrated model of Bi-LSTM and GCN uses the number of GCN convolutional layers as a hyper-parameter to observe the performance of GCN, as shown in Figure 3.

Fig. 3

GCN model convolutional layers test.

It can be seen from Figure 3 that from the perspective of GCN model performance, when the layer is equal to 7, the performance is the best. When the number of convolution layers is less than 4, GCN is not enough to obtain deep node information; when the number of convolution layers is in the range of 8 to 10, as the number of convolutional layers increases, accuracy shows a downward trend, which results in an over-smoothing problem. From the perspective of the integrated model of Bi-LSTM and GCN, the integrated model achieves the best performance when the number of convolutional layers is 2. The advantage of the Bi-LSTM model in modelling global sequence information allows the integrated model to achieve the best performance without relying on too high a convolutional depth. In summary, the integrated model has the advantage of modelling global sequence information. The lowest accuracy value of 77.27% is at the same level as the highest accuracy value of 77.29% for the GCN model. The integrated model is undoubtedly the best choice, and the convolutional layer value of 2 is the best hyper-parameter.

In order to test the influence of the cross feature order on the practical effect, we set the number of convolutional layers of GCN to 2 to test the relationship between the model performance and the cross-order. The effect is shown in Figure 4:

Fig. 4

CF-GCN model cross-order test.

As shown in figure 4, the best result is obtained when the cross order is 2. When the cross order is 1, the linear combination between features does not constitute a direct cross effect. When the cross order is greater than 2, as the cross order increases, the noise generated by the cross of invalid features increases, and the accuracy rate decrease precipitously. When the cross order is 5, the F1 value slips by 7.61%, indicating that the noise has seriously affected the model’s performance. Therefore, the hyper-parameter of the cross features is optimal when taken as 2.

6 Case Study

This paper analyses the model’s performance using weighted heatmaps to highlight the differences between the GCN based on cross-feature and traditional GCN models. The CDT and ASGCN model share two parts with our CF-GCN model: following the traditional syntactic tree structure; and an integrated model of the joint LSTM and GCN. The difference is that the CDT model uses the original form of the GCN, and the ASGCN model improves the aggregation process of the GCN algorithm by using a gate mechanism to calculate the contribution of neighboring nodes to the target node. The CF-GCN model, on the other hand, improves the feature transformation process of the GCN algorithm by replacing the weighted summation operation with the feature product operation. Therefore, the CDT and ASGCN models were chosen as the baseline models. Since neither CF-GCN nor CDT adopts the attention mechanism, we calculate the impact of the current word change on the final output by masking the word vector representation in the final output. Finally, we use the difference between the final output before and after the change to calculate the word influence score. The higher the obtained score, the greater the importance the model places on the current word. The effect is shown in Figure 5:

Fig. 5

CF-GCN example weight heatmap.

Given the example: "How can hope to stay in business with service like this ?" the aspect term is "service", which corresponds to the " negative " sentiment polarity. First of all, CDT gives the most significant attention to "service". The attention to other words is roughly the same, which means that the model fails to obtain valid information from the sentence environment, leading to the model’s wrong classification results. Secondly, ASGCN pays the closest attention to "stay in" and also focuses on "hope" and "like". However, the object of "hope to stay in" is "business" rather than "service", the polysemous word "like" also confuses sentiment polarity, which leads to the model giving incorrect classification results. Finally, CF-GCN gives the greatest attention to "this". In the case of great attention to "stay", the model accurately identifies the auxiliary role of "with", and the object referred to by "with. . .this" is exactly the aspect of "service". The comparison results show that our model pays more attention to the sentence environment than CDT and ASGCN. It focuses on mining the co-occurrence relationship of words in the context.

7 Conclusions

The feature transformation process of the traditional GCN model is similar to a Multi-layer perceptron, using a weighted summation approach to generate a linear combination of features. In a semantic environment with ambiguous sentiment information, this way of learning feature interactions is slightly inferior. This paper designs a novel cross-feature graph convolutional neural network model to address this problem. In the convolution process of GCN, the model generates non-linear features in the form of feature multiplication, replacing the original linear combination method. The advantage is that it iteratively generates different cross-feature pairs, constructing a diverse feature co-occurrence scene and improving the GCN model’s ability to learn potential connections between features. A series of experiments demonstrate the validity of the cross-feature.

Undoubtedly, our model also has several limitations: (1) the problem of cross-order, increasing cross-order will not only increase the complexity of the model and lead to the risk of overfitting but also introduce noise, which may reduce the performance of the model. (2) There is a conflict between cross-feature and information pruning methods. The advantage of cross-feature is to extract the correlations between different features and expand the information collection range of the model. In contrast, pruning methods weaken the interference of irrelevant features and focus on a small portion of useful information. Currently, these two methods are not compatible with each other.

In the future, this paper will focus on developing a selectable feature-cross algorithm that strategically interacts with node information that has a positive effect on aspect terms to further improve the accuracy of aspect-based sentiment classification.

Footnotes

Acknowledgments

This work is supported by National Natural Science Foundation of China (No. 62166041) .

References

Tang

, Qin

, Feng

and Liu

, Effective lstms for target-dependent sentiment classification. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 2016a. pp. 3298-3307.

Wang

, Huang

, Zhu

and Zhao

, Attention-based lstm for aspect-level sentiment classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, (2016), pp. 606–.615.

, Li

, Zhang

and Wang

, Interactive attention networks for aspect-level sentiment classification. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (2017), pp. 4068–.4074. AAAI Press.

, Zhang

, Hou

and Song

, A position-aware bidirectional attention network for aspect-level sentiment analysis. In Proceedings of the 27th international conference on computational linguistics (2018), pp. 774–.784.

Xing

, Liao

, Song

, Wang

, Zhang

, Wang

and Huang

, Earlier attention? aspect-awareLSTMfor aspect-based sentiment analysis. In International Joint Conference on Artificial Intelligence (2019), pp. 5313–.5319.

Chen

, Sun

, Bing

and Yang

, Recurrent attention network on memory for aspect sentiment analysis. In Proceedings of the 2017 conference on empirical methods in natural language processing (2017), pp. 452–.461.

Fan

, Feng

and Zhao

, Multigrained attention network for aspect-level sentiment classification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (2018), pp. 3433–.3442.

, Lee

W.S.

, Ng

H.T.

and Dahlmeier

, Effective attention modeling for aspect-level sentiment classification. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, (2018), pp. 1121–.1131.

Kipf

T.N.

and Welling

, Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations 2019.

10.

Geerts

, Mazowiecki

and Perez

G.A.

, Let’s agree to degree: Comparing graph convolutional networks in the message-passing framework. In International Conference on Machine Learning (2021), pp. 3640–.3649. PMLR.

11.

Zhang

, Qi

and Manning

C.D.

, Graph convolution over pruned dependency trees improves relation extraction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (2018), pp. 2205–.2215.

12.

Sun

, Zhang

, Mensah

, Mao

and Liu

, Aspect-level sentiment analysis via convolution over dependency tree. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), HongKong, China. Association for Computational Linguistics (2019b), pp. 5683–.5692.

13.

Zhang

, Li

and Song

, Aspectbased sentiment classification with aspect specific graph convolutional networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing andthe 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP), Hong Kong, China. Association for Computational Linguistics (2019), pp. 4560–.4570.

14.

, Zhu

, Zhang

, Kang

and Liu

, Aspect-gated graph convolutional networks for aspect-based sentiment analysis. In Applied Intelligence (2021), pp. 4408–.4419.

15.

Velickovic

, Cucurull

, Casanova

, Romero

, Lio

and Bengio

, Graph attention networks. 2017. arXiv preprint arXiv:1710.10903.

16.

Wang

, Shen

, Yang

, Quan

and Wang

, Relational graph attention network for aspectbased sentiment analysis. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020), pp. 3229–.3238, Online. Association for Computational Linguistics.

17.

Phan

M.H.

and Ogunbona

, Modelling context and syntactical features for aspect–.based sentiment analysis. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020), pp. 3211–.3220.

18.

Chen

, Huang

and Xue

, Bilateral-brainlike Semantic and Syntactic Cognitive Network for Aspect-level Sentiment Analysis. In 2021 International Joint Conference on Neural Networks (IJCNN) (2021), pp. 1-8.

19.

Manning

C.D.

, Surdeanu

, Bauer

, Finkel

, Bethard

S.J.

and McClosky

, The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, (2014), pp. 55–.60.

20.

Schmitt

, Kubler

, Robert

, Papadakis

and LeTraon

, A replicable comparison study of NER software: StanfordNLP, NLTK, OpenNLP, SpaCy, Gate. In 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS) (2019), pp. 338–.343.

21.

Zhang

and Qian

, Convolution over hierarchical syntactic and lexical graphs for aspect level sentiment analysis. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, (2020), pp. 3540–.3549.

22.

Chen

, Teng

and Zhang

, Inducing target-specific latent structures for aspect sentiment classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), (2020), pp. 5596–.5607, Online. Association for Computational Linguistics.

23.

Alsmadi

M.k.

, Omar

K.B.

, Noah

S.A.

and Almarashdah

, Performance comparison of multi-layer perceptron (Back Propagation, Delta Rule and Perceptron) algorithms in neural networks. In IEEE International Advance Computing Conference, (2009), pp. 296–.299.

24.

, Cai

, Ren

, Zhang

and Yu

, Product-based Neural Networks for User Response Prediction. In IEEE 16th International Conference on Data Mining, (2016), pp. 1149–.1154.

25.

Wang

, Fu

and Wang

, Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17, (2017), pp. 1–.7.

26.

Lian

, Zhou

, Zhang

, Chen

, Xie

and Sun

, xdeepfm: Combining explicit and implicit feature interactions for recommender systems. In SIGKDD. ACM 2018, (2018), pp. 1754–.1763.

27.

Feng

, He

, Zhang

and Chua

T-S.

, Cross-GCN: Enhancing graph convolutional network with k-Order feature interactions. In lIEEE Transactions on Knowledge and Data Engineering 2021.

28.

Pennington

, Socher

and Manning

C.D.

, GloVe: Global vectors for word representation, In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014) 14 (2014), pp. 1532–1543.

29.

Dong

, Wei

, Tan

, Tang

, Zhou

and Xu

, Adaptive recursive neural network for targetdependent twitter sentiment classification, In Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 2: Short papers) 2 (2014), pp. 49–54.

30.

Pontiki

, Galanis

, Pavlopoulos

, Papageorgiou

, Androutsopoulos

and Manandhar

, Semeval-2014 task 4: Aspect based sentiment analysis. In Proceedings of the 8th International Workshop on SemanticEvaluation (SemEval 2014), (2014), pp. 27–.35.

31.

Zheng

, Zhang

, Mensah

and Mao

, Replicate, walk, and stop on syntax: an effective neural network model for aspect-level sentiment classification, Proceedings of the AAAI conference on artificial intelligence, (2020), pp. 9685–.9692.

32.

Tang

, Ji

, Li

and Zhou

, Dependency Graph Enhanced Dual-transformer Structure for Aspect-based Sentiment Classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020), pp. 6578–.6588, Online. Association for Computational Linguistics.

33.

Deng

, Lei

, Li

, Lin

, Cheng

and Yang

, Sentiment knowledge-induced neural network for aspect-level sentiment analysis, Neural Computing and Applications (2022), pp. 1–.12.

Cross feature enhanced graph convolutional network for aspect-based sentiment analysis

Abstract

Keywords

1 Introduction

2 Graph convolutional neural networks with multi-layer perceptron

2.1 Graph convolutional neural networks

4.1 Input and Bi-LSTM

5.1 Datasets

Table 1 Statistics of datasets Dataset Positive Neural Negative train test train test train test Twitter 1507 172 3016 336 1528 169 Laptop 976 337 455 167 851 128 Restaurant 2164 727 637 196 807 196

Table 2 Hyper-parameter setting Hyper-parameter Setting Word embedding 300 Position embedding 30 Part-of-speech embedding 30 Bi-LSTM embedding 100 Batch size 32 Optimizer Adam Learning rate 0.01 Epoch 100 GCN dropout 0.5

5.4 Experimental result

Table 4 Input layer ablation comparison Model Twitter Laptop Restaurant Acc F1 Acc F1 Acc F1 BI-LSTM 70.34 69.21 75.20 70.23 77.99 63.67 w post 71.39 69.49 76.04 72.10 81.30 71.51 w pos 72.16 69.77 72.96 68.66 78.26 64.82 w post& pos 72.53 73.17 77.08 73.17 80.41 69.55

Footnotes

Acknowledgments

References

Table 1
Statistics of datasets

Dataset Positive Neural Negative

train test train test train test

Twitter 1507 172 3016 336 1528 169

Laptop 976 337 455 167 851 128

Restaurant 2164 727 637 196 807 196

Table 2
Hyper-parameter setting

Hyper-parameter Setting

Word embedding 300

Position embedding 30

Part-of-speech embedding 30

Bi-LSTM embedding 100

Batch size 32

Optimizer Adam

Learning rate 0.01

Epoch 100

GCN dropout 0.5

Table 4
Input layer ablation comparison

Model Twitter Laptop Restaurant

Acc F1 Acc F1 Acc F1

BI-LSTM 70.34 69.21 75.20 70.23 77.99 63.67

w post 71.39 69.49 76.04 72.10 81.30 71.51

w pos 72.16 69.77 72.96 68.66 78.26 64.82

w post& pos 72.53 73.17 77.08 73.17 80.41 69.55