Memory: An adaptive contextual Memory framework for aspect-based dialogue sentiment quadruple analysis

Abstract

Aspect-based dialogue sentiment quadruple analysis (DiaASQ) is a critical task in sentiment analysis, aiming to extract sentiment quadruples (target, aspect, opinion, sentiment polarity) from dialogues. Existing methods primarily focus on single-sentence sentiment analysis, often neglecting the rich contextual information and long-range dependencies in multi-turn dialogues. To address this limitation, we propose a novel memory framework, Memory, which incorporates adaptive contextual memory mechanisms to simulate human-like emotional refinement during conversations. Our framework consists of three key components: a Contextual Knowledge Memorizer to capture token-level syntactic-semantic dependencies, an Utterance-level Sentiment Interactor to model speaker-respondent dynamics, and a Multi-granularity Memory Integrator to fuse token-level and utterance-level information for precise sentiment relationship extraction. Extensive experiments on two benchmark datasets demonstrate the framework’s superiority, achieving 10.14% and 6.03% improvements in Micro-F1, and 13.07% and 5.60% improvements in Iden-F1 on Chinese and English datasets, respectively.

Keywords

Memory discourse structure graph Neural Network conversational aspect-based sentiment analysis

1. Introduction

Aspect-level dialogue sentiment analysis (DiaASQ)¹ functions as a significant subtask in sentiment analysis. It aims to extract precise sentiment information from multi-turn conversations. Conventional methods often focus on single-sentence classification. As a result, they miss the temporal context and structural complexity of real-world dialogues. Such approaches ignore the long-range dependencies found in conversational data, which reduces their practical value. The primary task involves extracting sentiment quadruples $t$ , $a$ , $o$ , $s$ from dialogues. Here, $t$ denotes the discussion topic, $a$ is the specific attribute, $o$ reflects the evaluation of that attribute, and $s$ indicates the sentiment polarity. As Figure 1 illustrates, sentiment quadruples frequently appear across multiple utterances. This dispersion complicates the detection of long-range dependencies. Overcoming these challenges aids Intelligent Customer Service and Social Media Analysis. In scenarios like e-commerce logs or threaded discussions on Weibo and Twitter, user opinions are rarely isolated. Instead, they develop through multi-turn interactions. Accurate extraction from these structures helps enterprises sort specific complaints, such as distinguishing “battery life” from “screen resolution.” Furthermore, this process supports public sentiment tracking in threaded conversations.

Figure 1.

Example of a dialogue (top) and an emotional quartet (bottom). In the dialogue, different emotional elements are highlighted in various colors. Dotted lines indicate response relationships, and letters in circles represent different spokespersons. In the emotional quartet, the structure and relationships are similarly annotated to illustrate the interactions and emotional dynamics.

Prior studies have sought to tackle these challenges using graph convolutional networks (GCN)^2,3 and heterogeneous graphs⁴ to improve token- and discourse-level representations. These methods increase cross-layer interaction but often integrate information only at a shallow level, and they do not use memory mechanisms to retain and filter salient content from earlier turns. Methods such as dynamic multi-scale context aggregation (DMCA)⁵ show the same limitation: they do not capture long-range dependencies or track topic shifts in multi-turn conversations.

To overcome these limitations, this paper introduces a novel memory network framework (Memory), designed to emulate human-like memory filtering for dialogue sentiment analysis. Our framework is centered on three key components: (1)

A contextual knowledge memorizer that captures and filters token-level syntactic-semantic information by leveraging the “thread” structure of dialogues (subtree structures rooted in speaker-reply relationships).

(2)

An utterance-level sentiment interactor that improves information exchange between speakers and respondents at the discourse level, offering complete structural discourse features.

(3)

A multi-granularity memory integrator that effectively combines the filtered token-level and discourse-level memory to support accurate sentiment relationship extraction.

The Memory departs from conventional sentiment analysis and models human-like memory through selective storage and retrieval rather than processing each utterance in isolation. Empirical evaluations on multilingual datasets (Chinese and English) support the effectiveness of the framework under standard settings. It yields improvements of 10.14% and 6.03% in Micro-F1, and 13.07% and 5.60% in Iden-F1, respectively. Relative to state-of-the-art baselines, the model reports gains of 1.71% (ZH) and 0.33% (EN) in Micro-F1, and 3.60% (ZH) and 0.21% (EN) in Iden-F1, and it remains effective for long-range dependencies and cross-dialogue sentiment analysis. This work offers four contributions: (1)

We present a memory network framework built around a dedicated memory unit that supports selective storage and retrieval across the full dialogue context. It supplies integrated dialogue information to downstream modules in sentiment analysis.

(2)

A contextual knowledge memorizer and an utterance-level sentiment interactor enrich context and support memory-guided information filtering; this design mitigates limitations in prior methods.

(3)

A multi-granularity memory integrator that resolves cross-layer memory fusion in the overall architecture.

(4)

Experiments on the DiaASQ dataset show improvements over strong baselines and establish a performance benchmark for DiaASQ.

2. Related works

2.1. Single ABSA tasks

This section reviews single aspect-based sentiment analysis (ABSA) tasks whose objective is to predict one sentiment element only. The four tasks (corresponding to four sentiment elements) are aspect term extraction, aspect category detection, aspect sentiment classification, and opinion term extraction.

Aspect term extraction (ATE) is a core task in ABSA. It identifies aspect-related phrases in text that anchor sentiment. Several methods have been proposed to improve ATE. For example, Yin et al.⁶ proposed a syntactic dependency-aware embedding framework (POD), which integrates dependency patterns with positional contextual information, and Wang et al.⁷ proposed a progressive self-training method.

Aspect category detection (ACD) identifies the aspect categories discussed in a sentence, where categories come from a predefined, often domain-specific set. Compared with ATE, ACD has two advantages. First, whereas ATE focuses on discrete aspect terms, ACD yields a consolidated representation of opinion targets through aggregated outputs. Second, ACD can detect implicit opinion targets without explicit textual mentions. For instance, in “It is very overpriced and not tasty,” ACD identifies the categories price and food, even though ATE would not apply. Prior studies explore different strategies for category assignment. In Tulkens and van Cranenburgh,⁸ category labels are assigned by cosine similarity between sentence embeddings and predefined category vectors. Recent work by Shi et al.⁹ proposed a precision-oriented discriminative mapping framework and improved alignment fidelity.

Aspect sentiment classification (ASC), also known as aspect-oriented or context-specific sentiment analysis, predicts the polarity linked to a particular aspect in text. Recent work proposes methods that directly use structural linguistic features. One line employs a graph convolutional network to model syntactic relations.¹⁰ Another line uses dependency parse trees to guide sentiment classification.¹¹

Opinion term extraction (OTE) detects sentiment expressions associated with given aspects. Because opinion and aspect terms often co-occur, extracting opinion terms in isolation is less informative. OTE is therefore divided into two sub-tasks based on whether an aspect term is provided: (1) aspect-opinion co-extraction (AOCE), which extracts aspect and opinion terms jointly, and (2) target-oriented opinion words extraction (TOWE), which extracts opinion words for a specified aspect term. For instance, Veyseh et al.¹² use syntactic structure, such as dependency-tree distance to the aspect, to aid identification of opinion terms. Mensah et al.¹³ empirically evaluate positional embeddings across encoders and report that BiLSTM-based methods have an inductive bias suited to TOWE; they also add a GCN to capture structure, but the gain is small.

2.2. Compound ABSA tasks

This section reviews compound ABSA tasks that target multiple sentiment elements. They are often cast as integrated versions of the single-element tasks above. The goal is not only to extract multiple elements but also to link them through prediction of paired (two elements), triplet (three elements), or quad (four elements) structures that encode explicit associations among aspects, opinions, and polarities within a sentence.

Studies on the aspect-opinion co-extraction (AOCE) task often find that extracting aspect and opinion terms can benefit each other. However, AOCE outputs separate sets of aspects and opinions and does not record explicit links between them. The aspect-opinion pair extraction (AOPE) task was proposed to resolve this issue; it jointly extracts aspect–opinion pairs and returns explicit mappings between each opinion target and its expression.¹⁴ Wu et al.¹⁵ presented a grid-based annotation framework (GTS), in which the model classifies word pairs into four types: intra-aspect, intra-opinion, aspect–opinion pairs, or unrelated. This change converts pair extraction into a unified token-classification problem. Later work added syntactic and other linguistic cues and reported improved extraction.¹⁶

End-to-end ABSA (E2E-ABSA) identifies aspect terms and their sentiment polarities as ( $a$ , $p$ ) pairs directly from input sentences. The task can be decomposed into aspect term extraction (ATE) and aspect sentiment classification (ASC). A simple pipeline is natural, but boundary detection and polarity prediction can support each other. For example, in “I like pizza,” the word “like” signals positive polarity and helps mark “pizza” as the target. Several end-to-end frameworks adopt this view. Because opinion terms are informative for aspect occurrence and polarity, many models also learn to extract them as an auxiliary objective.^17,18 Recent studies report gains with pipeline,¹⁹ unified,²⁰ and joint¹⁷ designs.

Aspect category sentiment analysis (ACSA) jointly detects discussed aspect categories and their sentiment polarities. It resembles E2E-ABSA, but the aspect is a predefined category and may be implicit or explicit in the sentence. This flexibility supports industrial use. Cai et al.²¹ proposed a hierarchical GCN method (Hier-GCN): a lower layer models relations among categories, and a higher layer models relations between categories and category-oriented sentiments. Liu et al.²² used a sequence-to-sequence architecture for ACSA; they use a pretrained generative model and express outputs as natural-language sentences, and the method outperforms prior classification-based models. The approach also uses prior knowledge well and performs strongly in few-shot and zero-shot settings.

The aspect sentiment triplet extraction (ASTE) task²³ identifies ( $a$ , $o$ , $p$ ) triplets in text, where each triplet contains the opinion target ( $a$ ), its polarity ( $p$ ), and the opinion term ( $o$ ) that explains the sentiment. Triplet prediction yields a more complete view of sentiment than solving the individual subtasks in isolation. In recent years, ASTE has drawn broad attention and has led to frameworks from different paradigms.^24,25

Compound ABSA seeks fine-grained sentiment at the aspect level, either via pairwise extraction (e.g., AOPE) or triplet extraction (e.g., ASTE). These settings are useful, but a single model that predicts all four sentiment elements in parallel provides a more complete representation. This motivation led to the aspect sentiment quad prediction (ASQP) task, which jointly extracts all four elements as unified quadruples from text. Zhang et al.²⁶ proposed a paraphrase-based framework that generates sentiment quads with end-to-end learning. They integrate annotated sentiment elements with predefined templates and then reformulate quad prediction as a text-generation problem solved with a sequence-to-sequence architecture. The method makes full use of label semantics (i.e., the contextual meaning of sentiment elements). Subsequent studies extend this line to output opinion trees^27,28 or structured schemas²⁹ and further specify the task design.

2.3. Sentiment analysis in conversation

Sentiment analysis in conversation presents unique challenges, including multi-actor interactions, long-range dependencies, and dynamic topic evolution. Traditional approaches to conversation sentiment analysis have primarily focused on coarse-grained sentiment classification, often overlooking the fine-grained sentiment elements that are crucial for understanding complex dialogues. Early methods relied on sequence modeling or graph networks to capture dialogue features, but these approaches struggled to encode the intricate structures of multi-turn conversations. Generative models, while effective in some contexts, often fail to handle the complexity of dialogue structures, and pre-trained models are constrained by input length limitations, making them less effective for long dialogues.

Recent advancements have introduced GCNs^2,3 to address these challenges by enabling fusion at multiple granularities. These methods have improved the modeling of token-level and discourse-level interactions, but still face difficulties in handling long dialogues and dialogue intersections. For instance, H2DT⁴ introduced heterogeneous graphs to model dialogue structures, but its performance on long dialogues remains suboptimal. Despite these limitations, these studies have laid a solid foundation for addressing the complexities of multi-actor interactions and long-range dependencies in conversations.

This work builds on that line with a Memory model that adds a memory mechanism to capture multi-actor interaction and to cope with long-range dependencies and topic shifts in long dialogues. The model combines token-level and discourse-level information via a multi-granularity memory framework and offers a more complete solution for dialogue sentiment analysis.

2.4. Memory-augmented models and long-context modeling

Recent progress in memory-augmented neural networks and long-context mechanisms has reshaped natural language processing. Original memory models relied on external components for data storage and retrieval to support reasoning and question-answering.³⁰ Modern large language models (LLMs) now extend context windows through specialized attention mechanisms; these systems process tens of thousands of tokens in a single pass.³¹

Comprehensive assessments like the HELMET benchmark³² reveal that current long-context models fail to reliably extract and process data embedded deep within lengthy input sequences. These performance gaps motivated new active memory management frameworks; for example, MemOS³³ organizes memory-augmented generation through a structured, multi-tier approach. Yet, adapting generic long-context methods for multi-turn dialogue sentiment analysis remains difficult due to unique linguistic properties. Standard models often process inputs as uniform, linear sequences. Such simple concatenation ignores the specific structural links found in dialogue, such as speaker dynamics and complex thread-and-reply patterns.

To fill this research gap, the Memory framework moves away from basic flat sequence modeling. The system instead includes an Adaptive Contextual Memory Unit (ACMU) that uses the structural threads of dialogues. Through human-like selective memory and forgetting, our model filters out irrelevant conversational noise. At the same time, it retains long-range sentiment cues across multiple turns. This specialized memory mechanism connects local utterance features with global discourse structures; it provides a more precise and interpretable solution for DiaASQ than earlier memory-based architectures in this field.

3. Method

Table 1 summarizes the primary mathematical notations and variables used throughout our Memory framework.

Table 1.
Notations and their corresponding descriptions used in the memory framework.

Notations Description Notations Description

$D, u_{i}$ A multi-turn dialogue and its $i$ -th utterance $w_{j}, | u_{i} |$ The $j$ -th token and the length of $u_{i}$

$τ_{k}, Q$ The $k$ -th thread structure and sentiment quad $t, a, o, p$ Target, aspect, opinion, and sentiment polarity

$s_{i}, r_{i}$ Speaker indicator and reply index for $u_{i}$ $[c l s], ψ_{i}$ Special token and structural positional info

$A^{s y n}, A^{s e m}$ Syntactic and semantic adjacency matrices $H^{s y n}, H^{s e m}$ Syntactic and semantic GCN representations

$h_{i}^{l}$ Token hidden state at the $l$ -th GCN layer $H^{t o k}$ Fused token-level representation

$I_{t}, F_{t}, O_{t}$ Input, forget, and output gates in ACMU ${\tilde{M}}_{t}, M_{t}$ Candidate and updated memory states at step $t$

$X_{t}, H_{t}$ Input and hidden state of ACMU at step $t$ $W_{}, b_{}$ Learnable weight matrices and biases

$A^{u t t}$ Utterance-level replying adjacency matrix $H_{i}^{w u}$ Word-level weighted utterance representation

$H^{u t t}, h_{i}^{o}$ Utterance-level representation and final state $G, d$ Gating mechanism matrix and hidden dimension

$S^{t o k - t o k}$ Intra-level (token-to-token) attention score $S^{t o k - u t t}, S^{u t t - t o k}$ Inter-level (cross-granularity) attention scores

$A^{i t g}$ Integrated multi-granularity attention matrix $H^{i t g}, H^{f}$ Integrated and final fused representation

$u_{i}^{r}, v_{i}^{r}$ Transformed vectors with relative position $R (θ, i)$ Rotary position encoding (RoPE) operator

$s_{i j}^{r}, p_{i j}^{e n t}$ Relation score and entity classification prob. $σ, LN$ Sigmoid activation function and Layer Normalization

$L_{e n t}, L_{p a i r}, L_{p o l}$ Loss components for entity, pair, and polarity $L, β, η$ Total joint loss and hyperparameter weights

Notations	Description	Notations	Description
$D, u_{i}$	A multi-turn dialogue and its $i$ -th utterance	$w_{j}, \| u_{i} \|$	The $j$ -th token and the length of $u_{i}$
$τ_{k}, Q$	The $k$ -th thread structure and sentiment quad	$t, a, o, p$	Target, aspect, opinion, and sentiment polarity
$s_{i}, r_{i}$	Speaker indicator and reply index for $u_{i}$	$[c l s], ψ_{i}$	Special token and structural positional info
$A^{s y n}, A^{s e m}$	Syntactic and semantic adjacency matrices	$H^{s y n}, H^{s e m}$	Syntactic and semantic GCN representations
$h_{i}^{l}$	Token hidden state at the $l$ -th GCN layer	$H^{t o k}$	Fused token-level representation
$I_{t}, F_{t}, O_{t}$	Input, forget, and output gates in ACMU	${\tilde{M}}_{t}, M_{t}$	Candidate and updated memory states at step $t$
$X_{t}, H_{t}$	Input and hidden state of ACMU at step $t$	$W_{}, b_{}$	Learnable weight matrices and biases
$A^{u t t}$	Utterance-level replying adjacency matrix	$H_{i}^{w u}$	Word-level weighted utterance representation
$H^{u t t}, h_{i}^{o}$	Utterance-level representation and final state	$G, d$	Gating mechanism matrix and hidden dimension
$S^{t o k - t o k}$	Intra-level (token-to-token) attention score	$S^{t o k - u t t}, S^{u t t - t o k}$	Inter-level (cross-granularity) attention scores
$A^{i t g}$	Integrated multi-granularity attention matrix	$H^{i t g}, H^{f}$	Integrated and final fused representation
$u_{i}^{r}, v_{i}^{r}$	Transformed vectors with relative position	$R (θ, i)$	Rotary position encoding (RoPE) operator
$s_{i j}^{r}, p_{i j}^{e n t}$	Relation score and entity classification prob.	$σ, LN$	Sigmoid activation function and Layer Normalization
$L_{e n t}, L_{p a i r}, L_{p o l}$	Loss components for entity, pair, and polarity	$L, β, η$	Total joint loss and hyperparameter weights

3.1. Problem definition and preliminary statement

This section formally defines the DiaASQ task and states the assumptions used in its formulation. The goal is to extract sentiment quadruples (target, aspect, opinion, and sentiment polarity) from multi-turn dialogues. These quadruples encode the sentiment information expressed in a conversational exchange.

Consider a dialogue $D = {u_{1}, \dots, u_{n}}$ , where each $u_{i}$ denotes the textual content of the $i$ -th utterance in the dialogue, and $n$ is the total number of utterances. The dialogue is paired with a replying record $r = {r_{1}, \dots, r_{n}}$ that encodes the hierarchical structure by linking utterances. For instance, $r_{i}$ indicates that $u_{i}$ replies to the $u_{r_{i}}$ . Each utterance $u_{i}$ is represented as a sequence of words $u_{i} = {w_{1}, \dots, w_{m}}$ , where $m$ is the number of words in $u_{i}$ .

Given the dialogue $D$ and the replying record $r$ , the objective of DiaASQ is to extract all possible sentiment quadruples. A sentiment quadruple $Q = {t_{k}, a_{k}, o_{k}, p_{k}}_{k = 1}^{K}$ consists:

(1)
$t_{k}$ : the target, which refers to the main entity or subject of the sentiment.
(2)
$a_{k}$ : the aspect, which is a specific feature or attribute of the target.
(3)
$o_{k}$ : the opinion, which expresses the evaluation or sentiment toward the aspect.
(4)
$p_{k}$ : the sentiment polarity, reflecting the expressed sentiment (e.g., positive, negative, or other categories).
Each element in the quadruple is a substring of one or more utterances in $D$ , which requires the model to effectively capture inter-utterance dependencies and long-range context across the dialogue.

In line with the grid annotation framework of Li et al.,¹ DiaASQ is decoupled into three joint subtasks: entity boundary detection, entity-pair identification, and sentiment polarity prediction. The entity labels are {tgt, asp, opi} for target, aspect, and opinion. Relations between entities are denoted by {t2t, h2h}. Sentiment polarities are labeled as {pos, neg, other}.
3.2. Base encoding

In dialogue sentiment analysis, it is essential to capture long-range dependencies and speaker-specific interactions to track sentiment over a conversation. To improve contextual extraction and handle the length limits of pre-trained language models (PLMs), we adopt a discourse unit–thread architecture. The approach uses the thematic coherence of discourse units within the same conversational thread. Each utterance $u_{i}$ is augmented with speaker identity information, yielding ${\hat{u}}_{i} = {[c l s], u_{i}, ψ_{i}}$ , where $[c l s]$ is the special classification token, and $ψ_{i}$ represents speaker embeddings. The dialogue is organized into threads, and each thread $τ_{k}$ = ${{\tilde{u}}_{1}, {\tilde{u}}_{α}, {\tilde{u}}_{α + 1}, \dots, {\tilde{u}}_{ω}}$ consists of discourse units linked by reply relations; this arrangement preserves context across turns. To process these threads under PLM constraints, we apply hierarchical encoding. We pass each thread $τ_{k}$ through the PLM and obtain hierarchical representations $H_{k}^{t}$ that encode token-level and contextual information.

\begin{aligned} H_{k}^{t} & = {H_{1}^{u^{'}}, H_{i}^{u^{'}}, \dots, H_{j}^{u^{'}}} = PLMs (τ_{k}), \end{aligned}

(1)

\begin{aligned} H_{i}^{u^{'}} & = {h_{i}^{c l s}, H_{i}^{τ}, h_{i}^{s}}, \end{aligned}

(2)

where the utterance feature

H_{i}^{τ} \in R^{m_{i} \times d}

consists of token-level representations.

4. Memory model

The proposed model Memory consists of four core components. The Contextual Knowledge Memorizer extracts token-level features via GCNs that integrate syntactic and semantic representations. The Utterance-level Sentiment Interactor extends this by focusing on discourse-level interactions, constructing structured contextual representations to model sentiment at the utterance level. The Multi-granularity Memory Integrator introduces a hierarchical attention mechanism to unify token-level and utterance-level information. Finally, the Quadruple Decoding mechanism facilitates joint inference across multiple sentiment-related subtasks, optimizing performance while addressing class imbalances. Figure 2 depicts all modules of the model.

Figure 2.

The overall architecture of our proposed Memory.

4.1. Contextual knowledge memorizer

To improve token-level information extraction in dialogues, the contextual knowledge memorizer adopts a dual-path design that consists of a syntactic parsing module and a semantic module. The syntactic path captures grammatical structure features, and the semantic path captures deep semantic representations, respectively. The two outputs undergo adaptive fusion with the original BERT embeddings through gating mechanisms, and the fused representation is then passed to a memory-augmented network that retains salient information and models long-range dependencies.

Assuming that the graph contains $n$ nodes and the activation function is denoted by $σ$ , the representation of the $i$ -th node in the $l$ -th layer of GCN can be formulated as follows:

\begin{aligned} h_{i}^{l} = σ (\sum_{j = 1}^{n} A_{i j} W^{l} h_{j}^{l - 1} + b^{l}), \end{aligned}

(3)

where

A_{i j}

represents the adjacency matrix,

W^{l}

and

b^{l}

is learnable parameters and bias.

Syntactic GCN (synGCN): Previous studies in ABSA have demonstrated that dependency parsing trees can effectively model semantic correlations between aspect terms and opinion terms. Inspired by this, we propose the use of GCNs for extracting syntactic features from dependency trees.

To extract syntactic features, we construct a syntactic adjacency matrix for the $k$ -th thread, which represents the syntactic relationships between words. The adjacency matrix is defined as follows:

\begin{aligned} A_{k, i j}^{s y n} = {\begin{cases} 1, & if words w_{i}, w_{j} contain dependency relationship \\ 0, & otherwise \end{cases}, \end{aligned}

(4)

Here,

A_{k, i j}^{s y n}

captures the binary syntactic relationship between each pair of words in the

k

-th thread. This matrix serves as the input for the first layer of the GCN, along with the thread’s textual features

H^{t}

. The GCN then processes these inputs to produce the syntactic feature representation

H^{s y n}

, as given by the following equation

H^{s y n} = GCNs (A^{s y n}, H^{t})

, where GCNs denotes the graph convolution operation.

Semantic GCN (semGCN): Our framework additionally extracts contextual embeddings through attention-based feature interactions, The semantic neighbourhood matrix can be formulated as:

\begin{aligned} A^{sem} = Atten (H^{t} W^{Q}, H^{t} W^{K}), \end{aligned}

(5)

where

Atten (Q, K)

is the attention function that calculates the relationships between query and key matrices. The attention mechanism is formalized as:

\begin{aligned} Atten (Q, K) = softmax (\frac{Q K^{T}}{\sqrt{d}}) . \end{aligned}

(6)

where,

Q = H^{t} W^{Q}

and

K = H^{t} W^{K}

denote the query-key projection matrices constructed from the input representations

H^{t}

, with

d

indicating the dimension of the threaded text feature. The softmax operator applies probabilistic normalization to attention weights, guaranteeing their summation equals unity.

Once the semantic neighborhood matrix $A^{s e m}$ is computed, it is used as input to the GCN to derive the semantic feature representation $H^{s e m}$ , which is updated through graph convolutions $H^{s e m} = GCNs (A^{s e m}, H^{t})$ .

Adaptive Contextual Memory Unit (ACMU): ACMU is a novel neural architecture designed to effectively capture and integrate syntactic, semantic, and textual features for sentiment classification. The process of ACMU includes input representation, fusion of syntactic and semantic information, gating mechanisms, and memory update.

Let $X_{τ} \in R^{n \times d}$ denote the input textual features at time step $t$ , where $n$ is the number of tokens and $d$ is the dimensionality of the feature space. The hidden state from the previous time step is represented as $H_{t - 1} \in R^{n \times h}$ , where $h$ is the dimensionality of the hidden state.

The ACMU begins by computing a fusion score $G$ using a sigmoid activation function $σ$ :

\begin{aligned} G = σ (W x + b), \end{aligned}

(7)

where

W

and

b

are learnable parameters. This score

G

is used to dynamically fuse the syntactic

H^{s y n}

and semantic

H^{s e m}

representations with the textual features:

\begin{aligned} H^{t o k} = LN (G H^{t} + (1 - G) (H^{s y n} + H^{s e m})), \end{aligned}

(8)

where

LN (\cdot)

denotes layer normalization. The resulting token representation

H^{t o k}

is then passed to the next module.

The ACMU employs three gating mechanisms to control the flow of information: the input gate $I_{τ}$ , the forget gate $F_{τ}$ , and the output gate $O_{τ}$ . These gates are computed as follows:

\begin{aligned} I_{τ} & = σ (X_{τ} W_{x i} + H_{t - 1} W_{h i} + b_{i}), \end{aligned}

(9)

\begin{aligned} F_{τ} & = σ (X_{τ} W_{x f} + H_{t - 1} W_{h f} + b_{f}), \end{aligned}

(10)

\begin{aligned} O_{τ} & = σ (X_{τ} W_{x o} + H_{t - 1} W_{h o} + b_{o}), \end{aligned}

(11)

where

W_{x i}, W_{x f}, W_{x o} \in R^{d \times h}

and

W_{h i}, W_{h f}, W_{h o} \in R^{h \times h}

are weight matrices. The terms

b_{i}, b_{f}, b_{o} \in R^{1 \times h}

are bias vectors.

The candidate memory ${\tilde{M}}_{t}$ is computed using a hyperbolic tangent activation function:

\begin{aligned} {\tilde{M}}_{t} = \tanh (X_{τ} W_{x c} + H_{t - 1} W_{h c} + b_{c}), \end{aligned}

(12)

where

W_{x c} \in R^{d \times h}

and

W_{h c} \in R^{h \times h}

, and

b_{c} \in R^{1 \times h}

is a bias vector. The memory cell

M_{t}

is then updated by combining the previous memory

M_{t - 1}

and the candidate memory

{\tilde{M}}_{t}

\begin{aligned} M_{t} = F_{t} ⊙ M_{t - 1} + I_{t} ⊙ {\tilde{M}}_{t}, \end{aligned}

(13)

where

⊙

denotes element-wise multiplication.

The hidden state $H_{t}$ is computed using the output gate $O_{t}$ and the updated memory $M_{t}$ :

\begin{aligned} H_{t} = O_{t} ⊙ \tanh (M_{t}) . \end{aligned}

(14)

This ensures that the hidden state remains within the interval $($ - $1, 1)$ , allowing for efficient information propagation to the prediction layer. Algorithm 1 presents the flow of the contextual knowledge memorizer in detail.

4.2. Utterance-level sentiment interactor

The utterance-level sentiment interactor has two primary components: a Top-k selector and an utterance GCN. Both components are described below.

Top-k Selector: Let $H_{i}^{u} \in R^{m_{i} \times d}$ denote the token representations of the $i$ -th discourse, where $m_{i}$ is the number of tokens and $d$ is the dimensionality of the hidden states. The token point $P^{u} = {p_{1}^{u}, \dots, p_{i}^{u}}$ for each discourse $i$ is calculated as:

\begin{aligned} p_{i}^{u} = H_{i}^{u} \cdot W^{s} + b^{s}, \end{aligned}

(15)

where

W^{s} \in R^{d \times 1}

and

b^{s}

are learnable parameters. The resulting token scores are aggregated into a matrix

p_{i}^{u} \in R^{m_{i} \times 1}

. Using the computed token scores, we select the top

k

tokens that contribute most significantly to the sentiment representation. The index of these tokens is obtained by:

\begin{aligned} {idx}_{i}^{u} = argmax (p_{i}^{u}, k), \end{aligned}

(16)

where

k = m_{i} * λ

and

λ \in (0, 1)

is a hyperparameter controlling the proportion of tokens to be selected.

The selected tokens are emphasized by weighting them according to their scores. Specifically, we compute the weighted utterance representation as:

\begin{aligned} H_{i}^{w u} = softmax (p_{i}^{u} [{idx}_{i}^{u}]) ⊙ H_{i}^{u} [{idx}_{i}^{u}] \end{aligned}

(17)

where

⊙

denotes the element-wise multiplication operation.

Utterance GCN (UttGCN): We build on the weighted utterance representation and form a discourse adjacency matrix $A^{u t t} \in R^{n \times n}$ to capture reply links among utterances in the dialogue. The matrix is defined as:

\begin{aligned} A_{i j}^{u t t} = {\begin{cases} 1, & if utterances u_{i}, u_{j} contain replying relationship, \\ 0, & otherwise . \end{cases} \end{aligned}

(18)

We then propagate utterance-level features through a GCN to encode structured context and speaker information. The overall utterance representation is

\begin{aligned} H^{u t t} = LN (GCNs (A^{u t t}, H^{o}) + H^{o}), \end{aligned}

(19)

where

H^{o} = {h_{1}^{o}, \dots, h_{n}^{o}}

represents the integrated utterance-level representations. Each

h_{i}^{o} \in R^{1 \times d}

is computed by concatenating the maximum, average, and speaker features, followed by a linear transformation:

\begin{aligned} h_{i}^{o} = MLP (max (H_{i}^{w u}) ∥ a r g (H_{i}^{w u}) ∥ h_{i}^{s}) . \end{aligned}

(20)

This model architecture effectively captures both local token-level sentiment and global discourse-level context. Algorithm 2 shows the algorithm flow of Utterance-level Sentiment Interactor in detail, where

L

refers to split sentences,

M

denotes global masks, where

B

refers to batch size,

N_{s}

refers to max sentence num,

ℓ

denotes sentence length.

4.3. Multi-granularity memory integrator

The multi-granularity memory integrator aims to bridge the gap between token-level and utterance-level representations by effectively integrating their contextual information into a unified representation. Given the token-level representations $Q^{t o k}$ and $K^{t o k}$ , as well as the utterance-level representations $Q^{u t t}$ and $K^{u t t}$ , we first compute the attention scores between different granularities:

\begin{aligned} S^{t o k - t o k} = Atten (Θ^{1} Q^{t o k}, Θ^{2} K^{t o k}), \end{aligned}

(21)

\begin{aligned} S^{t o k - u t t} = Atten (Θ^{1} Q^{t o k}, Θ^{3} K^{u t t}), \end{aligned}

(22)

\begin{aligned} S^{u t t - t o k} = Atten (Θ^{4} Q^{u t t}, Θ^{2} K^{t o k}), \end{aligned}

(23)

where,

Θ^{1}

Θ^{2}

Θ^{3}

, and

Θ^{4}

are learnable transformation matrices. The integrated attention matrix

A^{i t g}

is then computed as:

\begin{aligned} A^{i t g} = S^{t o k - u t t} \cdot S^{u t t - t o k} + S^{t o k - t o k}, \end{aligned}

(24)

Using

A^{i t g}

, we compute the final integrated representation

H^{i t g}

by combining the token-level values

V^{t o k}

with a threshold matrix

M^{t h}

\begin{aligned} H^{i t g} = softmax (G A^{i t g} ⊙ M^{t h}) \cdot V^{t o k}, \end{aligned}

(25)

where

⊙

denotes element-wise multiplication. Finally, the integrated representation is passed through a feed-forward network (FFN) followed by layer normalization to obtain the final unified contextual representation:

\begin{aligned} H^{f} = LN (FFN (H^{i t g}) + H^{i t g}), \end{aligned}

(26)

after that we pass

H^{f}

through Eq.(9)

\sim

Eq.(14) and an MLP to get the final result:

\begin{aligned} v_{i}^{γ} = MLP (h_{i}^{f}) . \end{aligned}

(27)

It is important to note that the ACMU module employed here is instantiated separately from the one in Section 4.1 to distinctly capture features at different granularities.

Algorithm 3 presents the flow of the multi-granularity memory integrator in detail.

4.4. Quadruple decoding

Due to the limitations of the PLM, utterances are necessarily encoded in isolation, which may negatively impact conversational discourse. To address this issue, we integrate rotary position embedding (RoPE)³⁴ into token representations. RoPE dynamically encodes the global relative distances between utterances at the dialogue level, providing crucial contextual information. Incorporating such distance information enhances discourse understanding.

\begin{aligned} u_{i}^{r} = R (θ, i) v_{i}^{r}, \end{aligned}

(28)

Utilizing the label-specific embeddings

u_{i}^{r}

, we compute the unary score for each token pair with respect to label

r

\begin{aligned} s_{i j}^{r} = (u_{i}^{r})^{T} u_{j}^{r}, \end{aligned}

(29)

where

s_{i j}^{r}

represents the probability that the relation label between

w_{i}

and

w_{j}

equals

r

. A softmax layer is applied to all elements in each matrix to determine the relation label

r

. For example, the probability of the entity boundary matrix can be obtained as follows:

\begin{aligned} p_{i j}^{e n t} & = Softmax ([s_{i j}^{ϵ_{e n t}}; s_{i j}^{t g t}; s_{i j}^{a s p}; s_{i j}^{o p i}]) . \end{aligned}

(30)

\begin{aligned} L_{k} & = - \frac{1}{R \cdot L^{2}} \sum_{g = 1}^{G} \sum_{i = 1}^{L} \sum_{j = 1}^{L} α^{k} y_{i j}^{k} \log (p_{i j}^{k}), \end{aligned}

(31)

where

k \in {e n t, p a i r, p o l}

denotes the specific task component,

L

denotes the cumulative token count per dialogue, and

R

corresponds to the training dataset size.

y_{i j}^{k}

represents the ground-truth label, and

p {i j}^{k}

is the predicted probability. To address class imbalance, we introduce a tag-wise weighting vector

α^{k}

, which is designed as a learnable parameter to dynamically adjust the contribution of each class during training. The final loss is a weighted sum of the individual losses:

\begin{aligned} L = L_{e n t} + β L_{p a i r} + η L_{p o l} . \end{aligned}

(32)

The loss

L

ensures that the model effectively balances the contributions of different subtasks while maintaining a focus on mitigating class imbalance and optimizing the overall performance. Algorithm ?? presents the flow of the Overall procedure of Memory in detail.

4.5. Model training

The training of the model involves five essential stages, each playing a role in enhancing the performance of the overall architecture.

Step 1 (Encoding): Dialogue samples are first processed to generate the thread $τ_{k}$ . The encoder then uses a pre-trained language model subsystem (PLMS) to transform $τ_{k}$ into thread-level representations $H_{k}^{t} = {H_{1}^{u^{'}}, H_{i}^{u^{'}}, \dots, H_{j}^{u^{'}}}$ , where each $H_{i}^{u^{'}} = {h_{i}^{c l s}, H_{i}^{u}, h_{i}^{s}}$ . These representations form the basis for later stages.

Step 2 (Contextual Knowledge Memorizer): Syntactic and semantic GCNs extract detailed token-level features, denoted as $H^{s y n}$ and $H^{s e m}$ . Alongside $H_{k}^{t}$ , these features undergo processing via the adaptive contextual memory unit (ACMU), producing $H_{t}$ .

Step 3 (Utterance-level Sentiment Interactor): This module computes token scores $l_{i}^{u}$ for each utterance, selects the top $K$ tokens, and generates a weighted representation $H^{w u}$ using Eq.(17). Further transformation through Eq.(20) yields $h^{o}$ . After constructing an adjacency matrix and applying GCNs, the final output $H^{u t t}$ is obtained.

Step 4 (Multi-granularity Memory Integrator): The module integrates $H_{t}$ and $H^{u t t}$ through an attention mechanism, producing $A^{i t g}$ . A fusion portal processes this to compute score $G$ , followed by softmax and layernorm operations to derive $H^{f}$ . To reinforce memory, $H^{f}$ generates $v_{i}^{γ}$ via Eqs.(9)–(14) within the ACMU.

Step 5 (Quadruple Decoding): The final step refines discourse comprehension by processing $v_{i}^{γ}$ through Eq.(28), resulting in $u_{i}^{r}$ . Unary scores $s_{i j}^{r}$ are computed for token pairs, softmax is applied, and all quaternions are decoded. The loss $L$ is calculated using Eqs.(31)–(32).

5. Experiment

5.1. Data sets and assessment indicators

Our experimental framework is evaluated using two benchmark datasets: the Chinese dataset (denoted as ZH)¹ and the English dataset (EN).¹ Both datasets are sourced from electronic product reviews, each comprising 1,000 dialogues. Each dialogue contains an average of seven discourses and five unique speakers, providing a rich context for sentiment analysis tasks. The ZH consists of 5,742 quadruples, while the EN includes 5,514 quadruples. A notable feature of both datasets is that approximately 22% of the quadruples groups exhibit cross-pronunciation characteristics, indicating a diverse linguistic landscape that challenges traditional monolingual sentiment analysis approaches. A detailed breakdown of dataset statistics refers to Table 2. Consistent with previous studies, we used Micro-F1 score and Iden-F1 as our evaluation metrics. These metrics were used for item detection (T, A, O), pairwise detection (T-A,T-O,A-O), Micro-F1 scores to evaluate the whole quadruple (t,a,o,s), and Iden-F1 to focus only on the triple (t, a, o) without considering affective polarity.

Table 2.
Dataset specifications. “dia”, “utt”, and “spk” stand for dialogue, utterance, and speaker, respectively. “tgt”, “asp”, and “opi” refer to target, aspect, and opinion terms. “intra” and “cross” distinguish between intra-utterance and cross-utterance quadruples, where a quadruple is considered cross-utterance if at least two of its components (target, aspect, or opinion) appear in different utterances.

Dialogue Items Pairs Quadruples

Dia Utt Spk Tgt Asp Opi Pair $_{t - a}$ Pair $_{t - o}$ Pair $_{a - o}$ Quad Intra Cross

ZH Total 1,000 7,452 4,991 8,308 6,572 7,051 6,041 7,587 5,358 5,742 4,467 1,275

Train 800 5,947 3,986 6,652 5,220 5,622 4,823 6,062 4,297 4,607 3,594 1,013

Valid 100 748 502 823 662 724 621 758 538 577 440 137

Test 100 757 503 833 690 705 597 767 523 558 433 125

EN Total 1,000 7,452 4,991 8,264 6,434 6,933 5,894 7,432 4,994 5,514 4,287 1,227

Train 800 5,947 3,986 6,613 5,109 5,523 4,699 5,931 3,989 4,414 3,442 972

Valid 100 748 502 822 644 719 603 750 509 555 423 132

Test 100 757 503 829 681 691 592 751 496 545 422 123

		Dialogue	Items	Pairs	Quadruples
ZH	Total	1,000	7,452	4,991	8,308	6,572	7,051	6,041	7,587	5,358	5,742	4,467	1,275
	Train	800	5,947	3,986	6,652	5,220	5,622	4,823	6,062	4,297	4,607	3,594	1,013
	Valid	100	748	502	823	662	724	621	758	538	577	440	137
	Test	100	757	503	833	690	705	597	767	523	558	433	125
EN	Total	1,000	7,452	4,991	8,264	6,434	6,933	5,894	7,432	4,994	5,514	4,287	1,227
	Train	800	5,947	3,986	6,613	5,109	5,523	4,699	5,931	3,989	4,414	3,442	972
	Valid	100	748	502	822	644	719	603	750	509	555	423	132
	Test	100	757	503	829	681	691	592	751	496	545	422	123

5.2. Compared methods

We evaluated our approach against a suite of baseline models to ensure a thorough and fair comparison. The selected baselines include :

CRF-Extract-Classify:³⁵ This paper proposes a new mandate termed Aspect-Category-Opinion-Sentiment (ACOS) Quadruple Extraction, which aims to extract all aspect-category-opinion-sentiment quadruples from review sentences to support the discovery of implicit aspects and opinions in textual feedback. Furthermore, the authors constructed two new datasets tailored to this task.

SpERT:²⁴ SpERT introduces a model that leverages attention mechanisms to simultaneously extract entities and relations using spans. The framework implements streamlined inference processes over BERT-generated embeddings, enhancing entity detection accuracy and selective filtering mechanisms, while enabling relationship categorization through localized, unannotated contextual patterns. The model employs intra-sentence negative sampling within a single BERT instance, improving span detection and overall performance. This approach has proven highly effective in extracting and classifying entities and their relationships.

ParaPhrase:²⁶ This study introduces a novel approach to paraphrase modeling, accompanied by two datasets that reformulate the Aspect Sentiment Quadruple Prediction (ASQP) task as a unified paraphrase generation process. The proposed model effectively predicts sentiment quadruples by integrating semantic insights derived from natural language tags. Building on this foundation, the authors present a new framework that casts the ASQP task as a generative procedure. This generative perspective offers two key advantages: firstly, it enables an end-to-end solution for ASQP, thereby circumventing potential error propagation inherent in pipeline-based methods. Secondly, it facilitates the comprehensive exploitation of sentiment element semantics by learning to generate these elements as natural language expressions.

Span-ASTE:³⁶ The article proposes a new approach to sentiment analysis that addresses the limitations of word-to-word interactions by processing multi-word goals and opinions through span-to-span interactions.The Span-ASTE model combines the ASTE, ATE and OTE tasks and employs a two-channel fragment pruning strategy that reduces computational cost and optimises the combination of goal and opinion candidates.

DiaASQ:¹ In this study, the authors propose for the first time the task of aspect-level sentiment quaternion analysis in dialogue scenarios, annotate a large-scale dataset containing both Chinese and English, and provide a baseline model based on word-pair relationship modelling, thus filling the gap of viewpoint mining and fine-grained sentiment analysis in dialogue scenarios, and also presenting a new challenge for research in the field of sentiment analysis.

Overall-QPN:³⁷ This study presents a framework for Conversational Aspect-Based Sentiment Quadruple Analysis (ConASQ). Unlike traditional aspect-based sentiment quadruple analysis, ConASQ requires modeling relationships between utterances within a dialogue context. Previous research typically employs attention mechanisms to model utterance interactions after extracting individual utterance features. However, a single self-attention or transformer layer may fail to capture these interactions effectively. To address this limitation, the authors propose a simple yet efficient approach. Specifically, all utterances are concatenated into a single sequence and processed by a pre-trained model, allowing for comprehensive utterance representation from the outset. Additionally, distinct mask matrices are used to model dialogue threads, speakers, and replies. Finally, a grid-tagging method is applied for quadruple extraction.

H2DT:⁴ This study proposes a model that combines cohesive discourse elements and three-way interaction mechanisms to achieve dialogue sentiment quadruple extraction. Specifically, a heterogeneous graph is constructed to encode reply and speaker information within a dialogue, resulting in a comprehensive discourse structure. Through a triad-based evaluation module, the system optimizes structural alignment and interdependency across quadruple components, strengthening semantic linkage between contextual units. This research lays critical groundwork for advancing fine-grained sentiment analysis in conversational contexts.

DMIN:² This study addresses the limitations of previous work and introduces a novel perspective on DiaASQ to better model discourse structure in dialogues. Specifically, token-level utterance interactions are enhanced at the thread scale, while global discourse information is captured at the utterance level across the dialogue scale, ensuring greater efficiency and a broader contextual scope. Additionally, a novel integrator is proposed to effectively unify data across different granularities, resulting in a comprehensive and cohesive contextual representation.

DMCA:⁵ The authors present a Dialogue-Multiscale Context Aggregator (DMCA) architecture for conversational aspect-based sentiment quadruple analysis. To address technical challenges in processing lengthy dialogues and identifying cross-turn quadruple relations, a hierarchical contextual windowing strategy is implemented to preserve discourse-level semantic continuity. A dual hierarchical attention (DHA) mechanism combined with a phased optimization framework is further developed to calibrate prediction confidence scores based on multiscale contextual representations.

Triple GNN:³ This work presents an innovative Triple Graph Neural Network framework that jointly combines within-turn grammatical features and cross-turn semantic relationships for dialogue aspect-based sentiment quadruple analysis.Utilizing conversational threads, contextual segments are built to capture fine-grained utterance patterns and global discourse information.

IFUSION:³⁸ This work presents IFusionQuad, a sophisticated comprehensive architecture designed for conversational aspect-based sentiment quadruple analysis (DiaASQ). The system improves localized feature awareness, allowing the model to effectively acquire both high- and low-frequency signals. It incorporates novel modules that collaboratively operate to enhance sentiment recognition and analysis in dialogue scenarios. Notably, IFusionQuad tackles key shortcomings of current leading approaches for DiaASQ by providing resolutions to pivotal issues.

These models were chosen based on their established performance and relevance to sentiment analysis tasks.

5.3. Experimental configuration

This study employs two linguistically distinct datasets to validate the proposed method’s cross-linguistic capabilities. The Chinese language component uses Chinese RoBERTa-wwm-ext-base,³⁹ pretrained on a large-scale corpus of smartphone-related conversational data, including user discussions, product reviews, and technical forums. Its 768-dimensional representation can effectively handle Mandarin’s unique characteristics, particularly the semantic ambiguity of individual characters and word segmentation challenges. The whole-word masking strategy proves essential for maintaining contextual integrity in Chinese, where multi-character compounds often carry meanings distinct from their constituent characters.

For English language modeling, we adopt RoBERTa-Large⁴⁰ with 1024-dimensional embeddings, trained primarily on BookCorpus and English Wikipedia with additional domain-specific texts. The increased dimensionality reflects English’s morphological complexity and richer syntactic structures compared to Chinese. Where Mandarin relies on contextual positioning for grammatical relationships, English requires higher-dimensional space to capture its inflectional morphology and grapheme-phoneme variations.

The optimization approach addresses these linguistic differences through tailored learning rates: $1 \times 10^{- 5}$ for BERT parameters maintains pretrained knowledge, while $1 \times 10^{- 4}$ facilitates adaptation to our graph architecture. This dual-rate system accommodates the datasets’ differing information densities, where English tokens contain more syntactic information per unit, while Chinese depends more on contextual relationships. The 2-layer GCN with 0.2 dropout rate provides sufficient regularization, particularly important for the Chinese dataset’s noisier social media content compared to the more standardized English texts.

Training extends for 40 epochs, determined through validation to balance convergence and efficiency. This accounts for the Chinese dataset’s larger sample size (offsetting shorter average sentence length) and the English data’s deeper syntactic structures. All experiments run on NVIDIA 3090 GPUs with 24GB memory, adequately handling both languages’ sequence length variations without excessive padding. The detailed configuration of the experimental setup is presented in Table 3.

Table 3.
Experimental configuration parameters.

Category Parameter Value

Model Architecture Chinese Encoder RoBERTa-wwm-ext-base (768D)

English Encoder RoBERTa-Large (1024D)

Optimization BERT Learning Rate $1 \times 10^{- 5}$

Model Learning Rate $1 \times 10^{- 4}$

Network GCN Layers 2

Dropout Rate 0.2

Training Epochs 40

Hardware GPU NVIDIA RTX3090 (24GB)

Category	Parameter	Value
Model Architecture	Chinese Encoder	RoBERTa-wwm-ext-base (768D)
	English Encoder	RoBERTa-Large (1024D)
Optimization	BERT Learning Rate	$1 \times 10^{- 5}$
	Model Learning Rate	$1 \times 10^{- 4}$
Network	GCN Layers	2
	Dropout Rate	0.2
Training	Epochs	40
Hardware	GPU	NVIDIA RTX3090 (24GB)

5.4. Main result

The experimental results on two benchmark datasets, Chinese (ZH) and English (EN), demonstrate our model’s superior performance in sentiment analysis. As shown in Table 4, Figures 3 and 4, our approach achieved impressive improvements over previous baseline models in critical evaluation metrics: Micro F1 and Iden F1. Specifically, we observed a 10.14% and 6.03% increase in Micro F1 scores on the Chinese and English datasets, respectively, and a 13.07% and 5.60% improvement in Iden F1 scores. Even when benchmarked against state-of-the-art models, our methodology outperformed them by 1.71% and 0.33% in Micro F1, and 3.6% and 0.21% in Iden F1 on the respective datasets. These enhancements underscore the model’s robustness and effectiveness across different linguistic contexts. Moreover, our model excelled in cross-dialogue analysis, with a standout performance on the most challenging task, Cross-3. Achieving a score of 22.22%, it nearly doubled the baseline’s 12.50% score. This result highlights the model’s exceptional ability to handle complex, cross-utterance sentiment interactions Table 5.

Table 4.
Different baseline models and the overall performance of our proposed memory, where ‘T/A/O’ stands for target/perspective/viewpoint, respectively.

Span Match (F1) Pair Extraction (F1) Quadruple (F1)

Dataset Model T A O T-A T-O A-O Micro. Iden.

ZH CRF-Extract-Classify 91.11 75.24 50.06 32.47 26.78 18.90 8.81 9.25

SpERT 90.69 76.81 38.05 38.05 31.28 21.89 13.00 14.19

ParaPhrase / / / 37.81 34.32 27.76 23.27 27.98

Span-ASTE / / / 44.13 34.46 32.21 27.42 30.85

Overall-QPN / / / 52.86 50.98 53.33 37.77 43.56

H2DT 91.72 76.93 61.87 50.48 48.80 52.40 40.34 42.81

IFUSION 91.69 75.90 60.96 54.68 51.81 50.04 41.53 44.56

DMCA 92.03 77.07 60.27 56.88 51.70 52.80 42.68 45.36

Triple GNN / / / / / / 42.87 45.43

DMIN 91.69 77.61 60.58 58.16 51.89 55.80 43.37 46.98

DiaASQ 90.23 76.94 59.35 48.61 43.31 45.44 34.94 37.51

Our Memory 91.67 78.38 60.36 62.94 52.65 55.48 45.08 50.58

EN CRF-Extract-Classify 88.31 71.71 47.90 34.31 20.94 19.21 11.59 12.80

SpERT 87.82 74.65 54.17 28.33 21.39 23.64 13.07 13.28

ParaPhrase / / / 37.22 32.19 30.78 24.54 26.76

Span-ASTE / / / 42.19 30.44 45.90 26.99 28.34

Overall-QPN / / / 50.70 49.46 50.31 35.37 39.73

H2DT 88.69 73.81 62.61 48.69 48.84 52.47 39.01 42.19

IFUSION 88.31 74.23 63.48 52.65 51.82 51.94 35.96 41.49

DMCA 88.11 73.95 63.47 53.08 50.99 52.40 37.96 41.00

Triple GNN / / / / / / 38.32 40.07

DMIN 88.18 73.90 62.31 53.00 51.55 51.24 38.48 41.94

DiaASQ 88.62 74.71 60.22 47.91 45.58 44.27 33.31 36.80

Our Memory 89.90 75.05 63.31 56.17 52.44 52.19 39.34 42.40

Figure 3.

Different baseline models and the overall performance of our proposed Memory.

Figure 4.

Different baseline models and the overall performance of our proposed Memory.

Table 5.

Results of the ablation study on the dataset. All metrics are micro F1 scores, where “cross-utt” refers to the model’s micro F1 scores on the cross-utt. quaternion.

	ZH			EN
Model	Micro F1	Ident F1	Cross-Utt	Micro F1	Ident F1	Cross-Utt
Our Memory	45.08	50.58	35.00	39.34	42.40	27.48
w/o CK Memorizer	43.21	46.23	34.73	38.62	42.38	25.00
w/o ACMU	42.37	45.94	29.87	37.50	41.07	24.46
w/o US Interactor	43.43	45.90	33.75	37.82	41.46	25.32

5.5. Ablation study

Our ablation study systematically evaluates the contributions of each component in the Memory framework through controlled experiments. The performance variations observed when removing key modules reveal distinct functional roles and validate the architectural design. Comparative analysis between the complete Memory framework and its ablated versions demonstrates several significant patterns that substantiate our design rationale.

w/o CK Memorizer: The Contextual Knowledge Memorizer (CK Memorizer) proves essential for maintaining model performance, particularly for Chinese language processing. Performance metrics show a 1.87% decrease in Micro F1 for ZH and 0.72% for EN when this component is removed, with an even more substantial 4.35% reduction in Iden F1 for Chinese. These findings align with the linguistic characteristics of Mandarin, where contextual dependencies play a more critical role in semantic interpretation compared to English. The memorizer’s capacity to preserve local contextual information significantly benefits cross-utterance analysis tasks.

w/o ACMU: The Adaptive Contextual Memory Unit (ACMU) demonstrates particularly strong effects on model performance. When disabled, we observe a 5.13% performance drop in cross-utterance analysis for Chinese, along with 2.71% and 1.84% decreases in Micro F1 for ZH and EN respectively. These results confirm the unit’s critical function in coordinating information flow across dialogue turns through its gating mechanisms, which maintain coherent sentiment analysis throughout extended conversations.

w/o US Interactor: The Utterance-level Sentiment Interactor contributes to discourse-level pattern recognition, though its impact appears somewhat less pronounced than the ACMU. Performance reductions of 1.65% (ZH) and 1.52% (EN) in Micro F1 scores accompany its removal, with corresponding decreases in cross-utterance task performance. The component shows greater importance for Chinese dialogue processing, likely due to Mandarin’s stronger reliance on discourse markers and speaker dynamics.

Language-specific analysis reveals interesting variations in component importance. The framework shows greater sensitivity to component removal in Chinese processing, particularly regarding the Contextual Knowledge Memorizer and Utterance-level Sentiment Interactor. This aligns with Mandarin’s linguistic properties, including its contextual dependency and use of discourse markers. English processing maintains more consistent performance patterns across ablations. These experimental results support our hypothesis that effective dialogue sentiment analysis requires integrated processing at multiple levels. The Memory framework’s effectiveness stems from the coordinated operation of its components: the Contextual Knowledge Memorizer handling local context, the Utterance-level Sentiment Interactor managing discourse structure, and the ACMU integrating these operations. The system’s consistent performance across languages and dialogue contexts suggests its potential as a robust solution for conversational sentiment analysis.

5.6. Cross-utterance quadruple extraction

In this section, we explicitly assess the model’s capacity to resolve complex dialogue structures characterized by long-range dependencies and frequent context shifts. Figure 5 illustrates the performance capability across varying cross-utterance distances, where the red and blue lines represent the proposed Memory framework and the DMIN baseline, respectively.

Figure 5.

Results Across Various Cross-Utterance Levels.

The “cross- $\geq$ 3-Utt” subset represents the highest complexity level in the dataset. In these scenarios, sentiment elements are separated by at least three turns, making them highly susceptible to semantic decay and intervening noise from unrelated speakers. Under such rigorous conditions, the DMIN baseline struggles to maintain context, limiting its Micro-F1 to 12.50%. In contrast, the Memory model demonstrates exceptional robustness, attaining 22.22%—effectively doubling the baseline performance by successfully retaining distinct sentiment cues over long distances.

The impact of our architectural design on solving these complex problems is further evidenced by the yellow line (w/o ACMU). Removing the memory module causes performance on deep cross-utterance samples to drop to 11.76%. This sharp decline confirms that the ACMU is not merely an enhancement but a critical component for overcoming the forgetting problem in complex, multi-turn interactions.

5.7. Case study

This case study examines the model’s performance in extracting sentiment quadruples from a technical product discussion (Figure 6). The dialogue reveals both the framework’s strengths in contextual memory utilization and its challenges with metaphorical language interpretation.

Figure 6.

Some examples of detailed dialog in the DiaASQ English dataset.

In the utterance regarding “11U silicon-oxygen battery technology”, the model accurately identifies the target (11U) and aspect (battery) but misclassifies the sentiment polarity of the opinion phrase “has not moved”. While the phrase literally indicates stasis, its contextual meaning conveys technological stagnation - a negative sentiment the model fails to capture. This error stems from the system’s limited metaphorical comprehension, particularly for technical jargon where literal and contextual meanings diverge.

Conversely, the model demonstrates robust contextual understanding in analyzing comparative statements about battery life. Despite only partially capturing the opinion term (“better” instead of “obviously better”), it correctly identifies the negative sentiment toward the 11U’s battery performance. This success highlights the memory mechanism’s effectiveness in compensating for local information gaps through discourse-level context integration.

The contrast between these two cases reveals key insights: the memory architecture successfully maintains sentiment consistency across utterances when analyzing explicit comparisons, but struggles with implicit negative expressions conveyed through technical metaphors. This limitation points to the need for enhanced metaphorical processing capabilities in future iterations, particularly for domain-specific language.

These findings underscore the framework’s advanced contextual reasoning capabilities while identifying technical metaphor interpretation as a critical area for improvement. The case demonstrates how memory-enhanced models can overcome local information deficiencies, yet also illustrates the ongoing challenges in processing nuanced, domain-specific language constructs.

5.8. Error analysis and long-range dependency

To provide a more interpretable understanding of our memory mechanism’s capabilities and limitations, we conduct a focused analysis based on the specific qualitative cases presented in Table 6 and Figure 6.

Table 6.
A case study comparing the quadruple extraction results between the baseline and our memory framework. The target aspect is implicit in the second clause, requiring contextual reasoning.

When the Memory Mechanism Helps (Resolving Ellipsis and Long-Range Dependency): Table 6 demonstrates a common conversational challenge: cross-utterance long-range dependency. In the dialogue comparing mobile phones, the aspect “taking photo” is introduced early in the context regarding the “12”. As the conversation extends to discuss the “12x” in subsequent turns (partially abbreviated in the table for brevity), the baseline model (DMIN) loses track of this distant context, failing to associate the previously mentioned aspect with the new target. Conversely, our Adaptive Contextual Memory Unit (ACMU) successfully retains the aspect “taking photo” in its active memory state. By treating the dialogue as an interconnected thread, the model bridges this discourse distance, accurately completing the quadruple for “12x”. This confirms the memory module’s effectiveness in tracking ongoing topics across long conversational spans.

When the Memory Mechanism Fails (Error Analysis): However, the detailed dialogue analysis in Figure 6 exposes a limitation regarding deep metaphorical language. When discussing the “11U” battery, the opinion phrase “has not moved” is literally neutral (stasis), but contextually conveys a negative sentiment about technological stagnation. While our framework correctly tracks the target and aspect, it fails to capture this nuanced, domain-specific negative sentiment. This error highlights a crucial boundary: although the memory mechanism efficiently models dialogue structure and tracks entities across turns, the dual GCNs still rely heavily on surface-level syntactic-semantic features. When literal meanings diverge significantly from pragmatic sentiment, the model struggles. Addressing this by integrating external commonsense knowledge or LLM-based semantic reasoning remains a vital direction for our future work.

5.9. Case study on complex scenarios

To visually demonstrate the model’s superiority in handling complex linguistic phenomena (e.g., implicit aspects and comparative inference), we analyze a representative case from the test set, as shown in Table 6.

In this example, the user contrasts two phones (“12” vs. “12x”). The aspect “taking photo” is explicitly mentioned for the first entity but is elliptical (implicit) for the second entity (“the 12x didn’t work”). Standard baselines like DMIN often fail to capture this dependency because the aspect is not syntactically connected to “12x”. However, our Memory framework successfully extracts the correct quadruple. This indicates that the Contextual Knowledge Memorizer effectively retained the semantic focus (“taking photo”) from the preceding clause and the Integrator correctly aligned it with the target “12x”, validating the model’s ability to solve complex context-dependent problems.

Figure 7.

T-SNE visualisation of sentiment representations on the dataset. (a) The baseline DMIN model shows blurred boundaries and overlapping features. (b) Our Memory framework produces well-separated and distinct clusters, indicating superior discriminative power.

5.10. Visualization analysis

To evaluate how our framework learns discriminative features, we map the high-dimensional sentiment representations of the dataset into a two-dimensional space using t-SNE. Figure 7 compares the latent spaces of the state-of-the-art DMIN baseline (Figure 7a) and our Memory framework (Figure 7b).

As shown in Figure 7(a), the baseline model produces separate clusters, yet significant overlap persists between different sentiment categories. This confusion is most evident at the decision boundaries between Neutral and Negative classes. Such results indicate that the baseline model cannot easily separate complex emotional transitions in multi-turn dialogues, which makes the resulting representations sensitive to context noise.

Figure 7(b) shows that the Memory framework learns more stable representations. The Positive, Neutral, and Negative sentiment clusters have higher density and better separation from one another. These clear margins confirm that the Adaptive Contextual Memory mechanism effectively removes irrelevant information. Consequently, the model identifies the essential semantic features needed for accurate sentiment quadruple extraction.

5.11. Efficiency analysis

The Memory framework combines several architectural components, including dual GCNs, the ACMU, and an LSTM-based interaction layer. This high level of structural complexity requires a rigorous analysis of its computational efficiency. Our model is compared directly against the DMIN baseline for performance validation. Table 7 lists the total parameter count, average training time per epoch, and GPU memory usage for the ZH dataset.

Table 7.
Efficiency comparison between the strongest baseline (DMIN) and our proposed memory framework on the ZH dataset.

Model Parameters (M) Training Time (s/epoch) Memory Usage (GB)

DMIN 128.0 173 11.0

Ours (Memory) 134.5 180 12.5

Model	Parameters (M)	Training Time (s/epoch)	Memory Usage (GB)
DMIN	128.0	173	11.0
Ours (Memory)	134.5	180	12.5

Table 7 indicates that the Memory framework adds only a slight computational cost to the base architecture. The total parameter count grows by 6.5 M to a final 134.5 M; this change results from the gating mechanism and the shared LSTM network inside the interaction layer. Likewise, the training time adds 7 seconds per epoch, which shifts the duration from 173s to 180s. GPU memory usage reaches 12.5 GB to store intermediate activations during the process. These costs are justified by substantial performance gains. The framework almost doubles the Micro-F1 score on difficult cross-utterance tasks (Cross- $\geq$ 3-Utt) relative to the DMIN baseline. These efficiency metrics show that the system is practical for real-world deployment. It also offers better long-range sentiment reasoning for complex dialogue analysis tasks.

6. Conclusion

Our memory framework advances dialogue sentiment analysis through effective integration of multi-granularity sentiment information. The proposed method combines local context modeling with global context understanding via two key components: a Contextual Knowledge Memorizer and an Utterance-level Sentiment Interactor. These components collaboratively capture both fine-grained and broad conversational patterns. A hierarchical attention-based Multi-granularity Memory Integrator further bridges token-level and discourse-level information, enabling more accurate sentiment analysis through the synthesis of local and global contexts.

Experimental results demonstrate consistent performance improvements, with our framework outperforming existing methods across both Chinese and English datasets. This robust performance confirms the framework’s effectiveness in handling complex dialogue sentiment tasks. The integration of cross-level sentiment information significantly advances dialogue sentiment analysis.

Future research directions include optimizing memory utilization efficiency and developing advanced information extraction methods. These enhancements will further strengthen the model’s capabilities for aspect-level sentiment analysis in conversational settings.

Footnotes

Acknowledgment

This work is supported by the National Key R&D Program of China (Grant nos. 2023YFB3308601, 2022YFB3104700), the National Natural Science Foundation (Grant nos. 62576287,62402395), Chengdu “Open bidding for selecting the best candidates” Science and Technology Project (Grant no. 2023-JB00-00020-GX), the Science and Technology Program of Sichuan Province (Grant no. 2023YFS0424), the Science and Technology Service Network Initiative (Grant no. KFJ-STS-QYZD-2021-21-001), and the Talents by Sichuan provincial Party Committee Organization Department, and Chengdu - Chinese Academy of Sciences Science and Technology Cooperation Fund Project (Major Scientific and Technological Innovation Projects).

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data accessibility

The experimental data employed in this framework is publicly accessible through .

ORCID iDs

Shuoqiu Duan

Xiaoliang Chen

Baiyang Chen

Xiaolin Qin

References

Fei

, et al. Diaasq: A benchmark of conversational aspect-based sentiment quadruple analysis. In: Findings of the association for computational linguistics: ACL, 2023, pp.13449–13467.

Huang

Xiao

, et al. Dmin: A discourse-specific multi-granularity integration network for conversational aspect-based sentiment quadruple analysis. In: Findings of the association for computational linguistics ACL 2024, 2024, pp.16326–16338.

Jia

, et al. Triple gnns: Introducing syntactic and semantic information for conversational aspect-based quadruple sentiment analysis. In: 2024 27th international conference on computer supported cooperative work in design (CSCWD), 2024, pp.998–1003. IEEE.

Fei

Liao

, et al. Harnessing holistic discourse features and triadic interaction for sentiment quadruple extraction in dialogues. In: Proceedings of the AAAI conference on artificial intelligence, 2024, Vol. 38, pp.18462–18470.

Zhang

, et al. Dynamic multi-scale context aggregation for conversational aspect-based sentiment quadruple analysis. In: ICASSP 2024-2024 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2024, pp.11241–11245. IEEE.

Yin

Wang

Zhang

. PoD: Positional dependency-based word embedding for aspect term extraction. In: Proceedings of the 28th international conference on computational linguistics, 2020, pp.1714–1719.

Wang

Wen

Zhao

, et al. Progressive self-training with discriminator for aspect term extraction. In: Proceedings of the 2021 conference on empirical methods in natural language processing, EMNLP 2021, 2021, pp.257–268.

Tulkens

van Cranenburgh

. Embarrassingly simple unsupervised aspect extraction. In: Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, 2020, pp.3182–3187.

Shi

Wang

, et al. A simple and effective self-supervised contrastive learning framework for aspect detection. In: AAAI, 2021, pp.13815–13824.

10.

Chen

Feng

, et al. Dual graph convolutional networks for aspect-based sentiment analysis. In: ACL-IJCNLP, 2021, pp.6319–6329.

11.

Zhou

Liao

Gao

, et al. To be closer: Learning to link up aspects with opinions. In: Proceedings of the 2021 conference on empirical methods in natural language processing, 2021, pp.3899–3909.

12.

Veyseh

APB

Nouri

Dernoncourt

, et al. Introducing syntactic structures into target opinion word extraction with deep learning. In: Proceedings of the 2020 conference on empirical methods in natural language processing, EMNLP 2020, 2020, pp.8947–8956.

13.

Mensah

Sun

Aletras

. An empirical study on leveraging position embeddings for target-oriented opinion words extraction. In: Proceedings of the 2021 conference on empirical methods in natural language processing, EMNLP 2021, 2021, pp.9174–9179.

14.

Zhao

Huang

Zhang

, et al. Spanmlt: A span-based multi-task learning framework for pair-wise aspect and opinion terms extraction. In: Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, 2020, pp.3239–3248.

15.

Ying

Zhao

, et al. Grid tagging scheme for aspect-oriented fine-grained opinion extraction. In: Findings of EMNLP, 2020, pp.2576–2585.

16.

Fei

Ren

, et al. Learn from syntax: Improving pair-wise aspect and opinion terms extraction with rich syntactic knowledge. In: Proceedings of the thirtieth international joint conference on artificial intelligence, IJCAI 2021, 2021, pp.3957–3963.

17.

Liang

Meng

Zhang

, et al. An iterative multi-knowledge transfer network for aspect-based sentiment analysis. In: Findings of the association for computational linguistics: EMNLP 2021, 2021, pp.1768–1780.

18.

Luo

, et al. Self question-answering: Aspect-based sentiment analysis by role flipped machine reading comprehension. In: Findings of the association for computational linguistics: EMNLP 2021, 2021, pp.1331–1342.

19.

Mao

Shen

, et al. A joint training dual-mrc framework for aspect based sentiment analysis. In: AAAI, 2021, pp.13543–13551.

20.

Zhang

Deng

, et al. Aspect-based sentiment analysis in question answering forums. In Findings of EMNLP, 2021, pp.4582–4591.

21.

Cai

Zhou

, et al. Aspect-category based sentiment analysis with hierarchical graph convolutional network. In: Proceedings of the 28th international conference on computational linguistics, COLING 2020, 2020, pp.833–843.

22.

Liu

Teng

Cui

, et al. Solving aspect category sentiment analysis as a text generation task. In: EMNLP, 2021, pp.4406–4416.

23.

Peng

Bing

, et al. Knowing what, how and why: A near complete solution for aspect-based sentiment analysis. In: Proceedings of the AAAI conference on artificial intelligence, 2020, Vol. 34, pp.8600–8607.

24.

Chia

Bing

. Learning span-level interactions for aspect sentiment triplet extraction, 2021.

25.

Fei

Ren

Zhang

, et al. Nonautoregressive encoder-decoder neural framework for end-to-end aspect-based sentiment triplet extraction. IEEE Trans Neural Networks Learn Syst 2021. DOI: 10.1109/TNNLS.2021.3129483.

26.

Zhang

Deng

, et al. Aspect sentiment quad prediction as paraphrase generation. In: Proceeding of EMNLP, 2021, pp.9209–9219. DOI: 10.18653/v1/2021.emnlp-main.726.

27.

Mao

Shen

Yang

, et al. Seq2path: Generating sentiment tuples as paths of a tree. In: Findings of ACL, 2022, pp.2215–2225.

28.

Bao

Wang

Jiang

, et al. Aspect-based sentiment analysis with opinion tree generation. In: IJCAI, 2022, pp.4044–4050.

29.

Liu

Dai

, et al. Unified structure generation for universal information extraction. In: ACL, 2022, pp.5755–5772.

30.

Sukhbaatar

Weston

Fergus

, et al. End-to-end memory networks. Adv Neural Inf Process Syst 2015; 28. DOI: 10.48550/arXiv.1503.08895.

31.

Liu

Zhu

Bai

, et al. A comprehensive survey on long context language modeling. arXiv preprint arXiv:250317407 2025.

32.

Yen

Gao

Hou

, et al. Helmet: How to evaluate long-context models effectively and thoroughly. In: The Thirteenth international conference on learning representations, 2025.

33.

Song

Wang

, et al. Memos: An operating system for memory-augmented generation (mag) in large language models. arXiv preprint arXiv:250522101 2025.

34.

Ahmed

, et al. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing 2024; 568: 127063. DOI: 10.1016/j.neucom.2023.127063.

35.

Cai

Xia

. Aspect-category-opinion-sentiment quadruple extraction with implicit aspects and opinions. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), 2021, pp.340–350.

36.

Eberts

Ulges

. Span-based joint entity and relation extraction with transformer pre-training. arXiv preprint arXiv:190907755 2019.

37.

Cai

Zhao

, et al. Improving conversational aspect-based sentiment quadruple analysis with overall modeling. In: CCF International conference on natural language processing and chinese computing, 2021, pp.149–161. Springer.

38.

Jiang

Chen

Miao

, et al. Ifusionquad: A novel framework for improved aspect-based sentiment quadruple analysis in dialogue contexts with advanced feature integration and contextual cloblock. Expert Syst Appl 2025; 261: 125556. DOI: 10.1016/j.eswa.2024.125556.

39.

Cui

Che

Liu

, et al. Pre-training with whole word masking for chinese bert. IEEE/ACM TASLP 2021; 29: 3504–3514. DOI: 10.1109/TASLP.2021.3124365.

40.

Liu

Ott

Goyal

, et al. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:190711692 2019.

		Dialogue			Items			Pairs			Quadruples
		Dia	Utt	Spk	Tgt	Asp	Opi	Pair $_{t - a}$	Pair $_{t - o}$	Pair $_{a - o}$	Quad	Intra	Cross
ZH	Total	1,000	7,452	4,991	8,308	6,572	7,051	6,041	7,587	5,358	5,742	4,467	1,275
	Train	800	5,947	3,986	6,652	5,220	5,622	4,823	6,062	4,297	4,607	3,594	1,013
	Valid	100	748	502	823	662	724	621	758	538	577	440	137
	Test	100	757	503	833	690	705	597	767	523	558	433	125
EN	Total	1,000	7,452	4,991	8,264	6,434	6,933	5,894	7,432	4,994	5,514	4,287	1,227
	Train	800	5,947	3,986	6,613	5,109	5,523	4,699	5,931	3,989	4,414	3,442	972
	Valid	100	748	502	822	644	719	603	750	509	555	423	132
	Test	100	757	503	829	681	691	592	751	496	545	422	123

		Span Match (F1)			Pair Extraction (F1)			Quadruple (F1)
Dataset	Model	T	A	O	T-A	T-O	A-O	Micro.	Iden.
ZH	CRF-Extract-Classify	91.11	75.24	50.06	32.47	26.78	18.90	8.81	9.25
	SpERT	90.69	76.81	38.05	38.05	31.28	21.89	13.00	14.19
	ParaPhrase	/	/	/	37.81	34.32	27.76	23.27	27.98
	Span-ASTE	/	/	/	44.13	34.46	32.21	27.42	30.85
	Overall-QPN	/	/	/	52.86	50.98	53.33	37.77	43.56
	H2DT	91.72	76.93	61.87	50.48	48.80	52.40	40.34	42.81
	IFUSION	91.69	75.90	60.96	54.68	51.81	50.04	41.53	44.56
	DMCA	92.03	77.07	60.27	56.88	51.70	52.80	42.68	45.36
	Triple GNN	/	/	/	/	/	/	42.87	45.43
	DMIN	91.69	77.61	60.58	58.16	51.89	55.80	43.37	46.98
	DiaASQ	90.23	76.94	59.35	48.61	43.31	45.44	34.94	37.51
	Our Memory	91.67	78.38	60.36	62.94	52.65	55.48	45.08	50.58
EN	CRF-Extract-Classify	88.31	71.71	47.90	34.31	20.94	19.21	11.59	12.80
	SpERT	87.82	74.65	54.17	28.33	21.39	23.64	13.07	13.28
	ParaPhrase	/	/	/	37.22	32.19	30.78	24.54	26.76
	Span-ASTE	/	/	/	42.19	30.44	45.90	26.99	28.34
	Overall-QPN	/	/	/	50.70	49.46	50.31	35.37	39.73
	H2DT	88.69	73.81	62.61	48.69	48.84	52.47	39.01	42.19
	IFUSION	88.31	74.23	63.48	52.65	51.82	51.94	35.96	41.49
	DMCA	88.11	73.95	63.47	53.08	50.99	52.40	37.96	41.00
	Triple GNN	/	/	/	/	/	/	38.32	40.07
	DMIN	88.18	73.90	62.31	53.00	51.55	51.24	38.48	41.94
	DiaASQ	88.62	74.71	60.22	47.91	45.58	44.27	33.31	36.80
	Our Memory	89.90	75.05	63.31	56.17	52.44	52.19	39.34	42.40

Memory: An adaptive contextual Memory framework for aspect-based dialogue sentiment quadruple analysis

Abstract

Keywords

1. Introduction

2.1. Single ABSA tasks

2.2. Compound ABSA tasks

2.3. Sentiment analysis in conversation

2.4. Memory-augmented models and long-context modeling

3. Method

5. Experiment

5.1. Data sets and assessment indicators

5.3. Experimental configuration

5.6. Cross-utterance quadruple extraction

Table 6. A case study comparing the quadruple extraction results between the baseline and our memory framework. The target aspect is implicit in the second clause, requiring contextual reasoning.

5.11. Efficiency analysis

Table 7. Efficiency comparison between the strongest baseline (DMIN) and our proposed memory framework on the ZH dataset. Model Parameters (M) Training Time (s/epoch) Memory Usage (GB) DMIN 128.0 173 11.0 Ours (Memory) 134.5 180 12.5

Footnotes

Acknowledgment

Funding

Declaration of conflicting interests

Data accessibility

ORCID iDs

References

Table 6.
A case study comparing the quadruple extraction results between the baseline and our memory framework. The target aspect is implicit in the second clause, requiring contextual reasoning.

Table 7.
Efficiency comparison between the strongest baseline (DMIN) and our proposed memory framework on the ZH dataset.

Model Parameters (M) Training Time (s/epoch) Memory Usage (GB)

DMIN 128.0 173 11.0

Ours (Memory) 134.5 180 12.5