A survey of heterogeneous graph neural networks for cybersecurity anomaly detection

Abstract

Anomaly detection is a critical task in cybersecurity, where identifying insider threats, access violations, and coordinated attacks is essential for ensuring system resilience. Graph-based approaches have become increasingly important for modeling entity interactions, yet most rely on homogeneous and static structures, which limit their ability to capture the heterogeneity and temporal evolution of real-world environments. Heterogeneous graph neural networks (HGNNs) have emerged as a promising paradigm for anomaly detection by incorporating type-aware transformations and relation-sensitive aggregation, enabling more expressive modeling of complex cyber data. However, current research on HGNN-based anomaly detection remains fragmented, with diverse modeling strategies, limited comparative evaluation, and an absence of standardized benchmarks. To address this gap, we provide a comprehensive survey of HGNN-based anomaly detection methods in cybersecurity. We introduce a taxonomy that classifies approaches by anomaly type and graph dynamics, analyze representative models, and map them to key cybersecurity applications. We also review commonly used benchmark datasets and evaluation metrics, highlighting their strengths and limitations. Finally, we identify key open challenges related to modeling, data, and deployment, and outline promising directions for future research. This survey aims to establish a structured foundation for advancing HGNN-based anomaly detection toward scalable, interpretable, and practically deployable solutions.

Keywords

Heterogeneous graph neural networks anomaly detection cybersecurity applications temporal modeling graph representation learning

1. Introduction

Anomaly detection is fundamental for maintaining the security and resilience of cyber systems. In cybersecurity, anomalies often correspond to access violations, insider threats, or coordinated attacks that deviate from expected system behavior. As digital environments grow in complexity, traditional flat-feature modeling becomes insufficient. Instead, graph-structured data has emerged as a natural and powerful way to represent relational interactions among entities such as users, hosts, files, and processes.¹

Cybersecurity graphs are typically heterogeneous and dynamic, consisting of multiple entity types connected via diverse relations that evolve over time.² These structural and temporal complexities present significant challenges for anomaly detection models. While graph neural networks (GNNs) have become a standard approach for learning from graph data, most existing methods assume homogeneous and static structures.³ This limits their effectiveness in real-world cyber scenarios, where capturing semantic heterogeneity and behavioral evolution is essential. To address this gap, heterogeneous graph neural networks (HGNNs) introduce type-aware transformations and relation-sensitive aggregation mechanisms,⁴ enabling more accurate modeling of the underlying semantics in such multi-typed environments. Although HGNNs have been increasingly applied to anomaly detection in cybersecurity, the field remains fragmented, with diverse modeling strategies and limited comparative analysis, highlighting the need for a structured survey.

Moreover, most existing surveys^5–8 focus on homogeneous or static graphs, offering limited insight into models tailored for heterogeneous and dynamic graphs. In this survey, we bridge this gap by reviewing recent advances in HGNN-based anomaly detection methods, with a particular emphasis on their applicability to cybersecurity. Recent cybersecurity-specific graph anomaly detection studies further underscore the pace of change in this area, including self-supervised intrusion detection systems such as Anomal-E⁹ and TS-IDS,¹⁰ graph-based log anomaly detection frameworks with explicit explanation components,¹¹ interpretable spatio-temporal log models such as IST-GCN,¹² and 2025 intrusion-detection extensions including BS-GAT,¹³ TE-G-SAGE,¹⁴ and real-world Internet of Things (IoT) communication anomaly detection with graph learning.¹⁵ A recent 2025 systematic review of GNN-based malicious attack detection likewise confirms the rapid expansion and diversification of this literature across network, IoT, and web-security settings.¹⁶ At the same time, evaluating HGNN-based anomaly detection models remains a significant challenge. Existing studies often use heterogeneous datasets, inconsistent evaluation metrics, and non-standardized pre-processing pipelines, making direct comparison between models difficult. Furthermore, benchmark datasets vary widely in heterogeneity, temporal scope, and label quality, which affects the reliability and reproducibility of results. Addressing this gap requires a clearer understanding of the current evaluation landscape, including what metrics and benchmarks are commonly used, how they differ, and where improvements are needed.

Despite the growing number of HGNN-based anomaly detection models, several unresolved challenges hinder their broader adoption in cybersecurity. Many approaches remain limited to static settings and struggle to capture the evolving nature of attacks and user behaviors. Data scarcity, class imbalance, and inconsistent labeling further restrict their robustness and generalizability. Moreover, interpretability and scalability remain open issues, as most models provide little explanation for detected anomalies and are computationally expensive to deploy at scale. Identifying and addressing these challenges is essential for guiding future research toward practical, interpretable, and adaptive HGNN-based solutions.

The main contributions of this survey are as follows:

Comprehensive categorization: We review and classify existing HGNN-based anomaly detection methods by anomaly type (node, edge, and subgraph-level) and by graph dynamics (static vs. dynamic graphs).

Novel taxonomy: We propose a unified taxonomy that captures how modeling strategies address structural and temporal heterogeneity, serving as a framework for organizing existing methods and identifying research gaps.

Application mapping: We review four key cybersecurity domains, including insider threat detection, network intrusion detection, fraud detection in access logs, and advanced persistent threats (APTs), and analyze how HGNN models address the unique challenges and requirements of each setting.

Evaluation landscape: We consolidate commonly used metrics and benchmark datasets, highlighting their applicability, limitations, and the need for standardized protocols.

Open challenges: We identify limitations in current approaches and outline promising directions for future HGNN-based anomaly detection research.

The remainder of this survey is organized as follows. Section 2 reviews background on heterogeneous and dynamic graphs in cybersecurity. Section 3 presents the proposed taxonomy with a detailed discussion of node, edge, and subgraph-level approaches. Section 4 examines major application domains. Section 5 summarizes evaluation metrics and benchmark datasets. Section 6 outlines open challenges and future directions, and Section 7 concludes the survey.

2. Background

Graphs are a fundamental data structure for representing entities and their relationships in cybersecurity, where interactions among users, hosts, files, and network components form complex and interdependent systems. In a graph, nodes represent entities and edges represent interactions or dependencies. Traditional homogeneous graphs assume a single type of node and relation, which limits their expressiveness in real-world scenarios. In contrast, heterogeneous graphs support multiple node and edge types, capturing rich semantic information essential for modeling diverse cybersecurity behaviors. These graph structures serve as the foundation for graph-based machine learning, particularly GNNs, which have been extended to operate on heterogeneous and dynamic graphs.

2.1. Graph representations

A graph is a data structure formally defined as $G = (V, E)$ , where $V$ is the set of nodes and $E$ is the set of edges that represent relationships between node pairs.¹⁷

2.1.1. Homogeneous graphs

Homogeneous graphs¹⁸ are those where all nodes and edges belong to a single type and share the same features and label space. These graphs are widely used in early GNN literature and are often applied to settings such as citation networks or communication graphs, where relationships are uniform across the network. An example is an IP communication graph, where nodes represent IP addresses and edges represent undifferentiated communication links, as shown in Figure 1(a). The uniform structure simplifies both modeling and computation, making homogeneous graphs a foundational case in graph learning research.

Figure 1.

Comparison of homogeneous and heterogeneous graph structures. Panel (a) shows a single-type graph with uniform nodes and edges, while panel (b) shows a heterogeneous cyber graph with multiple entity and relation types.

2.1.2. Heterogeneous graphs

In contrast, heterogeneous graphs^18,19 involve multiple types of nodes and/or edges. A schema $S = (T_{V}, T_{E})$ defines the set of valid node types $T_{V}$ and edge types $T_{E}$ , outlining the semantic structure of the graph. Real-world cybersecurity systems often involve diverse entities (e.g., users, hosts, files) and diverse relationships (e.g., login, file access). These systems are naturally modeled as heterogeneous graphs that incorporate multiple node and edge types, enabling more expressive modeling of complex cyber data. Figure 1(b) illustrates this idea through a simple cyber interaction schema in which different entity categories and relation types are represented explicitly, capturing the semantic diversity that HGNNs are designed to exploit.

2.1.3. Graph schema

A graph schema defines the semantic structure of a heterogeneous graph. Formally, a schema $S = (T_{V}, T_{E})$ consists of a set of node types $T_{V}$ and a set of edge types $T_{E} \subseteq T_{V} \times R \times T_{V}$ , where $R$ is the set of relation types.²⁰ These definitions provide a blueprint for how entities interact and are essential for designing HGNNs that can leverage semantic context. Schema-aware models help distinguish between normal and abnormal behaviors based on how users, systems, and files interact in multi-typed environments.

2.2. Graph neural networks: basics

A fundamental concept in GNNs is the node embedding, which maps each node to a vector in a continuous feature space. Nodes with similar structural roles or semantic attributes are positioned close to each other, enabling graph-structured data to be processed by standard machine learning models while preserving relational patterns. In heterogeneous graphs, embeddings must additionally encode type-specific information, ensuring that differences in node categories and relation types are reflected in the learned representations. To update these embeddings, GNNs employ a message passing mechanism. At each layer $l$ , the representation of a node $v$ is updated by aggregating information from its neighbors:

h_{v}^{(l)} = {UPDATE}^{(l)} (h_{v}^{(l - 1)}, {AGGREGATE}^{(l)} ({h_{u}^{(l - 1)} : u \in N (v)})) .

(1)

Here,

N (v)

denotes the neighbors of

v

. The aggregation step typically uses summation, mean, or attention-weighted pooling, which ensures permutation invariance with respect to neighbor ordering, while the update step fuses the aggregated information with the node’s previous representation.

2.3. Dynamic graph concepts

Dynamic graphs extend the conventional static graph representation by incorporating temporal information into nodes, edges, and their attributes.²¹ While static graphs capture a single snapshot of relationships, temporal graphs describe how these relationships evolve over time, enabling the modeling of interaction sequences, evolution patterns, and time-dependent dependencies. This temporal perspective is particularly important when analyzing environments in which entity interactions are inherently time-varying.

A critical phenomenon in dynamic graph data is concept drift,²² where the statistical properties of the graph structure or associated features change over time. Such drift may arise from shifts in connectivity patterns, evolving behavioral trends, or modifications in attribute distributions. If unaccounted for, concept drift can degrade the performance of models trained on historical data, as patterns that were once indicative of normal or abnormal behavior may no longer hold. In the cybersecurity domain, dynamic graph modeling is essential for capturing the continuously changing nature of both benign and malicious activities. User behaviors, network configurations, and attack strategies are not static. They adapt in response to new technologies, security measures, and adversarial tactics.²³ By representing these evolving interactions—such as user-device access patterns, file transfers, or communication flows—temporal graph analysis enables the detection of emerging anomalies and facilitates proactive threat mitigation.²⁴ This capability goes beyond the limitations of static graph approaches, which may overlook subtle yet critical temporal signals in security-sensitive environments.

2.4. Types of anomalies in graph-structured data

Anomalies in graph data refer to nodes, edges, or subgraphs whose behaviors significantly deviate from patterns learned from the global topology, local neighborhood context, or temporal evolution.²⁵ These deviations often correspond to critical events such as coordinated attacks, fraud, misconfiguration, or system failures, particularly in domains such as cybersecurity, financial systems, and industrial monitoring. In both static and dynamic graphs, anomaly detection becomes more challenging, as irregularities may emerge from evolving multi-relational semantics or shift over time.²⁶

Classical anomaly detection frameworks have traditionally grouped anomalies into three categories: point, contextual, and collective, depending on whether the deviation occurs independently, within a specific context, or through a set of related instances.^27,28 Although conceptually comprehensive, these categories do not align directly with the structural components of graph data. Following recent surveys,^6,7,25 this paper instead adopts a graph-centric taxonomy that categorizes anomalies according to the type of graph element affected.

In this view, node-level anomalies describe individual nodes whose attributes, connectivity, or temporal behaviors deviate from normal patterns.^25,29 Edge-level anomalies capture irregular or suspicious interactions between nodes such as unexpected or unauthorized links that violate relational norms.²⁷ Subgraph-level anomalies involve groups of nodes and edges that collectively exhibits abnormal topology, density, or semantics, often reflecting coordinated or covert behavior.^25,27 This taxonomy captures the full spectrum of structural irregularities observed in heterogeneous and dynamic graphs and provides a unified conceptual basis for designing and evaluating graph-based anomaly detection models.

2.4.1. Node anomalies

Node anomalies refer to individual nodes whose structural positioning, interaction patterns, or temporal behaviors deviate from the expected norms of a graph.²⁹ These deviations may be structural (e.g., a user account with unusually low connectivity compared to peers), attribute-based (e.g., a host machine with atypical system configurations), or temporal (e.g., an employee suddenly shifting access to critical servers after months of stable behavior). In cybersecurity datasets such as CERT³⁰ and UNSW-NB15,³¹ such anomalies often correspond to insider accounts, compromised devices, or misconfigured hosts that diverge from expected interaction patterns.

To detect node anomalies in static graphs, one common approach is to examine irregularities in local connectivity or interaction density. As shown in Figure 2(a), the anomalous node appears structurally distinct from the rest of the graph because its connectivity pattern differs from that of surrounding nodes. This type of structural outlier can indicate unusual behavior, misconfiguration, or compromised activity. However, static graphs provide only a snapshot view, ignoring the temporal evolution of interactions. As a result, short-lived anomalies may disappear in aggregation, and gradual behavioral shifts may remain undetected. These limitations highlight the need for dynamic graph modeling, where temporal information is explicitly incorporated into anomaly detection.

Figure 2.

Node anomalies in static (a) and dynamic (b) graphs. Panel (a) illustrates a structurally atypical node within a static graph, while panel (b) shows a node whose connectivity pattern changes over time and becomes anomalous at $t_{2}$ .

In dynamic graphs, node-level anomalies can manifest through temporal inconsistencies, such as changes in access patterns or community affiliation. In Figure 2(b), a node that appears normal at time $t_{1}$ becomes anomalous at $t_{2}$ because its interaction pattern changes relative to the surrounding graph. This behavioral drift may indicate a role transition, lateral movement, or policy violation—common in stealthy attacks or insider threats. Capturing context-sensitive anomalies is challenging for static models, which rely on a fixed graph structure and cannot account for temporal or semantic shifts.²¹ In contrast, HGNN-based approaches offer enhanced capabilities for detecting complex and evolving node behaviors in heterogeneous systems by incorporating temporal reasoning, type-aware message passing, and schema-guided evolution modeling.^4,19

2.4.2. Edge anomalies

Edge anomalies describe irregular interactions between nodes that diverge from expected structural, semantic, or temporal patterns.³² These deviations are particularly challenging to detect in heterogeneous graphs, where node and edge types constrain valid relationships. In static graphs, anomalous edges often arise when a connection violates known access policies or interaction norms. As shown in Figure 3(a), one highlighted edge links an otherwise typical user-device pattern to a suspicious target, making the irregularity visible at the relation level rather than the node level alone. In cybersecurity contexts, such an anomalous edge may correspond to unauthorized access, privilege escalation, or misuse of restricted resources.

Figure 3.

Edge anomalies in static (a) and dynamic (b) graphs. Panel (a) highlights an irregular edge in an otherwise normal interaction pattern, while panel (b) shows a new suspicious edge appearing over time at $t_{2}$ .

In dynamic graphs, edge anomalies are characterized by abrupt changes in connectivity over time.³² In Figure 3(b), the interaction pattern at $t_{1}$ appears routine, whereas at $t_{2}$ a new suspicious edge emerges. Such changes qualify as edge anomalies because the irregularity arises from the appearance of a new connection that deviates from historical or role-consistent patterns, rather than from the intrinsic properties of the node itself. In cybersecurity contexts, these anomalous edges may signal privilege escalation or lateral movement, where attackers establish unauthorized links to new devices or systems. Detecting these anomalies requires sensitivity to both relational semantics and temporal evolution. HGNN-based models address this by learning edge-aware representations, capturing interaction dynamics across time, and incorporating context through attention-based aggregation mechanisms.^18,33

2.4.3. Subgraph anomalies

Subgraph anomalies arise when a group of nodes and edges collectively exhibit behavior that deviates from the structural or semantic norms of the overall graph.³² Unlike node or edge anomalies, which are localized to individual elements, subgraph anomalies involve the topology, density, or temporal dynamics of a subset, often reflecting coordinated or covert behavior. In static graphs, such anomalies are typically characterized by irregular internal connectivity, motif structures, or semantic outliers.³⁴ As shown in Figure 4(a), a compact highlighted region forms a dense local structure that stands apart from the broader graph topology. This kind of isolated dense subgraph may indicate covert collaboration, unauthorized exchanges, or coordinated malicious behavior. Recent work on subgraph anomaly detection over dynamic graphs highlights that such irregular substructures are critical indicators of coordinated attacks and evolving malicious behaviors.³⁵ In dynamic graphs, subgraph anomalies can emerge through temporal densification or behavioral shifts.^36,37 As illustrated in Figure 4(b), a relatively sparse subgraph at $t_{1}$ becomes much denser at $t_{2}$ . This sudden increase in internal connectivity suggests coordinated activity that may signal collusion or synchronized access behavior. Detecting subgraph anomalies requires reasoning over higher-order structures and capturing temporal progression. HGNN-based approaches support this by enabling subgraph-level representations, modeling group semantics, and incorporating dynamic context to identify complex, evolving group behaviors.^37–39

Figure 4.

Subgraph anomalies in static (a) and dynamic (b) graphs. Panel (a) shows an anomalous dense local structure embedded within a larger graph, while panel (b) shows a subgraph that becomes substantially denser over time and indicates coordinated behavior at $t_{2}$ .

Node, edge, and subgraph anomalies represent conceptually distinct categories, yet they frequently co-occur in real-world graphs. An anomalous node may simultaneously belong to an irregular substructure and participate in unexpected interactions. The perception of anomalies is often context-dependent. A pattern that appears irregular in one region of a graph or at a particular time may be entirely typical in another. This complexity increases in heterogeneous graphs, where multiple node and relation types introduce rich semantic dependencies.⁴ Detecting such complex and evolving patterns requires models that account for structural context, semantic meaning, and temporal progression. These challenges have led to the development of graph neural frameworks that incorporate heterogeneous modeling and dynamic reasoning.^1,25,26,34 In summary, node, edge, and subgraph anomalies represent conceptually distinct categories, yet they frequently co-occur in real-world graphs. An anomalous node may simultaneously belong to an irregular substructure and participate in unexpected interactions. Detecting such complex and evolving patterns requires models that account for structural context, semantic meaning, and temporal progression. These challenges have led to the development of graph neural frameworks that incorporate heterogeneous modeling and dynamic reasoning, which we will categorize in the next section.

3. Taxonomy of HGNN-based anomaly detection

To provide a unified view of existing methods, we propose a taxonomy that categorizes HGNN-based anomaly detection approaches according to the graph element they target as anomalous: node-level, edge-level, and subgraph-level. This perspective emphasizes the structural granularity of the anomaly, which fundamentally shapes both the modeling formulation and the evaluation protocol. Table 1 presents the proposed taxonomy and maps representative HGNN-based anomaly detection models to each category. Unlike prior surveys that group methods by architectural choices or learning paradigms, our classification organizes the field from a task-driven perspective. This highlights common challenges and solution patterns that recur within each anomaly granularity and provides a consistent framework for comparison across methods. By structuring the discussion in this way, we enable a clearer understanding of how detection objectives influence the design of HGNN architectures, and how these designs adapt to heterogeneous and potentially dynamic graph settings. Subsections 3.1–3.3 elaborate on each category in detail, discussing representative methods, their core mechanisms, and the specific challenges they address.

Table 1.
Taxonomy of HGNN-based anomaly detection methods organized by anomaly granularity (node-level, edge-level, subgraph-level) and representative modeling paradigms.

Paradigm Node-level Edge-level Subgraph-level

Contrastive GraphCAD,⁴⁰ HeCo⁴¹ – –

Autoencoding DOMINANT,²⁵ SpecAE⁴² AER-AD,⁴³ eFraudCom⁴⁴ OCGNN⁴⁵

Attention ALARM,⁴⁶ GraphConsis⁴⁷ StrGNN³² HON-GAT⁴⁸

Temporal GDN,⁴⁹ OCAN,⁵⁰ GCAN,⁵¹ TADDY,⁵² HRGCN,⁵³ XG-NID⁵⁴ AddGraph,⁵⁵ DynAD,⁵⁶ Bi-GCN⁵⁷ ST-GCAE,⁵⁸ TGBULLY⁵⁹

Metapath HeCo⁴¹ – SubAnom,³⁵ mHGNN⁶⁰

Hybrid/motif – – MatchGNet,⁶¹ GraphRfi⁶²

One-class – – OCGATL,⁶³ OCGNN⁴⁵

Distillation – – GLocalKD⁶⁴

Structural – Hierarchical-GCN⁶⁵ MatchGNet⁶¹

Adversarial – AANE⁶⁶ –

Paradigm	Node-level	Edge-level	Subgraph-level
Contrastive	GraphCAD,⁴⁰ HeCo⁴¹	–	–
Autoencoding	DOMINANT,²⁵ SpecAE⁴²	AER-AD,⁴³ eFraudCom⁴⁴	OCGNN⁴⁵
Attention	ALARM,⁴⁶ GraphConsis⁴⁷	StrGNN³²	HON-GAT⁴⁸
Temporal	GDN,⁴⁹ OCAN,⁵⁰ GCAN,⁵¹ TADDY,⁵² HRGCN,⁵³ XG-NID⁵⁴	AddGraph,⁵⁵ DynAD,⁵⁶ Bi-GCN⁵⁷	ST-GCAE,⁵⁸ TGBULLY⁵⁹
Metapath	HeCo⁴¹	–	SubAnom,³⁵ mHGNN⁶⁰
Hybrid/motif	–	–	MatchGNet,⁶¹ GraphRfi⁶²
One-class	–	–	OCGATL,⁶³ OCGNN⁴⁵
Distillation	–	–	GLocalKD⁶⁴
Structural	–	Hierarchical-GCN⁶⁵	MatchGNet⁶¹
Adversarial	–	AANE⁶⁶	–

HGNN: heterogeneous graph neural network.

Figure 5 provides a compact visual summary of the survey taxonomy. Unlike the tabular view in Table 1, the schematic makes it easier to see how reconstruction, temporal, structural, and hybrid modeling strategies recur across node-, edge-, and subgraph-level anomaly detection, thereby clarifying the broader organization of the literature at a glance.

Figure 5.

Schematic organization of HGNN-based anomaly detection methods by anomaly granularity and dominant modeling strategy. The figure complements Table 1 by visually grouping representative methods and highlighting cross-cutting trends such as temporal modeling and hybridization. HGNN: heterogeneous graph neural network.

3.1. Node-level anomaly detection

Node-level anomaly detection focuses on identifying individual nodes whose structural, semantic, or temporal behaviors deviate from expected patterns.⁶⁷ In heterogeneous graphs, this task is particularly challenging because multiple node and edge types, role-dependent interactions, and evolving neighborhood semantics make abnormality harder to define. Over the years, a wide variety of HGNN-based models have been proposed, each grounded in different methodological principles. To systematically compare these methods, we group existing models into five categories according to their core learning strategy: reconstruction-based, attention and inconsistency-aware, contrastive learning, semi-supervised, and temporal approaches. Table 2 presents this taxonomy, where each row corresponds to a representative model and each column specifies its supervision type, temporal characteristics, and key architectural mechanisms. This structured comparison illustrates how different modeling choices capture complementary aspects of node-level anomaly detection in heterogeneous graphs.

Table 2.
Comparison of HGNN models for node-level anomaly detection.

Model Learning strategy Temporal Supervision Key mechanism Pros/cons

DOMINANT²⁵ Reconstruction Static Unsupervised Dual-channel GCN autoencoder Simple, interpretable/assumes homophily

SpecAE⁴² Reconstruction Static Unsupervised Spectral decoder + GMM density Captures global structure/high complexity

ALARM⁴⁶ Attention Static Unsupervised Multi-view GCN + entropy reg. Handles multi-view data/sensitive to noise

GraphConsis⁴⁷ Inconsistency Static Unsupervised Feature/relation/context alignment Robust to inconsistency/needs rich schema

GraphCAD⁴⁰ Contrastive Static Self-supervised Semantic augmentations + contrastive MI Scalable, no labels/augmentation sensitive

HeCo⁴¹ Contrastive Static Self-supervised Dual metapath views + contrastive loss Strong semantic views/metapath dependent

SemiGNN⁶⁹ Semi-supervised Static Semi-supervised Hierarchical attention over metapaths Uses sparse labels/label quality dependent

GCNSI⁷⁰ Semi-supervised Static Semi-supervised GCN encoder + label diffusion Improves low-label performance/task-specific

GDN⁴⁹ Forecasting Dynamic Unsupervised Graph structure learning + feature prediction Adaptive structure/forecasting bias

OCAN⁵⁰ Forecasting Dynamic Unsupervised LSTM autoencoder + GAN discriminator Good sequential modeling/training instability

GCAN⁵¹ Forecasting Dynamic Unsupervised CNN–GRU–GCN hybrid fusion Captures propagation/higher compute cost

TADDY⁵² Forecasting Dynamic Unsupervised Transformer encoder + temporal attention Strong temporal modeling/expensive

HRGCN⁵³ Forecasting Dynamic Unsupervised Hierarchical relational–temporal modeling Rich hierarchy/complex implementation

XG-NID⁵⁴ Forecasting Dynamic Unsupervised Transformer with multi-modal inputs Multi-modal robustness/resource intensive

Model	Learning strategy	Temporal	Supervision	Key mechanism	Pros/cons
DOMINANT²⁵	Reconstruction	Static	Unsupervised	Dual-channel GCN autoencoder	Simple, interpretable/assumes homophily
SpecAE⁴²	Reconstruction	Static	Unsupervised	Spectral decoder + GMM density	Captures global structure/high complexity
ALARM⁴⁶	Attention	Static	Unsupervised	Multi-view GCN + entropy reg.	Handles multi-view data/sensitive to noise
GraphConsis⁴⁷	Inconsistency	Static	Unsupervised	Feature/relation/context alignment	Robust to inconsistency/needs rich schema
GraphCAD⁴⁰	Contrastive	Static	Self-supervised	Semantic augmentations + contrastive MI	Scalable, no labels/augmentation sensitive
HeCo⁴¹	Contrastive	Static	Self-supervised	Dual metapath views + contrastive loss	Strong semantic views/metapath dependent
SemiGNN⁶⁹	Semi-supervised	Static	Semi-supervised	Hierarchical attention over metapaths	Uses sparse labels/label quality dependent
GCNSI⁷⁰	Semi-supervised	Static	Semi-supervised	GCN encoder + label diffusion	Improves low-label performance/task-specific
GDN⁴⁹	Forecasting	Dynamic	Unsupervised	Graph structure learning + feature prediction	Adaptive structure/forecasting bias
OCAN⁵⁰	Forecasting	Dynamic	Unsupervised	LSTM autoencoder + GAN discriminator	Good sequential modeling/training instability
GCAN⁵¹	Forecasting	Dynamic	Unsupervised	CNN–GRU–GCN hybrid fusion	Captures propagation/higher compute cost
TADDY⁵²	Forecasting	Dynamic	Unsupervised	Transformer encoder + temporal attention	Strong temporal modeling/expensive
HRGCN⁵³	Forecasting	Dynamic	Unsupervised	Hierarchical relational–temporal modeling	Rich hierarchy/complex implementation
XG-NID⁵⁴	Forecasting	Dynamic	Unsupervised	Transformer with multi-modal inputs	Multi-modal robustness/resource intensive

HGNN: heterogeneous graph neural network; GCN: graph convolutional network; LSTM: long short-term memory.

Reconstruction-based methods operate under the assumption that anomalous nodes are harder to reconstruct than normal ones. For example, DOMINANT²⁵ optimizes a joint reconstruction loss:

L = (1 - α) ‖ A - \hat{A} ‖_{F}^{2} + α ‖ X - \hat{X} ‖_{F}^{2},

(2)

where

A

is the adjacency matrix and

X

are node attributes. While effective for structural outliers, it often struggles with semantic anomalies in heterogeneous settings where relations vary significantly. DOMINANT²⁵ employs a dual-channel autoencoder to jointly reconstruct both adjacency structure and node attributes, while SpecAE⁴² extends this idea by adding spectral deconvolution and density estimation with a Gaussian mixture model. These approaches are straightforward and interpretable, but they often depend on strong feature homophily and can struggle when semantics are heterogeneous or evolving.

Attention and inconsistency-aware methods shift the focus from reconstruction error to semantic irregularities. ALARM⁴⁶ constructs multiple graph convolutional network (GCN)⁶⁸ channels for different semantic neighborhoods and applies entropy-based penalties to emphasize uncertain node representations. GraphConsis⁴⁷ further models inconsistencies across features, relations, and contexts, aligning these heterogeneous views to detect stealthy anomalies. These models are particularly effective in schema-rich graphs where relation diversity provides abundant semantic cues. However, they depend heavily on balanced and well-defined relation patterns, which may limit their robustness in sparse or incomplete networks. Compared with reconstruction-based methods, attention and inconsistency-aware approaches capture finer-grained relational semantics but are generally more sensitive to noise and data imbalance.

Contrastive learning provides a self-supervised alternative by aligning node embeddings across different semantic or structural views. GraphCAD⁴⁰ generates semantic augmentations of local neighborhoods and applies contrastive objectives, while HeCo⁴¹ constructs dual metapath-based views and enforces consistency between them. These methods are effective in low-label scenarios and scale well to large graphs, but their performance is sensitive to the design of augmentations and view sampling strategies.

Semi-supervised approaches take advantage of limited labeled data by combining relational schema with label propagation. SemiGNN⁶⁹ uses hierarchical attention over metapath-defined subgraphs for fraud detection, whereas GCNSI⁷⁰ integrates GCN⁶⁸ encoding with label diffusion for rumor source detection. These methods can improve accuracy under label scarcity but are highly dependent on the quality and distribution of labels, which are often uneven in practice.

Temporal models extend anomaly detection into dynamic settings by explicitly modeling node behavior over time. TADDY⁵² utilizes a relational transformer to encode temporal–relational interactions:

z_{i}^{(t)} = Transformer ({h_{j}^{(t - τ)} : j \in N (i), τ \in [0, T]}) .

(3)

By explicitly modeling the evolution of node embeddings across time steps, TADDY can detect behavioral drift that static models overlook. However, the computational complexity of transformer-based architectures remains a bottleneck for large-scale cybersecurity streaming data. GDN⁴⁹ forecasts multivariate node features and identifies anomalies as deviations from predicted values. OCAN⁵⁰ couples an long short-term memory (LSTM)-based autoencoder with a GAN discriminator to refine sequence modeling, while GCAN⁵¹ fuses CNN, GRU, and GCN modules to detect propagation anomalies. TADDY⁵² employs a transformer-based encoder with temporal–relational attention, and recent advances such as HRGCN⁵³ and XG-NID⁵⁴ integrate hierarchical temporal cues and multi-modal input for insider threat detection. Related cybersecurity graph models have also moved toward directed and interpretable temporal reasoning: GNN-based log anomaly detection and explanation¹¹ models event dependencies with graph structure and explanation-oriented outputs, while IST-GCN¹² jointly exploits directed temporal links and undirected event-similarity structure for interpretable system-log anomaly detection. These methods capture evolving patterns effectively but introduce computational overhead and require careful temporal discretization.

For static enterprise graphs with well-defined schemas, attention-aware models like ALARM are preferred due to their interpretability. In contrast, for rapidly evolving network traffic where labels are unavailable, self-supervised temporal models like TADDY offer superior robustness to concept drift. Many modern frameworks are increasingly hybrid, combining temporal reasoning with contrastive objectives to alleviate label dependence while capturing dynamic patterns.

In summary, node-level HGNN anomaly detection exhibits diverse strategies with complementary strengths. Reconstruction-based models are transparent but assume homogeneity. Attention and inconsistency-aware methods capture rich semantics but depend on diverse relational structures. Contrastive frameworks scale well without labels but rely on high-quality augmentations. Semi-supervised approaches exploit partial labels but are constrained by the quality and distribution of labels. Temporal models excel in dynamic environments but often introduce significant computational costs. Recently, research has increasingly shifted toward hybrid and self-supervised frameworks that combine multiple learning signals. Multi-view and contrastive paradigms are gaining popularity because they alleviate label dependence while enhancing representation robustness. Temporal extensions of HGNNs are also becoming more common, driven by the need to capture evolving attack behaviors and adaptive relational patterns in cybersecurity and fraud detection. Despite these advances, several challenges remain unresolved:

Limited integration of temporal reasoning, semantic consistency, and supervision within a single unified framework.

Continued reliance on static and homogeneous benchmarks that fail to represent real-world heterogeneity and dynamics.

Evaluation protocols that assume clean data and full observability, overlooking label noise, partial views, and concept drift.

Addressing these gaps requires end-to-end dynamic HGNNs capable of adaptive reasoning across structure, time, and semantics. Evaluating such models under realistic conditions will be essential to improve robustness, generalizability, and practical deployment in high-stakes domains such as fraud detection and cybersecurity.

3.2. Edge-level anomaly detection

Edge-level anomaly detection focuses on identifying irregular or suspicious relationships between nodes, including the sudden creation, disappearance, or behavioral change of links.^32,71 Unlike node anomalies, these irregularities are more challenging to capture because they depend not only on the characteristics of the connected nodes but also on the semantic meaning and temporal dynamics of their interactions.⁷² Such complexities make edge-level analysis essential for applications such as fraud detection, abnormal communication monitoring, and information flow analysis.^47,71 To enable systematic comparison, existing HGNN-based models for edge-level anomaly detection are grouped into four categories based on their primary learning strategy: reconstruction and autoencoding, temporal modeling, hierarchical designs, and adversarial or generative frameworks. Table 3 presents this taxonomy, where each row corresponds to a representative model and each column reports its supervision type, temporal capability, and key architectural mechanism. This organization provides a clear basis for analyzing how different design philosophies address complementary aspects of edge-level anomaly detection.

Table 3.
Comparison of HGNN models for edge-level anomaly detection.

Model Learning strategy Temporal Supervision Key mechanism Pros/cons

AER-AD⁴³ Reconstruction/autoencoding Static Unsupervised Relational attention with type-specific aggregation Good for schema-rich graphs/rarity bias

eFraudCom⁴⁴ Reconstruction/autoencoding Static Unsupervised Graph autoencoder for transaction networks Effective for fraud/assumes rare is abnormal

AddGraph⁵⁵ Temporal modeling Dynamic Unsupervised Temporal GCN with attention and memory module Long-term temporal cues/window sensitive

DynAD⁵⁶ Temporal modeling Dynamic Unsupervised Gated attention over temporal snapshots Flexible on streams/retraining overhead

StrGNN³² Temporal modeling Dynamic Unsupervised RNN-GCN fusion for spatio-temporal patterns Strong dynamic structure/high compute

Bi-GCN⁵⁷ Temporal modeling Dynamic Unsupervised Bidirectional message passing for asymmetric evolution Captures directionality/less cyber-specific

Hierarchical-GCN⁶⁵ Hierarchical Static Unsupervised Multi-scale neighborhood aggregation for edge-centric regions Multi-scale context/hierarchy sensitive

AANE⁶⁶ Adversarial/generative Static Unsupervised Adversarial autoencoder with latent prior alignment Flexible unsupervised scoring/sparse-graph false positives

Model	Learning strategy	Temporal	Supervision	Key mechanism	Pros/cons
AER-AD⁴³	Reconstruction/autoencoding	Static	Unsupervised	Relational attention with type-specific aggregation	Good for schema-rich graphs/rarity bias
eFraudCom⁴⁴	Reconstruction/autoencoding	Static	Unsupervised	Graph autoencoder for transaction networks	Effective for fraud/assumes rare is abnormal
AddGraph⁵⁵	Temporal modeling	Dynamic	Unsupervised	Temporal GCN with attention and memory module	Long-term temporal cues/window sensitive
DynAD⁵⁶	Temporal modeling	Dynamic	Unsupervised	Gated attention over temporal snapshots	Flexible on streams/retraining overhead
StrGNN³²	Temporal modeling	Dynamic	Unsupervised	RNN-GCN fusion for spatio-temporal patterns	Strong dynamic structure/high compute
Bi-GCN⁵⁷	Temporal modeling	Dynamic	Unsupervised	Bidirectional message passing for asymmetric evolution	Captures directionality/less cyber-specific
Hierarchical-GCN⁶⁵	Hierarchical	Static	Unsupervised	Multi-scale neighborhood aggregation for edge-centric regions	Multi-scale context/hierarchy sensitive
AANE⁶⁶	Adversarial/generative	Static	Unsupervised	Adversarial autoencoder with latent prior alignment	Flexible unsupervised scoring/sparse-graph false positives

HGNN: heterogeneous graph neural network; GCN: graph convolutional network.

Reconstruction and autoencoding methods model edge anomalies by learning embeddings that reconstruct edge presence or attributes, with deviations indicating abnormal links. AER-AD⁴³ extends autoencoders with relational attention, explicitly scoring edges based on type-specific interactions between incident nodes. eFraudCom⁴⁴ applies a graph autoencoder in transaction networks, flagging anomalous edges via reconstruction error. These models are effective in schema-rich or transaction settings but often assume that rare edges are abnormal, which risks misclassifying legitimate but infrequent interactions.

Temporal modeling approaches capture anomalies that emerge from changes in connectivity patterns over time. AddGraph⁵⁵ integrates temporal attention into GCNs,⁶⁸ encoding historical activation patterns in a memory module to preserve long-term dependencies. DynAD⁵⁶ aggregates temporal snapshots using gated attention to adapt flexibly to streaming data. StrGNN³² combines recurrent neural networks with GCNs to jointly capture sequential dependencies and structural context. Specifically, it extracts a $k$ -hop enclosing subgraph $S_{e}^{(t)}$ for each edge $e = (u, v)$ at time $t$ and applies a GCN to learn structural representations $g_{e}^{(t)}$ , which are then fed into a GRU to model temporal evolution:

h_{e}^{(t)} = GRU (g_{e}^{(t)}, h_{e}^{(t - 1)}), where g_{e}^{(t)} = GCN (S_{e}^{(t)}) .

(4)

This allows the model to capture periodic behaviors and structural transitions in dynamic environments. Bi-GCN⁵⁷ employs bidirectional message passing to detect asymmetric temporal influences, such as abnormal rumor propagation. Compared with reconstruction and autoencoding models, temporal frameworks capture evolving anomalies more effectively and maintain robustness in dynamic environments. However, they depend on assumptions about temporal granularity and often require costly retraining in non- stationary or rapidly changing settings.

Hierarchical approaches embed edges by aggregating information at multiple neighborhood scales. Hierarchical-GCN⁶⁵ learns both local and global edge-centric regions, improving detection of anomalies that are invisible at the node-pair level but emerge in broader structural contexts, such as inter-community links. While this improves generalization in multi-scale networks, it increases computational complexity and is sensitive to the chosen hierarchy.

Adversarial and generative frameworks learn edge embeddings by aligning them with latent priors or generating synthetic samples. AANE⁶⁶ employs an adversarial autoencoder that pushes embeddings to match a learned prior, while unsupervised scoring identifies low-density edges as anomalies. These approaches offer flexibility in label-scarce settings but can struggle when benign rare interactions resemble true anomalies.

Comparative analysis: For edge-level detection, reconstruction methods like AER-AD are well-suited for static, schema-rich environments where anomalies manifest as rare interactions. However, in dynamic settings such as network traffic, temporal models like StrGNN and DynAD are preferred because they can capture evolving behaviors and periodic patterns. Adversarial frameworks like AANE offer strong unsupervised capabilities but may struggle with false positives in highly sparse graphs.

In summary, edge-level anomaly detection in HGNNs reflects a variety of modeling strategies. Reconstruction and autoencoding methods are intuitive but sensitive to rarity bias. Temporal models capture evolving patterns but depend on time discretization and require higher computational overhead. Hierarchical approaches improve robustness by leveraging multi-scale contexts but add structural complexity. Adversarial frameworks enhance unsupervised performance but risk false positives in sparse graphs. Recent research in edge-level anomaly detection shows a clear shift from pure reconstruction-based models toward temporal and adversarial designs. Temporal HGNNs are gaining popularity for their ability to capture evolving interaction patterns, while adversarial and generative frameworks attract attention for improving robustness and realism in training. Early hybrid architectures that combine structural reconstruction with temporal reasoning or generative adaptation are also emerging, reflecting a growing interest in unified modeling of dynamic relational behavior.

Despite these advances, edge-level anomaly detection remains an open challenge. Current methods often address structural, temporal, or semantic irregularities separately, and few frameworks attempt to integrate these dimensions in a unified manner. Many models still assume clean relational schemas or regular temporal granularity, which limits their robustness in real-world settings where edge dynamics are noisy, asynchronous, and partially observable. Evaluations are also frequently conducted on static or synthetic benchmarks, overlooking practical factors such as evolving role semantics, bursty interactions, and context-dependent edge rarity. Future work should prioritize the development of flexible HGNN frameworks capable of jointly reasoning over structure, semantics, and time, while adapting to irregular and incomplete observations. Building richer heterogeneous benchmarks that capture realistic edge evolution will likewise be essential for improving the practical applicability of these models in domains such as fraud detection, communication networks, and cybersecurity.

3.3. Subgraph-level anomaly detection

Subgraph-level anomaly detection addresses the task of identifying groups of nodes and edges that collectively exhibit abnormal behavior.^73,74 Unlike node- or edge-level irregularities, which are tied to individual elements, subgraph anomalies arise from the joint behavior of multiple entities that may appear normal in isolation but anomalous as a whole.⁷⁵ This makes detection more challenging, particularly in heterogeneous graphs where diverse node and edge types give rise to complex relational patterns. Applications include coordinated fraud campaigns, cybersecurity monitoring, and anomaly detection in biological or chemical graphs. A wide range of HGNN-based models have been proposed for this task, but they vary considerably in their detection paradigms. To establish a coherent taxonomy, we categorize existing approaches into six groups: metapath- and metagraph-based methods, structural invariance and motif-based approaches, temporal models, one-class classification, knowledge distillation, and hybrid designs. Table 4 summarizes representative models across these categories, highlighting their supervision type, temporal support, and architectural mechanisms.

Table 4.
Comparison of HGNN models for subgraph-level anomaly detection.

Model Learning strategy Temporal Supervision Key mechanism Pros/cons

SubAnom³⁵ Metapath/subgraph encoding Static Unsupervised Candidate subgraphs via metapath sampling, GNN encoder for semantic deviation Captures semantic patterns/metapath dependent

mHGNN⁶⁰ Metagraph reasoning Static Unsupervised Aggregates higher-order motifs from metagraphs for multi-hop relations Multi-hop reasoning/high compute cost

MatchGNet⁶¹ Structural invariance Static Supervised Hierarchical attention + subgraph matcher for invariant patterns Good for novel attack chains/needs labels

HON-GAT⁴⁸ Motif-based attention Static Unsupervised Motif-instance representations integrated into attention layers Encodes motifs/motif quality sensitive

ST-GCAE⁵⁸ Temporal modeling Dynamic Unsupervised Spatial–temporal GCN autoencoder for evolving motif disruptions Strong temporal motifs/expensive

TGBULLY⁵⁹ Temporal modeling Dynamic Semi-supervised GRU + GAT layers for evolving social subgraph behavior Captures staged behavior/domain specific

OCGATL⁶³ One-class classification Static One-class Learns compact boundary from normal subgraphs Works with scarce labels/boundary sensitivity

OCGNN⁴⁵ One-class + Autoencoding Static One-class GIN encoder + autoencoder for molecular/transaction subgraphs Compact normal modeling/benchmark limited

GLocalKD⁶⁴ Knowledge distillation Static Supervised Bi-level knowledge transfer between node- and subgraph-level encoders Cross-level robustness/teacher–student alignment needed

GraphRfi⁶² Hybrid (embedding + ensemble) Static Supervised GCN embeddings scored via neural random forest for review graphs Better interpretability/less generalizable

Model	Learning strategy	Temporal	Supervision	Key mechanism	Pros/cons
SubAnom³⁵	Metapath/subgraph encoding	Static	Unsupervised	Candidate subgraphs via metapath sampling, GNN encoder for semantic deviation	Captures semantic patterns/metapath dependent
mHGNN⁶⁰	Metagraph reasoning	Static	Unsupervised	Aggregates higher-order motifs from metagraphs for multi-hop relations	Multi-hop reasoning/high compute cost
MatchGNet⁶¹	Structural invariance	Static	Supervised	Hierarchical attention + subgraph matcher for invariant patterns	Good for novel attack chains/needs labels
HON-GAT⁴⁸	Motif-based attention	Static	Unsupervised	Motif-instance representations integrated into attention layers	Encodes motifs/motif quality sensitive
ST-GCAE⁵⁸	Temporal modeling	Dynamic	Unsupervised	Spatial–temporal GCN autoencoder for evolving motif disruptions	Strong temporal motifs/expensive
TGBULLY⁵⁹	Temporal modeling	Dynamic	Semi-supervised	GRU + GAT layers for evolving social subgraph behavior	Captures staged behavior/domain specific
OCGATL⁶³	One-class classification	Static	One-class	Learns compact boundary from normal subgraphs	Works with scarce labels/boundary sensitivity
OCGNN⁴⁵	One-class + Autoencoding	Static	One-class	GIN encoder + autoencoder for molecular/transaction subgraphs	Compact normal modeling/benchmark limited
GLocalKD⁶⁴	Knowledge distillation	Static	Supervised	Bi-level knowledge transfer between node- and subgraph-level encoders	Cross-level robustness/teacher–student alignment needed
GraphRfi⁶²	Hybrid (embedding + ensemble)	Static	Supervised	GCN embeddings scored via neural random forest for review graphs	Better interpretability/less generalizable

HGNN: heterogeneous graph neural network; GCN: graph convolutional network; GRU: gated recurrent unit.

Metapath- and metagraph-based methods capture semantic consistency by leveraging schema-guided relational patterns. SubAnom³⁵ constructs candidate subgraphs through metapath sampling and learns embeddings with a GNN encoder, identifying anomalies as deviations from typical semantic patterns. mHGNN⁶⁰ extends this idea by aggregating higher-order semantic motifs encoded in metagraphs, which allows the model to capture multi-hop and inter-relational dependencies. While effective in networks with strong type-specific structure, these methods are heavily dependent on schema quality, and metagraph reasoning, though more expressive, introduces higher computational demands.

Structural invariance and motif-based approaches instead focus on subgraph topologies. MatchGNet⁶¹ uses hierarchical attention and subgraph matching to identify program structures that violate learned invariants. It computes a matching score between a target subgraph $S_{q}$ and a set of normal invariant subgraphs ${S_{1}, \dots, S_{K}}$ :

Score (S_{q}) = max_{k \in [1, K]} sim (h_{S_{q}}, h_{S_{k}}), h_{S} = Readout (GNN (S)),

(5)

where

h_{S}

represents the graph-level embedding obtained via a hierarchical readout layer. By comparing query subgraphs against a library of benign templates, MatchGNet identifies structural deviations indicative of lateral movement. HON-GAT⁴⁸ incorporates motif instances into graph attention layers, scoring subgraphs that lack expected clique or triad structures as anomalous. These approaches are robust to schema variation but depend on reliable motif definitions, and they can struggle when motif signals are weak or noisy.

Temporal models integrate subgraph evolution into the detection process. ST-GCAE⁵⁸ combines spatial and temporal autoencoders to capture motif disruptions across evolving subgraphs, while TGBULLY⁵⁹ models abusive behavior in social networks by applying GRU and GAT layers to evolving interaction patterns. These methods highlight the importance of temporal reasoning, particularly in detecting staged or progressive anomalies, but they increase computational cost and are sensitive to temporal discretization.

One-class classification methods are particularly useful when labeled anomalies are rare. OCGATL⁶³ learns a compact representation boundary from normal subgraphs to isolate abnormal ones, while OCGNN⁴⁵ integrates graph isomorphism networks with autoencoding for molecular and transactional subgraph anomaly detection. These approaches are valuable in domains with limited supervision, though their effectiveness depends on the representativeness of normal subgraphs.

Knowledge distillation introduces a different perspective by transferring structural information across tasks. GLocalKD⁶⁴ uses bi-level knowledge transfer between node-level and subgraph-level encoders, improving robustness by aligning local consistency with macro-level deviations. While effective in balancing fine- and coarse-grained information, the approach is sensitive to the alignment between teacher and student networks.

Hybrid designs combine deep embeddings with ensemble methods to improve robustness and interpretability. GraphRfi⁶² embeds subgraphs using GCNs⁶⁸ and applies a neural random forest classifier, capturing both local and global structural abnormalities in review networks. Unlike purely embedding-based methods, GraphRfi enhances interpretability but may be less flexible across diverse graph domains.

Comparative analysis: For subgraph-level detection, metapath and metagraph methods like mHGNN are effective when the graph schema is well-defined and anomalies follow known semantic patterns. In contrast, structural invariance models like MatchGNet are better suited for detecting novel attack chains (e.g., APTs) where the exact semantics may vary but the underlying execution structure remains anomalous. Temporal models like ST-GCAE are essential for capturing progressive anomalies but require significant computational resources.

In summary, subgraph-level HGNN anomaly detection demonstrates diverse modeling strategies with complementary strengths. Metapath- and metagraph-based models effectively encode semantic consistency but depend on schema quality and scalability. Structural and motif-based approaches capture topological invariants but are sensitive to motif strength. Temporal methods detect evolving anomalies but at a high computational cost. One-class models address label scarcity but rely on representative training data. Knowledge distillation offers cross-level robustness but requires careful alignment, while hybrid designs improve interpretability at the cost of generality.

Despite the variety of existing approaches, subgraph-level anomaly detection remains underexplored compared to node- and edge-level tasks. Most models focus on either semantic consistency or structural invariance, but few frameworks attempt to integrate both dimensions in a unified manner. Temporal methods remain limited, often requiring discretized snapshots that fail to capture long-term or irregular dynamics. One-class and distillation approaches reduce label dependence but are sensitive to representation bias, while hybrid methods improve interpretability but sacrifice flexibility. Another critical gap is the scarcity of realistic benchmarks, since most evaluations rely on synthetic anomalies or domain-specific datasets that do not capture the scale and heterogeneity of real-world systems. Future work should prioritize scalable frameworks that jointly reason across structure, semantics, and time, while also addressing cross-level anomalies and evaluating under more realistic conditions.

We previously discussed methods for detecting anomalous nodes, edges, and subgraphs. However, these methods share a major limitation. They primarily focus on static graphs. This static view is especially problematic for subgraph-level detection. Many key anomalies are defined by their temporal evolution. They are not just structural oddities. For instance, a dense community might form suddenly, or a group’s behavior might change. Real-world networks are inherently dynamic. They constantly change in areas like cybersecurity and finance. This evolution over time often indicates an anomaly. To capture these patterns, researchers developed dynamic HGNNs. The next section will review these methods, which are designed for evolving graph data.

4. Applications in cybersecurity

Cybersecurity data is inherently multi-entity, multi-relation, and evolves over time.⁷⁶ Threat behaviors, whether driven by malicious insiders, external intruders, or APTs, rarely occur in isolation. They emerge as structured patterns of interaction that span users, hosts, processes, and resources across different time periods.⁷⁷ Representing these interactions as heterogeneous temporal graphs allows HGNN-based approaches to capture both the relational context and the sequential dependencies that traditional flat or tabular models often miss. This section examines four major application domains: insider threat detection, network intrusion detection, fraud detection in access logs, and APT or lateral movement detection. For each, we highlight how graph-based representations and reasoning make it possible to identify subtle, evolving, and context-dependent malicious behaviors in complex operational environments.

4.1. Insider threat detection

Insider threat detection focuses on identifying malicious actions perpetrated by legitimate users who exploit their access privileges to exfiltrate data, escalate access, or conduct sabotage. Unlike external attacks, insider threats often mimic normal behavior patterns, making them difficult to detect through traditional signature or rule-based systems. Graph-based modeling provides a powerful alternative by capturing the contextual relationships between users, systems, processes, and access events over time, thus enabling the discovery of subtle behavioral deviations embedded in complex relational structures.

In graph anomaly detection, insider threats are typically modeled using heterogeneous temporal graphs, where nodes represent users, machines, processes, or resources, and edges represent interaction events such as logins, file accesses, or privilege escalations. Temporal dynamics are essential, as malicious behavior often unfolds gradually or through behavioral drift. Bipartite or multi-partite graphs are frequently used to separate user entities from system assets, and edge timestamps allow for modeling sequential patterns or time windows.

Several HGNN-based models discussed in Section 3 are well aligned with this task. AddGraph⁵⁵ is particularly suitable for detecting access-based anomalies by modeling temporal user behavior via graph attention mechanisms. It constructs user-event graphs across discrete time windows and applies attention-based aggregation to track behavioral consistency. This allows it to flag abnormal user transitions or irregular access patterns even in unsupervised settings. GCAN⁵¹ originally developed for fake news propagation, demonstrates strong generalization to user behavior tracking by capturing sequential dependencies through its GRU–CNN–GCN hybrid architecture. In the context of insider threats, this allows for detecting suspicious propagation of commands, unusual file transfers, or unexpected account switching. OCAN⁵⁰ is another representative model that leverages sequential modeling through an LSTM-based autoencoder and employs adversarial training to refine detection boundaries. The model is trained in sequences of benign user activity and uses a GAN⁷⁸ discriminator to distinguish between generated and observed behavior. This approach is particularly effective for detecting insiders who slowly evolve their behavior to avoid triggering simple statistical thresholds. OCAN has been evaluated on the CERT Insider Threat Dataset,⁷⁹ a widely used benchmark consisting of simulated user activity logs from a fictitious enterprise, developed by Carnegie Mellon University. TADDY⁵² introduces a graph transformer framework capable of modeling long-range temporal dependencies across relation types, which is essential for capturing lateral movement or APTs within enterprise networks. In the context of insider threats, it enables the system to detect complex temporal sequences that may include login attempts, command execution, and privilege escalations across multiple machines and sessions. Complementary enterprise-scale evidence is available in the Los Alamos National Laboratory Unified Host and Network Data Set,⁸⁰ which provides authentication events, network connections, and process creation data suitable for graph-based insider threat analysis. Both datasets are high-dimensional, temporally dense, and semantically rich, making them ideal testbeds for graph-based insider threat detection research. Despite these advances, several challenges persist. Insider threat datasets suffer from extreme class imbalance, with anomalies often below 1%. Behavioral mimicry, where insiders imitate normal users, complicates detection, while gradual behavioral drift demands models that capture long-term context without overfitting to short-term noise. As a result, future research in HGNN-based insider threat detection should focus on adaptive temporal modeling, unsupervised semantic drift detection, and the fusion of structured logs with relational graph embeddings. Models that can reason over both short-term bursts and long-term deviations will be crucial for improving resilience against insider abuse.

Quantitative perspective: Application-level results in insider threat detection remain highly dependent on the underlying release, user population, and temporal aggregation strategy, so they should not be interpreted as a direct benchmark ranking. Even so, recently reported graph-based insider threat studies show that strong performance is achievable on CERT-style benchmarks. For example, HOGPNN-ITD reports accuracy, precision, and detection rate of 0.989, 0.979, and 0.973, respectively, on CERT r4.2, while also reaching 0.972 accuracy on CERT r4.1.⁸¹ These results suggest that temporally aware graph models can capture insider behavior effectively, but they also reinforce the need for standardized evaluation protocols before claims across studies can be compared fairly.

4.2. Network intrusion detection

Network intrusion detection focuses on identifying unauthorized or malicious activities within computer networks, including port scanning, brute-force login attempts, data exfiltration, and protocol violations. Unlike traditional log-based methods, graph-based approaches offer a structural view of communication behavior, capturing not only the volume and frequency of connections but also the contextual and temporal relationships among communicating entities. Graph representations of network flows allow anomaly detection models to reason over communication patterns, session dynamics, and inter-device dependencies, offering improved sensitivity to stealthy or distributed attacks. Graphs used in intrusion detection are commonly constructed as IP-host graphs, host-session graphs, or temporal communication graphs, where nodes represent IP addresses, hosts, or processes, and edges denote communication sessions or flow records, often with timestamp and protocol features. Temporal windows are applied to construct evolving snapshots or streaming graphs, enabling the capture of dynamic attack behavior, such as multi-step exploits or time-distributed probes. HGNN-based anomaly detection models offer strong potential in this domain due to their ability to integrate temporal dynamics, relational heterogeneity, and context-aware representation learning. GDN, originally designed for cyber-physical systems, is well-suited to network intrusion settings through its graph structure learning and multivariate time series forecasting components. In GDN, each node learns an adaptive neighborhood graph and predicts future behavior based on temporal dependencies, allowing the model to flag deviations as anomalies. Its capacity to handle multivariate node features and implicit relation learning makes it applicable to intrusion detection, where relationships among traffic flows are not explicitly labeled. DynAD⁵⁶ extends this dynamic modeling by explicitly learning time-evolving edge behaviors. It uses gated attention mechanisms to aggregate subgraph sequences across time and detect anomalous transitions in network topology. In intrusion detection, DynAD can model the progression of low-and-slow attacks or detect the sudden appearance of anomalous connections that deviate from historical patterns. Compared to GDN, DynAD offers finer-grained modeling of edge dynamics, which is particularly useful for capturing ephemeral connections and distributed scanning behavior.

StrGNN³² introduces a spatio-temporal framework that fuses recurrent neural networks with GCN encoders to model both sequential activity and structural patterns. When applied to network graphs, StrGNN captures not only the topology of interactions but also periodic behaviors, such as repeated authentication attempts or scheduled transfers. Its recurrent component enables the model to maintain a memory of previous interactions, which is critical in identifying multi-phase attacks that unfold over extended periods. Bi-GCN,⁵⁷ although originally proposed for rumor detection, employs bidirectional message passing to model influence propagation across graph edges. When adapted for network traffic graphs, Bi-GCN can detect asymmetric information flows indicative of command-and-control behavior, backdoors, or malicious redirection. Its bidirectional design enhances sensitivity to anomalous flow sequences that would be difficult to detect with unidirectional propagation alone. More recent graph-based IDS research reinforces this trajectory. Anomal-E⁹ and TS-IDS¹⁰ use self-supervised graph learning to exploit node–edge interactions under limited labels, while 2025 models such as BS-GAT¹³ and TE-G-SAGE¹⁴ emphasize behavior-aware graph construction, edge-aware learning, chronological evaluation, and explainability for operational settings. Complementary evidence from real-world IoT communication studies further shows that graph-based anomaly detection remains effective under low false-positive constraints and dynamic traffic conditions.¹⁵ A key strength of these models is their capacity to operate in unsupervised or weakly supervised settings, which is crucial given that network intrusion datasets often lack labeled nodes or precise attack annotations. Datasets such as UNSW-NB15,⁸² CICIDS2017,⁸³ and CTU13⁸⁴ provide session-level or flow-level labels and have been used extensively for evaluating GNN-based intrusion detection. Graphs can be constructed by aggregating these flows over time intervals, embedding host-level attributes, and encoding session statistics (e.g., packet count, duration, protocol). However, deploying GNN-based models in real-world intrusion detection systems remains challenging. High traffic volume imposes significant computational burdens, especially for models that require neighborhood aggregation or historical memory. Furthermore, the need for streaming inference limits the feasibility of models that rely on full-batch training or multi-hop subgraph extraction. Finally, the lack of labeled ground truth for individual nodes or subgraphs hinders supervised fine-tuning and evaluation. To address these challenges, future work should focus on scalable inductive models, incremental graph construction, and semi-supervised anomaly ranking that do not require full supervision or static graphs. HGNNs capable of modeling both short-term and long-term dependencies while preserving efficient online inference will be critical for advancing the practical deployment of GNN-based network intrusion detection.

Quantitative perspective: Although results are not directly comparable across datasets or graph-construction pipelines, published network intrusion studies do provide useful application-level evidence. On NF-BoT-IoT-v2 and NF-ToN-IoT-v2, BS-GAT reports binary-classification $F 1$ -scores of 0.9899 and 0.9790 with accuracies of 0.9900 and 0.9788, respectively.¹³ Under chronological evaluation on NF-UNSW-NB15-v3, TE-G-SAGE reports macro-average accuracy, precision, recall, and $F 1$ of 0.9559, 0.4942, 0.6274, and 0.4906, outperforming a GCN baseline in recall and interpretability-oriented analysis.¹⁴ Taken together, these results support the broader conclusion of this survey: temporal and edge-aware graph models are especially valuable when intrusion detection must balance detection quality with operational interpretability.

4.3. Fraud detection in access logs

Fraud detection in enterprise environments involves identifying unauthorized, anomalous, or policy-violating behaviors within access control systems, enterprise resource usage, and cloud-based environments. Unlike intrusion detection which often centers on external threats, access fraud is typically conducted by legitimate users through unusual activity patterns, abuse of privileges, or coordinated circumvention of access policies. Given the complex relational structure and temporal evolution of access behaviors, graph-based modeling—particularly HGNNs—has become a compelling approach for contextualizing and detecting such fraud. Access logs are naturally represented as heterogeneous graphs, where nodes can denote users, actions (e.g., login, upload), and resources (e.g., files, APIs, servers), and edges represent observed interactions over time. Graphs can be constructed from user–action–resource triplets, forming temporal session graphs with edge attributes such as timestamps, durations, or success/failure codes. These session-based graphs capture not only co-occurrence but also behavioral sequences and semantic relationships among access events, which are critical for modeling intent and consistency.

HGNN-based models have demonstrated significant promise in this domain. SemiGNN⁶⁹ introduces a semi-supervised graph attention network designed for fraud detection in financial systems. It leverages both labeled and unlabeled nodes and incorporates hierarchical attention over metapath-based subgraphs. This makes it well-suited for access fraud scenarios where only partial labels are available and fraud behaviors differ across user roles and access types. GraphConsis⁴⁷ addresses fraud detection by modeling multiple sources of inconsistency in graph data—specifically, inconsistencies in features, relations, and context. This is particularly useful in enterprise logs, where fraudulent access may manifest through conflicting behavior patterns across access contexts (e.g., accessing an internal HR system from an unusual department). GraphConsis applies GCNs to multiple subgraph views and penalizes representational divergence, effectively surfacing anomalies that violate inter-type behavioral norms. GCNSI,⁷⁰ although originally proposed for rumor propagation, introduces a label propagation framework over GCN-encoded graphs. When adapted to access fraud detection, it allows for the detection of anomalous user behavior based on relational diffusion—capturing how misuse patterns may spread across user-resource interactions or departments. Its simplicity and scalability make it suitable for enterprise settings where access graphs are updated regularly.

Common datasets for fraud detection in access logs include enterprise simulation logs, such as synthetically generated policy violation data, or real-world transaction graphs (e.g., from Alibaba or eBay platforms). While public access to corporate logs remains limited due to privacy concerns, research communities have increasingly relied on anonymized datasets and controlled simulations to benchmark graph-based fraud detectors. This domain presents unique challenges. First, behavioral heterogeneity is high—different departments, user roles, and systems generate diverse access patterns, making “normal” behavior difficult to define. Second, fraud itself is contextual and loosely defined; what is malicious in one role may be benign in another. Finally, label imbalance is severe: fraudulent events are rare, inconsistently labeled, and often delayed in discovery, which hampers supervised training and evaluation. Future work should address these limitations by combining HGNNs with meta-learning, domain adaptation, and semantic-aware contrastive learning, allowing models to generalize across dynamic enterprise domains and usage regimes. Semi-supervised and unsupervised approaches that incorporate access semantics, user hierarchy, and policy constraints will be key to improving robustness in fraud detection systems.

Quantitative perspective: Public fraud-detection results are especially difficult to compare because many studies rely on proprietary interaction graphs, different fraud definitions, and different label ratios. Nevertheless, representative graph-based studies indicate clear practical gains. On public fraud benchmarks, HHLN-GNN reports improvements of 10.0% in $F 1$ -macro, 12.5% in area under the curve (AUC), and 17.3% in G-mean on YelpChi, with smaller but still positive gains on Amazon.⁸⁵ In industrial transaction settings, xFraud reports an AUC of 0.9074 on eBay-xlarge and demonstrates precision above 0.95 at operationally relevant recall levels under selected thresholds.⁴⁷ These figures are illustrative rather than directly comparable, but they show that graph-based fraud detectors can provide measurable advantages when relational context is preserved.

4.4. APTs or lateral movement detection

APTs refer to stealthy, multi-stage cyber intrusions that aim to maintain prolonged access within a system. Unlike one-off attacks, APTs unfold gradually through lateral movement, privilege escalation, and internal reconnaissance, often mimicking legitimate behavior to evade detection. Detecting APTs require reasoning over long temporal sequences, diverse entity interactions, and cross-context anomalies—challenges well-suited to heterogeneous and temporal graph modeling. In practice, graph construction for APT detection typically involves host-process graphs, user-device-process graphs, or command-session graphs, where nodes represent system entities (users, machines, processes), and edges denote interactions such as command execution, session access, or process spawning. Temporal dependencies are central to this task, as APT stages (initial access, foothold, escalation, exfiltration) unfold over time and across multiple components.

HGNN-based models address these complexities through advanced spatio-temporal and relational reasoning. MatchGNet⁶¹ introduces an invariant graph modeling framework that captures structural and semantic invariants from benign subgraphs and uses hierarchical attention to detect substructures that violate these learned templates. This makes MatchGNet particularly effective for detecting APT lateral movements, where attackers traverse unexpected host-process paths without altering node-level features significantly. GLocalKD⁶⁴ enhances detection by distilling knowledge between node- and subgraph-level GNNs. In the context of APTs, this allows the model to align fine-grained local behaviors (e.g., user logins or process spawns) with high-level behavior patterns (e.g., suspicious movement through internal hosts), capturing the multi-resolution nature of threats. The joint learning strategy helps detect subtle escalations and role shifts, which may go unnoticed in models focusing solely on one level of granularity. TGBULLY,⁵⁹ though originally proposed for detecting online bullying, uses a temporal GAT-GRU architecture that adapts well to capturing evolving behavior patterns in user interactions. In an enterprise setting, the same framework can be applied to session graphs where sequences of access events reflect potential staging activity for lateral movement. The model’s temporal edge modeling and context attention help distinguish coordinated internal traversal from benign variability.

The DARPA Operationally Transparent Cyber (OpTC)/Transparent Computing data release⁸⁶ provides a practical source of labeled APT-like behaviors, offering detailed host, process, and network telemetry with red-team ground truth. Beyond OpTC, APT-oriented audit collections remain fragmented and are often described only through secondary reviews,⁸⁷ which reinforces the lack of a single standardized benchmark for persistent-threat detection. Despite recent progress, APT detection remains one of the most challenging applications of graph-based anomaly detection. First, APT campaigns span long temporal windows, requiring models to maintain memory over extended sequences without overfitting. Second, ground truth is sparse, making supervised training difficult and raising the need for unsupervised or weakly supervised strategies. Third, reasoning over heterogeneous relational dependencies—such as mapping user behaviors to underlying host-process interactions—requires multi-modal learning architectures that balance structural, semantic, and temporal cues. Future research should explore explainable graph transformers, online subgraph tracking, and relational motif reasoning for interpretable and real-time detection of persistent threats. Integrating domain knowledge—such as known privilege escalation paths or device roles—into HGNN architectures may further improve performance and trust in operational settings.

Quantitative perspective: Compared with intrusion detection, quantitative evidence for APT and lateral-movement detection is still relatively sparse and often task-specific. A representative example is MatchGNet, which reports 50% fewer false positives while maintaining zero false negatives in unknown-malware detection based on execution-behavior graphs.⁶¹ Although malware detection is not identical to enterprise APT monitoring, the result is highly relevant because it shows that invariant subgraph matching can materially reduce analyst burden while preserving detection sensitivity in attack-sequence-like settings. This also helps explain why structural and subgraph-level reasoning remains attractive for persistent-threat detection despite the limited availability of standardized APT benchmarks.

To complement the broader application discussion, a targeted empirical case study of representative HGNN models is presented later in the evaluation section under a unified pre-processing pipeline.

5. Evaluation metrics and benchmark datasets

Rigorous evaluation is critical for developing and comparing HGNN-based anomaly detection models. Due to the wide range of application domains and anomaly types, the community employs diverse metrics and datasets. However, evaluations are often inconsistent, and there remains a need for standardized benchmarks and reporting protocols. This section reviews commonly used evaluation metrics and benchmark datasets, structured around the taxonomy introduced in Sections 3 and 4.

5.1. Evaluation metrics

The evaluation of HGNN-based anomaly detection models centers on their ability to rank true anomalies above benign instances. In practice, most methods compute anomaly scores for nodes, edges, or subgraphs, and performance is assessed using ranking-oriented metrics. The choice of metric critically affects the interpretation of model performance, especially given the prevalence of class imbalance and heterogeneous structures in real-world graphs. These metrics are used across different anomaly types: area under the receiver operating characteristic curve (AUROC)⁸⁸ and area under the precision–recall curve (AUPRC)⁸⁹ are most common in node-level benchmarks like ACM and DBLP. Precision@ $K$ and Recall@ $K$ are suited for streaming edge-level detection tasks like AddGraph or DynAD, where only top-ranked alerts are reviewed. NDCG@ $K$ is primarily used in subgraph-level detection settings, such as in cybersecurity or molecular domains, where anomalies carry varying degrees of severity. $F 1$ -score, while widely reported in semi-supervised node-level tasks, depends on pre-defined thresholds and is often used in synthetic or label-limited benchmarks.

One of the most widely reported metrics is the AUROC. AUROC quantifies the overall ability of the model to distinguish between anomalous and normal instances across all possible thresholds. It is computed as the AUC plotting the true positive rate (TPR) against the false positive rate (FPR), where:

TPR = \frac{TP}{TP + FN}, FPR = \frac{FP}{FP + TN} .

(6)

Here, TP (true positives) is the count of correctly identified anomalies, FN (false negatives) is the count of missed anomalies, FP (false positives) is the count of incorrectly flagged normal instances, and TN (true negatives) is the count of correctly identified normal instances. Although AUROC provides a useful global ranking measure, it tends to overestimate performance in imbalanced datasets, as the false positive rate can remain deceptively low when the normal class dominates. To counter this, the AUPRC is often recommended, particularly in settings with severe class imbalance such as fraud detection and insider threat scenarios. The precision (P) and recall (R) are defined as:

Precision = \frac{TP}{TP + FP}, Recall = \frac{TP}{TP + FN} .

(7)

AUPRC emphasizes performance on the minority class (anomalies), making it a more informative measure than AUROC in many HGNN use cases.

For practical deployment scenarios—where analysts typically investigate only the top-ranked results—metrics such as Precision@ $K$ and Recall@ $K$ provide actionable insight. Precision@ $K$ measures the fraction of true anomalies within the top $K$ ranked predictions:

Precision @ K = \frac{Number of True Anomalies in Top K}{K} .

(8)

Similarly, Recall@

K

measures the proportion of total anomalies that appear within the top

K

results. In subgraph-level detection tasks, where anomalies can vary in severity (e.g., minor vs. severe breaches), models like MatchGNet and GLocalKD utilize the normalized discounted cumulative gain (NDCG) to incorporate graded relevance. NDCG@

K

is defined as:

NDCG @ K = \frac{1}{IDCG @ K} \sum_{i = 1}^{K} \frac{2^{{rel}_{i}} - 1}{\log_{2} (i + 1)},

(9)

where

{rel}_{i}

represents the relevance score (or severity) of the anomaly at the

i

-th ranked position, and IDCG@

K

is the ideal (best possible) DCG value for normalization. In semi-supervised models where partial ground truth is available, the

F 1

-score is frequently reported. It combines precision and recall into a single metric:

F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} .

(10)

Although intuitive, the

F 1

-score depends on a pre-defined threshold, which can be difficult to set reliably in unsupervised anomaly detection tasks unless threshold calibration is handled systematically.

Recent studies have underscored common pitfalls in metric selection and reporting. These include threshold sensitivity, potential inflation of AUROC in highly imbalanced datasets, and a lack of standardized operational evaluation protocols. To mitigate these challenges, current best practices emphasize reporting multiple complementary metrics, ensuring clarity around threshold selection, and aligning evaluations with the intended deployment context. Table 5 summarizes the core evaluation metrics discussed above, including their mathematical formulations and recommended usage contexts within HGNN-based anomaly detection.

Given the wide range of anomaly structures and task formulations, no single metric provides a complete picture. As such, recent studies increasingly advocate for multi-metric evaluation strategies that combine global ranking metrics (e.g., AUROC) with task-oriented metrics (e.g., Precision@ $K$ or NDCG), especially when evaluating models under real-world operational constraints. In particular, for streaming cybersecurity systems, Precision@ $K$ is often the most critical metric as it directly correlates with the false alert fatigue experienced by security analysts. A high AUROC may be misleading if the top-ranked results are dominated by false positives, rendering the system unusable in practice.

5.2. Benchmark datasets

The choice of benchmark datasets strongly influences both model design and evaluation outcomes. In the context of heterogeneous graphs, benchmark datasets must capture multi-typed node and edge structures, temporal evolution, and realistic anomaly patterns. Existing benchmarks vary widely in annotation quality, graph construction assumptions, and temporal richness, often posing challenges for consistent and reproducible evaluation. This subsection reviews commonly used datasets across node-level, edge-level, and subgraph-level anomaly detection tasks, detailing how graphs are constructed from raw data and examining critical factors such as heterogeneity, temporal scope, and label quality. By highlighting dataset characteristics and limitations, we aim to clarify where current evaluation practice remains weak and what properties future benchmarks should provide.

Table 5.
Core evaluation metrics in HGNN-based anomaly detection.

Metric Formula/description Application context

AUROC AUROC $= \int_{0}^{1} TPR ({FPR}^{- 1} (x)) d x$ Node-level detection (e.g., DOMINANT, HeCo); standard for global ranking.

AUPRC AUPRC $= \int_{0}^{1} P (R^{- 1} (x)) d x$ Node-level in imbalanced datasets (GraphConsis, OCAN); robust to false positives.

Precision@ $K$ Precision@ $K = \frac{True Anomalies in Top K}{K}$ Edge-level, real-time detection (AddGraph, DynAD); used in limited analyst settings.

Recall@ $K$ Recall@ $K = \frac{True Anomalies in Top K}{Total Anomalies}$ Edge- or subgraph-level (GCAN, DynAD); ranks based on high-sensitivity detection.

$F 1$ -score $F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}$ Semi-supervised node detection (SemiGNN, GCNSI); sensitive to thresholding.

NDCG@ $K$ NDCG@K $= \frac{1}{{IDCG}_{K}} \sum_{i = 1}^{K} \frac{2^{{rel}_{i}} - 1}{\log_{2} (i + 1)}$ Subgraph-level (MatchGNet, GLocalKD); ranks based on graded threat relevance.

Metric	Formula/description	Application context
AUROC	AUROC $= \int_{0}^{1} TPR ({FPR}^{- 1} (x)) d x$	Node-level detection (e.g., DOMINANT, HeCo); standard for global ranking.
AUPRC	AUPRC $= \int_{0}^{1} P (R^{- 1} (x)) d x$	Node-level in imbalanced datasets (GraphConsis, OCAN); robust to false positives.
Precision@ $K$	Precision@ $K = \frac{True Anomalies in Top K}{K}$	Edge-level, real-time detection (AddGraph, DynAD); used in limited analyst settings.
Recall@ $K$	Recall@ $K = \frac{True Anomalies in Top K}{Total Anomalies}$	Edge- or subgraph-level (GCAN, DynAD); ranks based on high-sensitivity detection.
$F 1$ -score	$F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}$	Semi-supervised node detection (SemiGNN, GCNSI); sensitive to thresholding.
NDCG@ $K$	NDCG@K $= \frac{1}{{IDCG}_{K}} \sum_{i = 1}^{K} \frac{2^{{rel}_{i}} - 1}{\log_{2} (i + 1)}$	Subgraph-level (MatchGNet, GLocalKD); ranks based on graded threat relevance.

HGNN: heterogeneous graph neural network; AUROC: area under the receiver operating characteristic curve; AUPRC: area under the precision–recall curve; TPR: true positive rate; FPR: false positive rate; NDCG: normalized discounted cumulative gain; IDCG: ideal discounted cumulative gain.

5.2.1. Node-level anomalies

A significant proportion of prior work utilizes semi-synthetic citation networks such as Cora,⁹⁰ Citeseer,⁹⁰ and PubMed.⁹⁰ These datasets represent homogeneous graphs where node features and links are artificially perturbed to simulate anomalies, for instance, through attribute noise injection or random edge rewiring. Models such as DOMINANT²⁵ and SpecAE⁴² adopt these benchmarks in semi-supervised settings. While widely adopted, these datasets offer limited realism for heterogeneous or dynamic scenarios, and the anomalies they model are often simplistic. To address this, heterogeneous benchmarks such as Amazon review graphs,⁹¹ Yelp/YelpChi review graphs,⁹² and the DBLP bibliographic network⁹³ have been employed in more recent models like HeCo⁴¹ and GraphCAD.⁴⁰ These datasets are not all constructed in the same way: Amazon and Yelp-style fraud benchmarks are typically modeled as review-centered heterogeneous graphs, whereas DBLP is commonly represented as an author–paper–term–venue network. These benchmarks better reflect real-world heterogeneity but present challenges such as schema ambiguity, unclear ground-truth labels, and manual metapath specification. In addition, industrial transactional benchmarks such as the eBay/xFraud data release⁹⁴ extend graph anomaly evaluation into large-scale fraud settings. While rich in relational semantics, such transaction data are often only partially accessible, which continues to complicate reproducibility and fair comparison.

5.2.2. Edge-level anomalies

For edge-level detection tasks, CICIDS2017⁸³ and CTU-13⁸⁴ are widely used. These datasets consist of temporal graphs modeling network flows between IP addresses and services, with ground-truth labels for attacks such as port scans, brute-force intrusions, and botnet activity. Models like DynAD,⁵⁶ StrGNN,³² and SedanSpot⁹⁵ utilize these benchmarks to identify anomalous interactions. However, they are characterized by high class imbalance and sparsely labeled attack events, which necessitate careful evaluation using metrics like AUPRC and Recall@ $K$ . The eBay/xFraud data release⁹⁴ is also used in edge-level anomaly detection, particularly in financial fraud scenarios. Here, edges represent transactions and their linkage entities, while anomalies correspond to fraudulent or suspicious interactions. Due to limited public availability and selective release constraints, this benchmark is often handled through semi-supervised or weakly supervised learning frameworks, as in models like eFraudCom.

5.2.3. Subgraph-level anomalies

Subgraph-level anomaly detection remains less developed, with relatively few public benchmarks. Notable examples include the DARPA Operationally Transparent Cyber (OpTC)/Transparent Computing data release,⁸⁶ which captures host-process and network telemetry with red-team ground truth in an enterprise environment. This dataset is used in models such as MatchGNet and GLocalKD to study complex multi-step attack chains and other structured threat behaviors. Beyond OpTC, APT-oriented audit corpora remain fragmented and are often summarized only through secondary reviews,⁸⁷ underscoring the scarcity of standardized subgraph-level cybersecurity benchmarks. The Open Graph Benchmark (OGB) Series⁹⁶ is another important resource. While primarily used for node- or graph-level classification, its segmentation-based subgraph support has been adapted in models like OCGNN⁴⁵ and HON-GAT. However, OGB benchmarks often lack detailed subgraph-level anomaly labels, limiting their direct applicability.

5.2.4. Cross-cutting issues

Across all datasets, a recurring limitation is the lack of standardization in graph construction protocols. Studies often apply different pre-processing techniques, metapath definitions, time windowing methods, edge aggregation strategies, or feature engineering pipelines, which can significantly impact the resulting graph topology and feature distributions and thereby hinder reproducibility and fair comparison. Additionally, while many benchmarks are static snapshots, real-world anomaly detection increasingly requires datasets that support streaming, online detection, and temporal drift. Recent efforts like SedanSpot, evaluated on CICIDS2017,⁸³ attempt to address this by simulating streaming edge environments, but broader community adoption of dynamic benchmarking remains limited.

In summary, while existing benchmarks have laid the foundation for HGNN-based anomaly detection, there remains a critical need for diverse, standardized, and temporally aware datasets, particularly in heterogeneous or multi-phase attack scenarios. Table 6 summarizes the key characteristics of each benchmark dataset discussed in this section.

Table 6.
Summary of benchmark datasets for HGNN-based anomaly detection.

Dataset Anomaly type Graph construction Heterogeneity Temporal Example models Deployment context

Cora, Citeseer, PubMed⁹⁰ Node-level (synthetic) Citation graphs with attribute/edge perturbations Homogeneous (synthetic in HGNNs) Static DOMINANT, SpecAE Academic/baseline

Amazon,⁹¹ Yelp/YelpChi,⁹² DBLP⁹³ Node-level Review or bibliographic heterogeneous graphs (e.g., user–item–review; author–paper–term–venue) Heterogeneous Static HeCo, GraphCAD Review fraud/bibliographic benchmark

CICIDS2017⁸³ Edge-level Network flow graphs between IPs; temporal snapshots Semi-heterogeneous (IP, protocol) Temporal SedanSpot, DynAD Network intrusion

CTU-13⁸⁴ Edge-level Botnet flow graphs Semi-heterogeneous Temporal DynAD, StrGNN Botnet detection

eBay/xFraud⁹⁴ Edge-level Transaction-linkage heterogeneous graphs derived from e-commerce records Heterogeneous Static eFraudCom Financial fraud

DARPA OpTC/TC⁸⁶ Subgraph-level Enterprise host, process, and network telemetry with red-team ground truth Heterogeneous Temporal MatchGNet, GLocalKD APT/Lateral movement

OGB Series⁹⁶ Subgraph-level Citation/biological graphs; subgraphs via segmentation Varies Static OCGNN, HON-GAT General graph learning

Dataset	Anomaly type	Graph construction	Heterogeneity	Temporal	Example models	Deployment context
Cora, Citeseer, PubMed⁹⁰	Node-level (synthetic)	Citation graphs with attribute/edge perturbations	Homogeneous (synthetic in HGNNs)	Static	DOMINANT, SpecAE	Academic/baseline
Amazon,⁹¹ Yelp/YelpChi,⁹² DBLP⁹³	Node-level	Review or bibliographic heterogeneous graphs (e.g., user–item–review; author–paper–term–venue)	Heterogeneous	Static	HeCo, GraphCAD	Review fraud/bibliographic benchmark
CICIDS2017⁸³	Edge-level	Network flow graphs between IPs; temporal snapshots	Semi-heterogeneous (IP, protocol)	Temporal	SedanSpot, DynAD	Network intrusion
CTU-13⁸⁴	Edge-level	Botnet flow graphs	Semi-heterogeneous	Temporal	DynAD, StrGNN	Botnet detection
eBay/xFraud⁹⁴	Edge-level	Transaction-linkage heterogeneous graphs derived from e-commerce records	Heterogeneous	Static	eFraudCom	Financial fraud
DARPA OpTC/TC⁸⁶	Subgraph-level	Enterprise host, process, and network telemetry with red-team ground truth	Heterogeneous	Temporal	MatchGNet, GLocalKD	APT/Lateral movement
OGB Series⁹⁶	Subgraph-level	Citation/biological graphs; subgraphs via segmentation	Varies	Static	OCGNN, HON-GAT	General graph learning

HGNN: heterogeneous graph neural network; APT: advanced persistent threat; OGB: Open Graph Benchmark.

As shown in Table 6, while several datasets exist for evaluating HGNNs, there is a clear gap in high-heterogeneity, dynamic benchmarks tailored for advanced threats like APTs. Addressing this gap will be crucial for advancing the field and ensuring that models are evaluated under realistic, operational conditions.

5.2.5. Targeted empirical case study

To complement the survey discussion with concrete evidence, we conducted a targeted empirical case study using public-code implementations of representative graph anomaly detection models. Although numerous graph anomaly detection methods exist, evaluating all of them is neither feasible nor necessary for a survey-aligned empirical study. Instead, we intentionally selected three models—DOMINANT, TADDY, and StrGNN—because they collectively span three major methodological paradigms discussed throughout this survey: static reconstruction-based modeling, temporal sequence modeling, and structural–temporal joint learning. These models therefore provide a compact but methodologically diverse comparison set, while also being among the few methods with reproducible public implementations suitable for evaluation under a unified pipeline.

All three models were executed under a unified local pre-processing pipeline that converted the public UNSW-NB15 release into model-specific inputs. This design supports a fair and transparent comparison under shared practical constraints while avoiding the redundancy and limited interpretive value of adding many closely related variants.

Because the publicly accessible UNSW-NB15 release does not expose raw host identities in a form directly usable for host-to-host graph construction, we generated a documented behavioral-profile proxy graph from the available flow attributes. Consequently, the following results should be interpreted as a reproducible case study under a common graph-construction protocol rather than as directly comparable benchmark numbers from the original papers. Even with this limitation, the experiment is useful because it illustrates how representative static, temporal, and structural–temporal methods behave under the same practical pre-processing assumptions.

The results in Table 7 show that executable reproduction is feasible, but performance is sensitive to graph construction choices. All models were evaluated under consistent pre-processing, feature extraction, and graph construction. While alternative graph construction strategies may lead to different absolute performance levels, using a consistent proxy graph ensures that observed differences reflect relative model behavior under identical structural assumptions. On the prepared UNSW-NB15 proxy graph, DOMINANT achieved an AUROC of approximately 0.365, TADDY achieved a total AUC of approximately 0.639, and StrGNN achieved an AUROC of approximately 0.733 with an average precision of approximately 0.730 under a lightweight reproducibility run. The relatively low performance of DOMINANT is expected in this setting, as reconstruction-based methods assume feature homophily and static structure, which are not well aligned with the heterogeneous and behavior-driven proxy graph constructed from UNSW-NB15. Within this unified setup, the structural–temporal model produced the strongest anomaly separation behavior, while the dynamic transformer remained clearly stronger than the static reconstruction baseline. These observations are directionally consistent with the broader survey argument that temporal and structure-aware models are better suited to evolving cybersecurity anomalies than purely static methods.

Table 7.
Targeted empirical case study under a unified public-code pipeline.

Model Paradigm Setting Observed result

DOMINANT²⁵ Static reconstruction 20 training epochs AUROC $\approx 0.365$

TADDY⁵² Dynamic transformer 5 training epochs Total AUC $\approx 0.639$

StrGNN³² Structural–temporal GNN Consistent-budget run AUROC $\approx 0.733$ ; AP $\approx 0.730$

Model	Paradigm	Setting	Observed result
DOMINANT²⁵	Static reconstruction	20 training epochs	AUROC $\approx 0.365$
TADDY⁵²	Dynamic transformer	5 training epochs	Total AUC $\approx 0.639$
StrGNN³²	Structural–temporal GNN	Consistent-budget run	AUROC $\approx 0.733$ ; AP $\approx 0.730$

AUROC: area under the receiver operating characteristic curve; AUC: area under the curve.

Results are reported on the normalized public UNSW-NB15 release and should be interpreted as case-study outputs under a reproducible local graph-construction protocol.

Figure 6 complements the empirical case study by providing a compact cross-model comparison that is difficult to convey through text alone. In particular, it highlights the trade-offs between dynamic modeling strength, heterogeneous support, interpretability, scalability, and deployment readiness, helping readers quickly identify which representative models are better aligned with different cybersecurity settings.

Figure 6.

Visual comparison of representative HGNN-based anomaly detection models across anomaly granularity, temporal capability, heterogeneous support, interpretability, scalability, and deployment readiness. Filled cells indicate primary task support, while shaded labels indicate relative strengths inferred from the surveyed literature. HGNN: heterogeneous graph neural network.

5.2.6. Concrete recommendations for evaluation

To address the limitations identified in current datasets and move toward operationally relevant benchmarking, we propose the following recommendations:

Standardized temporal splits: Use chronological splits (e.g., 70%/10%/20% by time) instead of random shuffling to prevent “looking into the future” and to evaluate model robustness against concept drift.

Imbalance-aware metrics: Report both AUROC and AUPRC. Given that anomalies often represent less than 1% of data, AUPRC provides a more realistic measure of a model’s ability to isolate rare threats without excessive false positives.

Analyst-centric metrics (Precision@ $K$ ): Evaluate models using Precision@ $K$ , where $K$ is calibrated to the typical daily alert capacity of a Security Operations Center (SOC) (e.g., $K = 50$ or $100$ ). This directly measures the alert-fatigue impact on human analysts.

Heterogeneity reporting: Explicitly report the number of node and edge types utilized. Models that achieve high performance by collapsing a heterogeneous graph into a homogeneous one should be critically evaluated for their loss of semantic nuance.

5.2.7. Practical deployment considerations

Beyond traditional metrics, the transition from research to production requires addressing several deployment constraints:

Latency budgets: Benchmark end-to-end inference latency under realistic streaming or batched conditions, since cybersecurity alerts lose value when they arrive too late for operational response.

Memory footprint: Report GPU/CPU memory consumption during both training and inference, especially for temporal and transformer-based HGNNs intended for enterprise-scale graphs.

Explainability at triage time: Measure whether the model can surface supporting entities, relations, or substructures quickly enough for analyst validation, rather than treating explainability as an offline-only feature.

Drift robustness: Evaluate how performance degrades over time without retraining, and report the refresh cadence needed to maintain acceptable performance in changing environments.

6. Open challenges and future directions

Despite notable progress in HGNNs for anomaly detection, several obstacles remain before such methods can be reliably deployed in real-world cybersecurity systems.^3,18,75 These challenges span modeling limitations, data availability, evaluation practices, and deployment feasibility. Addressing them is essential for advancing both methodological rigor and operational readiness.

Figure 7 provides a structured summary of the major research directions highlighted in this section. It organizes the field’s next steps from current limitations through methodological and evaluation priorities to deployment goals, emphasizing that progress in HGNN-based cybersecurity anomaly detection depends not only on stronger models, but also on realistic benchmarks, reproducible evaluation, and operationally grounded system design.

Figure 7.

Roadmap for advancing HGNN-based anomaly detection in cybersecurity. The figure summarizes current limitations, near-term methodological priorities, evaluation and benchmark needs, and operational deployment goals required to move HGNN-based cyber anomaly detection toward practical use. HGNN: heterogeneous graph neural network.

Critical analysis: Despite strong reported results, many current HGNN-based anomaly detection methods still struggle in practice for five recurring reasons. First, their performance is highly sensitive to graph construction choices, including node and edge definitions, temporal windowing, and schema design, which are rarely standardized across studies. Second, many models assume relatively clean and stationary relational patterns, whereas real cybersecurity environments are noisy, incomplete, and subject to concept drift. Third, benchmark gains often depend on offline training and broad graph access, making them difficult to transfer to low-latency streaming settings. Fourth, anomaly scores are frequently insufficiently interpretable for analyst triage, limiting operational trust even when ranking performance is high. Fifth, many published evaluations emphasize detection performance in isolation, without accounting for memory footprint, alert fatigue, retraining cost, or integration into existing SOC workflows. As a result, the gap between benchmark success and deployment readiness remains a central weakness of the current literature.

Table 8 summarizes these recurring failure sources, how they appear in the current literature, and why they continue to limit practical deployment in cybersecurity settings.

Table 8.

Why current HGNN-based anomaly detection methods still struggle in practice.

Failure source	Manifestation in current literature	Operational consequence
Graph construction instability	Results change substantially with different node/edge definitions, temporal windows, and heterogeneous schemas; these choices are often underreported.	Weak cross-paper comparability and poor transfer from published benchmarks to real deployments.
Distribution shift and concept drift	Models are commonly trained on static or quasi-static data, while operational cyber environments evolve continuously.	Rapid performance degradation, retraining burden, and missed emerging threats.
Limited interpretability	Many methods output anomaly scores without clear supporting entities, relations, or subgraphs for analyst validation.	Low analyst trust, slower triage, and higher false-alert fatigue.
Computational and memory overhead	Temporal, transformer-based, and subgraph-centric methods often require expensive aggregation, history storage, or offline processing.	Difficulty meeting real-time latency budgets and scaling to enterprise graphs.
Weak benchmark realism	Common datasets are synthetic, simplified, or inconsistently pre-processed, with limited heterogeneity or sparse temporal labels.	Inflated experimental results that do not reliably predict field performance.
Evaluation misalignment	Studies frequently optimize AUROC or similar metrics without measuring deployment costs, alert capacity, or maintenance effort.	Models may look strong offline yet remain impractical for SOC integration.

AUROC: area under the receiver operating characteristic curve; SOC: security operations center.

Taken together, these gaps explain why strong benchmark performance does not automatically translate into operational usefulness. They also motivate the more specific modeling, data, and evaluation challenges discussed in the following subsections.

6.1. Modeling challenges

Cybersecurity graphs are complex and constantly evolving, creating several open challenges for HGNN-based anomaly detection.^17,65,67 A key issue is temporal dynamics: many models rely on static graph snapshots and fail to capture long-term dependencies that are critical for identifying staged attacks or gradual privilege escalation.^29,56 Another challenge lies in heterogeneity and modality, since real systems combine multiple entity types and often include logs, textual reports, or event traces that current approaches rarely exploit.²²

A further challenge concerns interpretability. Anomaly scores generated by HGNNs are often opaque, which hinders analysts from understanding or trusting model outputs in high-stakes cybersecurity environments. In operational settings, explainability is not merely desirable but essential for incident investigation, compliance auditing, and human–AI collaboration. Although attention mechanisms and feature attribution have been explored to provide interpretive insights,^25,39 these methods typically offer coarse-grained explanations and are difficult to extend to heterogeneous graphs, where multiple relation types and semantics coexist. Future research should focus on integrating fine-grained interpretability within HGNN architectures, enabling anomaly reasoning that is traceable to specific entities, relations, and causal contexts.

Future research should also emphasize hybrid modeling architectures that combine temporal reasoning, multi-modal fusion, and interpretable learning. Promising directions include graph transformers for cross-type dependency modeling,²⁵ neuro-symbolic reasoning for explainable anomaly inference, and causal or contrastive frameworks that can disentangle spurious correlations from meaningful behavioral deviations.³⁹ Advancing these directions will require balancing transparency, scalability, and robustness within unified, end-to-end HGNN architectures.

6.2. Data challenges

Data limitations remain one of the most significant barriers to advancing HGNN-based anomaly detection.^4,70 A major challenge is label scarcity and imbalance: malicious events are rare and costly to annotate, while benign activities dominate most datasets.^26,36 This imbalance makes models prone to bias and reduces their ability to detect rare but high-impact anomalies. Semi-supervised, self-supervised, and weakly supervised learning strategies show promise, yet they require careful adaptation to heterogeneous and temporal graph settings.⁶⁸

Another persistent issue is benchmark realism. Many publicly available datasets are synthetic, outdated, or lack the structural and temporal complexity observed in operational systems.³³ As a result, models that perform well on controlled datasets often fail to generalize in production environments. Closing this simulation-to-real gap will require large-scale, richly annotated, and temporally consistent benchmarks that capture noise, incomplete observability, and evolving adversarial tactics.⁴⁹ Collaboration between academia and industry could facilitate the creation of shared, privacy-preserving cybersecurity graph datasets. Future directions include federated graph learning for secure data sharing,⁵¹ synthetic data generation to augment rare attack cases, and transfer learning pipelines to adapt HGNNs across different environments while respecting privacy and confidentiality constraints.^53,58

6.3. Evaluation challenges

Current studies employ highly varied evaluation metrics and graph construction protocols, which complicates cross-paper comparison and limits reproducibility.^5,27 Without standardized practices, it becomes difficult to measure progress or validate new methods consistently. Establishing community-agreed benchmarks with unified metrics (e.g., AUROC, Precision@ $K$ ) and consistent pre-processing pipelines would enable fairer comparison and stronger claims of improvement.^6,23

Beyond metric unification, the field needs reproducible and transparent evaluation frameworks.²⁴ Public leaderboards, standardized data splits, and open-source codebases can encourage more rigorous benchmarking and reduce fragmentation. Another pressing need is for robust evaluation under realistic conditions, such as noisy, partially observable, or evolving graphs.^41,59 These efforts will help ensure that reported improvements reflect genuine robustness and operational value rather than overfitting to idealized datasets.

6.4. Practical deployment challenges

Deploying HGNN-based anomaly detection in real environments raises critical challenges of scalability, adaptability, and operational reliability.²⁸ Cybersecurity graphs often contain millions of nodes and edges that evolve in near real time, yet many existing HGNN architectures remain too computationally expensive for production use.^30,43 Real-time decision latency is another major constraint: anomaly detectors must process streaming updates and generate alerts within strict time budgets to support timely incident response. Achieving this balance between inference speed and detection accuracy requires efficient message-passing schemes, streaming graph processing, and resource-aware model compression.⁷²

Beyond latency and scale, deployment also introduces security and privacy concerns. HGNNs may themselves become targets of adversarial manipulation, including model poisoning or evasion attacks that exploit their learned representations.³² Moreover, training on sensitive logs or network data raises privacy and compliance risks, especially when graph embeddings inadvertently encode identifiable or confidential information.³⁸ Mitigating these risks calls for privacy-preserving learning techniques such as federated or encrypted graph learning, as well as robust model auditing, explainability dashboards, and access-control mechanisms.

Operational environments further demand reliability, interpretability, and maintainability.⁶⁹ Threat behaviors evolve continuously, and static models can quickly degrade as attack patterns shift.⁵⁵ Online and continual learning techniques are required to adapt parameters as new data arrive without catastrophic forgetting.^45,57 Equally important is interpretability at the deployment stage: transparent anomaly explanations allow analysts to validate alerts efficiently, reduce false positives, and improve overall response speed. Designing human-in-the-loop pipelines that integrate interpretability with adaptive learning will be essential to ensure operational trust and long-term resilience.

Future progress will depend on the development of lightweight, secure, and adaptive HGNN frameworks that can meet real-time constraints while preserving privacy and interpretability. Integrating scalable inference, adversarial robustness, and continual learning into unified operational pipelines will be central to advancing HGNN anomaly detection from research prototypes to robust, field-deployable cybersecurity systems.

7. Conclusion

This survey has provided a comprehensive review of HGNN methods for anomaly detection, with a focus on node-level, edge-level, and subgraph-level tasks. We introduced a taxonomy that categorizes existing models based on their anomaly detection targets and learning paradigms, encompassing reconstruction-based, contrastive, attention-driven, and temporal models. We reviewed over 100+ studies and cited 89 representative papers to highlight how HGNNs effectively leverage heterogeneous graph structures, metapath-based semantics, and temporal dependencies to address complex anomaly detection scenarios.

A critical examination of benchmark datasets revealed both the strengths and limitations of existing resources. While widely used datasets such as Cora, Amazon, and CICIDS2017 have supported model evaluation, they often exhibit synthetic or sparse anomalies, limited heterogeneity, and inconsistent graph construction protocols. Future progress will rely on the creation of large-scale, realistic, and temporally annotated datasets that capture the dynamic nature of cybersecurity environments. Similarly, evaluation metrics such as AUROC, AUPRC, Precision@ $K$ , and NDCG should evolve toward standardized, task-aware protocols that incorporate operational factors such as detection latency, interpretability, and false-alert cost. Developing such unified benchmarks and evaluation suites will be essential for ensuring fair comparison, reproducibility, and meaningful advancement in HGNN-based anomaly detection.

Several persistent challenges emerged from this survey. First, the lack of standardized and richly annotated heterogeneous graph datasets remains a significant bottleneck, hindering fair comparisons and reproducibility. Second, most benchmarks are static, limiting the evaluation of dynamic or streaming HGNN models that are increasingly relevant in real-world applications such as cybersecurity and fraud detection. Third, evaluation practices often rely on a narrow set of metrics, which may not fully capture operational performance, especially in highly imbalanced settings.

Looking ahead, we identify three key directions for future research. First, there is a pressing need to develop benchmark suites that provide standardized graph schemas, temporal annotations, and clear anomaly ground truth across diverse domains. Second, more attention should be paid to robust evaluation frameworks, incorporating adaptive thresholding, uncertainty quantification, and cost-sensitive metrics to better reflect deployment realities. Third, advancing HGNN architectures that can jointly model structural, semantic, and temporal irregularities while also maintaining scalability and interpretability remains an open research frontier. By consolidating the current landscape of HGNN-based anomaly detection, this survey aims to serve as both a reference and a catalyst for future innovation in the field. We envision that advances in HGNNs, supported by standardized datasets and robust evaluation protocols, will pave the way toward operationally viable anomaly detection systems that are not only accurate but also scalable, interpretable, and resilient against evolving threats.

Footnotes

Abbreviations

Acknowledgments

The authors acknowledge the Curtin Cybersecurity Research Group for their support and feedback during this study.

ORCID iDs

Laura Jiang

Reza Ryan

Qian Li

Nasim Ferdosian

Ethical considerations and informed consent

Not applicable. This article does not involve studies with human participants or animals.

Author contributions

All authors contributed to the conceptualization, literature analysis, and manuscript writing. Laura Jiang led the survey design, taxonomy development, and final editing. Reza Ryan supervised the research framework and provided technical review. Qian Li contributed to data interpretation and literature validation. Nasim Ferdosian reviewed the manuscript and offered critical revisions.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

Data sharing not applicable to this article as no new data were created or analyzed in this study.

References

Liu

Pan

Wang

, et al. Anomaly detection in dynamic graphs via transformer. IEEE Trans Knowl Data Eng 2023; 35: 12081–12094.

Sikos

. Cybersecurity knowledge graphs. Knowl Inf Syst 2023; 65: 3511–3531.

Scarselli

Gori

Tsoi

, et al. The graph neural network model. IEEE Trans Neural Netw 2009; 20: 61–80.

Zhang

Song

Huang

, et al. Heterogeneous graph neural network. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, 2019, pp.793–803, Association for Computing Machinery.

Wang

Kong

Huang

, et al. Survey of graph neural network. Jisuanji Gongcheng/Comput Eng 2021; 47: 1–12.

Lamichhane

Eberle

. Anomaly detection in graph structured data: a survey. 2024.

Xue

, et al. A comprehensive survey on graph anomaly detection with deep learning. IEEE Trans Knowl Data Eng 2023; 35: 12012–12038.

Pan

Chen

, et al. A comprehensive survey on graph neural networks. IEEE Trans Neural Netw Learn Syst 2021; 32: 4–24.

Caville

Layeghy

, et al. Anomal-E: a self-supervised network intrusion detection system based on graph neural networks. Knowl Based Syst 2022; 258: 110030.

10.

Nguyen

Kashef

. TS-IDS: traffic-aware self-supervised learning for IoT network intrusion detection. Knowl Based Syst 2023; 279: 110966.

11.

Shi

van Leeuwen

. Graph neural network based log anomaly detection and explanation. CoRR. 2023;abs/2307.00527.

12.

. Interpretable spatial-temporal graph convolutional network for system log anomaly detection. Adv Eng Inform 2024; 62: 102803.

13.

Wang

Han

, et al. BS-GAT: a network intrusion detection system based on graph neural network for edge computing. Cybersecurity 2025; 8: 27.

14.

Luša

Pintar

Vranić

. TE-G-SAGE: explainable edge-aware graph neural networks for network intrusion detection. Modelling 2025; 6: 165.

15.

Carletti

Foggia

Rosa

, et al. Detecting malicious IoT network communication through graph neural networks in real-world conditions. Pattern Recognit Lett 2025; 189: 92–98.

16.

Alshehri

Sharaf

Molla

. Systematic review of graph neural network for malicious attack detection. Information 2025; 16: 470.

17.

Duan

Huang

Chen

, et al. Semi-supervised classification of fundus images combined with CNN and GCN. J Appl Clin Med Phys 2022; 23: e13746.

18.

Wang

Cui

, et al. Heterogeneous graph attention network. In: The web conference 2019—Proceedings of the world wide web conference, WWW 2019. 2019.

19.

Dong

Wang

, et al. Heterogeneous graph transformer. In: Proceedings of the web conference 2020, 2020, pp.2704–2710. New York, NY, USA: ACM.

20.

Samy

Giaretta

Kefato

, et al. SchemaWalk: Schema aware random walks for heterogeneous graph embedding. In: Companion proceedings of the web conference 2022, 2022, pp.1157–1166. New York, NY, USA: ACM.

21.

Rossi

Chamberlain

Frasca

, et al. Temporal graph networks for deep learning on dynamic graphs. 2020.

22.

Liu

Dong

, et al. Learning under concept drift: a review. IEEE Trans Knowl Data Eng 2019; 31: 2346–2363.

23.

Ding

Liu

. Interactive anomaly detection on attributed networks. In: Proceedings of the twelfth ACM international conference on web search and data mining, 2019, pp.357–365. New York, NY, USA: ACM.

24.

Liuliakov

Schulz

Hermes

, et al. One-class intrusion detection with dynamic graphs. Lect Notes Comput Sci 2023; 14254: 537–549.

25.

Ding

Bhanushali

, et al. Deep anomaly detection on attributed networks. In: SIAM international conference on data mining, SDM 2019, 2019.

26.

Jia

Xiong

Nan

, et al. MAGIC: detecting advanced persistent threats via masked graph representation learning. 2023.

27.

Kiani

Keshavarzi

Bohlouli

. Detection of thin boundaries between different types of anomalies in outlier detection using enhanced neural networks. Appl Artif Intell 2020; 34: 345–377.

28.

Hayes

Capretz

. Contextual anomaly detection framework for big sensor data. J Big Data 2015; 2: 2.

29.

Chen

Weinberger

. Fast flux discriminant for large-scale sparse nonlinear classification. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014, pp.621–630. New York, NY, USA: ACM.

30.

Glasser

Lindauer

. Bridging the gap: a pragmatic approach to generating insider threat data. In: 2013 IEEE security and privacy workshops, 2013, pp.98–104. IEEE.

31.

Moustafa

Slay

. UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In: 2015 Military communications and information systems conference (MilCIS), 2015, pp.1–6. IEEE.

32.

Cai

Chen

Luo

, et al. Structural temporal graph neural networks for anomaly detection in dynamic graphs. In: International conference on information and knowledge management, proceedings. 2021.

33.

Schlichtkrull

Kipf

Bloem

, et al. Modeling relational data with graph convolutional networks. Lect Notes Comput Sci 2018; 10843: 593–607.

34.

Zhang

Zhao

. Unsupervised deep subgraph anomaly detection. In: Proceedings—IEEE International conference on data mining, 2022. ICDM.

35.

Zhang

Xiang

Guo

, et al. SubAnom: efficient subgraph anomaly detection framework over dynamic graphs. In: 2023 IEEE international conference on data mining workshops (ICDMW), 2023, pp.1178–1185. IEEE.

36.

Zong

Zhuang

Shao

, et al. Structural–temporal coupling anomaly detection with dynamic graph transformer. 2025.

37.

Shao

, et al. Temporal subgraph contrastive learning for anomaly detection on dynamic attributed graphs. Appl Intell 2025; 55: 667.

38.

Yuan

Zhou

, et al. Higher-order structure based anomaly detection on attributed networks. In: Proceedings—2021 IEEE international conference on big data, big data 2021, 2021.

39.

Feng

You

Zhang

, et al. Hypergraph neural networks. In: 33rd AAAI conference on artificial intelligence, 2019.

40.

Chen

Zhang

, et al. Graph contrastive learning for anomaly detection. 2021.

41.

Wang

Liu

Han

, et al. Self-supervised heterogeneous graph neural network with co-contrastive learning. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining. 2021.

42.

Huang

, et al. SpecAE. In: Proceedings of the 28th ACM international conference on information and knowledge management, 2019, pp.2233–2236. New York, NY, USA: ACM.

43.

Fang

Feng

Gui

, et al. Anonymous edge representation for inductive anomaly detection in dynamic bipartite graph. Proc VLDB Endowment 2023; 16: 1154–1167.

44.

Zhang

Huang

, et al. eFraudCom: an e-commerce fraud detection system via competitive graph neural networks. ACM Trans Inf Syst 2022; 40: 1–29.

45.

Wang

Cui

, et al. OCGNN: one-class classification with graph neural networks.

46.

Peng

Luo

, et al. A deep multi-view framework for anomaly detection on attributed networks (extended abstract). In: 2023 IEEE 39th international conference on data engineering (ICDE), 2023, pp.3799–3800. IEEE.

47.

Liu

Dou

, et al. Alleviating the inconsistency problem of applying graph neural network to fraud detection. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, 2020, pp.1569–1572. New York, NY, USA: ACM.

48.

Zhang

H-K

Zhang

Y-G

Zhou

, et al. HONGAT: graph attention networks in the presence of high-order neighbors. Proc AAAI Conf Artif Intell 2024; 38: 16750–16758.

49.

Deng

Hooi

. Graph neural network-based anomaly detection in multivariate time series. 2021.

50.

Zheng

Yuan

, et al. One-class adversarial nets for fraud detection. Proc AAAI Conf Artif Intell 2019; 33: 1286–1293.

51.

Y-J

C-T

. GCAN: graph-aware co-attention networks for explainable fake news detection on social media. In: Proceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp.505–514.

52.

Liu

Zheng

Song

, et al. TADDY: temporal anomaly detection in dynamic graphs via transformer representations. In: Proceedings of the ACM international conference on information and knowledge management (CIKM). 2022.

53.

Pang

Chen

, et al. HRGCN: heterogeneous graph-level anomaly detection with hierarchical relation-augmented graph neural networks. In: 2023 IEEE 10th international conference on data science and advanced analytics (DSAA), 2023, pp.1–10. IEEE.

54.

Farrukh

Wali

Khan

, et al. XG-NID: dual-modality network intrusion detection using a heterogeneous graph neural network and large language model. Expert Syst Appl 2025; 287: 128089.

55.

Zheng

, et al. AddGraph: anomaly detection in dynamic graph using attention-based temporal GCN. 2019.

56.

Zhu

Liu

. A flexible attentive temporal graph networks for anomaly detection in dynamic networks. In: 2020 IEEE 19th international conference on trust, security and privacy in computing and communications (TrustCom), 2020, pp.870–875. IEEE.

57.

Bian

Xiao

, et al. Rumor detection on social media with bi-directional graph convolutional networks. Proc AAAI Conf Artif Intell 2020; 34: 549–556.

58.

Chang

Liu

. Human-related anomalous event detection via spatial–temporal graph convolutional autoencoder with embedded long short-term memory network. Neurocomputing 2022; 490: 482–494.

59.

Cheng

Liu

. Improving cyberbullying detection with user interaction. In: Proceedings of the web conference 2021, 2021, pp.496–506. New York, NY, USA: ACM.

60.

Zhu

Yan

, et al. MHGNN: multi-view fusion based heterogeneous graph neural network. Appl Intell 2024; 54: 8073–8091.

61.

Wang

Chen

, et al. Heterogeneous graph matching networks for unknown malware detection. 2019.

62.

Zhang

Yin

Chen

, et al. GCN-based user representation learning for unifying robust recommendation and fraudster detection. In: SIGIR 2020—Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, 2020, pp.689–698. Association for Computing Machinery, Inc.

63.

Jiang

Liu

Wang

, et al. OCGATL: one-class graph attention networks with transformation learning for anomaly detection for ARGO data. 2024, pp.152–173.

64.

Pang

Chen

, et al. Deep graph-level anomaly detection by glocal knowledge distillation. In: Proceedings of the fifteenth ACM international conference on web search and data mining, 2022, pp.704–714. New York, NY, USA: ACM.

65.

Zhu

, et al. Hierarchical graph convolutional networks for semisupervised node classification. 2019.

66.

Huang

. Accelerated attributed network embedding. In: Proceedings of the 2017 SIAM international conference on data mining, 2017, pp.633–641.

67.

Pan

Liu

Zheng

, et al. PREM: a simple yet effective approach for node-level graph anomaly detection. 2023.

68.

Kipf

Welling

. Semi-supervised classification with graph convolutional networks. In: 5th international conference on learning representations, ICLR 2017. 2017.

69.

Wang

Lin

Cui

, et al. A semi-supervised graph attentive network for financial fraud detection. 2020. Available at: https://arxiv.org/abs/2003.01171.

70.

Dong

Zheng

Quoc Viet Hung

, et al. Multiple rumor source detection with graph convolutional networks. In: Proceedings of the 28th ACM international conference on information and knowledge management, 2019, pp.569–578. New York, NY, USA: ACM.

71.

Eswaran

Faloutsos

Guha

, et al. SpotLight: detecting anomalies in streaming graphs. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, 2018.

72.

Shah

Beutel

Hooi

, et al. EdgeCentric: anomaly detection in edge-attributed networks. In: IEEE international conference on data mining workshops, ICDMW. 2016.

73.

Jiang

Cui

Beutel

, et al. CatchSync: catching synchronized behavior in large directed graphs. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining. 2014.

74.

Hooi

Song

Beutel

, et al. FRAUDAR: bounding graph fraud in the face of camouflage. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining. 2016.

75.

Sun

Wang

, et al. Anomaly subgraph detection through high-order sampling contrastive learning. In: Proceedings of the thirty-third international joint conference on artificial intelligence, 2024, pp.2362–2369.

76.

Dong

Wang

, et al. Heterogeneous graph transformer. In: Proceedings of the web conference 2020, 2020, pp.2704–2710. New York, NY, USA: ACM.

77.

Milajerdi

Gjomemo

Eshete

, et al. HOLMES: real-time APT detection through correlation of suspicious information flows. In: 2019 IEEE symposium on security and privacy (SP), 2019, pp.1137–1152. IEEE.

78.

Goodfellow

Pouget-Abadie

Mirza

, et al. Generative adversarial networks. Commun ACM 2020; 63: 139–144.

79.

Lindauer

. CERT Insider Threat Dataset [Internet]. Carnegie Mellon University; 2020. Available at: https://kilthub.cmu.edu/articles/dataset/Insider_Threat_Test_Dataset/12841247/1 (accessed 12 June 2025).

80.

Turcotte

MJM

Kent

Hash

. Unified host and network data set. In: Data science for cyber-security, 2018, pp.1–22. World Scientific.

81.

Lavanya

Glory

Aggarwal

, et al. Unmasking insider threats using a robust hybrid optimized generative pretrained neural network approach. Sci Rep 2025; 15: 26718.

82.

Moustafa

Slay

83.

Sharafaldin

Lashkari

Ghorbani

. Toward generating a new intrusion detection dataset and intrusion traffic characterization. In: Proceedings of the 4th international conference on information systems security and privacy (ICISSP), 2018, pp.108–116.

84.

García

Grill

Stiborek

, et al. The CTU-13 dataset: a labeled dataset with botnet, normal and background traffic. 2014.

85.

Zheng

Song

, et al. Financial transaction fraud detector based on imbalance learning and graph neural network. Appl Soft Comput 2023; 149: 110984.

86.

FiveDirections and DARPA. Operationally Transparent Cyber (OpTC) Data Release [Internet]. Available at: https://github.com/FiveDirections/OpTC-data.

87.

Stojanović

Hofer-Schmitz

Kleb

. APT datasets and attack modeling for automated detection methods: a review. Comput Secur 2020; 92: 101734.

88.

Chang

Newman

. Receiver operating characteristic (ROC) curves: the basics and beyond. Hosp Pediatr 2024; 14: e330–e334.

89.

Danesh Pazho

Alinezhad Noghre

Rahimi Ardabili

, et al. CHAD: Charlotte anomaly dataset. 2023, pp.50–66.

90.

Sen

Namata

Bilgic

, et al. Collective classification in network data. AI Mag 2008; 29: 93–106.

91.

McAuley

. Amazon product data [Internet]. University of California San Diego. Available at: https://cseweb.ucsd.edu/jmcauley/datasets/amazon/links.html.

92.

Yelp Open Dataset [Internet]. Yelp. Available at: https://www.yelp.com/dataset.

93.

DBLP computer science bibliography [Internet]. Available at: https://dblp.org/.

94.

eBay. xFraud [Internet]. GitHub repository. Available at: https://github.com/eBay/xFraud.

95.

Eswaran

Faloutsos

. SedanSpot: detecting anomalies in edge streams. In: 2018 IEEE international conference on data mining (ICDM), 2018, pp.953–958. IEEE.

96.

Fey

Zitnik

, et al. Open graph benchmark: datasets for machine learning on graphs steering committee [Internet]. Available at: https://ogb.stanford.edu.