Abstract
Knowledge graphs are freely aggregated, published, and edited in the Web of data, and thus may overlap. Hence, a key task resides in aligning (or matching) their content. This task encompasses the identification, within an aggregated knowledge graph, of nodes that are equivalent, more specific, or weakly related. In this article, we propose to match nodes within a knowledge graph by (i) learning node embeddings with Graph Convolutional Networks such that similar nodes have low distances in the embedding space, and (ii) clustering nodes based on their embeddings, in order to suggest alignment relations between nodes of a same cluster. We conducted experiments with this approach on the real world application of aligning knowledge in the field of pharmacogenomics, which motivated our study. We particularly investigated the interplay between domain knowledge and GCN models with the two following focuses. First, we applied inference rules associated with domain knowledge, independently or combined, before learning node embeddings, and we measured the improvements in matching results. Second, while our GCN model is agnostic to the exact alignment relations (e.g., equivalence, weak similarity), we observed that distances in the embedding space are coherent with the “strength” of these different relations (e.g., smaller distances for equivalences), letting us considering clustering and distances in the embedding space as a means to suggest alignment relations in our case study.
Introduction
The Semantic Web [3] offers tools and standards that facilitate the construction of knowledge graphs [17] that may aggregate data and elements of knowledge of various provenances. The combined use of these scattered elements of knowledge allows access to a larger extent of the available knowledge, which is beneficial to many applications, such as fact-checking or query answering. For this conjoint use to be possible, one crucial task lies in matching units across knowledge graphs or within an aggregated knowledge graph, i.e., finding alignments or correspondences between nodes, edges, or subgraphs. This task is well-studied in the Ontology Matching research field [12] and is challenging since knowledge graphs differ in quality, completeness, vocabularies, and languages. Consequently, different alignment relations may hold between units: some may indicate that two units are equivalent, weakly related, or that one is more specific than the other.
In the present work, we focus on matching specific nodes within an aggregated knowledge graph represented within Semantic Web standards. We view such a knowledge graph as a directed and labeled multigraph in which nodes represent entities of a world – also named individuals – (e.g., places, drugs), literals (e.g., dates, integers), or classes of individuals (e.g.,

Outline of our approach. Gold clusters are computed from existing alignments between the nodes to match in the knowledge graph (e.g.,
We propose to match specific individuals that represent n-ary relationships through an approach that combines graph embedding and clustering, outlined in Fig. 1. Graph embeddings are low-dimensional vectors that represent graph substructures (e.g., nodes, edges, subgraphs) while preserving as much as possible the properties of the graph [5]. More precisely, we learn node embeddings with Graph Convolution Networks (GCNs) [20,29] such that similar nodes have a low distance between their embeddings. We employ graph embeddings since their continuous nature may provide the needed flexibility to cope with the heterogeneous representations of nodes to match [15]. GCNs compute the embedding of a node by considering the embeddings of its neighbors in the graph. Hence, nodes with similar neighborhoods will have similar embeddings, what is well-adapted to a structural and relational matching approach [26,33].
To suggest alignment relations from node embeddings, we apply a clustering algorithm on the embedding space and consider nodes that belong to the same cluster as similar. The resulting clusters are evaluated by comparison with gold clusters, i.e., reference clusters that we aim to reproduce. We define these gold clusters as groups of nodes linked directly or indirectly by preexisting alignments we obtained from a rule-based method previously published [21]. These pre-existing alignments use five different alignment relations. For example, nodes may be identical (
Within our approach, we particularly investigated the interplay between GCNs and domain knowledge through the two following aspects. First, similarly to existing works with different embedding models [18], we applied various inference rules associated with domain knowledge (e.g., class and predicate hierarchies, symmetry and transitivity of predicates), independently or combined, before learning node embeddings, and we measured the improvements or declines in matching results. Second, we explored how embeddings can differentiate between different types of alignment relations. We made our GCN model agnostic to these exact relations during learning. However, we observed that distances between the embeddings of similar nodes are different and coherent with the type and “strength” of each alignment relation (e.g., smaller distances for equivalences, larger distances for weak similarities). Such results allow us to think that distances in the embedding space can be used to suggest alignment relations to connect nodes, in respect with distinct types of similarities. To the best of our knowledge, our approach is the first one to investigate these aspects in a matching task, combining GCNs and clustering.
Our approach based on GCNs was motivated by the need to align pharmacogenomic (PGx) knowledge that we previously aggregated in a knowledge graph named PGxLOD [22]. The biomedical domain of PGx studies the influence of genetic factors on drug response phenotypes. As an example, Fig. 2 depicts the relationship

Representation of a PGx relationship between gene
The remainder of this paper is organized as follows. In Section 2, we outline some works related to node matching in knowledge graphs and graph embeddings. We detail the core of our matching approach (node embeddings and clustering) in Section 3, and how inference rules associated with domain knowledge are considered in Section 4. In Section 5, we conduct experiments with this approach on PGxLOD, a large knowledge graph we built that contains 50,435 PGx relationships [22]. Finally, we discuss our results and conclude in Section 6 and 7.
Matching
Numerous papers exist about knowledge graph matching. The interested reader could refer to the book of Euzenat and Shvaiko [12] for a formalization of the matching task, and a detailed presentation of the main methods. In the following, we focus on graph embedding techniques. Such techniques have been successfully applied on knowledge graphs for various tasks such as node classification, link prediction, or node clustering [5,32]. Interestingly, the task of matching nodes can be alternatively tackled as a link prediction task (i.e., predicting alignments between nodes) or as a node clustering task (i.e., grouping similar nodes into clusters). Here, we choose the node clustering approach.
Graph embedding
Existing papers about graph embedding differ in the considered type of graphs (e.g., homogeneous graphs, heterogeneous graphs such as knowledge graphs) or in the embedding techniques (e.g., matrix factorization, deep learning with or without random walk). The survey of Cai et al. [5] presents a taxonomies of graph embedding problems and techniques. Hereafter, a few specific examples are detailed but a more thorough overview can be found in the following surveys [5,24,32]. Some approaches are translational. For example, TransE [4] computes for each triple
Graph Convolutional Networks (GCNs)
The approach adopted in this article is based on Graph Convolutional Networks (GCNs). GCNs have been introduced for semi-supervised classification over graphs [20] and extended for entity classification and link prediction in knowledge graphs [29]. In contrast with TransE and RDF2Vec that work at the triple and sequence levels, GCNs compute the embedding of a node by considering its neighborhood in the graph. Hence, as aforementioned, we believe GCNs are well-suited for our application of matching reified n-ary relationships since similar relationships have similar neighborhoods. Other existing works rely on this assumption that similar nodes have similar neighborhoods and use GCNs for their matching. For example, Wang et al. [33] propose to align cross-lingual knowledge graphs by using GCNs to learn node embeddings such that nodes representing the same entity in different languages have close embeddings. Pang et al. [26] use the same approach to align two knowledge graphs, but introduce an iterative aspect. Some newly-aligned entities are selected and used when learning embeddings in the next iteration. To avoid introducing false positive alignments, the newly-aligned entities are selected with a distance-based criteria proposed by the authors. Interestingly, the two previous approaches take into account literals in the embedding process and use the triplet loss, also used by TransE. On the contrary, in our work, we discard literals and use the Soft Nearest Neighbor loss [13] to consider all positive and negative examples instead of sampling.1
GCNs and the Soft Nearest Neighbor loss are further detailed in Section 3.2.
However, previous methods do not consider inference rules associated with domain knowledge represented in knowledge graphs on the contrary of recent papers [27]. For example, Iana and Paulheim [18] evaluate the RDF2Vec embedding model when inferred triples associated with subproperties, symmetry, and transitivity of predicates are added to the knowledge graph. Interestingly, the addition of inferred triples seems to degrade the performance of RDV2Vec embeddings in downstream applications (e.g., regression, classification). Instead of materializing inferred triples into the knowledge graph, d’Amato et al. [10] propose to inject domain knowledge in the learning process by defining specific loss functions and scoring functions for triples. Logic Tensor Networks [30] learn groundings of logical terms and logical clauses. The grounding of a logical term consists in a vector of real numbers (i.e., an embedding) and the grounding of a logical clause is a real number in the interval
These related works and our preliminary results [23] inspired the present work where we investigate how (i) inference rules associated with domain knowledge can improve the performances in node matching and (ii) the distance in the embedding space is representative of the type and “strength” of alignment relations, and thus can be used to suggest the specific relation to use between matched nodes.
Matching nodes with Graph Convolutional Networks and clustering
Approach outline
Our approach is outlined in Fig. 1.
It takes as input an aggregated knowledge graph
Learn embeddings for all nodes in
Apply a clustering algorithm only on the embeddings of nodes from S and consider nodes belonging to the same cluster as similar (Section 3.3).
It should be noted that gold clusters can result from another automatic matching method or a manual alignment by an expert. For example, in Section 5, our gold clusters are computed from alignments semi-automatically obtained with rules manually written by experts [21]. These alignments can use different alignment relations (e.g., equivalence, weak similarity). We further detail in Section 5.1 how distinct relations are taken into account in our experiments.
Learning node embeddings with Graph Convolutional Networks and the Soft Nearest Neighbor loss
To learn embeddings for all nodes in
Graph Convolutional Networks (GCNs) can be seen as a message-passing framework of multiple layers, in which the embedding
The number of predicates in
Recall that our objective is to cluster similar nodes, which differs from previous applications of GCNs (e.g., node classification, link prediction [20,29,33]). Hence, we propose to train GCNs from scratch by minimizing the Soft Nearest Neighbor (SNN) loss which, to the best of our knowledge, has never been used with GCN models before. This loss was defined by Frosst et al. [13] and is presented in Eq. (3),
A set N of nodes belonging to the gold clusters (see Section 5.2). A set Y of labels for nodes in N. These labels corresponds to the assignments of nodes in N to the gold clusters. A temperature T. Embeddings h of nodes. These embeddings are the output of the last layer of the GCN model.
Minimizing the SNN loss corresponds to minimizing intra-cluster distances and maximizing inter-cluster distances for the gold clusters of nodes in N. The temperature T determines how distances influence the loss. Indeed, distances between widely separated embeddings are taken into account when T is large whereas only distances between close embeddings are taken into account when T is small. To avoid T as an hyperparameter of the model, we adopt the same learning procedure as Frosst et al. [13]: T is initialized to a predefined value and is optimized by learning
The computation of
This step enables to learn embeddings for all nodes in
Matching nodes by clustering their embeddings
After embeddings of all nodes in the graph have been output by the last layer of the GCN, we perform a clustering on embeddings
We conduct comparative experiments with three distinct clustering algorithms presented in Table 1, in regards with three classical metrics presented in Table 2. Within the large set of existing algorithms, our choice has been guided by the constraints of our task that requires to handle an important number of clusters, potentially large, and with uneven sizes (see Section 5.1 and Fig. 3 for the sizes of gold clusters computed on PGxLOD). We decided to arbitrarily limit ourselves to three algorithms, but decided to opt for algorithms that cover some diversity in the various family of algorithms. Our three algorithms differ in their parameters: in particular they require either the number of clusters to find (Ward and Single) or the minimum size of clusters (OPTICS). We used our set of gold clusters to set these parameters. Considering these distinct algorithms allows us to evaluate the influence of inference rules in different settings (see Section 4).
Clustering algorithms applied on the embeddings of nodes in S. Nodes that belong to the same predicted cluster are considered as similar
Clustering algorithms applied on the embeddings of nodes in S. Nodes that belong to the same predicted cluster are considered as similar
Performance metrics used to compare the clusters predicted by the algorithms presented in Table 1 with gold clusters
Semantic Web knowledge graphs are represented within formalims such as Description Logics [2] that are equipped with inference rules. Hence, we propose to evaluate the improvements in the results of our matching approach (detailed in Section 3) when considering such inference rules, independently or combined. Here, we only consider the inference rules associated with the following logic axioms: class and predicate assertions, equivalence axioms between entities or classes, subsumption axioms between classes or predicates, and axioms defining predicate inverses. We consider this limited set of inference rules because they are the only ones actionable in PGxLOD, the knowledge graph that motivated our study. Accordingly, we generate six different graphs (
Visual summary of the transformations of
to evaluate the influence of the application of inference rules associated with domain knowledge on node matching.
is the baseline that corresponds to no inference rules being run and the systematic addition of abstract inverses
Visual summary of the transformations of
For a predicate For a predicate Otherwise, for a predicate
We conducted experiments with PGxLOD,2
PGxLOD presents several characteristics needed in the scope of our study. First, PGxLOD contains nodes whose matching is well-adapted to a structure-based approach such as ours. Additionally, alignments are expected to be found between these nodes. Indeed, PGxLOD contains 50,435 PGx relationships resulting from:
an automatic extraction from the reference database PharmGKB; an automatic extraction from the biomedical literature; a manual representation of 10 studies made from Electronic Health Records of hospitals.
Alignments are expected to be found between such relationships since, for example, PharmGKB is manually curated by experts after a literature review. Recall that PGx relationships are n-ary, and thus they are reified as nodes, as illustrated in Fig. 2 [25]. Hence, nodes representing these relationships form our set S of nodes to match. The reification process entails that neighbors of such nodes are the drugs, genetic factors, and phenotypes involved in the relationships. Consequently, similar relationships have similar neighborhoods, which makes a structure-based approach such as ours well-adapted for their matching.
Second, PGxLOD contains
Third, PGxLOD contains subsumption axioms between classes and between predicates, which makes possible the transformations represented in
Alignment relations considered in each gold clustering to compute the gold clusters used in our experiments. We indicate whether a relation is transitive (T or ¬ T) and symmetric (S or ¬ S)
Alignment relations considered in each gold clustering to compute the gold clusters used in our experiments. We indicate whether a relation is transitive (T or ¬ T) and symmetric (S or ¬ S)
Fourth, some PGx relationships in S are already labeled as similar through alignments resulting from the application of five matching rules previously published [21]. These alignments use the five following alignment relations: Number of gold clusters (y-axis) by size (x-axis) for each gold clustering. The max value is the maximum size of gold clusters (in terms of number of nodes). The minimum size is 1 for every gold clustering. Only gold clusters larger than 10, 20, and 50 nodes are later used to compute performance metrics. Gold clusterings are defined in Table 4.
We experimented our approach with different pairs
An architecture formed by 3 GCN layers is used to learn node embeddings. The input layer consists of a featureless approach as in [20,29], i.e., the input is just a one-hot vector for each node of the graph. It should be noted that this one-hot vector encoding can scale to relatively large knowledge graphs because (i) we use look-up mechanisms in weight matrices based on node indices, and thus one-hot vectors are actually never stored in memory or used in computations, and (ii) we use a basis-decomposition to limit the number of parameters. All three layers of our architecture have an output dimension of 16. Therefore, output embeddings for all nodes in the knowledge graph are in
The 3-hop neighborhood of a node n consists of all the nodes that can be reached with a breadth-first traversal that starts at n and traverses at most 3 edges.
Statistics of PGxLOD and its transformations as described in Section 4. Statistics for PGxLOD discard literals and edges incident to literals. As we use a 3-layer architecture, statistics for all
Only the embeddings of nodes in S (here, the PGx relationships) are considered in our clustering task. Hence, only these embeddings are constrained in the SNN loss. However, in
Clustering algorithms are only applied on the embeddings of nodes in
Results on all gold clusterings and graphs
Results on
During learning and clustering, our model is unaware of the different alignment relations holding between similar nodes. Indeed, the SNN loss only considers labels of gold clusters that do not indicate the alignment relations used to compute these clusters. This is particularly relevant for gold clusterings

Distributions of distances between similar nodes by alignment relation for each test set, the
Impact of inference rules
It appears that
In
Impact of clustering set-up
Clustering performances are generally better for gold clusterings
Among the considered clustering algorithms, Single generally performs better than the others. For
Distance analysis
Regarding the distance analysis of node embeddings, Fig. 4 shows that distances between similar nodes are different depending on the alignment relation holding between them. Recall that our GCN model is agnostic to these alignment relations when computing the SNN loss. Interestingly, distances reflect the “strength” of the alignment relations: strong similarities (i.e.,
Towards a further integration of domain knowledge in GCNs
Our results highlight the interest of considering domain knowledge associated with knowledge graphs in embedding approaches and seem to advocate for a further integration of domain knowledge within embedding models. Future works may investigate the same targets with additional inferences rules (e.g., from OWL 2 RL semantics) or different embedding techniques, whether based on graph neural networks [11] or others (e.g., translational approaches such as TransE). Additionally, we did not use attention mechanisms, which could also consider domain knowledge as in Logic Attention Network [31]. Here, inference rules associated with domain knowledge are used to transform the knowledge graph as a pre-processing operation. However, we could envision to consider such mechanisms directly in the model (e.g., weight sharing between predicates and their super-predicates). Literals could also be taken into account [33]. In a larger perspective, one major future work lies in investigating if and how other semantics than types of alignments can emerge in the output embedding space.
Generalization to other knowledge graphs
Despite our approach being motivated by the matching of individuals within an aggregated knowledge graph, its transposition to distinct graphs could be explored. Such a perspective could allow to assess the generalization of our approach and its results. In particular, we could consider knowledge graphs that are not completely independent such as LOD datasets that are connected through major LOD hubs. Recall that our approach is supervised since gold clusters are computed from preexisting alignments. Hence, testing our approach on different knowledge graphs of the LOD would require such preexisting alignments or using ontology alignment systems in a distant supervision process [8]. In this setting, merging the different graphs into one and learning a “global” embedding, as we did, may provide positive results but may pose additional scalability issues.
Conclusion
In this paper, we proposed to match entities of a knowledge graph by learning node embeddings with Graph Convolutional Networks (GCNs) and clustering nodes based on their embeddings. We particularly investigated the interplay between formal semantics associated with knowledge graphs and GCN models. Our results showed that considering inference rules associated with domain knowledge tends to improve performance. Additionally, even if our GCN model was agnostic to the exact alignment relations holding between entities (e.g., equivalence, weak similarity), distances in the embedding space are coherent with the “strength” of the alignment relations. These results seem to advocate for a further integration of formal semantics within embedding models.
Footnotes
Acknowledgements
This work was supported by the PractiKPharma project, founded by the French National Research Agency (ANR) under Grant ANR15-CE23-0028, and by the Snowball Inria Associate Team.
Detailed results of clustering experiments
Detailed results of clustering experiments are available Table 8, Table 9 and Table 10.
Results of clustering nodes that belong to gold clusters whose size is greater or equal to 50 for graphs Results of clustering nodes that belong to gold clusters whose size is greater or equal to 20 for graphs Results of clustering nodes that belong to gold clusters whose size is greater or equal to 10 for graphs Results of clustering nodes that belong to gold clusters whose size is greater or equal to 50 for Results of clustering nodes that belong to gold clusters whose size is greater or equal to 20 for Results of clustering nodes that belong to gold clusters whose size is greater or equal to 10 for
ACC
ARI
NMI
ACC
ARI
NMI
Ward
Single
OPTICS
Ward
Single
OPTICS
Ward
Single
OPTICS
Ward
Single
OPTICS
Ward
Single
OPTICS
Ward
Single
OPTICS
Ward
Single
OPTICS
ACC
ARI
NMI
ACC
ARI
NMI
Ward
Single
OPTICS
Ward
Single
OPTICS
Ward
Single
OPTICS
Ward
Single
OPTICS
Ward
Single
OPTICS
Ward
Single
OPTICS
Ward
Single
OPTICS
ACC
ARI
NMI
ACC
ARI
NMI
Ward
Single
OPTICS
Ward
Single
OPTICS
Ward
Single
OPTICS
Ward
Single
OPTICS
Ward
Single
OPTICS
Ward
Single
OPTICS
Ward
Single
OPTICS
Ward
Single
OPTICS
ACC
ARI
NMI
ACC
↓
ARI
↓
NMI
↓
↓
↓
ACC
ARI
NMI
↓
ACC
↓
ARI
NMI
↓
ACC
↓
↓
ARI
↓
↓
NMI
↓
↓
↓
ACC
ARI
NMI
↓
Ward
Single
OPTICS
ACC
ARI
NMI
ACC
↓
ARI
↓
NMI
↓
↓
ACC
ARI
↓
NMI
↓
ACC
↓
ARI
↓
NMI
ACC
↓
↓
ARI
↓
↓
↓
NMI
↓
ACC
ARI
NMI
↓
Ward
Single
OPTICS
ACC
ARI
NMI
ACC
ARI
NMI
↓
ACC
↓
↓
ARI
↓
NMI
↓
↓
↓
ACC
↓
ARI
NMI
↓
ACC
↓
ARI
↓
NMI
↓
ACC
↓
ARI
NMI
↓
↓
