Abstract
Knowledge Graph Embeddings, i.e., projections of entities and relations to lower dimensional spaces, have been proposed for two purposes: (1) providing an encoding for data mining tasks, and (2) predicting links in a knowledge graph. Both lines of research have been pursued rather in isolation from each other so far, each with their own benchmarks and evaluation methodologies. In this paper, we argue that both tasks are actually related, and we show that the first family of approaches can also be used for the second task and vice versa. In two series of experiments, we provide a comparison of both families of approaches on both tasks, which, to the best of our knowledge, has not been done so far. Furthermore, we discuss the differences in the similarity functions evoked by the different embedding approaches.
Keywords
Introduction
In the recent past, the topic of knowledge graph embedding – i.e., projecting entities and relations in a knowledge graph into a numerical vector space – has gained a lot of traction. An often cited survey from 2017 [42] lists already 25 approaches, with new models being proposed almost every month, as depicted in Fig. 1.

Publications with Knowledge Graph Embedding in their title or abstract, created with dimensions.ai (as of November 15th, 2021).
Even more remarkably, two mostly disjoint strands of research have emerged in that vivid area. The first family of research works focus mostly on link prediction [12], i.e., the approaches are evaluated in a knowledge graph refinement setting [24]. The optimization goal here is to distinguish correct from incorrect triples in the knowledge graph as accurately as possible. The evaluations of this kind of approaches are always conducted within the knowledge graph, using the existing knowledge graph assertions as ground truth.
A second strand of research focuses on the embedding of entities in the knowledge graph for downstream tasks outside the knowledge graph, which often come from the data mining field – hence, we coin this family of approaches embeddings for data mining. Examples include: the prediction of external variables for entities in a knowledge graph [30], information retrieval backed by a knowledge graph [36], or the usage of a knowledge graph in content-based recommender systems [32]. In those cases, the optimization goal is to create an embedding space which reflects semantic similarity as well as possible (e.g., in a recommender system, similar items to the ones in the user interest should be recommended). The evaluations here are always conducted outside the knowledge graph, based on external ground truth.
In this paper, we want to look at the commonalities and differences of the two approaches. We look at two of the most basic and well-known approaches of both strands, i.e., TransE [4] and RDF2vec [30], and analyze and compare their optimization goals in a simple example. Moreover, we analyze the performance of approaches from both families in the respective other evaluation setup: we explore the usage of link-prediction based embeddings for other downstream tasks based on similarity, and we propose a link prediction method based on node embedding techniques such as RDF2vec. From those experiments, we derive a set of insights into the differences of the two families of methods, and a few recommendations on which kind of approach should be used in which setting.
As pointed out above, the number of works on knowledge graph embedding is legion, and enumerating them all in this section would go beyond the scope of this paper. However, there have already been quite a few survey articles.
The first strand of research works – i.e., knowledge graph embeddings for link prediction – has been covered in different surveys, such as [42], and, more recently, [8,14,33]. The categorization of approaches in those reviews is similar, as they distinguish different families of approaches: translational distance models [42] or geometric models [33] focus on link prediction as a geometric task, i.e., projecting the graph in a vector space so that a translation operation defined for relation r on a head h yields a result close to the tail t.
The second family among the link prediction embeddings are semantic matching [42] or matrix factorization or tensor decomposition [33] models. Here, a knowledge graph is represented as a three-dimensional tensor, which is decomposed into smaller tensors and/or two-dimensional matrices. The reconstruction operation can then be used for link prediction.
The third and youngest family among the link prediction embeddings are based on deep learning and graph neural networks. Here, neural network training approaches, such as convolutional neural networks, capsule networks, or recurrent neural networks, are adapted to work with knowledge graphs. They are generated by training a deep neural network. Different architectures exist (based on convolutions, recurrent layers, etc.), and the approaches also differ in the training objective, e.g., performing binary classification into true and false triples, or predicting the relation of a triple, given its subject and object. [33].
While most of those approaches only consider graphs with nodes and edges, most knowledge graphs also contain literals, e.g., strings and numeric values. Recently, approaches combining textual information with knowledge graph embeddings using language modeling techniques have also been proposed, using techniques such as word2vec and convolutional neural networks [45] or transformer methods [9,43]. [11] shows a survey of approaches which take such literal information into account. It is also one of the few review articles which considers embedding methods from the different research strands.
Link prediction is typically evaluated on a set of standard datasets, and uses a within-KG protocol, where the triples in the knowledge graph are divided into a training, testing, and validation set. Prediction accuracy is then assessed on the validation set. Datasets commonly used for the evaluation are FB15k, which is a subset of Freebase, and WN18, which is derived from WordNet [4]. Since it has been remarked that those datasets contain too many simple inferences due to inverse relations, the more challenging variants FB15k-237 [39] and WN18RR [10] have been proposed. More recently, evaluation sets based on larger knowledge graphs, such as YAGO3-10 [10] and DBpedia50k/DBpedia500k [34] have been introduced.
The second strand of research works, focusing on the embedding for downstream tasks (which are often from the domain of data mining), is not as extensively reviewed, and the number of works in this area are still smaller. One of the more comprehensive evaluations is shown in [7], which is also one of the rare works which includes approaches from both strands in a common evaluation. They show that at least the three methods for link prediction used – namely TransE, TransR, and TransH – perform inferior on downstream tasks, compared to approaches developed specifically for optimizing for entity similarity in the embedding space.
Co-citation likelihood of different embeddings approaches, obtained from Google scholar, July 12th, 2021. An entry (row,column) in the table reads as: this fraction of the papers citing column also cites row
Co-citation likelihood of different embeddings approaches, obtained from Google scholar, July 12th, 2021. An entry (row,column) in the table reads as: this fraction of the papers citing column also cites row
A third, yet less closely related strand of research works is node embeddings for homogeneous graphs, such as node2vec and DeepWalk. While knowledge graphs come with different relations and are thus considered heterogeneous, approaches for homogeneous graphs are sometimes used on knowledge graphs as well by first transforming the knowledge graph into an unlabeled graph, usually by ignoring the different types of relations. Since some of the approaches are defined for undirected graphs, but knowledge graphs are directed, those approaches may also ignore the direction of edges.
For the evaluation of entity embeddings for data mining, i.e., optimized for capturing entity similarity, there are quite a few use cases at hand. The authors in [25] list a number of tasks, including classification and regression of entities based on external ground truth variables, entity clustering, as well as identifying semantically related entities.
Most of the above mentioned strands exist mainly in their own respective “research bubbles”. Table 1 shows a co-citation analysis of the different families of approaches. It shows that the Trans* family, together with other approaches for link prediction, forms its own citation network, so do the approaches for homogeneous networks, while RDF2vec and KGlove are less clearly separated.
Works which explicitly compare approaches from the different research strands are still rare. In [48], the authors analyze the vector spaces of different embedding models with respect to class separation, i.e., they fit the best linear separation between classes in different embedding spaces. According to their findings, RDF2vec achieves a better linear separation than the models tailored to link prediction.
In [6], an in-KG scenario, i.e., the detection and correction of erroneous links, is considered. The authors compare RDF2vec (with an additional classification layer) to TransE and DistMult on the link prediction task. The results are mixed: While RDF2vec outperforms TransE and DistMult in terms of Mean Reciprocal Rank and Precision@1, it is inferior in Precision@10. Since the results are only validated on one single dataset, the evidence is rather thin.
Most other research works in which approaches from different strands are compared are related to different downstream tasks. In many cases, the results are rather inconclusive, as the following examples illustrate:
[5] and [15] both analyze drug drug interaction, using different sets of embedding methods. The finding of [5] is that RDF2vec outperforms TransE and TransD, whereas in the experiment in [15], ComplEx outperforms RDF2vec, KGlove, TransE, and CrossE, and, in particular, TransE outperforms RDF2vec. [3,6], and [44] all analyze link prediction in different graphs. While [3] state that RotatE and TransD outperform TransE, DistMult, and ComplEx, which in turn outperforms node2vec, [6] reports that DistMult outperforms RDF2vec, which in turn outperforms TransE, while [44] reports that KG2vec (which can be considered equivalent to RDF2vec) outperforms node2vec, which in turn outperforms TransE. [2] compare the performance of RDF2vec, DistMult, TransE, and SimplE on a set of classification and clustering datasets. The results are mixed. For classification, the authors use four different learning algorithms, and the variance induced by the learning algorithms is most often higher than that induced by the embedding method. For the clustering, they report that TransE outperforms the other approaches.1 We think that these results must be taken with a grain of salt. To evaluate the clustering quality, the authors use an intrinsic evaluation metric, i.e., Silhouette score, which is computed in the respective vector space. It is debatable, however, whether Silhouette scores computed in different vector spaces are comparable.
Traditionally, most data mining methods are working on propositional data, i.e., each instance is a row in a table, described by a set of (binary, numeric, or categorical) features. For using knowledge graphs in data mining, one needs to either develop methods which work on graphs instead of propositional data, or find ways to represent instances of the knowledge graph as feature vectors [31]. The latter is often referred to as propositionalization [29].
RDF2vec [30] is a prominent example from the second family. It adapts the word2vec approach [22] for deriving word embeddings (i.e., vector representations for words) from a corpus of sentences. RDF2vec creates such sentences by performing random walks on an RDF graph and collecting the sequences of entities and relations, then trains a word2vec model on those sequences. It has been shown that this strategy outperforms other strategies of propositionalization. The relation between propositionalization and embedding methods has also recently been pointed out by [17].
Data mining is based on similarity
Predictive data mining tasks are predicting classes or numerical values for instances. Typically, the target is to predict an external variable not contained in the knowledge graph (or, to put it differently: use the background information from the knowledge graph to improve prediction models). One example would be to predict the popularity of an item (e.g., a book, a music album, a movie) as a numerical value. The idea here would be that two items which share similar features should also receive similar ratings. The same mechanism is also exploited in recommender systems: if two items share similar features, users who consumed one of those items are recommended the other one.
RDF2vec has been shown to be usable for such cases, since the underlying method tends to create similar vectors for similar entities, i.e., position them closer in vector space [32]. Figure 2 illustrates this using a 2D PCA plot of RDF2vec vectors for movies in DBpedia. It can be seen that clusters of movies, e.g., Disney movies, Star Trek movies, and Marvel related movies are formed.
Many techniques for predictive data mining rely on similarity in one or the other way. This is more obvious for, e.g., k-nearest neighbors, where the predicted label for an instance is the majority or average of labels of its closest neighbors (i.e., most similar instances), or Naive Bayes, where an instance is predicted to belong to a class if its feature values are most similar to the typical distribution of features for this class (i.e., it is similar to an average member of this class). A similar argument can be made for neural networks, where one can assume a similar output when changing the value of one input neuron (i.e., one feature value) by a small delta. Other classes of approaches (such as Support Vector Machines) use the concept of class separability, which is similar to exploiting similarity: datasets with well separable classes have similar instances (belonging to the same class) close to each other, while dissimilar instances (belonging to different classes) are further away from each other [38].

RDF2vec embeddings for movies in DBpedia, from [32].
To understand how (and why) RDF2vec creates embeddings that project similar entities to nearby vectors, we use the running example depicted in Fig. 3 and Fig. 4, showing a number of European cities, countries, and heads of those governments.

Example graph used for illustration.

Triples of the example knowledge graph.
As discussed above, the first step of RDF2vec is to create random walks on the graph. To that end, RDF2vec starts a fixed number of random walks of a fixed maximum length from each entity. Since the example above is very small, we will, for the sake of illustration, enumerate all walks of length 4 that can be created for the graph. Those walks are depicted in Fig. 5. It is notable that, since the graph has nodes without outgoing edges, some of the walks are actually shorter than 4.

Walks extracted from the example graph.
In the next step, the walks are used to train a predictive model. Since RDF2vec uses word2vec, it can be trained with the two flavors of word2vec, i.e., CBOW (context back of words) and SG (skip gram). The first predicts a word, given its surrounding words, the second predicts the surroundings, given a word. For the sake of our argument, we will only consider the second variant, depicted in Fig. 6. Simply speaking, given training examples where the input is the target word (as a one-hot-encoded vector) and the output is the context words (again, one hot encoded vectors), a neural network is trained, where the hidden layer is typically of smaller dimensionality than the input. That hidden layer is later used to produce the actual embedding vectors.

The skip gram variant of word2vec [30].
To create the training examples, a window with a given size is slid over the input sentences. Here, we use a window of size 2, which means that the two words preceding and the two words succeeding a context word are taken into consideration. Table 2 shows the training examples generated for three instances.
Training examples for instances Paris, Berlin, Mannheim, Angela Merkel, Donald Trump, and Belgium (upper part) and majority predictions (lower part)
A model that learns to predict the context given the target word would now learn to predict the majority of the context words for the target word at hand at the output layer called output in Fig. 6, as depicted in the lower part of Table 2. Here, we can see that Paris and Berlin share two out of four predictions, so do Mannheim and Berlin. Angela Merkel and Berlin share one out of four predictions.2
Note that in the classic formulation of RDF2vec (and word2vec), the position at which a prediction appears does not matter. The order-aware variant RDF2vec oa [27] uses an order-aware formulation of word2vec [20].
Considering again Fig. 6, given that the activation function which computes the output from the projection values is continuous, it implies that similar activations on the output layer requires similar values on the projection layer. Hence, for a well fit model, the distance on the projection layer of Paris, Berlin, and Mannheim should be comparatively lower than the distance of the other entities, since they activate similar outputs.3
Note that there are still weights learned for the individual connections between the projection and the output layer, which emphasize some connections more strongly than others. Hence, we cannot simplify our argumentation in a way like “with two common context words activated, the entities must be projected twice as close as those with one common context word activated”.
Figure 7 depicts a two-dimensional RDF2vec embedding learned for the example graph.4
Created with PyRDF2vec [41], using two dimensions, a walk length of 8, and standard configuration otherwise.

The example graph embedded with RDF2vec.
From Fig. 7, we can assume that link prediction should, in principle, be possible. For example, the predictions for heads of governments all point in a similar direction. This is in line with what is known about word2vec, which allows for computing analogies, like the well-known example
RDF2vec does not learn relation embeddings, only entity embeddings.5
Technically, we can also make RDF2vec learn embeddings for the relations, but they would not behave the way we need them.
With the same idea, we can also average the relation vectors r for each relation that holds between all its head and tail pairs, i.e.,
It can also be observed that the vectors for locatedIn and capitalOf point in reverse directions, which makes sense because they form connections between two clusters (countries and cities) in opposite directions.

Average relation vectors for the example.
A larger body of work has been devoted on knowledge graph embedding methods for link prediction. Here, the goal is to learn a model which embeds entities and relations in the same vector space.
Link prediction is based on vector operations
As the main objective is link prediction, most models, more or less, try to find a vector space embedding of entities and relations so that
In most approaches, negative examples are created by corrupting an existing triple, i.e., replace the head or tail with another entity from the graph (some approaches also foresee corrupting the relation). Then, a model is learned which tries to tell apart corrupted from non-corrupted triples. The formulation in the original TransE paper [4] defines the loss function L as follows:
Figure 9 shows the example graph from above, as embedded by TransE.6
Created with PyKEEN [1], using 128 epochs, a learning rate of 0.1, the softplus loss function, and default parameters otherwise, as advised by the authors of PyKEEN:
This does not mean that TransE does not work. The training data for the very small graph is rather scarce, and two dimensions might not be sufficient to find a good solution here.

Example graph embedded by TransE.
Like in the RDF2vec example above, we can observe that the two vectors for locatedIn and capital point in opposite directions. Also similar to the RDF2vec example, we can see that entities in similar classes form clusters: cities are mostly in the upper part of the space, people in the left, countries in the lower right part.
As discussed above, positioning similar entities close in a vector space is an essential requirement for using entity embeddings in data mining tasks. To understand why an approach tailored towards link prediction can also, to a certain extent, cluster similar instances together (although not explicitly designed for this task), we first rephrase the approximate link prediction Equation (8) as
Using the triangle inequality for the first inequation.
This also carries over to entities sharing the same two-hop connection. Consider two further triples
In the examples above, we can see that embeddings for link prediction have a tendency to project similar instances close to each other in the vector space. Here, the notion of similarity is that two entities are similar if they share a relation to another entity, i.e.,
The argument in Section 4.2 would also work for shared relations to common heads.
RDF2vec, on the other hand, covers a wider range of such similarities. Looking at Table 2, we can observe that two entities sharing a common relation to two different objects are also considered similar (Berlin and Mannheim both share the fact that they are located in Germany, hence, their predictions for
However, there in RDF2vec, similarity can also come in other notions. For example, Germany and USA are also considered similar, because they both share the relations headOfGovernment and capital, albeit with different object (i.e., their prediction for
On a similar argument, RDF2vec also positions entities closer which share any relation to another entity. Although this is not visible in the two-dimensional embedding depicted in Fig. 7, RDF2vec would also create vectors with some similarity for Angela Merkel and Berlin, since they both have a (albeit different) relation to Germany (i.e., their prediction for
To compare the two sets of approaches, we use standard setups for evaluating knowledge graph embedding methods for data mining as well as for link prediction.
Experiments on data mining tasks
In our experiments, we follow the setup proposed in [28] and [25]. Those works propose the use of data mining tasks with an external ground truth, e.g., predicting certain indicators or classes for entities. Those entities are then linked to a knowledge graph. Different feature extraction methods – which includes the generation of embedding vectors – can then be compared using a fixed set of learning methods.
The setup of [25] comprises six tasks using 20 datasets in total:
Five classification tasks, evaluated by accuracy. Those tasks use the same ground truth as the regression tasks (see below), where the numeric prediction target is discretized into high/medium/ low (for the Cities, AAUP, and Forbes dataset) or high/low (for the Albums and Movies datasets). All five tasks are single-label classification tasks. Five regression tasks, evaluated by root mean squared error. Those datasets are constructed by acquiring an external target variable for instances in knowledge graphs which is not contained in the knowledge graph per se. Specifically, the ground truth variables for the datasets are: a quality of living indicator for the Cities dataset, obtained from Mercer; average salary of university professors per university, obtained from the AAUP; profitability of companies, obtained from Forbes; average ratings of albums and movies, obtained from Facebook. Four clustering tasks (with ground truth clusters), evaluated by accuracy. The clusters are obtained by retrieving entities of different ontology classes from the knowledge graph. The clustering problems range from distinguishing coarser clusters (e.g., cities vs. countries) to finer ones (e.g., basketball teams vs. football teams). A document similarity task (where the similarity is assessed by computing the similarity between entities identified in the documents), evaluated by the harmonic mean of Pearson and Spearman correlation coefficients. The dataset is based on the LP50 dataset [18]. It consists of 50 documents, each of which have been annotated with DBpedia entities using DBpedia spotlight [21]. The task is to predict the similarity for each pair of documents. An entity relatedness task (where semantic similarity is used as a proxy for semantic relatedness), evaluated by Kendall’s Tau. The dataset is based on the KORE dataset [13]. The dataset consists of 20 seed entities from the YAGO knowledge graph, and 20 related entities each. Those 20 related entities per seed entity have been ranked by humans to capture the strength of relatedness. The task is to rank the entities per seed by relatedness. Four semantic analogy tasks (e.g., Athens is to Greece as Oslo is to X), which are based on the original datasets on which word2vec was evaluated [22]. The datasets were created by manual annotation. In our evaluation, we aim at predicting the fourth element (D) in an analogy
Table 3 shows a summary of the characteristics of the datasets used in the evaluation. It can be observed that they cover a wide range of tasks, topics, sizes, and other characteristics (e.g., balance). More details on the construction of the datasets can be found in [25] and [28].
Note that all datasets are provided with predefined instance links to DBpedia. For the smaller ones, the creators of the datasets created and checked the links manually; for the larger ones, the linking had been done heuristically. We used the links provided in the evaluation framework as is, including possible linkage errors.
Overview on the evaluation datasets
Overview on the evaluation datasets
We follow the evaluation protocol suggested in [25]. This protocol foresees the usage of different algorithms on each task for each embedding (e.g., Naive Bayes, Decision Tree, k-NN, and SVM for classification), and also performs parameter tuning in some cases. In the end, we report the best results per task and embedding method. Those results are depicted in Table 4.
Results of the different data mining tasks. DM denotes approaches originally developed for node representation in data mining, LP denotes approaches originally developed for link prediction
All embeddings are trained on DBpedia 2016-10.10
The code for the experiments as well as the resulting embeddings can be found at
It is noteworthy that the default settings for node2vec and DeepWalk differ in one crucial property. While node2vec interprets the graph as a directed graph by default and only traverses edges in the direction in which they are defined, DeepWalk treats all edges as undirected, i.e., it traverses them in both directions.
From the table, we can observe a few expected and a few unexpected results. First, since RDF2vec is tailored towards classic data mining tasks like classification and regression, it is not much surprising that those tasks are solved better by using RDF2vec (and even slightly better by using RDF2vec oa ) vectors. Still, some of the link prediction methods (in particular TransE and RESCAL) perform reasonably well on those tasks. In contrast, KGloVe rarely reaches the performance level of RDF2vec, while the two approaches for unlabeled graphs – i.e., DeepWalk and node2vec – behave differently: while the results of DeepWalk are at the lower end of the spectrum, node2vec is competitive. The latter is remarkable, showing that pure neighborhood information, ignoring the direction and edge labels, can be a strong signal when embedding entities.
Referring back to the different notions of similarity that these families of approaches imply (cf. Section 4.3), this behavior can be explained by the tendency of RDF2vec (and also node2vec) to positioning entities closer in the vector space which are more similar to each other (e.g., two cities that are similar). Since it is likely that some of those dimensions are also correlated with the target variable at hand (in other words: they encode some dimension of similarity that can be used to predict the target variable), classifiers and regressors can pick up on those dimensions and exploit them in their prediction model.
What is also remarkable is the performance on the entity relatedness task. While RDF2vec embeddings, as well as node2vec, KGlove, and, to a lesser extent, DeepWalk, reflect entity relatedness to a certain extent, this is not given for any of the link prediction approaches. According to the notions of similarity discussed above, this is reflected in the RDF2vec mechanism: RDF2vec has an incentive to position two entities closer in the vector space if they share relations to a common entity, as shown in Equations (21)-(24). One example is the relatedness of Apple Inc. and Steve Jobs – here, we can observe the two statements
The same behavior of RDF2vec – i.e., assigning close vectors to related entities – also explains the comparatively bad results of RDF2vec on the first two clustering tasks. Here, the task is to separate cities and countries in two clusters, but since a city is also related to the country it is located in, RDF2vec may position that city and country rather closely together (RDF2vec oa changes that behavior, as argued in [27], and hence produces better results for the clustering problems). Hence, that city has a certain probability of ending up in the same cluster as the country. The latter two clustering tasks are different: the third one contains five clusters (cities, albums, movies, universities, and companies), which are less likely to be strongly related (except universities and companies to cities) and therefore are more likely to be projected in different areas in the vector space. Here, the difference of RDF2vec to the best performing approaches (i.e., TransE-L1 and TransE-L2) is not that severe. The same behavior can also be observed for the other embedding approaches for data mining, i.e., node2vec, DeepWalk, and KGlove, which behave similarly in that respect.
The problem of relatedness being mixed with similarity does not occur so strongly for homogeneous sets of entities, as in the classification and regression tasks, where all entities are of the same kind (cities, companies, etc.) – here, two companies which are related (e.g., because one is a holding of the other) can also be considered similar to a certain degree (in that case, they are both operating in the same branch). This also explains why the forth clustering task (where the task is to assign sports teams to clusters by the type of sports) works well for RDF2vec – here, the entities are again homogeneous.
At the same time, the test case of clustering teams can also be used to explain why link prediction approaches work well for that kind of tasks: here, it is likely that two teams in the same sports share a relation to a common entity, i.e., they fulfill Equations (19) and (20). Examples include participation in the same tournaments or common former players.
The semantic analogies task also reveals some interesting findings. First, it should be noted that the relations which form the respective analogies (capital, state, and currency) is contained in the knowledge graph used for the computation. That being said, we can see that most of the link prediction results (except for RotatE and RESCAL) perform reasonably well here. Particularly, the first cases (capitals and countries) can be solved particularly well in those cases, as this is a 1:1 relation, which is the case in which link prediction is a fairly simple task. On the other hand, most of the data-mining-centric approaches (i.e., node2vec, DeepWalk, KGlove) solve this problem relatively bad. A possible explanation is that the respective entities belong to the strongly interconnected head entities of the knowledge graphs, and also the false solutions are fairly close to each other in the graph (e.g., US Dollar and Euro are interconnected through various short paths). This makes it hard for approaches concentrating on a common neighborhood to produce decent results here.
On the other hand, the currency case is solved particularly bad by most of the link prediction results. This relation is an n:m relation (there are countries with more than one official, unofficial, or historic currency, and many currencies, like the Euro, are used across many countries. Moreover, looking into DBpedia, this relation contains a lot of mixed usage and is not maintained with very high quality. For example, DBpedia lists 33 entities whose currency is US Dollars14
RDF2vec, in contrast, can deal reasonably well with that case. Here, two effects interplay when solving such tasks: (i) as shown above, relations are encoded by the proximity in RDF2vec to a certain extent, i.e., the properties in Equations (3) and (4) allow to perform analogy reasoning in the RDF2vec space in general. Moreover, (ii) we have already seen the tendency of RDF2vec to position related entities in relative proximity. Thus, for RDF2vec, it can be assumed that the following holds:
In a second series of experiments, we analyze if we can use embedding methods developed for similarity computation, like RDF2vec, also for link prediction. We use the two established tasks WN18 and FB15k for a comparative study.
While link prediction methods are developed for the task at hand, approaches developed for data mining are not. Although RDF2vec computes vectors for relations, they do not necessarily follow the same notion as relation vectors for link prediction, as discussed above. Hence, we investigate two approaches:
We average the difference for each pair of a head and a tail for each relation r, and use that as average as a proxy for a relation vector for prediction, as shown in Equation (4). The predictions are the entities whose embedding vectors are the closest to the approximate prediction. This method is denoted as avg. For predicting the tail of a relation, we train a neural network to predict an embedding vector of the tail based embedding vectors, as shown in Fig. 10. The predictions for a triple
We trained the RDF2vec embeddings with 2,000 walks, a depth of 4, a dimension of 200, a window of 5, and 25 epochs in SG mode. For the second prediction approach, the two neural networks each use two hidden layers of size 200, and we use 15 epochs, a batch size of 1,000, and mean squared error as loss. KGlove, node2vec, and DeepWalk do not produce any vectors for relations. Hence, we only use the avg strategy for those approaches.

Training a neural network for link prediction with RDF2vec.
The results of the link prediction experiments are shown in Table 5.15
The code for the experiments can be found at
While the results are not overwhelming, they show that similarity of entities, as RDF2vec models it, is at least a useful signal for implementing a link prediction approach.
Results of the link prediction tasks on WN18 and FB15K. Results for TransE and RESCAL from [4], results for RotatE from [37], results for DistMult from [46], results for TransR from [19]. DM denotes approaches originally developed for node representation in data mining, LP denotes approaches originally developed for link prediction
Results of the link prediction tasks on WN18 and FB15K. Results for TransE and RESCAL from [4], results for RotatE from [37], results for DistMult from [46], results for TransR from [19]. DM denotes approaches originally developed for node representation in data mining, LP denotes approaches originally developed for link prediction
Closest concepts to Angela Merkel in the different embedding approaches used
As already discussed above, the notion of similarity which is conveyed by RDF2vec mixes similarity and relatedness. This can be observed, e.g., when querying for the 10 closest concepts to Angela Merkel (the chancellor, i.e., head of government in Germany) in DBpedia in the different spaces, as shown in Table 6. The approach shows a few interesting effects:
While most of the approaches (except for RotatE, KGlove and DeepWalk) provide a clean list of people, RDF2vec brings up a larger variety of results, containing also Germany and Berlin (and also a few results which are not instances, but relations; however, those could be filtered out easily in downstream applications if necessary). This demonstrates the property of RDF2vec of mixing similarity and relatedness. The people in the RDF2vec result set are all related to Angela Merkel: Joachim Gauck was president during her chancellorship, Norbert Lammert was the head of parliament, Stanislaw Tillich was a leading board member in the same party as Merkel, and Andreas Voßkuhle was the head of the highest court during her chancellorship.
The approaches at hand have different foci in determining similarity. For example, TransE-L1 outputs mostly German politicians (Schröder, Gauck, Trittin, Gabriel, Westerwelle, Wulff) and former presidents of other countries (Buchanan as a former US president, Sarkozy and Chirac as former French presidents) TransE-L2 outputs a list containing many former German chancellors (Schröder, Kohl, Adenauer, Schmidt, Kiesinger, Erhardt), TransR mostly lists German party leaders (Gabriel, Steinmeier, Rösler Schröder, Wulff, Westerwelle, Kohl, Trittin). Likewise, node2vec produces a list of German politicians, with the exception of Merkel’s husband Joachim Sauer.16
The remaining approaches – RotatE, DistMult, RESCAL, ComplEx, KGlove, DeepWalk – produce lists of (mostly) persons which, in their majority, share no close link to the query concept Angela Merkel.
In contrast, the persons in the output list of RDF2vec are related to the query entity in different respects. In particular, they played different roles during Angela Merkel’s chancellorship (Gauck was the German president, Lammert was the chairman of the parliament, and Voßkuhle was the chairman of the federal court). Here, there is no dominant property, instead, similarity (or rather: relatedness) is encoded along various properties. RDF2vec oa brings up a results which is slightly closer to the politicians lists of the other approaches, while the result list of KGlove looks more like a random list of entities. A similar observation can be made for DeepWalk, which, with the exception of the first result (which is a German politician) does not produce any results seemingly related to the query concept at hand.
With that observation in mind, we can come up with an initial set of recommendations for choosing embedding approaches:
Approaches for data mining (RDF2vec, KGlove, node2vec, and DeepWalk) work well when dealing with sets of homogeneous entities. Here, the problem of confusing related entities (like Merkel and Berlin) is negligible, because all entities are of the same kind anyways. In those cases, RDF2vec captures the finer distinctions between the entities better than embeddings for link prediction, and it encodes a larger variety of semantic relations. From the approaches for data mining, those which respect order (RDF2vec
oa
and node2vec) work better than those which do not (classic RDF2vec, KGlove, and DeepWalk).17 As discussed above, this comments holds for the default configuration of node2vec and DeepWalk used in this paper.
For problems where heterogeneous sets of entities are involved, embeddings for link prediction often do a better job in telling different entities apart.
The same argument underlies an observation made by Zouaq and Martel [48]: the authors found that RDF2vec is particularly well suited for distinguishing fine-grained entity classes (as opposed to coarse-grained entity classification). For fine-grained classification (e.g., distinguishing guitar players from singers), all entities to be classified are already of the same coarse class (e.g., musician), and RDF2vec is very well suited for capturing the finer differences. However, for coarse classifications, misclassifications by mistaking relatedness for similarity become more salient.
From the observations made in the link prediction task, we can come up with another recommendation:
For relations which come with rather clean data quality, link prediction approaches work well. However, for more noisy data, RDF2vec has a higher tendency of creating useful embedding vectors.
For the moment, this is a hypothesis, which should be hardened, e.g., by performing controlled experiments on artificially noised link prediction tasks.
In this paper, we have compared two use cases and families of knowledge graph embeddings which have, up to today, not undergone any thorough direct comparison: approaches developed for data mining, such as RDF2vec, and approaches developed for link prediction, such as TransE and its descendants.
We have argued that the two approaches actually do something similar, albeit being designed with different goals in mind. To support this argument, we have run two sets of experiments which examined how well the different approaches work if applied in the respective other setup. We show that, to a certain extent, embedding approaches designed for link prediction can be applied in data mining and vice versa, however, there are differences in the outcome.
From the experiments, we have also seen that proximity in the embedding spaces works differently for the two families of approaches: in RDF2vec, proximity encodes both similarity and relatedness, while TransE and its descendants rather encode similarity alone. On the other hand, for entities that are of the same type, RDF2vec covers finer-grained similarities better. Moreover, RDF2vec seems to work more stably in cases where the knowledge graphs are rather noisy and weakly adherent to their schema.
These findings give rise both for a recommendation and some future work. First, in use cases where relatedness plays a role next to similarity, or in use cases where all entities are of the same type, approaches like RDF2vec may yield better results. On the other hand, for cases with mixed entity types where it is important to separate the types, link prediction embeddings might yield better results.
Since the set of knowledge graphs used in our experiments is limited, we can, however, not come up with recommendations of which kind of embedding is better suited for which kind of knowledge graph. While we expect that there are differences with respect to different characteristics of the graph – e.g., homogeneity, link degree and cardinality distributions, density and sparsity, schema size and variety – both theoretical considerations and experimental evaluations in that direction are subject to future work.
Moreover, the open question remains whether it is possible to develop embedding methods that combine the best of both worlds – e.g., that provide both the coarse type separation of TransE and its descendants and the fine type separation of RDF2vec, or that support competitive link prediction while also representing relatedness. We expect to see some interesting developments along these lines in the future.
Footnotes
Acknowledgement
The publication of this article was funded by the Ministry of Science, Research and the Arts Baden-Württemberg and the University of Mannheim.
