Abstract
An increasing number of structured knowledge bases have become available on the Web, enabling many new forms of analyses and applications. However, the fact that the data is being published by different parties with different vocabularies and ontologies means that there is a high degree of heterogeneity and no common schema. At the same time, the abundance of different human languages across unstructured data presents a similar problem, because most text mining tools only cater to the English language. This paper presents solutions for these two kinds of heterogeneity. It introduces Klint, a Web-based system that automatically creates mappings to transform knowledge from heterogeneous sources into FrameBase, which is a broad linked data schema that enables the representation of a wide range of knowledge. With Klint, a user can review and edit the mappings with a streamlined interface, which in turn allows for human-level accuracy with minimum human effort. The paper further describes how FrameBase can be extended to support multilingual labels, which can aid in extending current tools for integrating English text into FrameBase knowledge.
Introduction
The Web of Data includes a rich and increasing number of structured knowledge bases, and has enabled many new applications and forms of analyses. Such open knowledge bases often provide their data in RDF format [34], based on Subject-Predicate-Object triples. When coupled with the use of global identifiers in the form of Internationalized Resource Identifiers (IRIs), this facilitates the linking and merging of data from disparate sources into a single graph, usually referred to as the LOD (Linked Open Data) cloud. However, despite these advantages, different datasets may still model equivalent or overlapping information using different vocabularies and different triple patterns. This poses two important problems.
Querying data can be cumbersome. In order to capture all pertinent knowledge, a structured query will have to consist of a disjunction of all possible semantic patterns occurring in the myriad of heterogeneous vocabularies used in the data.
Linking data is typically carried out by creating triples with a predicate
Beyond these, there is also the inevitable challenge that most data on the Web is in the form of natural language [17]. This can be regarded as a further, particularly challenging case of heterogeneity, either if text is considered in its most primitive form as strings of characters, or if it is modelled as a syntactic or semantic network as emitted by one of the state-of-the-art natural language processing libraries, such as CoreNLP [38].
Automatic integration could solve this challenge, either from fully structured and disambiguated heterogeneous sources, or from syntactic/semantic networks produced using existing text mining methods. However, it is often an AI-hard problem, especially since many applications may expect knowledge bases to have a high accuracy. This is because the accuracy is generally high in linked data KBs [23], owing to the fact that many KBs were built manually (e.g., folksonomical approaches as for Wikidata) or using integration rules that exploit highly reliable markup and hyperlinking structures (Wikipedia infoboxes and taxonomical links in DBpedia [35] and YAGO [53]).
In this paper, we present two attempts at tackling these issues, both related to FrameBase, a dataset that has been able to integrate a wide range of knowledge from other sources.
Firstly, we describe Klint (Knowledge integrator), a Web-based system enabling semi-automatic schema integration. Given one or more existing RDF ontologies, Klint generates tentative integration rules for these ontologies by representing mappings in a unified schema. For this unified schema, Klint relies on FrameBase [44], a wide-coverage, highly expressive and extensible schema that can be used to represent and integrate [47] a wide spectrum of knowledge from many sources in a homogeneous and seamless way.2
See
Secondly, we introduce a method for adding multilingual support to FrameBase. There are already several systems to extract FrameBase-modelled knowledge from natural language [2,11], but these are restricted to input sources given in the English language. Linked data is increasingly being embraced in domains such as government data, digital libraries, cultural heritage, and linguistics. As an increasing number of organizations around the world seek to adopt the principles of Linked data, it is important to also facilitate the interlinking of data captured in different natural languages. This is particularly important for multinational institutions such as within the European Union. We evaluate results obtained for the Swedish language, but the method is generalizable to other languages.
This paper is structured as follows. Section 2 discusses related work in this area. Section 3 introduces the FrameBase schema, on top of which the work described in this paper is based. Section 4 describes the details of our Klint system for semi-automatic integration of heterogeneous structured data. Section 5 presents the algorithms used to add multilingual support to FrameBase, and provides an evaluation of the results. Finally, Section 6 provides a conclusion.
In this section we present previous research that is related to the work described in this paper. Section 2.1 describes work in the field of ontology alignment. Section 2.2 describes existing tools for ontology visualization and edition, and for the creation of mappings between different ontologies. Section 2.3 describes existing work addressing the problem of linguistic heterogeneity and multilinguality in the context of ontologies and linked data.
Ontology alignment
Integrating knowledge from different sources is a long-standing, ubiquitous problem [31]. In the context of linked data and ontology alignment [16,22], there has been extensive work in tasks such as identifying equivalent classes from different ontologies, and in some cases also equivalent instances and properties [4,36,41,52].
Since knowledge can be expressed in structurally different ways, one-to-one equivalences are not sufficient to exhaustively connect equivalent information across knowledge bases, and complex mappings become necessary. There has been some work addressing the problem of declaring and finding these mappings, either semi-automatically or fully automatically. The EDOAL (Expressive and Declarative Ontology Alignment Language) format [14] has been proposed to express complex relationships between properties, and complex correspondence patterns between ontologies have been described and classified in an ontology [49]. However, these works do not address the problem of identifying these correspondence patterns. The iMAP tool [21] searches a space of possible complex relationships between the values of entries in two knowledge bases, e.g.,
FrameBase [44,45] declares integration rules using SPARQL CONSTRUCT queries, which enables arbitrarily complex mappings between external knowledge bases and a hub schema based on frames (classes representing events or processes) and frame elements (properties of frames), extracted from FrameNet [26,48]. Different methods have been developed to automatically create certain types of complex rules [45]. This will be explained in more depth in Section 3.
Graphical tools for addressing structural heterogeneity
There are several tools for visualizing linked data [7,13,29], but these do not provide facilities to create arbitrary mappings.
WebProtégé [55] is a widely used ontology engineering tool for RDF, with extensive support for OWL reasoning, but it is designed to facilitate creating individual ontologies, not mappings between them.
LodLive [9] allows users to browse RDF resources using a dynamic visual graph and to create 1-to-1 links between resources stored in different endpoints. However, it does not allow for creating the more complex kinds of mappings we consider in this work.
Fusion [1] provides facilities to create mappings that are more complex than 1-to-1 relations between entities, but it is still limited to mappings to a single relation in the source ontology, and hence does not support arbitrary mappings. More specifically, it is limited to one of the following two kinds.
Mapping a single property in the source ontology to a path3
In the sense of property paths in SPARQL [32], i.e., a sequence of properties.
Fusion also allows for mapping a single property in the source ontology to a complex pattern in another, which is defined in the imperative programming language Ruby.
FrameBase and Klint, on the contrary, allow complex mappings by means of arbitrarily complex but declarative graph patterns implemented in SPARQL.
There are knowledge bases that include multilingual labels, such as YAGO 3 [37], BabelNet [40], MENTA [18], and DBpedia [35]. A large part of the entities and relations in these ontologies are obtained from Wikipedia articles and their mutual hyperlinks. As part of this approach, multilingual labels can be obtained by exploiting the cross-lingual links in Wikipedia [19], which constitutes a fundamental part of these ontologies. However, these ontologies are not explicitly built with the purpose of knowledge integration.
In order to solve the problem of linguistic heterogeneity in the context of knowledge bases, there has been work on linking knowledge entries with lexical labels in different languages, using machine translation [27,28], graph models [56], or a combination of dictionary and monolingual ontology mapping tools [54].

Example of how FrameBase represents the knowledge that two entities, John and Mary, married in 1964. Blue entities constitute reified knowledge while green ones constitute dereified knowledge.
Our work differs from these two types of previous efforts in several ways. First, we do not use the cross-lingual links in Wikipedia, as done by YAGO 3 and DBpedia. Instead, we add multilingual labels to FrameBase by exploiting its connections to WordNet and FrameNet as well as the existence of high-quality multilingual versions of these linguistic resources. Second, we do not attempt to directly and reciprocally interlink multilingual knowledge bases, since FrameBase is meant as a central and unique hub for integrating knowledge, where multilinguality can be expressed by means of labels that will be annotated with Lemon [39], reflecting their language and morphosyntactic properties.
FrameBase [44–47] provides a unified schema designed to integrate knowledge from heterogeneous datasets such as those in the linked data cloud. What makes FrameBase particularly well-suited for integrating heterogeneous knowledge is that it is highly expressive, both from a lexical perspective (containing thousands of lexical entries), but also structurally, because the frame-based representation it relies on can represent n-ary relations in an extensible way.
The backbone of FrameBase stems from FrameNet [26,48], a linguistic resource that compiles frames to annotate how specific words, denoted as Lexical Units (LUs), evoke them when they appear in text. FrameBase extends its frame coverage using synsets, which are sets of synonymous disambiguated words taken from another linguistic resource called WordNet [24]. However, it is possible to understand FrameBase without detailed knowledge of these resources.
The core elements of the FrameBase schema are classes that correspond to frames, which are conceptual structures representing situations, events, or processes of any kind. The frame classes have properties called Frame Elements (FEs), in the sense that each of these properties has a specific frame class as its
Figure 1 provides an example of how FrameBase represents the knowledge that two entities, John and Mary, married in 1964 (with the frame classes and FE properties depicted in blue).

Example illustrating the hierarchy of frames. The prefix before the colon is “
In general, frame classes in FrameBase are organized into a hierarchy of three different kinds of frames: microframes, miniframes, and macroframes. Figure 2 shows some example frames to illustrate this hierarchy.
Microframes have very specific meanings and are directly connected to sense-disambiguated terms. Microframes are divided into two categories.
LU-microframes, each associated with an LU in FrameNet. Some examples are illustrated as green nodes in Fig. 2, such as
synset-microframes, each associated with a synset in WordNet. Some examples are illustrated as blue nodes in Fig. 2, such as
Miniframes cluster together microframes with near-equivalent meaning. Two examples are illustrated as pink nodes in Fig. 2:
Macroframes, finally, constitute an upper level of general types of situations or events. For example,
The purpose of miniframes is offering an intermediate level of granularity, in which microframes representing equivalent concepts can be merged (“wedding”, “spouse”, and “marry” are under the same miniframe), thus reducing data sparsity. Still, a conflation of clearly different concepts is avoided, as would happen if macroframes were used (“wedding”, “spouse”, and “marry” are under the same macroframe as “divorce”).
The FrameBase hierarchy contains around 130,000 frames in total, and supports efficient RDFS inference with a few extensions [10].
The FE properties are also imported from FrameNet. FE properties change across macroframes, but microframes and miniframes inheriting a given macroframe share the same FE properties. For instance, both microframes
The main purpose of FrameBase is being able to represent a wide range of knowledge in a homogeneous way. This should ease the difficulty associated with jointly querying data represented by an unbounded mix of different schemas that are heterogeneous not only at a lexical level but also structurally. This structural heterogeneity also prevents linking data by means of binary properties such as
Knowledge can directly be created under the FrameBase schema, or integrated from existing knowledge bases by means of integration rules (usually implemented as SPARQL CONSTRUCT queries), which allow what could be termed as “complex linked data”, or “non-binary linked data”.
The frame-based representation is the core of FrameBase and is chosen because it is very expressive and unambiguous, while striking a balance in terms of space complexity. One cannot unambiguously represent multiple marriages of a given person using

Overview of the structure of FrameBase system (ontology and rules).
Figure 3 provides a general overview of FrameBase. This paper also introduces a new version of FrameBase, with an extended set of synset-microframes and a more homogeneous structure in the IRIs. Previous versions of FrameBase (1.x) used identifiers for synset-microframes that included the name of the parent macroframe [44]. The parent macroframe was identified using an automatic method that was not able to map the entirety of WordNet Synsets into FrameBase. Some synsets should not even be mapped to an existing macroframe because this may not be available in the original FrameNet resource. However, including these orphan synset-microframes turned out to be advisable because even without a proper integration into the upper hierarchy (which can be added later), these synsets increase the expressivity of FrameBase. Hence, a new syntax has been defined for FrameBase 2.0, which removes parent frame information from synset-microframe IRIs, and allows orphan synset-microframes to be defined as children of the upper frame

Two rules automatically created by the integration engine, integrating different properties from Wikidata: “time of death” and “place of death”. There is some knowledge overlap among them (the death event of type “to die” and the property “protagonist”), and also with the results from the rule in Fig. 5. This is impossible to capture by simply linking the source schemas via equivalence or subsumption properties.
We have developed an integration engine that uses ReDer rules to map properties from source KBs into FrameBase, matching them with DBPs after a process of canonicalization [45,47]. Figure 4 includes an example. In order to map classes from the source KBs, a Support Vector Machine model is trained to classify (Class, Frame) pairs based on a range of lexical similarity features.
Because of its ties to computational linguistics, FrameBase is not only appropriate for integrating heterogeneous structured knowledge, but also for integrating knowledge extracted from natural language. Currently, there are several third-party systems that extract FrameBase-modelled knowledge from natural language [2,11], but these are restricted to source texts provided in the English language.

Example of Klint’s interface: integrating elements from DBpedia – Klint used the contextual and lexical information from the source elements to suggest two candidate values for the integrated type (selected node, “conflict”), for which the correct assigned value,
We will describe Klint from two different angles. First, Section 4.1 functionally describes the features of the application. Then, Section 4.2 provides non-functional details about the algorithm used for representing the graph. Finally, Section 4.3 illustrates the steps in Klint to edit a graph representing an integration rule.
Working modes
Klint supports three working modes:
The Assisted Schema Integration mode, which helps a user curate and create rules to integrate knowledge from external sources into FrameBase. It is explained in detail in Section 4.1.1. The Visual Knowledge Building mode, which assists a user in creating FrameBase knowledge directly. It is explained in detail in Section 4.1.2. The Visual Query Building mode, which assists a user in creating queries to retrieve existing Frames. It is explained in detail in Section 4.1.3.
Assisted schema integration mode
Under the Assisted Schema Integration Mode, the user can integrate one or more entire knowledge bases (KBs) into FrameBase with reduced effort in comparison to an entirely manual or non-graphical approach. The input KBs can be loaded from an RDF file or a SPARQL endpoint. Other formats can also be used after pre-processing with a suitable RDF converter.5
Integration heuristics. Klint automatically creates complex integration rules for each element (i.e. beyond 1-to-1 equivalences between entities) in the source schema, using integration algorithms based on linguistic annotations in FrameBase [47]. These heuristics can create two kinds of integration rules:
Property-Frame integration rules, where a property P in the source KB is mapped to a pattern consisting of an instantiated frame (that represents a situation or eventuality evoked by P). The frame is connected to the subject and object of P by two FE properties that represent “semantic roles” [25]. Figure 4 includes examples of these rules. These integration rules are created by matching the properties of the source KB with the FrameBase DBPs after a process of canonicalization [45,47].
Class-Frame integration rules, where a class is mapped to a frame instance, and some of its outgoing properties are mapped to FE properties of that frame, when there is textual overlap for all of them.
Graphical interface. Each integration rule is represented as a graph with nodes and edges in the right pane of Klint’s graphical interface (Fig. 5). Users can navigate across different integration rules with the buttons at the top bar, making modifications in a given graph if necessary. The nodes can be of different types.
Grounded nodes represent single entities, and can themselves be of three types:
Source nodes (green) represent resources from the source KB and connect two variable nodes.
FrameBase nodes (blue) represent FrameBase resources and also connect variable nodes. They provide the translation of the source pattern to FrameBase.
Auxiliary nodes (gray) represent resources from third-party KBs, usually representing common idioms or very specific entities.
Variable nodes (presented in red) represent universally quantified variables over entities. They bind the pattern from the source KB to the integrated FrameBase pattern. The remaining nodes are classified according to the type of entity they represent.
The nodes are connected via directed edges representing triples. Since an RDF triple involves three elements, each triple is represented by two successive edges, one from the subject to the predicate and another from the predicate to the object.
In the Assisted Schema Integration Mode, both edges and any of the above-mentioned types of nodes can be added, deleted, and edited. When a node is selected, the node is highlighted and the left panel is activated, where the user can change its name and unique identifier (in RDF, this is an IRI). Klint automatically deduces which consecutive pairs of edges define triples.
Automatic suggestions. When a new FrameBase node is created, its identifier is initially unspecified. Klint will try to use its context (the type of the neighboring nodes) and the integration engine to suggest possible values automatically. Specifically, if the node is a property emanating from a frame class, the FE properties of that frame class will be suggested, and if it is the subject of a FE property, the frame class to which this FE property belongs will be suggested. In case there are too many suggestions, the search function may be used instead.
In this mode, the user is able to introduce FrameBase nodes as well as external nodes, but not variable nodes. Accordingly, the resulting graph then represents knowledge rather than a query, and can be exported in any of the common RDF formats. Unlike the Assisted Schema Integration mode, this is not meant to produce massive amounts of data from external structured sources, but rather serves as a source-neutral way of evaluating the expressivity of FrameBase when creating small examples of knowledge.
Visual query building
In this mode, the user can introduce FrameBase nodes and variable nodes, but not external nodes. The resulting graph represents any kind of query, with the purpose of retrieving existing knowledge under the FrameBase schema, instead of an integration rule implemented specifically as a CONSTRUCT query, as in the case of the Assisted Schema Integration mode.
Specifically, the user may choose between the following different options:
Obtain a SELECT SPARQL query, suited for a FrameBase KB, selecting all variables. Obtain a CONSTRUCT SPARQL query, suited for a FrameBase KB, extracting all the knowledge that follows a given pattern. Run the SELECT/CONSTRUCT SPARQL query directly, visualizing the results.
Representation layer
RDF graphs are commonly regarded as directed labelled graphs, with each Subject-Predicate-Object triple as an edge between the subject and the object, and the predicate as its edge label. However, this is a simplification, as predicates can also be subjects or objects of some triples, and RDF graphs are truly bipartite graphs [33], where each triple is represented by two consecutive directed edges: subject–predicate and predicate–object ones. The graph representation in Klint maintains the bipartite model, so RDF is fully supported, but at the same time it maintains a data layer equivalent to a directed labelled graph (where each triple is an edge labelled with the predicate), which is the one used to provide an intuitive interface to the user. In order to make the presentation of this graph even more clear, Klint relies on a combination of visual clues: it uses physics simulation algorithms to maintain a similar orientation for edges of the same triple; it uses a distinctly different representation for subject and object nodes in relation to predicates; and it creates alias nodes for predicates, so if the same resource is used twice as a predicate in different triples, or as predicate in one and as a subject in another, then it is represented with two different nodes that are internally linked to the same resource. In order to create human-readable labels for nodes, Klint tries to find an explicit label for a node, and if it fails to find it, it uses a heuristic based on extracting the last part of the IRI, which is often human-readable.
When users want to make a modification in the graph, they can do so in a simple, visual way. Subject–Predicate–Object triples can be added or removed by adding or removing edges between nodes. The system automatically creates a new triple after subject-predicate and predicate-object edges are created, sharing the predicate. Temporary links are shown for unfinished subject-predicate edges.
Editing an integration rule: Example
We next consider the steps one takes to edit a graph representing an integration rule so as to make it more precise. Figure 6 presents an integration rule that expresses that the DBpedia property

(a) The starting point is the initial integration rule in the lower part of the screen, which includes a frame instance labeled “An entity”. After clicking the empty area in the upper space, a new node is created. Klint marks it green, assuming it belongs to a source KB. (b) Once an edge has been created between the frame instance and the new node, Klint recognizes that this is meant to be a property whose domain is the type of the frame instance, i.e. the frame class, and therefore, when the new node is selected, it automatically proposes all the FE properties of that frame as possible values, including “Time”. (c) After selecting “Apply” for “Time”, the new node is re-assigned as a FrameBase node. (d) After creating a literal node “past” and a second edge from “Time” to “Past”, Klint recognizes the new triple and updates the rule.
While we have thus far not carried out any full-fledged user studies with untrained users, the manual construction of integration rules within the ePOOLICE EU project [46] showed that the use of a graphical interface of this sort would be advantageous.

WordNet-based algorithm

FrameNet-based algorithm
We add multilingual support to FrameBase by adding multilingual labels to microframe classes, alongside the already existing ones in English. We consider this more appropriate than creating new microframe classes for terms in another language, since microframe classes are semantic elements that ought to be language-independent. While there may be connotational and denotational differences in nuance between translated terms, and there are terms that do not have a direct translation into other languages, the first phenomenon can be disregarded in most applications of knowledge representation and linked data, and the second phenomenon only pertains to a comparably small number of cases (and FrameNet as well as FrameBase allow for multi-word expressions, which can potentially alleviate this).
We do not consider multilingual labels of RDF terms such as FE properties, macroframes or FrameBase-specific classes and properties, because the lexical labels for these terms are not meant to be used in text mining.
Method description
We implement our method for the case of Swedish, but it can be applied for another second language (which we will denote throughout the paper as
WordNet-based method
This method, described in Algorithm 1, depends on the availability of a mapping from Princeton WordNet synsets to some lexicon for the other language (we denote this mapping as
This algorithm can be generalized for any language for which there is a multilingual WordNet version. Currently there are manually created open-source WordNets for over 30 languages, while many further languages are covered by automatically produced WordNets such as the Universal WordNet [20].
Algorithm 2 uses the following two resources.
A function
A dictionary function
This algorithm relies on the premise that among the possible translations for the term associated with an English LU (the elements of
We test this algorithm with Swedish in Section 5.2, but it can easily be extended to other languages.
As mentioned above,
Because
A particular example of a mistaken translation that is avoided by the algorithm is the following one. The LU-microframe
Results and evaluation
The results from Algorithm 1 covered 6,796 synset-microframes of a total of 117,659 in FrameBase 2.0, producing in total 7,331 Swedish labels. This does not need further evaluation for correctness because it relies on manual mappings.
Results of adding Swedish labels to FrameBase
Results of adding Swedish labels to FrameBase
Lossless use of manual mappings.
Examples of obtained Swedish annotations
The results from Algorithm 2 on LU-microframes produced 3,542 labels for 2,847 LU-microframes out of a total of 11,114. We evaluated 100 labels, of which 99 were correct. The labels were sampled randomly, with a uniform distribution, independently, without replacement. A label was judged correct if any possible English sentence was deemed possible so that (1) the English LU defining the LU-microframe was being used to evoke the macroframe, and (2) the translation of the sentence would include the Swedish label as translation of the English LU. The evaluation was jointly performed by two annotators, and it is available at
This evaluation yields a precision of 0.99, and the recall, under the simplifying assumption of one correct translation per LU, is 0.26. The somewhat low coverage is due to the low coverage of SweFn
We also tested Algorithm 2 on synset-macroframes, and we used SweFn
Besides simple lexical annotations using
We have presented two approaches to advance the integration of heterogeneous knowledge – both structured and non-structured – into FrameBase, which is a schema able to represent a wide range of knowledge using linguistic structures.
Klint is a Web-based framework that allows users to supervise an automatic integration of heterogeneous structured knowledge by providing a user-friendly graph-based interface that enables the reviewing and curation of complex integration rules produced by state-of-the-art integration algorithms. We have also demonstrated how FrameBase can be extended to support multilingual labels by reusing publicly available lexical resources, which has the potential to benefit text mining in languages other than English.
