Abstract
The popularity of the Semantic Web (SW) encourages organizations to organize and publish semantic data using the RDF model. This growth poses new requirements to Business Intelligence technologies to enable On-Line Analytical Processing (OLAP)-like analysis over semantic data. The incorporation of semantic data into a Data Warehouse (DW) is not supported by the traditional Extract-Transform-Load (ETL) tools because they do not consider semantic issues in the integration process. In this paper, we propose a layer-based integration process and a set of high-level RDF-based ETL constructs required to define, map, extract, process, transform, integrate, update, and load (multidimensional) semantic data. Different to other ETL tools, we automate the ETL data flows by creating metadata at the schema level. Therefore, it relieves ETL developers from the burden of manual mapping at the ETL operation level. We create a prototype, named Semantic ETL Construct (SETL CONSTRUCT ), based on the innovative ETL constructs proposed here. To evaluate SETL CONSTRUCT , we create a multidimensional semantic DW by integrating a Danish Business dataset and an EU Subsidy dataset using it and compare it with the previous programmable framework SETL PROG in terms of productivity, development time, and performance. The evaluation shows that 1) SETL CONSTRUCT uses 92% fewer Number of Typed Characters (NOTC) than SETL PROG , and SETL AUTO (the extension of SETL CONSTRUCT for generating ETL execution flows automatically) further reduces the Number of Used Concepts (NOUC) by another 25%; 2) using SETL CONSTRUCT , the development time is almost cut in half compared to SETL PROG , and is cut by another 27% using SETL AUTO ; and 3) SETL CONSTRUCT is scalable and has similar performance compared to SETL PROG . We also evaluate our approach qualitatively by interviewing two ETL experts.
Introduction
Semantic Web (SW) technologies enable adding a semantic layer over the data; thus, the data can be processed and effectively retrieved by both humans and machines. The Linked Data (LD) principles are the set of standard rules to publish and connect data using semantic links [9]. With the growing popularity of the SW and LD, more and more organizations natively manage data using SW standards, such as Resource Description Framework (RDF), RDF-Schema (RDFs), the Web Ontology Language (OWL), etc. [17]. Moreover, one can easily convert data given in another format (database, XML, JSON, etc.) into RDF format using an RDF Wrappers [8]. As a result, a lot of semantic datasets are now available in different data portals, such as DataHub,1
OLAP is a well-recognized technology to support decision making by analyzing data integrated from multiple sources. The integrated data are stored in a Data Warehouse (DW), typically structured following the Multidimensional (MD) Model that represents data in terms of facts and dimensions to enable OLAP queries. The integration process for extracting data from different sources, translating them according to the underlying semantics of the DW, and loading them into the DW is known as Extract-Transform-Load (ETL). One way to enable OLAP over semantic data is by extracting those data and translating them according to the DW’s format using a traditional ETL process. [47] outlines such a type of semi-automatic method to integrate semantic data into a traditional Relational Database Management System (RDBMS)-centric MD DW. However, the process does not maintain all the semantics of data as they are conveying in the semantic sources; hence, the integrated data no more follow the SW data principles defined in [27]. The semantics of the data in a semantic data source is defined by 1) using Internationalized Resource Identifiers (IRIs) to uniquely identify resources globally, 2) providing common terminologies, 3) semantically linking with published information, and 4) providing further knowledge (e.g., logical axioms) to allow reasoning [7].
Therefore, considering semantic issues in the integration process should be emphasized. Moreover, initiatives such as Open Government Data3
Semantic Web technologies address the problems described above, as they allow adding semantics at both data and schema level in the integration process and publish data in RDF using the LD principles. On the SW, the RDF model is used to manage and exchange data, and RDFS and OWL are used in combination with the RDF data model to define constraints that data must meet. Moreover, Data Cube (QB) [12] and Data cube for OLAP (QB4OLAP) [19] vocabularies can be used to define data with MD semantics. [44] refers to an MD DW that is semantically annotated both at the schema and data level as a Semantic DW (SDW). An SDW is based on the assumption that the schema can evolve and be extended without affecting the existing structure. Hence, it overcomes the problems triggered by the evolution of an RDBMS-based data warehousing system. On top of that, as for the physical storage of the facts and pre-aggregated values, a physically materialized SDW, the setting we focus on, store both of these as triples in the triplestore. Thus, a physical SDW is a new type of OLAP (storage) compared to classical Relational OLAP (ROLAP), Multidimensional OLAP (MOLAP), and their combination Hybrid OLAP (HOLAP) [61]. In general, physical SDW offers more expressivity at the cost of performance [35]. In [44], we proposed SETL (throughout this present paper, we call it SETL
PROG
), a programmable semantic ETL framework that semantically integrates both semantic and non-semantic data sources. In SETL
PROG
, an ETL developer has to create hand-code specific modules to deal with semantic data. Thus, there is a lack of a well-defined set of basic ETL constructs that allows developers having a higher level of abstraction and more control in creating their ETL process. In this paper, we propose a strong foundation for an RDF-based semantic integration process and a set of high-level ETL constructs that allows defining, mapping, processing, and integrating semantic data. The unique contributions of this paper are:
We structure the integration process into two layers: Definition Layer and Execution Layer. Different to SETL
PROG
or other ETL tools, here, we propose a new paradigm: the ETL flow transformations are characterized once and for all at the Definition Layer instead of independently within each ETL operation (in the Execution Layer). This is done by generating a mapping file that gives an overall view of the integration process. This mapping file is our primary metadata source, and it will be fed (by either the ETL developer or the automatic ETL execution flow generation process) to the ETL operations, orchestrated in the ETL flow (Execution Layer), to parametrize themselves automatically. Thus, we are unifying the creation of the required metadata to automate the ETL process in the Definition layer. We propose an OWL-based Source-To-target Mapping (S2TMAP) vocabulary to express the source-to-target mappings. We provide a set of high-level ETL constructs for each layer. The Definition Layer includes constructs for target schema4 Here, we use the terms “target” and “MD SDW” interchangeably.
We propose an approach to automate the ETL execution flows based on metadata generated in the Definition Layer.
We create a prototype SETL CONSTRUCT , based on the innovative ETL constructs proposed here. SETL CONSTRUCT allows creating ETL flows by dragging, dropping, and connecting the ETL operations. In addition, it allows creating ETL data flows automatically (we call it SETL AUTO ).
We perform a comprehensive experimental evaluation by producing an MD SDW that integrates an EU farm Subsidy dataset and a Danish Business dataset. The evaluation shows that SETL CONSTRUCT improves considerably over SETL PROG in terms of productivity, development time, and performance. In summary: 1) SETL CONSTRUCT uses 92% fewer Number of Typed Characters (NOTC) than SETL PROG , and SETL AUTO further reduces the Number of Used Concepts (NOUC) by another 25%; 2) using SETL CONSTRUCT , the development time is almost cut in half compared to SETL PROG , and is cut by another 27% using SETL AUTO ; 3) SETL CONSTRUCT is scalable and has similar performance compared to SETL PROG . Additionally, we interviewed two ETL experts to evaluate our approach qualitatively.
In this section, we provide the definitions of the notions and terminologies used throughout the paper.
RDF graph
An RDF graph is represented as a set of statements, called RDF triples. The three parts of a triple are subject, predicate, and object, respectively, and a triple represents a relationship between its subject and object described by its predicate. Each triple, in the RDF graph, is represented as
Semantic data source
We define a semantic data source as a Knowledge Base (KB) where data are semantically defined. A KB is composed of two components, TBox and ABox. The TBox introduces terminology, the vocabulary of a domain, and the ABox is the assertions of the TBox. The TBox is formally defined as a 3-tuple:
Semantic data warehouse
A semantic data warehouse (SDW) is a DW with the semantic annotations. We also considered it as a KB. Since the DW is represented with Multidimensional (MD) model for enabling On-Line Analytical Processing (OLAP) queries, the KB for an SDW needs to be defined with MD semantics. In the MD model, data are viewed in an n-dimensional space, usually known as a data cube, composed of facts (the cells of the cube) and dimensions (the axes of the cube). Therefore, it allows users to analyze data along several dimensions of interest. For example, a user can analyze sales of products according to time and store (dimensions). Facts are the interesting things or processes to be analyzed (e.g., sales of products) and the attributes of the fact are called measures (e.g., quantity, amount of sales), usually represented as numeric values. A dimension is organized into hierarchies, composed of several levels, which permit users to explore and aggregate measures at various levels of detail. For example, the location hierarchy (
We use the QB4OLAP vocabulary to describe the multidimensional semantics over a KB [19]. QB4OLAP is used to annotate a TBox with MD components and is based on the QB vocabulary which is the W3C standard to publish MD data on the Web [15]. QB is mostly used for analyzing statistical data and does not adequately support OLAP MD constructs. Therefore, in this paper, we choose QB4OLAP. Figure 1 depicts the ontology of QB4OLAP [62]. The terms prefixed with “qb:” are from the original QB vocabulary, and QB4OLAP terms are prefixed with “qb4o:” and displayed with gray background. Capitalized terms represent OWL concepts, and non-capitalized terms represent OWL properties. Capitalized terms in italics represent concepts with no instances. The blue-colored square in the figure represents our extension of QB4OLAP ontology.

QB4OLAP vocabulary.
In QB4OLAP, the concept
The ontology of the Danish Business dataset. Due to the large number of datatype properties, they are not included.
We create a semantic Data Warehouse (SDW) by integrating two data sources, namely, a Danish Agriculture and Business knowledge base and an EU Farm Subsidy dataset. Both data sources are described below.
Description of Danish Agriculture and Business knowledge base The Danish Agriculture and Business knowledge base integrates a Danish Agricultural dataset and a Danish Business dataset. The knowledge base can be queried through the SPARQL endpoint
We start the description from the concept
Description of the EU subsidy dataset Every year, the European Union provides subsidies to the farms of its member countries. We collect EU Farm subsidies for Denmark from

The ontology of the Subsidy dataset. Due to the large number of datatype properties, all are not included.

The ontology of the MD SDW. Due to the large number, data properties of the dimensions are not shown.
Listing 1 shows the example instances of Example instances of the DBD and the Subsidy dataset
Description of the Semantic Data Warehouse Our goal is to develop an MD Semantic Data Warehouse (SDW) by integrating the Subsidy and the DBD datasets. The

The overall semantic data integration process. Here, the round-corner rectangle, data stores, dotted boxes, ellipses, and arrows indicate the tasks, semantic data sources and SDW, the phases of the ETL process, ETL operations and flow directions.
In this paper, we assume that all given data sources are semantically defined and the goal is to develop an SDW. The first step of building an SDW is to design its TBox. There are two approaches to design the TBox of an SDW, namely source-driven and demand-driven [61]. In the former, the SDW’s TBox is obtained by analyzing the sources. Here, ontology alignment techniques [40] can be used to semi-automatically define the SDW. Then, designers can identify the multidimensional constructs from the integrated TBox and annotate them with the QB4OLAP vocabulary. In the latter, SDW designers first identify and analyze the needs of business users as well as decision makers, and based on those requirements, they define the target TBox with multidimensional semantics using the QB4OLAP vocabulary. How to design a Target TBox is orthogonal to our approach. Here, we merely provide an interface to facilitate creating it regardless of whatever approach was used to design it.
Summary of the ETL operations
After creating the TBox of the SDW, the next step is to create the ETL process. ETL is the backbone process by which data are entered into the SDW and the main focus of this paper. The ETL process is composed of three phases: extraction, transformation, and load. A phase is a sub-process of the ETL which provides a meaningful output that can be fed to the next phase as an input. Each phase includes a set of operations. The extraction operations extract data from the data sources and make it available for further processing as intermediate results. The transformation operations are applied on intermediate results, while the load operations load the transformed data into the DW. The intermediate results can be either materialized in a data staging area or kept in memory. A data staging area (temporary) persists data for cleansing, transforming, and future use. It may also prevent the loss of extracted or transformed data in case of the failure of the loading process.
As we want to separate the metadata needed to create ETL flows from their execution, we introduce a two-layered integration process, see Fig. 5. In the Definition Layer, a single source of metadata truth is defined. This includes: the target SDW, semantic representation of the source schemas, and a source to target mapping file. Relevantly, the metadata created represents the ETL flow at the schema level. In the Execution Layer, ETL data flows based on high-level operations are created. This layer executes the ETL flows for instances (i.e., at the data level). Importantly, each ETL operation is fed the metadata created to parameterize themselves automatically. Additionally, the Execution Layer automatically checks the correctness of the created flow, by checking the compatibility of the output and input of consecutive operators. Overall the data integration process requires the following four steps in the detailed order.
Defining the target TBox with MD semantics using QB and QB4OLAP constructs. In addition, the TBox can be enriched with RDFS/OWL concepts and properties. However, we do not validate the correctness of the added semantics beyond the MD model. This step is done at the Definition Layer. Extracting source TBoxes from the given sources. This step is done at the Definition Layer. Creating mappings among source and target constructs to characterize ETL flows. The created mappings are expressed using the proposed S2TMAP vocabulary. This step is also done at the Definition Layer. Populating the ABox of the SDW implementing ETL flows. This step is done at the Execution Layer.
Figure 5 illustrates the whole integration process and how the constructs of each layer communicate with each other. Here, we introduce two types of constructs: tasks and operations. On the one hand, a task requires developer interactions with the interface of the system to produce an output. Intuitively, one may consider the tasks output as the required metadata to automate operations. On the other hand, from the given metadata, an operation produces an output. The Definition Layer consists of two tasks (TargetTBoxDefinition and SourceToTargetMapping) and one operation (TBoxExtraction). These two tasks respectively address the first and third steps of the integration process mentioned above, while the TBoxExtraction operation addresses the second step. This is the only operation shared by both layers (see the input of SourceToTargetMapping in Fig. 5). Therefore, the Definition Layer creates three types of metadata: target TBox (created by TargetTBoxDefinition), source TBoxes (created by TBoxExtraction), and source-to-target mappings (created by SourceToTargetMapping). The Execution Layer covers the fourth step of the integration process and includes a set of operations to create data flows from the sources to the target. Figure 5 shows constructs (i.e., the Mediatory Constructs) used by the ETL task/operations to communicate between them. These mediatory constructs store the required metadata created to automate the process. In the figure,
Since traditional ETL tools (e.g., PDI) do not have ETL operations supporting the creation of an SDW, we propose a set of ETL operations for each phase of the ETL to process semantic data sources. The operations are categorized based on their functionality. Table 1 summarizes each operation with its corresponding category name, compatible successors, and its objectives. Next, we present the details of each construct of the layers presented in Fig. 5.
This layer contains two tasks (TargetTBoxDefinition and Source2TargetMapping) and one operation TBoxExtraction. The semantics of the tasks are described below.
TargetTBoxDefinition The objective of this task is to define a target TBox with MD semantics. There are two main components of the MD structure: dimensions and cubes. To formally define the schema of these components, we use the notation from [12] with some modifications.
A → is a strict partial order on
Figure 4 shows that our use case MD SDW has two dimensions:
A
The cube schema of our use case, shown in Fig. 4, is formally defined as follows:
In Section 2.3, we discussed how the QB4OLAP vocabulary is used to define different constructs of an SDW. Listing 2 represents the

QB4OLAP representation of
TBoxExtraction After defining a target TBox, the next step is to extract source TBoxes. Typically, in a semantic source, the TBox and ABox of the source are provided. Therefore, no external extraction task/operation is required. However, sometimes, the source contains only the ABox, no TBox. In that scenario, an extraction process is required to derive a TBox from the ABox. Since the schema level mappings are necessary to create the ETL process, and the ETL process will extract data from the ABox, we only consider the intentional knowledge available in the ABox in the TBox extraction process. We formally define the process as follows.
The TBox extraction operation from a given ABox, ABox is defined as C: By checking the unique objects of the triples in ABox where H: The taxonomies among concepts are identified by checking the instances they share among themselves. Let P, D, R: By checking the unique predicates of the triples, P is derived. A property
Note that proving the formal correctness of the approach is beyond the scope of this paper and left for future work.
SourceToTargetMapping Once the target and source TBoxes are defined, the next task is to characterize the ETL flows at the Definition Layer by creating source-to-target mappings. Because of the heterogeneous nature of source data, mappings among sources and the target should be done at the TBox level. In principle, mappings are constructed between sources and the target; however, since mappings can get very complicated, we allow to create a sequence of SourceToTargetMapping definitions whose subsequent input is generated by the preceding operation. The communication between these operations is by means of a materialized intermediate mapping definition and it is meant to facilitate the creation of very complex flows (i.e., mappings) between source and target.
A source-to-target mapping is constructed between a source and a target TBox, and it consists of a set of concept-mappings. A concept-mapping defines i) a relationship (equivalence, subsumption, supersumption, or join) between a source and the corresponding target concept, ii) which source instances are mapped (either all or a subset defined by a filter condition), iii) the rule to create the IRIs for target concept instances, iv) the source and target ABox locations, v) the common properties between two concepts if their relationship is join, vi) the sequence of ETL operations required to process the concept-mapping, and vii) a set of property-mappings for the properties having the target concept as a domain. A property-mapping defines how a target property is mapped from either a source property or an expression over properties. Definition 4 formally defines a source-to-target mapping.
Let
The semantics of each concept-mapping tuple is given below.
represents the relationship between the source and target concept. The relationship can be either
, or a left-outer join
. A join relationship exists between two sources when there is a need to populate a target element (a level, a (QB) dataset, or a concept) from multiple sources. Since a concept-mapping represents a binary relationship, to join n sources, an ETL process requires
In principle, an SDW is populated from multiple sources, and a source-to-target ETL flow requires more than one intermediate concept-mapping definitions. Therefore, a complete ETL process requires a set of source-to-target mappings. We say a mapping file is a set of source-to-target mappings. Definition 5 formally defines a mapping file.

Graphical overview of key terms and their relationships to the S2TMAP vocabulary.
To implement the source-to-target mappings formally defined above, we propose an OWL-based mapping vocabulary: Source-to-Target Mapping (S2TMAP). Figure 6 depicts the mapping vocabulary. A mapping between a source and a target TBox is represented as an instance of the concept
To map at the property stage, a property-mapping (an instance of
Listing 3 represents a snippet of the mapping file of our use case MD SDW and the source datasets. In the Execution Layer, we show how the different segments of this mapping file will be used by each ETL operation.
An S2TMAP representation of the mapping file of our use case

The conceptual presentation of Listing 3.
A mapping file is a Directed Acyclic Graph (DAG). Figure 7 shows the DAG representation of Listing 3. In this figure, the sources, intermediate results and the SDW are denoted as nodes of the DAG and edges of the DAG represent the operations. The dotted-lines shows the parts of the ETL covered by concept-mappings, represented by a rectangle.
In the Execution Layer, ETL data flows are constructed to populate an MD SDW. Table 1 summarizes the set of ETL operations. In the following, we present an overview of each operation category-wise. Here, we give the intuitions of the ETL operations in terms of definitions and examples. To reduce the complexity and length of the paper, we place the formal semantics of the ETL operations in the Appendix. In this section, we only present the signature of each operation. That is, the main inputs required to execute the operation. As an ETL data flow is a sequence of operations and an operation in the sequence communicates with its preceding and subsequent operations by means of materialized intermediate results, all the operations presented here have side effects5
An operation has a side effect if it modifies some state variable value(s) outside its local environment (
Developers can use either of the two following options: (i) The recommended option is that given a TBox construct aConstruct (a concept, a level, or a QB dataset) and a mapping file aMappings generated in the Definition Layer, the automatic ETL execution flow generation process will automatically extract the parameter values from aMappings (see Section 7 for a detailed explanation of the automatic ETL execution flow generation process). (ii) They can manually set input parameters at the operation level. In this section, we follow the following order to present each operation: 1) we first give a high-level definition of the operation; 2) then, we define how the automatic ETL execution flow generation process parameterizes the operation from the mapping file, and 3) finally, we present an example showing how developers can manually parameterize the operation. When introducing the operations and referring to their automatic parametrization, we will refer to aMappings and aConstruct as defined here. Note that each operation is bound to exactly one concept-mapping at a time in the mapping file (discussed in Section 7).
Extraction is process of data retrieval from the sources. Here, we introduce two extraction operations for semantic sources: (i) GraphExtractor – to form/extract an RDF graph from a semantic source and (ii) TBoxExtraction – to derive a TBox from a semantic source as described in Section 5. As such, TBoxExtraction is the only operation in the Execution Layer generating metadata stored in the Mediatory Constructs (see Fig. 5).
GraphExtractor(Q, G, outputPattern, tABox) Since the data integration process proposed in this paper uses RDF as the canonical model, we extract/generate RDF triples from the sources with this operation. GraphExtractor is functionally equivalent to SPARQL CONSTRUCT queries [39].
If the ETL execution flow is generated automatically, the automatic ETL execution flow generation process first identifies the concept-mapping cm from aMappings where aConstruct appears (i.e., in aMapping,
Listing 1 shows the example instances of the Danish Business Dataset (DBD). To extract all instances of Q=((?ins, To make it easily distinguishable, here, we use comma instead of space to separate the components of a triple pattern and an RDF triple.
G=“/map/dbd.ttl”,
outputPattern= (?ins,?p,?v),
tABox=“/map/com.ttl”.7
We present the examples in Turtle format for reducing the space and better understanding. In practice, our system prefers N-Triples format to support scalability.
Example of GraphExtractor
TBoxExtraction is already described in Section 5, therefore, we do not repeat it here.
Transformation operations transform the extracted data according to the semantics of the SDW. Here, we define the following semantic-aware ETL transformation operations: TransformationOnLiteral, JoinTransformation, LevelMemberGenerator, ObservationGenerator, ChangedDataCapture, UpdateLevel, External linking, and MaterializeInference. The following describe each operation.
TransformationOnLiteral(sConstruct, tConstruct, sTBox, sABox propertyMappings, tABox) As described in the SourceToTargetMapping task, a property (in a property-mapping) of a target construct (i.e., a level, a QB dataset, or a concept) can be mapped to either a source concept property or an expression over the source properties. An expression allows arithmetic operations, datatype (string, number, and date) conversion and processing functions, and group functions (sum, avg, max, min, count) as defined in SPARQL [25]. This operation generates the instances of the target construct by resolving the source expressions mapped to its properties.
If the ETL execution flow is generated automatically, the automatic ETL execution flow generation process first identifies the concept-mapping
Listing 5 (lines 16–19) shows the transformed instances after applying the operation TransformationOnLiteral(sConstruct, tConstruct, sTBox, sABox, PropertyMappings, tABox), where
sConstruct = tConstruct= sTBox=”/map/subsidyTBox.ttl”, sABox= source instances of propertyMappings = lines 2–14 in Listing 5, tABox=”/map/temp1.ttl”. Example of TransformationOnLiteral

JoinTransformation(sConstruct, tConstruct, sTBox, tTBox, sABox, tABox, comProperty, propertyMappings) A TBox construct (a concept, a level, or a QB dataset) can be populated from multiple sources. Therefore, an operation is necessary to join and transform data coming from different sources. Two constructs of the same or different sources can only be joined if they share some common properties. This operation joins a source and a target constructs based on their common properties and produce the instances of the target construct by resolving the source expressions mapped to target properties. To join n sources, an ETL process requires n-1 JoinTransformation operations.
If the ETL execution flow is generated automatically, the automatic ETL execution flow generation process first identifies the concept-mapping cm from aMappings, where aConstruct appears and the operation to process
A developer can also manually set the parameters. Once it is parameterized, JoinTransformation joins two constructs based on comProperty, transforms their data based on the expressions (specified through
The recipients in sConstruct= tConstruct= sTBox= “/map/businessTBox.ttl”, tTBox= “/map/subsidyTBox.ttl”, sABox= source instances of tABox= source instances of comProperty = lines 31, 34–37 in Listing 3, propertyMappings = lines 73–96 in Listing 3. Example of JoinTransformation
Listing 6 shows the output of the joinTransformation operation.

LevelMemberGenerator(sConstruct, level, sTBox, sABox, tTBox, iriValue, iriGraph, propertyMappings, tABox) In QB4OLAP, dimensional data are physically stored in levels. A level member, in an SDW, is described by a unique IRI and its semantically linked properties (i.e., level attributes and roll-up properties). This operation generates data for a dimension schema defined in Definition 1.
If the ETL execution flow is generated automatically, the automatic process first identifies the concept-mapping cm from aMappings, where aConstruct appears and the operation to process cm is LevelMemberGenerator. Then it parameterizes LevelMemberGenerator as follows: 1) sConstruct is the source construct defined by
A level is termed as a level property in QB4OLAP, therefore, throughout this paper, we use both the term “level” and “level property” interchangeably.
A rule can be either a source property, an expression or incremental, as described in Section 5.
The IRI graph is an RDF graph that keeps a triple for each resource in the SDW with their corresponding source IRI.
A developer can also manually set the paramenters. Once it is parameterized, LevelMemberGenerator operation generates QB4OLAP-compliant triples for the level members of level based on the semantics encoded in tTBox and stores them in tABox.
Listing 3 shows a concept-mapping (lines 40–54) describing how to populate sConstruct= level= sTBox= “/map/subsidyTBox.ttl”, sABox= “/map/subsidy.ttl”, shown in Example 7, tTBox = “/map/subsidyMDTBox.ttl”, iriValue = iriGraph = “/map/provGraph.nt”, propertyMappings= lines 98–115 in Listing 3, tABox=”/map/temp.ttl”. Example of LevelMemberGenerator

ObservationGenerator(sConstruct, dataset, sTBox, sABox, tTBox, iriValue, iriGraph, propertyMappings, tABox) In QB4OLAP, an observation represents a fact. A fact is uniquely identified by an IRI, which is defined by a combination of several members from different levels and contains values for different measure properties. This operation generates data for a cube schema defined in Definition 2.
If the ETL execution flow is generated automatically, the way used by the automatic ETL execution flow generation process to extract values for the parameters of ObservationGenerator from aMappings is analogous to LevelMemberGenerator. Developers can also manually set the parameters. Once it is parameterized, the operation generates QB4OLAP-compliant triples for observations of the QB datasetdataset based on the semantics encoded in tTBox and stores them in tABox.
Listing 8 (lines 21–25) shows a QB4OL-AP-compliant observation create by the ObservationGenerator(sConstruct, dataset, sTBox, sABox, tTBox, iriValue, iriGraph, propertyMappings, tABox) operation, where
sConstruct = dataset = sTBox = “/map/subsidyTBox.ttl” sABox = “/map/subsidy.ttl”, shown in Example 6, tTBox = “/map/subsidyMDTBox.ttl”, shown in Listing 2, iriValue = “Incremental”, iriGraph = “/map/provGraph.nt”, propertyMappings = lines 8–19 in Listing 8, tABox = “/map/temp2.ttl”. Example of ObservationGenerator

ChangedDataCapture(nABox, oABox, flag) In a real-world scenario changes occur in a semantic source both at the schema and instance level. Therefore, an SDW needs to take action based on the changed schema and instances. The adaption of the SDW TBox with the changes of source schemas is an analytical task and requires the involvement of domain experts, therefore, it is out of the scope of this paper. Here, only the changes at the instance level are considered.
If the ETL execution flow is generated automatically, the automatic process first identifies the concept-mapping cm from aMappings, where aConstruct appears and the operation to process cm is ChangedDataCapture. Then, it takes
Developers can also manually set the parameters. From the given inputs, ChangedDataCapture outputs either 1) a set of new instances (in the case of SDW evolution, i.e.,
Suppose Listing 6 is the old ABox of
Example of ChangedDataCapture
UpdateLevel(level, updatedTriples, sABox, tTBox, tABox, propertyMappings, iriGraph) Based on the triples updated in the source ABox sABox for the level level (generated by ChangedDataCapture), this operation updates the target ABox tABox to reflect the changes in the SDW according to the semantics encoded in the target TBox tTBox and level property-mappings propertyMappings. Here, we address three update types (Type1-update, Type2-update, and Type3-update), defined by Ralph Kimball in [37] for a traditional DW, in an SDW environment. The update types are already defined in tTBox for each level attribute of level (as discussed in Section 2.3), so they do not need to be provided as parameters. As we consider only instance level updates, only the objects of the source updated triples are updated. To reflect a source updated triple in level, the level member using the triple to describe itself, will be updated. In short, the level member is updated in the following three ways: 1) A Type1-update simply overwrites the old object with the current object. 2) A Type2-update creates a new version for the level member (i.e., it keeps the previous version and creates a new updated one). It adds the validity interval for both versions. Further, if the level member is a member of an upper level in the hierarchy of a dimension, the changes are propagated downward in the hierarchy, too. 3) A Type3-update overwrites the old object with the new one. Besides, it adds an additional triple for each changed object to keep the old object. The subject of the additional triple is the instance IRI, the object of the triple is the old object, and the predicate is concat(oldPredicate, “oldValue”).
If the ETL execution flow is generated automatically, this operation first identifies the concept-mapping cm from aMappings, where aConstruct appears and the operation to process cm is UpdateLevel. Then it parameterizes LevelMemberGenerator as follows: 1) level is defined by the
Listing 10 describes how different types of updates work by considering two members of the Example of different types of updates
Besides the transformation operations discussed above, we define two additional transformation operations that cannot be run by the automatic ETL dataflows generation process.
ExternalLinking (sABox, externalSource) This operation links the resources of sABox with the resources of an external source externalSource. externalSource can either be a SPARQL endpoint or an API. For each resource
MaterializeInference(ABox, TBox) This operation infers new information that has not been explicitly stated in an SDW. It analyzes the semantics encoded into the SDW and enriches the SDW with the inferred triples. A subset of the OWL 2 RL/RDF rules, which encodes part of the RDF-Based Semantics of OWL 2 [52], are considered here. The reasoning rules can be applied over the TBox TBox and ABox ABox separately, and then together. Finally, the resulting inference is asserted in the form of triples, in the same spirit as how the SPARQL regime entailments11
Loader(tripleSet, tsPath) An SDW is represented in the form of RDF triples and the triples are stored in a triplestore (e.g., Jena TDB). Given a set of RDF triples triplesSet and the path of a triple store tsPath, this operation loads triplesSet in the triple store.
If the ETL execution flow is generated automatically, this operation first identifies the concept-mapping cm from aMappings, where aConstruct appears and the operation to process cm is Loader. Then, it takes values of
Automatic ETL execution flow generation
We can characterize ETL flows at the Definition Layer by means of the source-to-target mapping file; therefore, the ETL data flows in the Execution Layer can be generated automatically. This way, we can guarantee that the created ETL data flows are sound and relieve the developer from creating the ETL data flows manually. Note that this is possible because the Mediatory Constructs (see Fig. 5) contain all the required metadata to automate the process.



Algorithm 1 shows the steps required to create ETL data flows to populate a target construct (a level, a concept, or a dataset). As inputs, the algorithm takes a target construct x, the mapping file G, and the IRI graph
Here, some functions used in the algorithms are characterized by a pattern over G, shown at its side as comments in the corresponding algorithm.
Algorithm 2 generates an ETL data flow for a concept-mapping c and recursively does so if the current concept-mapping source element is connected to another concept-mapping, until it reaches a source element. Algorithm 2 recursively calls itself and uses a stack to preserve the order of the partial ETL data flows created for each concept-mapping. Eventually, the stack contains the whole ETL data flow between the source and target schema.
Algorithm 2 works as follows. The sequence of operations in c is pushed to the stack after parameterizing it (lines 1–2). Algorithm 3 parameterizes each operation in the sequence, as described in Section 6 and returns a stack of parameterized operations. As inputs, it takes the operation sequence, the concept-mapping, the mapping file, and the IRI graph. For each operation, it uses the

The graph presentation of a part of Listing 3.
Then, Algorithm 2 traverses to the adjacent concept-mapping of c connected via c’s source concept (line 3–4). After that, the algorithm recursively calls itself for the adjacent concept-mapping (line 6). Note that here, we set a restriction: except for the final target constructs, all the intermediate source and target concepts can be connected to at most one concept-mapping. This constraint is guaranteed when building the metadata in the Definition Layer. Once there are no more intermediate concept-mappings, the algorithm pushes a dummy starting ETL operation (StartOp) (line 7) to the stack and returns it. StartOp can be considered as the root of the ETL data flows that starts the execution process. The stacks generated for each concept-mapping of
In this section, we show how to populate
Here, lines 95–112 in Listing 3.
(..) indicates that the operation is parameterized.
Then,
lines 32, 36–39 in Listing 3.
Lines 76–93 in Listing 3.
We created a GUI-based prototype, named SETL
CONSTRUCT
[17] based on the different high-level constructs described in Sections 5 and 6. We use Jena 3.4.0 to process, store, and query RDF data and Jena TDB as a triplestore. SETL
CONSTRUCT
is developed in Java 8. Like other traditional tools (e.g., PDI [11]), SETL
CONSTRUCT
allows developers to create ETL flows by dragging, dropping, and connecting the ETL operations. The system is demonstrated in [17]. On top of SETL
CONSTRUCT
, we implement the automatic ETL execution flow generation process discussed in Section 7; we call it SETL
AUTO
. The source code, examples, and developer manual for SETL
CONSTRUCT
and SETL
AUTO
are available at
To evaluate SETL
CONSTRUCT
and SETL
AUTO
, we create an MD SDW by integrating the Danish Business Dataset (DBD) [3] and the Subsidy dataset (
In this evaluation process, we use SETL PROG as our competitive system. We could not directly compare SETL CONSTRUCT and SETL AUTO with traditional ETL tools (e.g., PDI, pygramETL) because they 1) do not support semantic-aware data, 2) are not compatible with the SW technology, and 3) cannot supprot a data warehouse that is semantically defined. On the other hand, we could also not compare them with existing semantic ETL tools (e.g., PoolParty) because they do not support multidimensional semantics at the TBox and ABox level. Therefore, they do not provide any operations for creating RDF data following multidimensional principles. Nevertheless, SETL PROG supports both semantic and non-semantic source integration, and it uses the relational model as a canonical model. In [44], SETL PROG is compared with PDI to some extent. We also present the summary of that comparison in the following sections. We direct readers to [44] for further details.
Comparison among the productivity of SETL
PROG
, SETL
CONSTRUCT
, and SETL
AUTO
for the SDW
Comparison among the productivity of SETL PROG , SETL CONSTRUCT , and SETL AUTO for the SDW
(Continued)
Comparison between the ETL processes of SETL PROG and PDI for SDW (Reproduced from [44])
SETL PROG requires Python knowledge to maintain and implement an ETL process. On the other hand, using SETL CONSTRUCT and SETL AUTO , a developer can create all the phases of an ETL process by interacting with a GUI. Therefore, they provide a higher level of abstraction to the developer that hides low-level details and requires no programming background. Table 2 summarizes the effort required by the developer to create different ETL tasks using SETL PROG , SETL CONSTRUCT , and SETL AUTO . We measure the developer effort in terms of Number of Typed Characters (NOTC) for SETL PROG and in Number of Used Concepts (NOUC) for SETL CONSTRUCT and SETL AUTO . Here, we define a concept as a GUI component of SETL CONSTRUCT that opens a new window to perform a specific action. A concept is composed of several clicks, selections, and typed characters. For each ETL task, Table 2 lists: 1) its sub construct, required procedures/data structures, number of the task used in the ETL process (NUEP), and NOTC for SETL PROG ; 2) the required task/operation, NOUC, and number of clicks, selections, and NOTC (for each concept) for SETL CONSTRUCT and SETL AUTO . Note that NOTC depends on the user input. Here, we generalize it by using same user input for all systems. Therefore, the total NOTC is not the summation of the values of the respective column.
To create a target TBox using SETL PROG , an ETL developer needs to use the following steps: 1) defining the TBox constructs by instantiating the Concept, Property, and BlankNode classes that play a meta modeling role for the constructs and 2) calling conceptPropertyBinding() and createTriples(). Both procedures take the list of all concepts, properties, and blank nodes as parameters. The former one internally connects the constructs to each other, and the latter creates triples for the TBox. To create the TBox of our SDW using SETL PROG , we used 24,509 NOTC. On the other hand, SETL CONSTRUCT and SETL AUTO use the TargetTBoxDefinition interface to create/edit a target TBox. To create the target TBox of our SDW in SETL CONSTRUCT and SETL AUTO , we use 101 concepts that require only 382 clicks, 342 selections, and 1905 NOTC. Therefore, for creating a target TBox, SETL CONSTRUCT and SETL AUTO use 92% fewer NOTC than SETL PROG .
To create source-to-target mappings, SETL PROG uses Python dictionaries where the keys of each dictionary represent source constructs and values are the mapped target TBox constructs, and for creating the ETL process it uses 23 dictionaries, where the biggest dictionary used in the ETL process takes 253 NOTC. In total, SETL PROG uses 2052 NOTC for creating mappings for the whole ETL process. On the contrary, SETL CONSTRUCT and SETL AUTO use a GUI. In total, SETL CONSTRUCT uses 87 concepts composed of 330 clicks, 358 selections, and 30 NOTC. Since SETL AUTO internally creates the intermediate mappings, there is no need to create separate mapping datasets and concept-mappings for intermediate results. Thus, SETL AUTO uses only 65 concepts requiring 253 clicks, 274 selections, and 473 NOTC. Therefore, SETL CONSTRUCT reduces NOTC of SETL PROG by 98%. Although SETL AUTO uses 22 less concepts than SETL CONSTRUCT , SETL CONSTRUCT reduces NOTC of SETL AUTO by 93%. This is because, in SETL AUTO , we write the data extraction queries in the concept-mappings where in SETL CONSTRUCT we set the data extraction queries in the ETL operation level.
To extract data from either an RDF local file or an SPARQL endpoint, SETL PROG uses the query() procedure and the ExtractTriplesFromEndpoint() class. On the other hand, SETL CONSTRUCT uses the GraphExtractor operation. It uses 1 concept composed of 5 clicks and 20 NOTC for the local file and 1 concept with 5 clicks and 30 NOTC for the endpoint. SETL PROG uses different functions/procedures from the Petl Python library (integrated with SETL PROG ) based on the cleansing requirements. In SETL CONSTRUCT , all data cleansing related tasks on data sources are done using TransformationOnLiteral (single source) and JoinTransformation (for multi-source). TransformationOnLiteral requires 12 clicks and 1 selection, and JoinTransformation takes 15 clicks and 1 selection.
To create a level member and observation, SETL PROG uses createDataTripleToFile() and takes 125 NOTC. The procedure takes all the classes, properties, and blank nodes of a target TBox as input; therefore, the given TBox should be parsed for being used in the procedure. On the other hand, SETL CONSTRUCT uses the LevelMemberGenerator and ObservationGenerator operations, and each operation requires 1 concept, which takes 6 clicks and 6 selections. SETL PROG provides procedures for either bulk or trickle loading to a file or an endpoint. Both procedures take 113 and 153 NOTC, respectively. For loading RDF triples either to a file or a triple store, SETL CONSTRUCT uses the Loader operation, which needs 2 clicks and 2 selections. Therefore, SETL CONSTRUCT reduces NOTC for all transformation and loading tasks by 100%.
SETL AUTO requires only a target TBox and a mapping file to generate ETL data flows through the CreateETL interface, which takes only 1 concept composed of 21 clicks and 16 selections. Therefore, other ETL Layer tasks are automatically accomplished internally. In summary, SETL CONSTRUCT uses 92% fewer NOTC than SETL PROG , and SETL AUTO further reduces NOUC by another 25%.
Comparison between SETL PROG and PDI
PDI is a non-semantic data integration tool that contains a rich set of data integration functionality to create an ETL solution. It does not support any functionality for semantic integration. In [44], we used PDI in combination with other tools and manual tasks to create a version of an SDW. The comparison between the ETL processes of SETL
PROG
and PDI to create this SDW are shown in Table 3. Here, we outline the developer efforts in terms of used tools, languages, and Lines of Codes (LOC). As mentioned earlier, SETL
PROG
provides built-in classes to annotate MD constructs with a TBox. PDI does not support defining a TBox. To create the SDW using PDI, we use the TBox created by SETL
PROG
. SETL
PROG
provides built-in classes to parse a given TBox and users can use different methods to parse the TBox based on their requirements. In PDI, we implement a Java class to parse the TBox created by SETL
PROG
which takes an RDF file containing the definition of a TBox as input and outputs the list of concepts and properties contained in the TBox. PDI is a non-semantic data integration tool, thus, it does not support processing semantic data. We manually extract data from SPARQL endpoints and materialize them in a relational database for further processing. PDI provides activities (drag & drop functionality) to pre-process database and CSV files. On the other hand, SETL
PROG
provides methods to extract semantic data either from a SPARQL endpoint or an RDF dump file batch-wise. In SETL
PROG
, users can create an IRI by simply passing arguments to the createIRI () method. PDI does not include any functionality to create IRIs for resources. We define a Java class of 23 lines to enable the creation of IRIs for resources. SETL
PROG
provides the createTriple() method to generate triples from the source data based on the MD semantics of the target TBox; users can just call it by passing required arguments. In PDI, we develop a Java class of 60 lines to create the triples for the sources. PDI does not support to load data directly to a triple store which can easily be done by SETL
PROG
. Finally, we could not run the ETL process of PDI automatically (i.e., in a single pass) to create the version of a SDW. We instead made it with a significant number of user interactions. In total, SETL
PROG
takes 401 Lines of Code (LOC) to run the ETL, where PDI takes
Therefore, we can conclude that SETL PROG uses 14.9% less LOC than PDI, SETL CONSTRUCT uses 92% fewer NOTC than SETL PROG , and SETL AUTO further reduces NOUC by another 25%. Combining these results, we see that SETL CONSTRUCT and SETL AUTO have significantly better productivity than PDI for building an SDW.
Development time
We compare the time used by SETL PROG , SETL CONSTRUCT , and SETL AUTO to build the ETL processes for our use case SDW. As the three tools were developed within the same project scope and we master them, the first author conducted this test. We chose the running use case used in this paper and created a solution in each of the three tools and measured the development time. We used each tool twice to simulate the improvement we may obtain when we are knowledgeable about a given project. The first time it takes more time to analyze, think, and create the ETL process, and in the latter, we reduce the interaction time spent on analysis, typing, and clicking. Table 4 shows the development time (in minutes) for main integration tasks used by the different systems to create the ETL processes.
ETL development time (in minutes) required for SETL
PROG
, SETL
CONSTRUCT
, and SETL
AUTO
ETL development time (in minutes) required for SETL PROG , SETL CONSTRUCT , and SETL AUTO

The segment of the main method to populate the

The ETL flows to populate the
SETL PROG took twice as long as SETL CONSTRUCT and SETL AUTO to develop a target TBox. In SETL PROG , to create a target TBox construct, for example a level l, we need to instantiate the Concept() concept for l and then add its different property values by calling different methods of Concept(). SETL CONSTRUCT and SETL AUTO create l by typing the input or selecting from the suggested items. Thus, SETL CONSTRUCT and SETL AUTO also reduce the risk of making mistakes in an error-prone task, such as creating an ETL. In SETL PROG , we typed all the source and target properties to create a mapping dictionary for each source and target concept pair. However, to create mappings, SETL CONSTRUCT and SETL AUTO select different constructs from a source as well as a target TBox and only need to type when there are expressions and/or filter conditions of queries. Moreover, SETL AUTO took less time than SETL CONSTRUCT in mapping generation because we did not need to create mappings for intermediate results.
To create ETL data flows using SETL
PROG
, we had to write scripts for cleansing, extracting, transforming, and loading. SETL
CONSTRUCT
creates ETL flows using drag-and-drop options. Note that the mappings in SETL
CONSTRUCT
and SETL
AUTO
use expressions to deal with cleansing and transforming related tasks; however, in SETL
PROG
we cleansed and transformed the data in the ETL design phase. Hence, SETL
PROG
took more time in designing ETL compared to SETL
CONSTRUCT
. On the other hand, SETL
AUTO
creates the ETL data flows automatically from a given mapping file and the target TBox. Therefore, SETL
AUTO
took only two minutes to create the flows. In short, SETL
PROG
is a programmatic environment, while SETL
CONSTRUCT
and SETL
AUTO
are drag and drop tools. We exemplify this fact by means of Figs 9 and 10, which showcase the creation of the ETL data flows for the
ETL execution time (in minutes) required for each sub-phase of the ETL processes created using SETL PROG and SETL CONSTRUCT
Since the ETL processes of SETL CONSTRUCT and SETL AUTO are the same and only differ in the developer effort needed to create them, this section only compares the performance of SETL PROG and SETL CONSTRUCT . We do so by analyzing the time required to create the use case SDW by executing the respective ETL processes. To evaluate the performance of similar types of operations, we divide an ETL process into three sub-phases: extraction and traditional transformation, semantic transformation, as well as loading and discuss the time to complete each.
Table 5 shows the processing time (in minutes), input and output size of each sub-phase of the ETL processes created by SETL PROG and SETL CONSTRUCT . The input and output formats of each sub-phase are shown in parentheses. The extraction and traditional transformation sub-phases in both systems took more time than the other sub-phases. This is because they include time for 1) extracting data from large RDF files, 2) cleansing and filtering the noisy data from the DBD and Subsidy datasets, and 3) joining the DBD and Subsidy datasets. SETL CONSTRUCT took more time than SETL PROG because its TransformationOnLiteral and JoinTransformation operations use SPARQL queries to process the input file whereas SETL PROG uses the methods from the Petl Python library to cleanse the data extracted from the sources.
SETL CONSTRUCT took more time during the semantic transformation than SETL PROG because SETL CONSTRUCT introduces two improvements over SETL PROG : 1) To guarantee the uniqueness of an IRI, before creating an IRI for a target TBox construct (e.g., a level member, an instance, an observation, or the value of an object or roll-up property), the operations of SETL CONSTRUCT search the IRI provenance graph to check the availability of an existing IRI for that TBox construct. 2) As input, the operations of SETL CONSTRUCT take RDF (N-triples) files that are larger in size than the CSV files (see Table 5), used by SETL PROG as input format. To ensure our claims, we run an experiment for measuring the performance of the semantic transformation procedures of SETL PROG and the operations of SETL CONSTRUCT by excluding the additional two features introduced in SETL CONSTRUCT operations (i.e., a SETL CONSTRUCT operation does not lookup the IRI provenance graph before creating IRIs and takes a CSV input). Figure 11 shows the processing time taken by SETL CONSTRUCT operations and SETL PROG procedures to create level members and observations with increasing input size. In the figure, LG and OG represent level member generator and observation generator operations (in case of SETL CONSTRUCT ) or procedures (in case of SETL PROG ).

Comparison of SETL CONSTRUCT and SETL PROG for semantic transformation. Here, LG and OG stand for LevelMemberGenerator and ObservationGenerator.

Scalability of LevelMemberGenerator and ObservationGenerator.
In summary, to process an input CSV file with 500 MB in size, SETL CONSTRUCT takes 37.4% less time than SETL PROG to create observations and 23.4% less time than SETL PROG to create level members. The figure also shows that the processing time difference between the corresponding SETL CONSTRUCT operation and the SETL PROG procedure increases with the size of the input. In order to guarantee scalability when using the Jena library, a SETL CONSTRUCT operation takes the (large) RDF input in the N-triple format, divides the file into several smaller chunks, and processes each chunk separately. Figure 12 shows the processing time taken by LevelMemberGenerator and ObservationGenerator operations with the increasing number of triples. We show the scalability of the LevelMemberGenerator and ObservationGenerator because they create data with MD semantics. The figure shows that the processing time of both operations increase linearly with the increase in the number of triples, which ensures that both operations are scalable. SETL CONSTRUCT takes less time in loading than SETL PROG because SETL PROG uses the Jena TDB loader command to load the data while SETL CONSTRUCT programmatically load the data using the Jena API’s method.
In summary, SETL PROG and SETL CONSTRUCT have similar performance (29% difference in total processing time). SETL CONSTRUCT ensures the uniqueness of IRI creation and uses RDF as a canonical model, which makes it more general and powerful than SETL PROG .
Besides the differences of the performances already explained in Table 5, SETL CONSTRUCT also includes an operation to update the members of the SDW levels, which is not included in SETL PROG . Since the ETL process for our use case SDW did not include that operation, we scrutinize the performance of this specific operation of SETL CONSTRUCT in the following.

Performance of UpdateLevel based on processing time and target ABox size with the changing of the
Performance analysis of UpdateLevel operation Figure 13 shows the performance of the UpdateLevel operation. To evaluate this operation, we consider two levels:
Figure 13c shows the processing time when increasing the number of updated cities. Since
In [44], we compare the performance of SETL
PROG
with a non-semantic data integration tool, PDI. We populate the SDW for
In overall, we conclude that SETL PROG is better in performance compared to PDI and SETL CONSTRUCT . However, SETL CONSTRUCT is a more correct and scalable framework. On top of that, SETL CONSTRUCT supports SDWs update.
Qualitative evaluation
To evaluate our proposed high-level ETL qualitatively, we selected and interviewed two experts with expertise across traditional ETL and semantic integration. We find that two experts are sufficient, because both of them have 10+ years research experiences in ETL and are the architects of popular open-source ETL tools (pygramETL [60] and QUARRY [33]). In this section, we present a high-level summary of the expert interviews. We follow the procedure described in [10,42] to structure the questionnaire and the answers of the experts. The two experts participated in the interviews via email. Interviews included nine semi-structured questions17
Summary of the expert interviews
On the significance of the proposed paradigm category, both experts provide positive responses to all questions. Their only suggestion is to allow technical users to extend the metadata artifacts because ETL tool palettes can typically be much more complex. We use RDF as the canonical model in our paradigm, and it allows users to extend metadata artifacts based on their business requirements. Both experts think that our proposed ETL operations are sound and complete considering that the ETL is specific to semantic data and multidimensional analysis. However, one of the suggestions is to make the set of ETL operations open and to introduce user-defined functions (UDFs) so that users can write their own functions if there is a requirement for additional transformations (e.g., row normalizer/denormalizer, language conversion). In our future work, we will introduce UDFs in the set of ETL operations. Both experts do not support complete automation of the mapping file considering the complex nature of ETL scenarios. However, they think that semi-automation is feasible. Expert 1 thinks that we can apply this two-layered RDF-based data integration approach in the context of Big data and Data Lakes. However, Expert 2 thinks that there is a need to support more complex data transformations.
In summary, both experts acknowledge the contribution of our work and think that the set of ETL operations is complete and sound to annotate a knowledge base with multidimensional semantics. However, to fit this approach in the context of Big Data and data lakes, there is a need to include UDFs to support more complex transformations.
Nowadays, combining SW and BI technologies is an emerging research topic as it opens interesting research opportunities. As a DW deals with both internal and (increasingly) external data presented in heterogeneous formats, especially in the RDF format, semantic issues should be considered in the integration process [1]. Furthermore, the popularity of SW data gives rise to new requirements for BI tools to enable OLAP-style analysis over this type of data [32]. Therefore, the existing research related to semantic ETL is divided into two lines: 1) on the one hand, the use of SW technologies to physically integrate heterogeneous sources and 2) on the other hand, enabling OLAP analysis over SW data.
One prominent example following the first research line is [58], which presents an ontology-based approach to enable the construction of ETL flows. At first, the schema of both the sources and the DW are defined by a common graph-based model, named the datastore graph. Then, the semantics of the datastore graphs of the data sources and the DW are described by generating an (OWL-based) application ontology, and the mappings between the sources and the target are established through that ontology. In this way this approach addresses heterogeneity issues among the source and target schemata and finally demonstrates how the use of an ontology enables a high degree of automation from the source to the target attributes, along with the appropriate ETL transformations. Nonetheless, computationally complex ETL operations like slowly changing dimensions and the annotation of the application ontology with MD semantics are not addressed in this work. Therefore, OLAP queries cannot be applied on the generated DW.
Another piece of existing work [6] aligned to this line of research proposes a methodology describing some important steps required to make an SDW, which enables to integrate data from semantic databases. This approach also misses the annotation of the SDW with MD semantics. [59] has proposed an approach to support data integration tasks in two steps: 1) constructing ontologies from XML and relational sources and 2) integrating the derived ontologies by means of existing ontology alignment and merging techniques. However, ontology alignment techniques are complex and error-prone. [5] presents a semantic ETL framework at the conceptual level. This approach utilizes the SW technologies to facilitate the integration process and discusses the use of different available tools to perform different steps of the ETL process. [3] presents a method to spatially integrate a Danish Agricultural dataset and a Danish Business dataset using an ontology. The approach uses SQL views and other manual processes to cleanse the data and Virtuoso for creating and storing integrated RDF data. [38] presents UnifiedViews, an open-source ETL framework that supports management of RDF data. Based on the SPARQL queries, they define four types of Data Processing Units (DPUs): Extractor, Transformer, Loader, and Quality Assessor. However, the DPUs do not support to generate MD RDF data.
In the second line of research, a prominent paper is [47], which outlines a semi-automatic method for incorporating SW data into a traditional MD data management system for OLAP analysis. The proposed method allows an analyst to accomplish the following tasks: 1) designing the MD schema from the TBox of an RDF dataset, 2) extracting the facts from the ABox of the dataset and populating the MD fact table, and 3) producing the dimension hierarchies from instances of the fact table and the TBox to enable MDX queries over the generated DW. However, the generated DW no longer preserves the SW data principles defined in [27]; thus, OLAP analysis directly over SW data is yet to be addressed. To address this issue, [13] introduces the notion of a lens, called the analytical schema, over an RDF dataset. An analytical schema is a graph of classes and properties, where each node of the schema presents a set of facts that can be analyzed by traversing the reachable nodes. [28] presents a self-service OLAP endpoint for an RDF dataset. This approach first superimposes an MD schema over the RDF dataset. Then, a semantic analysis graph is generated on top of that MD schema, where each node of the graph represents an analysis situation corresponding to an MD query, and an edge indicates a set of OLAP operations.
Both [13] and [28] require either a lens or a semantic analysis graph to define MD views over an RDF dataset. Since most published SW data contains facts and figures, W3C recommends the Data Cube (QB) [15] vocabulary to standardize the publication of SW data with MD semantics. Although QB is appropriate to publish statistical data and several publishers (e.g., [51]) have already used the vocabulary for publishing statistical datasets, it has limitations to define MD semantics properly. The QB4OLAP [20] vocabulary enriches QB to support MD semantics by providing constructs to define 1) a cube structure in terms of different level of dimensions, measures, and attaching aggregate functions with measures and 2) a dimension structure in terms of levels, level attributes, relationships, the cardinality of relationships among the levels, and hierarchies of the dimension. Therefore, MD data can be published either by enriching data already published using QB with dimension levels, level members, dimension hierarchies, and the association of aggregate functions to measures without affecting the existing observations [62] or using QB4OLAP from scratch [22,29,44].
In [36], the authors present an approach to enable OLAP operations on a single data cube published using the QB vocabulary and shown the applicability of their OLAP-to-SPARQL mapping in answering business questions in the financial domain. However, their OLAP-to-SPARQL mapping may not result in the most efficient SPARQL query and requires additional efforts and a longer time to get the results as they consider that the cube is queried on demand and the DW is not materialized. While some approaches have proposed techniques to optimize execution of OLAP-style SPARQL queries in a federated setting [30], others have considered view materialization [21,31]. The authors in [62] present a semi-automatic method to enrich the QB dataset with QB4OLAP terms. However, there is no guideline for an ETL process to populate a DW annotated with QB4OLAP terms.
After analyzing the two research lines, we can draw some conclusions. Although each approach described above addresses one or more aspects of a semantic ETL framework, there is no single platform that supports them all (target definition, source-to-target mappings generation, ETL generations, MD target population, and evolution). To solve this problem, we have proposed a Python-based programmable semantic ETL (SETL PROG ) framework [44] that provides a number of powerful modules, classes, and methods for performing the tasks mentioned above. It facilitates developers by providing a higher abstraction level that lowers the entry barriers. We have experimentally shown that SETL PROG performs better in terms of programmer productivity, knowledge base quality, and performance, compared to other existing solutions. However, to use it, developers need a programming background. Although SETL PROG enables to create an ETL flow and provides methods by combining several tasks, there is a lack of a well-defined set of basic semantic ETL constructs that allow users more control in creating their ETL process. Moreover, how to update an SDW to synchronize it with the changes taking place in the sources is not discussed. Further, in a data lake/big data environment, the data may come from heterogeneous formats, and the use of the relational model as the canonical model may generate an overhead. Transforming JSON or XML data to relational data to finally generate RDF can be avoided by using RDF as the canonical model instead. To this end, several works have discussed the appropriateness of knowledge graphs for data integration purposes and specifically, as a canonical data model [14,54,55]. An additional benefit of using RDF as a canonical model is that it allows adding semantics without being compliant to a fixed schema. The present paper presents the improvements introduced on top of SETL PROG to remedy its main drawbacks discussed above. As there are available RDF Wrappers (e.g., [18,57]) to convert another format to RDF, in this paper, we focus on only semantic data and propose an RDF-based two-layered (Definition Layer and Execution Layer) integration process. We also propose a set of high-level ETL constructs (tasks/operations) for each layer, with the aim of overcoming the drawbacks identified above for SETL PROG . We also provide an operation to update an SDW based on the changes in source data. On top of that, we characterize the ETL flow in the Definition Layer by means of creating an RDF based source-to-target mapping file, which allows to automate the ETL execution flows.
Conclusion and future work
In this paper, we proposed a framework of a set of high-level ETL constructs to integrate semantic data sources. The overall integration process uses the RDF model as canonical model and is divided into the Definition and Execution Layer. In the Definition Layer, ETL developers create the metadata (target and source TBoxes, and source-to-target mappings). We propose a set of high-level ETL operations for semantic data that can be used to create ETL data flows in the Execution Layer. As we characterize the transformations in the Definition Layer in terms of source-to-target mappings at the schema level, we are able to propose an automatic algorithm to generate ETL data flows in the Execution Layer. We developed an end-to-end prototype SETL CONSTRUCT based on the high-level constructs proposed in this paper. We also extended it to enable automatic ETL execution flows generation (and named it SETL AUTO ). The experiment shows that 1) SETL CONSTRUCT uses 92% fewer NOTC than SETL PROG , and SETL AUTO further reduces NOUC by another 25%; 2) usingSETL CONSTRUCT , the development time is almost cut in half compared to SETL PROG , and is cut by another 27% using SETL AUTO ; 3) SETL CONSTRUCT is scalable and has similar performance compared to SETL PROG .
SETL CONSTRUCT allows users to create source-to-target mappings manually. However, an extension of this work is to create the mappings (semi-)automatically. Although this kind of ETL mappings is different from traditional ontology schema matching techniques, we can explore those techniques to establish the relationships among the concepts across the source and target ontologies. Taking the output of this schema matching approach as a suggestion, the expert can enrich the mappings according to the business requirements [41]. As the correctness of an ontology schema matching technique depends on the semantic-richness of the given ontologies, besides the extensional knowledge given in the ABox, we need to take other external standard ontologies into account in the process of the TBoxExtraction operation.
The source-to-target mappings act as a mediator to generate data according to the semantics encoded in the target. Therefore, we plan to extend our framework from purely physical to also virtual data integration where instead of materializing all source data in the DW, ETL processes will run on demand. When considering virtual data integration, it is important to develop query optimization techniques for OLAP queries on virtual semantic DWs, similar to the ones developed for virtual integration of data cubes and XML data [48,49,63]. Another interesting work will be to apply this layer-based integration process in a Big Data and Data Lake environment. The metadata produced in the Definition Layer can be used as basis to develop advanced catalog features for Data Lakes. We will also introduce user-defined functions (UDFs) and other relevant ETL operations to make the approach fit in the Big Data context, as suggested by the two interviewed experts. Furthermore, we plan to investigate how to integrate provenance information into the process [2]. Another aspect of future work is the support of spatial data [23,24].
Footnotes
Acknowledgements
This research is partially funded by the European Commission through the Erasmus Mundus Joint Doctorate Information Technologies for Business Intelligence (EM IT4BI-DC), the Poul Due Jensen Foundation, and the Danish Council for Independent Research (DFF) under grant agreement no. DFF-8048-00051B.
