Abstract
Cultural heritage (CH) contents are typically strongly interlinked, but published in heterogeneous, distributed local data silos, making it difficult to utilize the data on a global level. Furthermore, the content is usually available only for humans to read, and not as data for Digital Humanities (DH) analyses and application development. This application report addresses these problems by presenting a collaborative publication model for CH Linked Data and six design principles for creating shared data services and semantic portals for DH research and applications. This Sampo model has evolved gradually in 2002–2021 through lessons learned when developing the Sampo series of linked data services and semantic portals in use, including MuseumFinland (2004), CultureSampo (2009), BookSampo (2011), WarSampo (2015), Norssit Alumni (2017), U.S. Congress Prosopographer (2018), NameSampo (2019), BiographySampo (2019), WarVictimSampo 1914–1922 (2019), MMM (2020), AcademySampo (2021), FindSampo (2021), WarMemoirSampo (2021), and LetterSampo (2022). These Semantic Web applications surveyed in this paper cover a wide range of application domains in CH and have attracted up to millions of users on the Semantic Web, suggesting feasibility of the proposed Sampo model. This work shows a shift of focus in research on CH semantic portals from data aggregation and exploration systems (1. generation systems) to systems supporting DH research (2. generation systems) with data analytic tools, and finally to automatic knowledge discovery and Artificial Intelligence (3. generation systems).
Breaking data silos of cultural heritage
Cultural Heritage content is published independently by different memory organizations, such as museums, libraries, archives, galleries, and media companies. The traditional web publishing model, where everybody can publish easily content for everybody to read, facilitates fast and flexible publication on the Web. However, using related local contents from separate data sources on a global level is difficult because of the incompatible data silos: the local databases and online systems of the publishers are associated in content, but heterogeneous in terms of incompatible data models, annotated using different thesauri and vocabularies, distributed geographically, based on different natural languages, and used with different kind of user interfaces. An even more fundamental problem is that the contents are typically published only for humans to read and not as data for computational analyses and application development. This means that the end users typically have to learn and use several different applications to cater their information needs about a topic. For the data publishers, lots of costly redundant work is needed in creating the data silos, e.g., in developing the vocabularies, data services, and user interfaces. The availability of the data in a usable open form is a prerequisite of the work for the application developers.
To mitigate these problems, various massive international data aggregation systems have been created, such as Europeana1
See
This paper concerns using Semantic Web (SW) technologies [14] and Linked Open Data (LOD) publishing [11,18] to address the data silo and data publishing problems above. A general model, called Sampo Model, is presented for the purpose. As empirical evidence of feasibility of applying the model in practise, the Sampo series of data services and semantic portals is presented.11
See
This paper is organized as follows. Section 2 presents the principles of the Sampo model. In Section 3, a survey of Sampo systems is presented as a proof-of-concept, illustrating use cases of the model and how it has evolved in 2002–2021. In conclusion, related works are discussed, contributions of the paper are summarized, and challenges and directions for further research are outlined. This paper extends substantially the earlier short paper [19] about the Sampo model at the DHN 2020 conference.
Sampo model principles P1–P6
The Sampo Model is an informal collection of principles for LOD publishing and designing semantic portals listed in Table 1, supported by an ontology and data infrastructure and software tools for user interface design and data publication. Principles P1–P3 can be seen as a foundation for developing data services; principles P4–P6 are related to creating semantic portals.12
The numbering of the principles P3 and P6 is switched in the table with respect to [24] to clarify this.
P1. Support collaborative data creation and publishing The model is based on the idea of collaborative content creation. The data is aggregated from local data silos into a global service, based on a shared ontology and publishing infrastructure [18]. The local data are harmonized and enriched with each other by linking and reasoning, based on Semantic Web standards. In this model everybody can win, including the data publishers by enriched data and shared publishing infrastructure, and the end users by richer global content and services. However, collaborative publishing also complicates the publication process, as more agreements are needed within the community.
This model addresses the problems of semantic data interoperability and distributed content creation at the same time. A shared semantic ontology infrastructure that includes shared metadata schemas [87] and domain ontologies for populating the data models are used for harmonizing and interlinking data form separate silos. If the content providers provide the system with metadata about their contents using the shared infrastructure, the data is automatically linked and enriched with each other and forms a knowledge graph [7]. For example, if metadata about a painting created by Picasso comes from an art museum, it can be enriched (linked) with, e.g., biographies from Wikipedia and other sources, photos taken of Picasso, information about his wives, books in a library describing his works of art, related exhibitions open in museums, and so on. At the same time, the contents of any organization in the portal having Picasso related material get enriched by the metadata of the new artwork entered in the system.

Publishing and using heterogeneous distributed data in the MMM Sampo system.
Figure 1 depicts as an example how the collaborative Sampo publication model (P1) was used in the Mapping Manuscript Migrations (MMM) system [24,43]. MMM includes three key datasets about ca.
P2. Use a shared open ontology infrastructure The Sampo model is based on a shared LOD ontology infrastructure with which the local datasets are made compatible. Re-using the same infrastructure, and developing it further step by step in each Sampo portal and application, saves a lot of effort for the developers of next Sampos and other applications. For example, the linked data-based geogazetteer of contemporary placenames in Finland, using data from the National Survey and introduced in NameSampo [40] for open use, contains some
The infrastructure includes harmonising shared metadata models (schemas) for representing individuas as well as domain ontologies (thesauri, vocabularies) that are used in populating (instantiating) the metadata models. This can be done by using data transformations and by aligning ontologies, as described in detail in [43,46] for the WarSampo and MMM systems, respectively. The Sampo portals use in practise both Dublin Core-based models and the dumb-down principle14
Many Sampo systems make use of the national FinnONTO ontology infrastructure [38]. Its development started in 2003 and the work is carried on today by the National Library of Finland as the Finto.fi ontology service,18
P3. Make clear distinction between the LOD service and the user interface (UI) The Sampo Model argues for the idea of separating the underlying Linked Data service completely from the user interface via a SPARQL API. The rationale for this is: Firstly, this simplifies the portal architecture. Secondly, the data service can be opened for data analysis research in Digital Humanities. For example, the YASGUI23
On top of a SPARQL API it is possible to define more dedicated and simpler APIs depending on the application and user needs, such as crlc [61]. SPARQL querying can be computationally inefficient and has lead some developers to use other tools for developing search engines, such as Elasticsearch26
P4. Provide multiple perspectives to the same data The Sampo model fosters the idea that on top of a LOD service different thematic application perspectives to the data can be created by re-using the data service. This means that the underlying data can be re-used without modifying it, which is typically costly [28] when dealing with Big Data.
The application perspectives are provided on the landing page of the Sampo portal, and they enrich each other by data linking. By selecting a perspective the corresponding application is opened. In addition, completely separate applications can be created on top of the data service by third parties, which is of help to memory organizations that typically are not strong in IT application development but are often willing to share the content openly through multiple channels.

Landing page of WarSampo with nine application perspectives.
For example, Fig. 2 depicts the landing page of WarSampo [22] with the following nine interlinked application perspectives for accessing the underlying LOD service data:
Major events (1050) of the Second Word War (WW2) in Finland visualized on a timeline and maps with related linked data
People (
Army Units (
Places perspective for searching the war zone events using contemporary and historical maps
Kansa taisteli magazine articles (3360) (1957–1986) containing memoirs of the soldiers after the war
Casualties data (
Authentic photographs (
War Cemeteries (630) of the casualties of WW2 with 3000 photographs
Finnish Prisoners of War (4500) in the Soviet Union in 1939–1945
P5. Standardize portal usage by a simple filter-analyze two-step cycle In later Sampos, the application perspectives can be used by a two-step cycle for research: First the focus of interest, the target group, is filtered out easily using faceted semantic search [32,81,83]. Second, the target group is visualized or analyzed by using ready-to-use DH tools of the application perspectives. The general idea here is to try to “standardize” the UI logic so that the portals are easier to use for the end users [39].

Comparing the life charts of two target groups in BiographySampo, admirals and generals (left) and clergy (right) of the historical Grand Duchy of Finland (1809–1917).
For example, Fig. 3 depicts a situation in BiographySampo where the user compares the life charts of two prosopographical groups in 1809–1917 when Finland was an autonomous Grand Duchy within the Russian Empire: 1) Finnish generals and admirals in the Russian armed forces (on the left). 2) Members of the Finnish clergy (1800–1920) (on the right). With a few selections from the facets the user can filter out the two target groups and see that, for some reason, quite a few officers moved to Southern Europe when they retired, like retirees today, while the Lutheran ministers usually stayed in Finland until their death.
P6. Support data analysis and knowledge discovery in addition to data exploration Three generations of semantic portals for Digital Humanities can be identified according to the vision [21] underlying the work on Sampos. Ten years ago the research focus in semantic portal development was on data harmonization, aggregation, search, and browsing (1. generation systems). However, the Sampo model aims not only at data publishing with search and data exploration [59]. The rise of DH research has started to shift the focus to providing the user with integrated tools for solving research problems in interactive ways (2. generation systems). The next step ahead is based on Artificial Intelligence: future portals not only provide tools for the human to solve problems but are used for finding research problems in the first place, for addressing them, and even for solving them automatically under the constraints set by the human researcher and explaining the results (3. generation systems). A step towards this is the relational search application perspective in BiographySampo where the machine tries to find “interesting” semantic connections in linked data and also explain them in natural language [30]. Principles P4–P6 are related to creating 2. and 3. generation systems.
FAIR Linked Data The widely used modern FAIR principles30
The FAIR principles are listed here and their numbering are based on
Findability: F1. (Meta)data are assigned a globally unique and persistent identifier; F2. Data are described with rich metadata (defined by R1 below) F3. Metadata clearly and explicitly include the identifier of the data they describe; F4. (Meta)data are registered or indexed in a searchable resource
Accessible A1. (Meta)data are retrievable by their identifier using a standardised communications protocol; A2. Metadata are accessible, even when the data are no longer available
Interoperable I1. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation; I2. (Meta)data use vocabularies that follow FAIR principles; I3. (Meta)data include qualified references to other (meta)data
Reusable R1. (Meta)data are richly described with a plurality of accurate and relevant attributes
Obviously, the Sampo model principles are compatible with the FAIR principles above. This follows from the fact that the model was developed using the standards, linked data principles,31
The Sampo model originates from applications in the CH domain but is generic and is not bound to its origin. The model has been applied in other domains, too. An example of this is the HealthFinland system [37,78] for health promotion information, that was deployed by the National Institute for Health and Welfare in Finland. HealthFinland got at the ISWC 2008 conference the international Semantic Web Challenge Award.
In [27] the idea of developing the Sampo model into domain-specific Sampo frameworks is presented using three epistolary datasets as case studies. The LetterSampo framework presented includes data models for representing metadata about historical letters as well as some domain-specific configurations for the Sampo-UI tool for user interfaces. In this way the LetterSampo framework could be applied easily to create portals for three different datasets related to the Republic of Letters [15] from the University of Oxford (Early Modern Letters Online33
Linked data publications on the SW are typically evaluated with the W3C “5-star model”,36
The Sampo model has evolved gradually over time in 2002–2021 via lessons learned in developing the Sampo series of semantic portals and related LOD services in various projects. This section overviews shortly a selection of these systems, listed in Table 2, in order to provide a proof-of-concept of the model and to give some examples and historical background of the work. For each system, the year of publication, application domain, number of end users, size of the underlying triplestore, and primary data owners are listed. In below, each system is described shortly with a reference to its research homepage, to the portal online, and to research articles for more detailed information. These references provide links to related works and additional publications, and to the data services online.
A selection of sampo portals and LOD services for digital humanities; user counts by Google analytics 2021 October
A selection of sampo portals and LOD services for digital humanities; user counts by Google analytics 2021 October
Original KG size in 2011; the size in much larger today including also non-fiction works.
This count includes only data of Kotus; the total number of triples of all sources is 241M.
Project:
Project:
Research project:
Project:
Interest in WarSampo lead to a new Sampo in the same application domain of war history:
Project:
Project:
Both WarSampo and WarVictimSampo have a feedback channel by which the data can be commented, and indeed hundreds of comments and correction suggestions have been collected for the data owner, the National Archives of Finland, to consider. Based on this activity, a new citizen science project for collecting and maintaining Sampo data is underway.46
A key idea in WarSampo is to reassemble the life stories of the soldiers based on data linking from different data sources. This biographical and prosopographical idea was a source of inspiration for several later biographical Sampos discussed below.
Project:
The idea of publishing textual biographies as structured LOD for data exploration and analysis was also developed in the Sampos
Portal:
Portal:
Project homepage: url
The NameSampo project developed, based on the SPARQL Faceter tool [44] used in many earlier Sampos, the first version of the Sampo-UI framework [39] that has been used after this in all Sampos, supporting implementation of principles P3–P6 from an UI point of view. Sampo-UI has also been re-used in Norway by the Norwegian Language Collections for creating a national service similar to NameSampo: Norske stedsnavn.53
See
See
The Bibale web service from the Institute for Research and History of Texts (IRHT) in Paris is described in
The
Portal:
In addition, new Sampos are already in prototype phase and planned to be published in 2022:
Design principles, models, and methods for software development are extensively studied and used in the field of Software Engineering [76]. The idea of trying to formulate general design principles for publishing and using linked data has turned out to be useful from a practical point of view. For example, the four Linked Data Principles and the 5-star model coined by Tim Berners-Lee have been quite influential, and ontology design patterns66
Related work The Principles of Table 1 behind the Sampo model have been explored and developed before in different contexts:
The principle of collaborative content creation by data linking (P1) is a fundamental idea behind the Linked Open Data Cloud movement,67
The importance of developing shared open data models, thesauri, and ontologies for interoperability (P2) is a driving force behind the work of virtually all related standardization efforts. In our work, the ambitious goal has been to develop not only individual standards and datasets but an infrastructure in a national level effort [38] in terms of open ontology services [82,86] and LOD services [36].
The principle P3 of separating data related services from UI design is in line with modern software architectures, such as the Model-View-Controller (MVC) structure.70
The principle P4 of providing multiple analyses and visualizations for a set of filtered search results has been used in different contexts and also in other portals, such as the ePistolarium71
Regarding principle P5, faceted search [10,32,81], also know as “view-based search” [67] and “dynamic taxonomies” [73], is a well-known paradigm for explorative search and browsing [59] in computer science and information retrieval, based on S. R. Ranganathan’s original ideas of faceted classification in Libary Science in the 1930’s. The two-step usage model in the Sampo model is also used as a general research method in prosopographical research [85].
The principle of supporting data analysis and knowledge discovery (P6) based on Big Data is fundamental in, e.g., distant reading [63], Humanities Computing [60], and Digital Humanities [6] in general. However, what is still largely missing in the DH methodology and tools in semantic portals is the next conceptual level of automatic knowledge discovery from data [66]. The Sampo model aims to integrate such tools into a consolidated approach for creating portals and LOD services.
In addition to the Sampo portals, Linked Data and ontologies have been used as a basis for publishing collections in many museums [1,74], libraries [9], and archives [8,84]. Linked data has been used in building knowledge graphs and infrastructures, such as the Europeana linked open data [41] and ARIADNEplus72
Contributions and Challenges The novelty of the Sampo model lies in the consolidated combination of the principles P1–P6 and in operationalizing them using an infrastructure and tooling for developing applications in Digital Humanities in a cost-efficient way. The approach aims at developing a gradually growing sustainable national LOD infrastructure: the work started with the Semantic Web Kick-off in Finland seminar [16] a few months after the seminal Semantic Web paper [2] was published in Scientific American and the W3C launched its Semantic Web Activity in 2001. The work presented demonstrates a shift of focus in research on CH semantic portals in three generations towards knowledge discovery and Artificial Intelligence [21]. The future work on the Sampo model aims at AI based DH tools that are able not only to present the data to the human researcher in useful ways but also to find and solve DH research problems with explanations. AI techniques are also useful when creating and enriching the knowledge graph underlying a semantic portal.
The model has also its limitations and challenges. For example, it assumes that the data is created by a separate pipeline. As suggested in [46], the transformation should be automatic and re-doable without a human in the loop, but optimally the RDF should be produced already when cataloging the data, not by correcting and aligning the data afterwards [17]. As Albert Einstein put it: “Intellectuals solve problems but geniuses prevent them”.
Neither does the Sampo model include principles for maintaining the knowledge graphs. This is a great challenge when using LD in general since the effects of a change may propagate all over the interlinked knowledge graph depending on the change. Especially changes in ontologies may have dramatic effects. Such knowledge management issues are discussed in [45] in relation to the WarSampo system, but more research in needed in this field.
A challenge of the semantic portal concept is related to the quality of the data produced typically using more or less automatic means, leading to problems of incomplete, skewed, and erroneous data [57]. This as well as conceptual difficulties in modeling complex real world ontologies, such as historical geogazetteers, become sometimes embarrassingly visible when using and exposing the knowledge structures to end users. In traditional systems the same problems are there, but are hidden in the unstructured presentations of the data. In general, more data literacy [47] is usually needed from the end user when using semantic portals and their data analytic tools. In spite of these challenges the linked data approach is according to our experiences useful is finding out interesting phenomena in Big Data using distant reading [63], but for interpreting the results traditional close reading is needed as before.
Footnotes
Acknowledgements
Tens of people74
