Abstract
This article explains the idea of pattern systems that develop gradually. These systems involve symbolic communication that includes symbols, syntax, and layout rules. Some pattern systems change over time, like historical scripts. The scientific study of pattern systems is called pattern evolution research, and scriptinformatics is concerned with the modelling of the evolution of scripts. The symbol series consists of symbols from a pattern system, while the graph sequence is a symbol sequence applied with a specific technology. This article describes a method for examining tested pattern systems to confirm their classification, which focuses on more ancient features. The method’s effectiveness was tested on Rovash scripts and graph sequences. Multivariate analysis was carried out by using PAST4 software, employing principal coordinates analysis ordination and
Introduction
A pattern system is a form of symbolic communication consisting of symbols, syntax and layout rules that determine their use. Among the possible pattern systems, this research focuses on those with evolutionary properties. The evolutionary study of pattern systems is called pattern evolution research. In biology, a taxonomic unit (taxon) is usually a species, while in pattern evolution research a taxon can be any pattern system (Table 1), such as Morse code, different character encodings or various scripts (human writing systems). For example, Morse code has evolved in several versions over time, from the code invented by Samuel Finley Breese Morse, through the code modified by Alfred Lewis Vail (the American Morse code) to the international Morse code, which itself evolved in several steps. Similarly, the evolution was a long and complicated process from Jean Maurice Émile Baudot’s 5-bit character code system through the American Standard Code for Information Interchange (ASCII), various international and national code systems to the ISO multilingual standard (ISO/IEC DIS 10646) and the BTRON encoding system [37].
An interesting type of pattern systems is represented by various human writing systems, or scripts in short. The different scripts and variants used by mankind have evolved over a long period of time. Understanding the evolution of these scripts can be important for deciphering the many untranslated past inscriptions, the so-called script remains [11]. In addition, deciphering the origins of ancient manuscripts may require an accurate description of the evolution and interaction of the large number of script variants associated with a single script. The study of the development of historical scripts and script variants is called scriptinformatics. The long-term goal of scriptinformatics is to help the interpretation and deciphering of script remains by exploring the evolutionary relationships between scripts [14, 15]. Scriptinformatics investigates the evolution of symbols and reveals the relationships between scripts. Phylogenetic methods of biology can generally be applied to describe the evolutionary processes of scripts. The focus of this research is the evolution of historical scripts; however, the developed method can also be applied to other types of pattern systems The purpose of pattern evolution research is to better understand and model the evolution of pattern systems, including historical scripts.
Similarly to the biological species or various objects in machine learning, a pattern system can be described by a set of features. However, in the phylogenetic analysis of pattern systems, some of their features are not informative or even ambiguous. Therefore, prior to the actual phylogenetic modelling, a pre-processing step called feature selection (a kind of feature engineering) is required, which creates a filtered subset of features from the original full set of features for each pattern system by filtering out uninformative and useless features. Feature selection is a typical task of data mining and machine learning. In some cases, data analysis methods such as regression, clustering, and classification can be performed more precisely in the reduced-dimensional feature space than in the original space of features [4, 6, 21, 29, 42, 43].
Phylogenetic modeling can be performed not only on the scripts, but on inscriptions written by the scripts under study. In the present research both scripts (in general pattern systems) and inscriptions (in general graph sequences, see later in Table 1) are analyzed using the developed composite method. The objective of the research is to verify whether the developed composite multivariate method detects the difference between the use of the full and the filtered feature sets.
The article first discusses the scientific background of pattern evolution research, then describes the developed composite multivariate method, then presents and evaluates the results of the analysis in the Results and Discussion section. Finally, the Conclusions section summarizes the obtained results. At the end of the article, public availability of the data used for the analysis is given The multivariate analyses were performed by the PAST 4.12a [7, 8], the figures were resized by the Inkscape 1.2 software [18].
Background
Terms and concepts
Basic terms and concepts of pattern evolution research and scriptinformatics
Basic terms and concepts of pattern evolution research and scriptinformatics
Basic concepts of the pattern evolution related research as presented in Table 1. On this basis, the evolution of pattern systems can be studied in a similar way to evolutionary biology and bioinformatics. In many cases, a surviving graph sequence is very short, but in principle, at the time of its creation, the creator (e.g., a scribe) of the symbol sequence implemented by the graph sequence had available a picture of the pattern system, i.e. the feature set (symbol set, syntax and layout rules) that the scribe knew; at most, only a small part of the feature set might be needed to implement the given symbol sequence or the graph sequence representing its implementation by a certain technology. Table 2 summarizes the corresponding concepts of pattern evolution research and scriptinformatics (a subfield of pattern evolution research).
Corresponding concepts of pattern evolution research and scriptinformatics
There is some research that is similar to pattern evolution investigation, and especially to scriptinformatics. An example is phylomemetics, which refers to the phylogenetic analysis of non-genetic data, such as the evolutionary analysis of manuscripts [16]. Another example of non-biological evolution is software evolution, which studies the evolutionary context of software [35]. Another body of research is scriptinformatics, which applies bioinformatic methods and convolutional neural networks [2, 34]. Cladistic tools have been used to examine manuscript versions of India’s national epic, the Mahabharata, written in various Brahmic scripts [30] It was established that the phylogenetic relationships of the scripts used for each manuscript version differ from those of the individual parts of the text itself. Phillips-Rodríguez pointed out that the phylogenetic relationships of the scripts used for the manuscripts should be considered external (codicological) data, while those of the text versions of the manuscripts should be considered internal (stemmatological) data [31].
In the case of pattern systems that have not been used for a long time and have often been forgotten, the features of the pattern systems can only be recognized from the surviving graph sequences (e.g. inscriptions), such as the pattern system of the Elymian script once used in western Sicily (Italy) [27] or the Khotanese version of the Brahmic pattern system [26]. The research efforts of the author’s research group cover a broad range of topics such as applying machine learning methods to explore similarities among scripts [11], reconstructing lineages of symbols in various scripts and investigating methods for testing the appropriateness of the reconstructed lineages [13].
The features of the examined pattern systems (e.g. scripts) can be handled as categorical variables since these features have the two states, e.g., “presence of a variant of a symbol” and “absence of a variant of a symbol”. Therefore, the two feature states (presence or absence) can be represented as binary variables [40]; in other words, the categorical variables are transformed into Boolean indicator variables. If the number of objects is
Contingency table describing the matching probabilities of two objects
and
(
in binary features
Contingency table describing the matching probabilities of two objects
In case of pattern systems the presence and absence of a feature is not symmetrical. Its reason is that only a minority of all features are actually present in each object (pattern system or graph sequence) under study. Therefore, the similarity measure that fits well with the examined dataset highlights similarities in the presence of features is the most appropriate one for the dataset in the present study, since the absence of a feature in an object (pattern system or graph sequence) is not characteristic. Hence the Sørensen–Dice coefficient [3, 39] is a good choice, since it emphasizes the effect of the co-existence of feature states; see Eq. (1).
The Sørensen–Dice dissimilarity (distance) between objects
There are many other measures that describe the similarity or dissimilarity of objects
Ordination methods are a type of machine learning techniques that represent similarity relationships in some dimensions; they aim to reduce the dimensionality of large data structures with the least information loss. The dimensionality of a dataset is the number of features of the objects. Ordination extracts artificial variables to reduce the dimensionality of the original feature set of objects [32]. Dimensionality reduction is reducing the number of features while preserving as large a fraction of the variation in the original dataset as possible.
Multidimensional scaling (MDS), one of the ordination methods, performs a kind of feature extraction by creating new variables from original features. It transforms multidimensional spatial data into a space with a smaller (
One type of MDS is principal coordinates analysis (PCoA), an exploratory analysis method [5]. Along with principal component analysis and correspondence analysis, PCoA belongs to the group of linear scaling ordination methods These methods are widely used in data analysis, e.g. [17].
PCoA handles cases where the analysis starts from a single dissimilarity matrix and defines a measure for the input data [33, 36, 41]. It performs a linear mapping of the difference between objects in Cartesian ordination space and PCoA tries to explain the largest variance of the original dataset. The purpose of PCoA is to compute Euclidean distances representing a set of any dissimilarities of the objects [1]. PCoA maximizes the agreement between the calculated Euclidean distances in ordination space and the dissimilarities in the original space. In contrast to principal component analysis, PCoA can be applied not only to quantitative but also to any proximity measures (similarity or dissimilarity) Using the Euclidean distance gives results similar to principal component analysis. In the present case, however, the Sørensen–Dice dissimilarity Eq. (2) was applied.
Both principal components analysis and PCoA are eigenanalysis techniques, i.e. objects are projected onto axes (coordinates), with the first axis explaining the greatest variation of the objects in the original space, the second axis explaining as much of the remaining variation as possible and so on. PCoA finds the eigenvalues and eigenvectors of the dissimilarity matrix of objects. Each eigenvalue has an associated eigenvector, which keeps its direction when it is multiplied by the transformation matrix. The associated eigenvalue is the scaling factor of the transformation along the direction of the eigenvector. The eigenvalue can be determined for each coordinate, giving the amount of variance caused by the corresponding eigenvector (coordinate). The importance of these axes is measured by the eigenvalues. The goodness of PCoA can be measured by the percentage of variation explained by each coordinate. However, in the case of PCoA, there is no direct connection between the components and the original variables, so interpreting the role of the variables can be difficult because, unlike in principal component analysis, the components are not linear combinations of the original variables.
PCoA creates a set of orthogonal (uncorrelated) axes to summarize the variability in the dataset. Each axis has an eigenvalue, the magnitude of which indicates the amount of variation captured in that coordinate The ratio of a given eigenvalue to the sum of all eigenvalues explores the relative importance of each coordinate. The PCoA result ideally generates some coordinates with relatively large eigenvalues, capturing more than half of the variation in the input data, while all other coordinates have small eigenvalues. The interpretation of a PCoA diagram is that objects that are closer together are more similar than those that are farther away. In the PCoA plot, coordinate 1 accounts for the largest data change, and coordinate 2 explains the largest proportion of the remaining data changes.
The scatter plot represents all objects in the coordinate system given by the PCoA, where the eigenvalue scaling was used, so each axis is scaled to the square root of the eigenvalue. In the PCoA process, before eigenanalysis, the dissimilarity values are raised to the power of the transformation exponent
In case of the structure of the objects under study is not continuous, a scatter plot in the ordinated space using two or three principal variables may be sufficient to evidence the group structure of the objects [23]. A common goal in machine learning is to form well-distinguished clusters in a low-dimensional space; this requires mixing an ordination method with a clustering method, such as in [28]. In the present research,
K-means
In the
where
The cluster assignments are initially random. In an iterative procedure, objects are then moved to the cluster which has the closest cluster mean, and the cluster means are updated accordingly. This procedure continues until objects are no longer jumping to other clusters. The result of the
The
There are various methods to find the best value of
where
The optimal
In the present study, the examined pattern systems are the Rovash scripts, which were used by some populations of the Eurasian steppe [13]; namely, Turkic Rovash (TR, also known as Turkic Runic), Székely-Hungarian Rovash (SHR), Carpathian-Basin Rovash (CBR) and Steppe Rovash (SR). In a process of data selection, the possible ancestors of these pattern systems were estimated by means of a phenetic-based successive elimination analysis [14]. These preliminary studies suggest that their probable ancestor could be a hybrid of the Aramaic, Middle Iranian and Brahmic scripts [15]. This hypothetical common ancestor (called Proto-Rovash, PR) was reconstructed [14], and it is used in the present analysis. The Rovash scripts that evolved gradually lost the features inherited from PR. Only a proportion of their features have their origin in PR, these features are collectively called filtered feature set The number of original, full features is 119, and that of filtered features is 72. The rest of the features resulted from later, local or occasional influences of various other scripts. Table 4 presents some properties of the pattern systems under study.
Main properties of tested pattern systems
Main properties of tested pattern systems
Of the pattern systems under study, only SHR has remained in use until the present day, the others became extinct a millennium ago. Very few surviving graph sequences (inscriptions) are known for the pattern systems examined in this research, and the age of most of them is either unknown or can only be estimated approximately. Since PR, TR, CBR and SR became extinct at the end of the 1st millennium AD, the attribution of surviving Rovash graph sequences to one of Rovash pattern systems is not always clear. Table 5 shows the name and traditional classification of each surviving Rovash graph sequence into pattern systems. There are no surviving graph sequences that can be classified as PR. The detailed descriptions of the graph sequences in Table 4 are presented in the literature [9, 10, 14].
Classification of graph sequences under study
Some of the key questions of pattern evolution research are the following: what features characterize the pattern systems under study and at what stage of the evolution of the tested pattern systems did these features evolve? The former question has been addressed by extended phenetic modelling, the latter by the successive elimination of extended phenetic modelling; both were performed in a previous research [14, 15]. These examinations can be called a feature engineering phase or simply feature selection, where the tests were applied to special pattern systems, the Rovash scripts (see above). During this the existence of the putative common ancestors of these pattern systems was confirmed and it was determined which part of the full feature set of these pattern systems might have existed at the beginning of their evolution (in their hypothetical common ancestor, denoted by PR, see above). This earliest part of the full feature set is called the filtered set of features. Features that are not part of the filtered set may have evolved during the independent evolution of each pattern system. It is important to emphasize that the feature selection carried out in previous research [14, 15] was based exclusively on pattern systems, the surviving graph sequences were not directly included in that study.
The question of the present research is whether the correctness of the selection of the filtered feature set can be verified by the surviving graph sequences made with the pattern systems under study or not Its way is to check whether the results obtained using the filtered feature set are indeed less reflective of the separation of each tested pattern system from each other than those obtained using the full feature set. Figure 1 shows the flowchart of both feature selection as a feature engineering phase and verification of feature selection as a newly developed machine learning method based on PCoA and
Main steps of the composite multivariate analysis method.
PCoA results of the tested pattern systems with full feature set (left) and with filtered feature set (right).
The new composite multivariate method performs the PCoA of the pattern systems and the associated graph sequences separately, both for the full feature set and for the filtered feature set, respectively.
PCoA of pattern systems and graph sequences
The PCoA scatter plots are shown in Fig. 2 for the pattern systems tested using the full feature set and the filtered feature set, respectively. The dissimilarity measure used was the Sørensen–Dice dissimilarity Eq. (2). Objects that are closer together have smaller dissimilarity scores in the scatter plot than those that are farther apart.
The eigenvalue is given for each axis, which gives a measure of the variance caused by the corresponding eigenvector (coordinate). The eigenvalues and the percentages of variance due to these components are given in Table 4.1 for the full feature set and filtered feature set cases, respectively.
| Axis |
|
|
|
|
||
| 1 | 4.3879 | 19.685 | 4.594 | 19.642 | ||
| 2 | 3.9130 | 17.555 | 3.480 | 14.880 | ||
| 3 | 1.4887 | 6.679 | 1.732 | 7.405 | ||
| 4 | 1.1509 | 5.163 | 1.321 | 5.649 | ||
| 5 | 1.1084 | 4.972 | 1.276 | 5.456 | ||
| 6 | 0.9148 | 4.104 | 1.015 | 4.341 |
In case of
