Abstract
Discovering motifs in time series data has been widely explored. Various techniques have been developed to tackle this problem. However, when it comes to spatial-time series, a clear gap can be observed according to the literature review. This paper tackles such a gap by presenting an approach to discover and rank motifs in spatial-time series, denominated Combined Series Approach (CSA). CSA is based on partitioning the spatial-time series into blocks. Inside each block, subsequences of spatial-time series are combined in a way that hash-based motif discovery algorithm is applied. Motifs are validated according to both temporal and spatial constraints. Later, motifs are ranked according to their entropy, the number of occurrences, and the proximity of their occurrences. The approach was evaluated using both synthetic and seismic datasets. CSA outperforms traditional methods designed only for time series. CSA was also able to prioritize motifs that were meaningful both in the context of synthetic data and also according to seismic specialists.
Keywords
Introduction
Under the data deluge scenario, Data Scientists are pushed to provide new ways for efficiently collecting, storing, processing, and organizing a large amount of data [55]. We are immersed in a scenario with massive databases from many sources, types, and formats. However, such scenario opens a set of research opportunities involving knowledge discovery [14, 47]. In this context, many phenomena can be observed and organized as a sequence of observations in a timeline that can be modeled as a time series, enabling discoveries.
A relevant area that is being explored in time series analysis is finding patterns [43]. Patterns are sub-sequences of time series that are related to some special properties or behaviors [13]. A particular pattern that occurs a significant number of times in time series is denominated motif[17].
The discovery of motifs enables the understanding of some specific behaviors observed in time series, in many areas of knowledge, such as weather prediction [29], wind generation [12], image recognition [7], seismic amplitude [3], and computation biology (such as protein discovery) [15, 25]. A vast number of motifs discovery techniques, methods, and algorithms have been developed [34, 44, 53, 65]. They include discovering motifs of a particular/variable length [23, 52], or without constraints (parameter-free algorithms) [40], or in multivariate time series [30, 64].
However, various important time-series phenomena present different behaviors when observed at points of space (for example, series collected by sensors and IoT) and are better modeled as spatial-time series, in which each time series is associated to a position in space. Under such scenarios, motifs might not be discovered when we restrict the analysis to the time-only dimension.
Consider, for example, the scenario where speed sensors are present in each corner of Manhattan to monitor vehicles speed. Imagine that a car accident occurs in one corner. Such occurrence decreases the vehicles speed there. It is a punctual phenomenon that might not occur again, i.e., this pattern is not observable using motif discovery techniques for univariate time series. However, due to the accident, congestions may occur in the nearby corners, decreasing vehicles speed. This pattern repeatedly occurs in the nearby corners, possibly with a time lag. It characterizes, intuitively, what we would call as a spatial-time motif. The challenge becomes to identify regions of space and time where the motifs are frequently observed. Finding patterns that are frequent in a constrained space and time, i.e., finding spatial-time motifs, may enable us to comprehend how a phenomenon occurs.
This paper tackles such problem by presenting an approach to discover and rank motifs in spatial-time series, denominated Combined Series Approach (CSA). CSA partitions the dataset into space-time blocks. Subsequences of time series inside these blocks are combined into a single time series. Then, it applies a traditional motif discovery algorithm over the combined time series to discover them. Subsequences occurring above a spatial-time threshold are selected as motifs. Finally, motifs are ranked considering their entropy, their number of occurrences, and the proximity of their occurrences.
We have evaluated our approach using Synthetic and Seismic datasets. It was able to identify spatial-time motifs that could not be identified using the traditional approach. CSA was also able to prioritize motifs that were meaningful both in the context of synthetic data and also according to seismic specialists.
In addition to this introduction, this work is organized into more seven sections. Section 2 presents the background for time series data mining. It includes a brief literature review about motifs in time series and the main concepts and techniques that support the motif discovering processes. Section 3 presents the Related Works. Section 4 formalizes the problem definition. Section 5 presents the CSA algorithm to find motifs in spatial-time series. Section 6 presents a synthetic dataset showing how the CSA behaves. Section 7 describes and explains the experiments that were made using seismic dataset. Finally, Section 8 concludes.
Background
In this section, we introduce some background for the data mining process of motif discovery.
Time series background
A
A
Figure 1 depicts the application of the above definitions to a time series. The blue line represents a time series
An example of a time series, sub-sequences, and sliding windows.
A spatial-time series
Data preprocessing techniques are key activities for enabling or improving the quality of data mining. In time series context, data is usually a continuous numerical value. For motif discovery, processing directly numerical representation is not efficient [9]. Due to that, during motif discovery, two data preprocessing techniques are commonly applied in sequence: (i) data normalization and (ii) Symbolic Aggregate Approximation (SAX).
Normalization is commonly used to enable the effectiveness of time series comparison methods. One of the most common normalization methods is z-score [19]. As a result of this normalization method, the normalized time series has zero as average and one as standard deviation. Equation (1) describes z-score normalization, where
SAX is an indexing technique. It consists primarily in partitioning the domain of a variable into ranges such that each range is associated with a particular symbol [26]. The SAX alphabet size defines the number of partitions for the domain. Thus, all values are replaced by their respective associated symbol. Given an alphabet
Given a sequence
Many methods proposed in the literature to discover motifs in time series are computationally intensive [43]. Due to that, many works aim to improve the effectiveness of motif discovery and reduce the computational resources needed. Such a process requires some data preprocessing such as normalization and indexing before running the motif discovery algorithms to increase the performance and precision of results [34].
There are some main approaches to discover motifs, such as (i) brute force; (ii) heuristic-based; and (iii) matrix profile. The brute force approach is the simplest method. It has a high computational cost, especially when used to discover sequences of greater size in large datasets [35]. It is indicated for discovering sequences of smaller size [22]. In this method, the coverage and accuracy are complete since it makes all possible comparisons between all subsequences of a time series.
The heuristic-based algorithms include methods such as random projections. The goal is to reduce the search space. The random projection was proposed to handle large dataset by reducing dimensionality. It randomly selects some of sliding windows columns for search [22]. The technique optimizes the execution time in discovering motifs [2, 8]. It adopts a collision matrix that is built by masking the projected columns of both the subsequence matrix and candidate search sequence. If they match, then the sequence is placed in a hash structure for full comparison.
The matrix profile is based on computing the distance of a sequence of size
After motif discovering process, an important task in motif analysis is how to sort the motifs according to their relevance [5]. A standard classification method is
Some approaches to evaluate the significance and relevance of motifs were proposed in the literature. A statistical approach to assess the relevance of motifs is based on information gain, which measures how expected is the motif to occurs [5]. The Log-odds considers the degree of how rare the motif is by comparing the amount of occurrence with the expected chance of having occurrence based on probabilistic distribution [62]. Castro and Azevedo[5] proposed the estimation of expectation for the frequency of a motif based on Markov Chain models. The value is assessed making the comparison between actual frequency and estimated based on hypothesis tests.
Related works
There are two recent review papers regarding motif discovery that characterize researches and trends [34, 53]. Concerning spatial-time approach, there are few initiatives such as discovering motifs in trajectory data [41] and discovering migration motifs in financial data [11]. Oates et al.[41] focused on analyzing repetitive sequences of moving objects. For that, they developed a grammar, applied SAX indexing, and searched for motifs over the trajectory. In our work, we do not have a moving object. Sensors are fixed, and we analyze a phenomenon that occurs at each position throughout time.
Meanwhile, in Du et al. [11], space is modeled by discrete attributes that resemble states of an object. In the context of their paper, they refer to the state of companies in the stock market. It is, in fact, a state-space model [47] where a trajectory is the registration of state transitions. In this way, it differs significantly from our work, since such a phenomenon may not be constrained in space and time.
Due to the absence of directly related work for the spatial-time motif discovery, for the sake of our work, we can group time series motif discovery approaches according to the exactness (exact or approximate), length (fixed or variable), and dimension (single or multiple).
Considering the exact motif discovery approach, some specific method to address the dimensionality and motif length problem were proposed. Jiang et al.[16] proposed an efficient motif discovery algorithm PMDGS (P-Motif Discovery based on Grid Structure) that processes data streams. Mueen et al.[35] proposed a motif discovery algorithm for exact time series called MK (Mueen-Keogh) and observed that MK was faster than brute-force [8]. Marang and Bhattacherjee [37] introduced the Par-MK, Par-MK-SLB, and Par-MK-DLB, which are all parallel multi-threaded algorithms for exact motif discovery that focus on load balancing. Mueen et al. [36] proposed a disk-aware algorithm to discover exact motifs in large time series databases. Cassisi et al.[3] applied a motif discovery technique for an exact time series to study recurrent patterns in seismic amplitude time series of the Etna 2011 periodic eruptive activity. Chi and Wang [6] introduced a method based on cloud model theory to extract the top
When it comes to approximate motif discovery, the approaches aim to reduce the complexity and, consequently, the computational cost. Some work proposed approaches to improve the accuracy and efficiency of Random Projection Algorithm as proposed in Chiu et al. [8]. Lin et al.[26] created a new symbolic representation for time series (SAX) for indexing. Mohammad and Nishida[31] proposed two algorithms called MCFull and MCInc that address constrained motif discovery problem. Castro and Azevedo[4] addressed motif discovery problem as an approximate top
Regarding variable-length motif discovery, Wilson et al.[60] proposed the Motif Tracking Algorithm (MTA) that uses a small number of parameters based on the implementation of the Bell immune memory theory. Yankov et al.[63] presented a novel algorithm that discovers motifs in time series with invariance to uniform scaling. It enables to reduce parameters such as motif length. Nunthanid et al.[39] described the VLMD motif discovery algorithm that does not require the motif length parameter. Such an algorithm automatically returns motif lengths from all possible sliding window lengths, reducing a set of possibilities of the sliding window lengths. Nunthanid et al.[40] presented the
Finally, in multivariate time series, Tanaka and Uehara[51] and Tanaka et al.[50] showed how to dynamically determine the optimum period length using the Minimum Description Length (MDL) principle and applied the method to the multidimensional time-series transforming into one-dimensional time-series using the Principal Component Analysis. Liu et al.[28] proposed a heuristic approach that can significantly improve the quality of motifs in
Problem definition
The motif discovery approaches presented in the literature review propose to solve the problem of discovering motifs on time series. In the context of spatial-time series, we observe a more complex scenario due to spatial constraints. In order to highlight this challenging problem, consider a synthetic dataset containing twelve spatial-time series (
If we apply to this scenario a known motif discovery method on each spatial-time series for a support
A synthetic dataset with twelve spatial-time series: 
Depending on the dataset, such similar subsequences in neighboring time series can correspond to some relevant information. Discovering and grouping motifs in spatial-time series datasets can address some real-world problems. Such a scenario was not studied in previous works as discussed in Section 3. The problem formalization for this new scenario is presented as follows.
A
Let
From the definition above, the problem can be summarized as the discovery of spatial-time motifs in a spatial-time series dataset.
To address the defined problem, we developed the Combined Series Approach (CSA) that is organized in three main steps: (i) normalization and SAX indexing; (ii) discovery of spatial-time motifs; (iii) ranking of spatial-time motifs. CSA is summarized in Algorithm 5. It takes as input a spatial-time series dataset
[!ht] [1] CSA
[1] normSAX
[1] discoverSTMotifs
[1] rankSTMotifsstmotifs
Combined series approach
Normalization and SAX indexing
The first step of the CSA, described by the normSAX function of Algorithm 5, applies z-score data normalization in the entire dataset. Right after normalization, the SAX indexing method is applied for a given alphabet
Spatial-time motif discovery
The second step of CSA corresponds to discoverSTMotifs function. In line 2, the indexed spatial-time series dataset
In line 5, all sequences inside a block are combined into a single time series
The discover function (line 6) applies a traditional motif discovery approach. More specifically, discover function applies an adapted hash-based approach [61] for exact match motif discovery [33].1
Such an approach enables the introduction of other state-of-the-art motif discovery algorithms.
Since the number of motifs can be high, especially when working with larger alphabets, it is important to establish ways to rank them in a way that more “interesting” ones are presented first. The third step of CSA, described by the rankSTMotifs function of Algorithm 5, makes a balance among three criteria: (i) the number of occurrences (significant higher occurrences are better); (ii) proximity (occurrences that are close to one another are better than the ones that are sparsely distributed in the dataset); (iii) entropy (higher entropy contains more information, which makes it more interesting).
Each motif corresponds to a sequence of SAX observations. All motifs that are discovered inside discoverSTMotifs are local block motifs. At group function (line 2), occurrences of motifs sharing the same sequence are grouped as long as they occur in neighboring blocks.
Then, for each group of motifs
At line 5 the set of occurrences
In line 7, the weight of the occurrences according to their proximity is computed. Consider the pairs of position and time for the set of occurrences
Once the entropy (
Analysis using synthetic dataset
For a better understanding of the CSA and its steps, consider a synthetic spatial-time dataset
Figure 3 depicts the result of applying the first step of CSA (normSAX) into the synthetic dataset. Values are being replaced by letters (
Synthetic dataset partitioned into blocks.
The second step of CSA (discoverSTMotifs) encompasses combine, discover, and validate functions. Figure 3 also shows the partitioning of the dataset according to CSA, where each orange box corresponds to a block. Since the dataset has 12 spatial-time series and 20 observations, the dataset is divided into 6 blocks, where each one contains 40 observations.
Figure 4 shows the result of the combine and discover functions presented in the discoverSTMotifs to our synthetic dataset. Each block produces a combined time series (cs) with 40 observations. In each cs, the discover identifies all motif with
Motif discovery algorithm applied to combined series.
The majority of motifs discovered using the CSA approach had not been found when applying traditional motif discovery algorithms on each spatial-time series. As indicated in Table 1, traditional approach (Trad.) only found two occurrences for motif
Table 1 also presents the dimensions used to rank the identified motifs. The entropy (Ent.) for both bded and ceeb was 1.5 since they contain three out of four distinct characters. The proximity metric (Prox.) is the reciprocal of the average weight of the minimum spanning tree that connects identified occurrences for each motif (the closer to one, the better it is). Thus, baba presents better proximity (0.83). The occurrences (Occ.) consider the
CSA versus traditional (Trad.) method in the synthetic dataset
Discovered motifs after spatial-temporal validation.
As a proof of concept evaluation, we applied CSA on the Netherlands seismic spatial-time series dataset, named F3 Block [10]. The database produced by the seismic reflection method was collected in a region located in the Dutch sector of the North Sea. This method consists of generating artificial seismic waves with energy sources that disturb the medium, such as explosives or air guns (called seismic shots) and record the waveforms of the various interfaces in the subsoil using sensors (geophones or hydrophones), in that acquisition air guns and hydrophones were used. The generated wave propagates through the interior of the Earth and the Sea. The partially reflected waves are used to find interfaces between layers that have significant contrasting elastic properties. The time of arrival of each reflection is related to the propagation velocities of the seismic wave in each layer. In a first approximation, the recorded amplitude is related to the contrast of the acoustic impedance, a product of velocity and density of the layers that define the interface. This method is analogous to imaging the human body using ultrasound. However, unlike medicine, where the density contrasts are imaged, on seismic exploration, the effect of the acoustic impedance difference is more studied [66].
In F3 Block dataset, each spatial-time series has a position in which the hydrophone is placed. The dataset is organized into inlines (direction of the ship navigation). We selected the inline 401 since it has been mapped by seismic specialists who have annotated some relevant information. Figure 6 shows the inline 401. Inline 401 consists of 920 spatial-time series with 440 observations in each. The horizontal axis represents the position of the receivers and vertical axis represents the time, which is also related to the depth at the subsoil.
The value of observations represents the wave amplitude reflected from the subsoil at a particular position and depth. Figure 7 depicts the probability density function (PDF), where it is possible to observe a frequency distribution with a high concentration of values close to zero and with values mainly varying from
Seismic dataset.
This section discusses the experimental setup aiming to evaluate CSA in discovering spatial-time motifs in the seismic dataset. The setup was driven for a sensitivity analysis to measure CSA performance under different aspects and also against the traditional approach designed only for time series.
The CSA algorithm requires parameters alpha, word, tb, sb,
Input parameters
Input parameters
Histogram of inline 401.
The CSA is available as an R Package (STMotif).2
In this analysis, the goal is to study the number of discovered motifs and their occurrences and computational time as we vary block size (tb and sb), word,
To evaluate the influence of block orientation, we set three orientations: vertical rectangle (
Table 3 presents the overall performance of both the traditional approach and CSA under different block orientation for all possible parameter combinations described in Table 2. The motifs column corresponds to the mean number of motifs whose occurrences were grouped with at least one neighboring block and contained more than seven occurrences (the maximum
Overall performance of CSA under different block orientation
Overall performance of CSA under different block orientation
MSE for each alphabet size.
The traditional approach, on average, discovered 43 different motifs under 449 sets of occurrences. It means that the same motif contains, on average, ten different spatial-temporal sets of occurrences. Also, the average discovery time and the average time to rank motifs were, respectively, 1.8 and 2.0 minutes. The average elapsed time was 4.7 minutes, which also includes the time to do the data normalization and SAX encoding.
The time for discovering motifs was approximately the same for all configuration, except for Horizontal orientation. In this setup, as we increase the size of the word, there is a lower number of possible motifs to discover. It becomes unnecessary to check for motifs in between two consecutive spatial-time series. It makes less possible comparisons for this setup, also meaning that a lower number of motifs are discovered. However, all CSA block orientations discovered more motifs than the traditional approach (the square had better performance. It discovered more than 2.5 times more motifs than traditional approach).
Comparing the performance of different CSA orientation (Vertical, Square, and Horizontal), we may expect that typically Horizontal orientation might break temporal sequences. However, in our dataset, patterns often occurred in a small time interval spread in space. Such behavior justifies the better performance of Horizontal orientation over Vertical one. Additionally, Square orientation had a better balance between time and space and was able to identify more patterns. The choice of block orientation is fairly domain-dependent. Users may consider their knowledge about the data to set up this parameter.
Table 4 disclosures the results of Table 3 according to the word size. It presents the number of discovered motifs and the sets of occurrences, applying the same criteria used to produce Table 3. It can be observed that as we increase the word size, the number of discovered motifs decreases. The same behavior occurs in the set of occurrences. The highest number of identified motifs occurred in CSA Square orientation for word size equals to four. Finally, the computation time (in minutes) for all discovered motifs also decreases as we increase the word size. It is due to the ranking function overhead. It has less impact on time when handling a lower number of occurrences.
Summary of discovered spatial-time motifs for different block orientation and word size
Table 5 presents the influence of
Influence of
Finally, we analyzed the top-k spatial-time motifs discovered using CSA Square block orientation for word size of four, fixing
Top five distinct motifs
Top five distinct motifs
Top-ten discovered motifs according to the ranking function.
Top-ten discovered motifs according to the number of occurrences, filtering the ones with ranking function lower than 1.0.
The highest-ranked motif (aagg) presented good proximity value, an average entropy value, and a high occurrences value. Such combination produced a rank value of 1.57. The second place (dfge), although exhibiting low occurrences value, has a good proximity and entropy values. The third place (aaag) was similar to the first motif, but with lower occurrences value. The fourth place (ggfa) compensated the low occurrences value with an excellent proximity value. Finally, the fifth place (egfa) is similar to the second, with a slightly lower proximity value.
In order to have an intuition on the quality of the ranked motifs, we have plotted the top-ten discovered motifs (Fig. 9), according to the ranking function, on top of the seismic dataset. The places where the motifs were plotted are in agreement with annotations from specialists where seismic horizons are located. Also, the yellow ones are very close to a gas reservoir.
In a complementary analysis, we sorted the motifs according to the number of occurrences. Figure 10 plots the top-ten distinct motifs sorted by their occurrences. The set of occurrences for each motif was plotted, as long as their ranking value were greater than 1.0. It can be observed that the occurrences of motifs matched more regions where seismic horizons are located.
It is worth mentioning that the ranking function was conceived for general purpose usage and did not focus on any aspect to target seismic horizons. Nevertheless, they were able to discover the majority areas in which seismic horizons were annotated. Finally, our algorithm is capable of identifying more or fewer motifs according to the used parameters. However, the meaning of the motifs and their relevance is up to the specialist evaluation.
Many applications observe phenomena whose values vary according to space and time dimensions. Discovering phenomena which are dependent on the occurrence in space and time requires extensions to traditional techniques adopted in time series analysis. In this paper, we tackle this problem by introducing a novel approach for spatial-time series motif discovery. To the best of our knowledge, this is the first work to propose a complete approach, named Combined Series Approach (CSA), for spatial-time motif discovery.
CSA supersedes traditional techniques when discovering spatial-time motifs, as it has been shown in our experimental evaluation. Additionally, CSA exhibits two major strengths points. Firstly, it is a divide-and-conquer algorithm that starts by discovering motifs inside a given spatial-time block. These blocks are then merged if neighboring blocks increase the number of occurrences of the discovered motifs. Such a technique makes the algorithm resilient to the initial block selection. Secondly, once the blocks have been defined, the approach is isolated from actual motif discovery algorithms applied. Such property enables exploring different motif discovery algorithms, such as Random Projection and Matrix Profile, exploring the effectiveness (precision), efficiency, and scalability targeting the improvement of space-time series discovery.
We have evaluated CSA against the traditional approach using both synthetic and seismic dataset. CSA was able to identify more motifs and occurrences than the traditional approach. Also, the identified motifs were well ranked considering both spatial-time constraints, number of occurrences and the motif entropy. Due to the potential of the technique applied to a spatial-time series, it opens opportunities for exploring other real-world applications modeled as spatial-time series. There are also opportunities to automate parameters for CSA. Finally, there are also opportunities to explore different ranking functions for targeting domain-specific problems.
Footnotes
Acknowledgments
The authors thank CNPq, CAPES (finance code 001), and FAPERJ for partially funding this research.
