Hierarchical clustering of heavy-tailed data using a new similarity measure

Abstract

Clustering is the primary technique used to divide data into groups based on unknown models inherent to the data. Regulation of the entire clustering method is complicated and submitted to several uncertainties. Similarity measures one of the first decisions to be made to establish how the similarity between two objects must be measured. This research focuses on the influence of similarity measures in the hierarchical clustering to uncover patterns in heavy-tailed data. Stable distributions are the most important subclass of heavy-tailed distributions. A well-known measure of similarity is defined based on correlation of two objects. However, this measure cannot be used for heavy-tailed data. We will illustrate how to perform a hierarchical cluster analysis in heavy-tailed data by extending the similarity measure based on the correlation. We introduce a new similarity measure based on covariation coefficient. We evaluate the performance of covariation similarity and compare it to others using external and internal criteria.

Keywords

Hierarchical clustering dissimilarity measure stable distribution currency market

1. Introduction

In real life, we attempt to organize similar objects together and arrange them into various groups, a simple and essential way of discovering order among the utter confusion. The aim of clustering is to uncover a system of organizing objects where members of the group have similar properties. Clustering, a main unsupervised learning method, can be utilized to investigate the structures of large and complex data. It has been applied to pattern recognition, data mining and image processing [25, 7].

The clustering methods are usually divided into two groups: non-hierarchical and hierarchical. Non-hierarchical clustering generates a single partition, whereas hierarchical clustering method generates a result named dendrogram from which various and consistent partitions can be obtained at the various levels [25, 7]. It has been extensively applied in a lot of applications, such as document clustering [12], the study of gene expression data, regulatory networks, protein interaction networks [3, 34] and foreign currency exchange market [6, 35, 17, 19].

Usually, investigators use the most common distance measures in the clustering methods without evaluating the validity of several conditions. When the various conditions are not considered, drawn conclusions are unclear, and may affect the decision-makers to wrong decisions. Furthermore, with regard to the choice of dissimilarity measures, the researcher must be aware that their choice can often meaningfully affect the results of the clustering. Also, some dissimilarity measures are not suitable when various conditions of variables are not satisfied. Such as, in this article, we will show that Euclidean distance is inadequate when the distribution of data is far from the normal distribution. We indicate our covariation similarity can solve this defect by improving hierarchical clustering of heavy-tailed data. Therefore, the identification of the proper dissimilarity measures to apply under various cases is the foremost motivation of researchers working on this topic to decide which dissimilarity measures should be utilized in the case of various conditions.

This study hopes to provide literature for making better decision on choice of appropriate dissimilarity measure for data with the heavy-tailed distribution. One of the common definitions of the heavy-tailed data is a minority of large values in the tails and a majority of little values in the middle of its corresponding density. For instance, a country’s population is often distributed such as a heavy-tailed behavior, with a minority of people in the countryside and the huge majority in urban areas [29, 31].

Hierarchical cluster analysis can be agglomerative or divisive. This article concentrates on hierarchical agglomerative clustering, a statistical procedure where groups are consecutively produced by regularly combining similar clusters together, as dictated by the dissimilarity and linkage measures chosen by a user [9, 28].

Data clustering is naturally connected to the concept of proximity. Proximities express the similarity or dissimilarity between records or objects. The first step an investigator must take is to determine the measurement that will be utilized to calculate the dissimilarity or similarity between objects. A similarity measure is often hard to define, and different similarity criteria can lead to different partitions. Most importantly, the choice of the dissimilarity measure will depend on the distribution of data.

Minkowski distance of exponent $p$ is a usually employed method to measure the dissimilarity between two objects. Minkowski distance is defined as

${d_{p}}\left({\bm{x},\bm{y}}\right)={\left({\sum\limits_{i=1}^{d}{|{x_{i}}-{y_% {i}}{|^{p}}}}\right)^{1/p}}$

where $\bm{x^{\prime}}=\left({{x_{1}},\ldots,{x_{d}}}\right)$ and $\bm{y^{\prime}}=\left({{y_{1}},\ldots,{y_{d}}}\right)$ refer to the two objects being compared on the $d$ variable (attribute). For $p\geqslant 1$ , the Minkowski distance is a metric. The most ordinarily utilized distance measures for continuous variables is the Euclidean distance, Minkowski distance with $p=2$ .

Pearson’s correlation coefficient is another usually utilized tool to measure the similarity between two objects [16]. Pearson similarity is defined as

$r\left({\bm{x},\bm{y}}\right)=\frac{{\sum\limits_{i=1}^{d}{\left({{x_{i}}-\bar% {x}}\right)\left({{y_{i}}-\bar{y}}\right)}}}{{\sqrt{\sum\limits_{i=1}^{d}{{{% \left({{x_{i}}-\bar{x}}\right)}^{2}}}}\sqrt{\sum\limits_{i=1}^{d}{{{\left({{y_% {i}}-\bar{y}}\right)}^{2}}}}}}$

Pearson similarity calculates the correlation between two objects with respect to all attribute values.

The problem that occurs when a cluster includes more than one object is that the ordinary distance can only be measured between a pair of objects and cannot measure three or more objects simultaneously. It comes up with the use of the linkage measure, and the investigator must determine how to do the proper calculation to figure out the connection between two clusters. Once again, the aim is to determine the two clusters that are most proximate to each other to merge them together. The traditional linkages for hierarchical clustering methods are the Single, Complete, Average, Centroid, Median, Ward’s [28] and E-distance methods [32].

The rest of paper is organized as follows. In the next section, we review the related works. Multivariate symmetric stable distributions are explained in Sections 3. Section 4 introduces the covariation similarity measure based on covariation. In Section 5, we evaluate the performance of covariation similarity and compare it to other dissimilarities by the various criteria on both artificial data and real exchange rate datasets. Section 6 clarifies the conclusions drawn from the results.

2. Related work

So far, several papers have presented methods to cluster normally distributed data that are based on Euclidean distance, Mahalanobis distance, correlation distance, Minkowski distance, etc. [18, 11, 21, 1, 8]. Jia and Darrell [15], in 2011, proposed heavy-tailed distribution for the statistics of a gradient based image descriptors. The authors used distance measure based on the likelihood ratio test based on heavy-tailed distribution. In [30, 33], the authors used model-based clustering in the case of heavy-tailed data. In these works, the mixture of a symmetric stable distribution model is presented and compared to the mixture of Gaussians model. This proposed methodology has proven to be more robust to outliers than the mixture of Gaussians. Amorim and Mirkin [2] proposed a way to overcome the lack of defense against noisy features, using feature weights in the K-Means based on Minkowski distance. Furthermore, Aggarwal et al. [1] show that the fractional Minkowski distance is more robust to the presence of noise in the data. Fractional Minkowski dissimilarity has been always more preferable than the Minkowski distance when the noise affecting the high dimensional data is strongly far from normal [1, 8]. Many practical applications require an asymmetric dissimilarity measure. Asymmetric dissimilarity is widely used in binary data and improves the efficiency of clustering algorithms [5, 10]. Guerra et al. [13] compared some of the best-known criteria are used to cluster data with outliers or noise.

3. Stable distributions

Stable distributions are a class of probability models having interesting theoretical and practical properties. Their applications to many models derive from the fact that they extend the normal distribution and provide heavy tails and skewness, which are many times experienced in biological, financial, physical and big data. There is powerful experimental evidence for Stable distributions united with the Generalized Central Limit Theorem is applied to validate the use of stable probability models. Examples of application in stable models are given in [22, 4, 27]. Such data sets are not well characterized by a normal model, but some can be well characterized by a stable model.

(Symmetric $\alpha$ Stable, S $\alpha$ S [31]).

A $d$ -dimensional stable random vector $\bm{X^{\prime}}=(X_{1},\ldots,X_{d})$ is symmetric about $\bm{\mu}\in\mathbb{R}^{d}$ if it has the following characteristic function:

${\phi_{\bm{X-\mu}}}\left(\bm{t}\right)=\left\{{\begin{array}[]{*{20}{l}}{\exp% \left\{{-\int_{\mathbb{S}_{d}}{{\Bigl{\lvert}{\sum\limits_{i=1}^{d}{{t_{i}}{s_% {i}}}}\Bigm{|}^{\alpha}}}\Gamma\left(\textup{d}\bm{s}\right)}\right\}}&{0<% \alpha<2},\\ {\exp\left\{{-\frac{{\bm{t}^{\prime}\sum\bm{t}}}{2}}\right\}}&{\alpha=2,}\end{% array}}\right.$

where $\bm{t^{\prime}}=(t_{1},\ldots,t_{d})$ , $\bm{s^{\prime}}=(s_{1},\ldots,s_{d})$ , $\Sigma$ is the positive definite matrix and $\Gamma$ is a finite Borel measure on the unit sphere $\mathbb{S}_{d}$ in $\mathbb{R}^{d}$ . The index $\alpha$ is called the index of stability.

The parameter $\alpha$ represents a tail thickness or a kurtosis of the distribution. The normal distribution is S $\alpha$ S distributions with index $\alpha=2$ . When $\alpha<2$ , the variance is infinite, and the tails are asymptotically equivalent to a Pareto law, i.e. stable distributions are a power-law model. Moreover, as $\alpha\nearrow 2$ , S $\alpha$ S distributions get closer to the normal distribution. On the contrary, $\alpha\searrow 0$ , S $\alpha$ S distributions get farther than the normal distribution. Regardless $\alpha=$ 2, stable distributions are the most applicable subclass of heavy-tailed distribution.

([31]).

Let $\bm{X}$ a $d$ -dimensional stable random vector with $\alpha>1$ and let $\Gamma$ be the spectral measure of the random vector $(X_{i},X_{j})i,j=1,\ldots,d$ . The covariation and covariation coefficient of $X_{i}$ on $X_{j}$ are defined respectively as follows:

$\displaystyle{\left[{{X_{i}},{X_{j}}}\right]_{\alpha}}=\int_{{\mathbb{S}}_{d}}% {{s_{1}}{{\left|{{s_{2}}}\right|}^{\alpha-1}}\textup{sign}\left({{s_{2}}}% \right)\Gamma\left({\textup{d}\bm{s}}\right)},$ $\displaystyle\lambda=\frac{{{{\left[{{X_{i}},{X_{j}}}\right]}_{\alpha}}}}{{{{% \left[{{X_{j}},{X_{j}}}\right]}_{\alpha}}}}.$

The covariance is a particular case of the covariation. If $\alpha=2$ ,

${\left[{{X_{1}},{X_{2}}}\right]_{2}}=\frac{\textit{Cov}({X_{1}},{X_{2}})}{2}.$

4. Similarity measure based on covariation

We describe the estimation of a covariation coefficient as a generalization of the Pearson’s correlation coefficient for S $\alpha$ S distributions, and we finally introduce the associated dissimilarity measure. One defect of Pearson similarity is that it considers the distribution of objects that are normal and therefore may not be appropriate for objects that are far from the normal distribution [16]. Another defect of Pearson similarity is that it is not robust, concerning outliers [14]. The correlation distance for clustering data with two features cannot be used. The Mahalanobis distance needs to compute the inverse of the covariance matrix that is sometimes impossible. To solve these defects, we introduce covariation coefficient as the similarity measure.

4.1 Covariation similarity

We introduce covariation coefficient [22] as a tool to measure the similarity between two objects. Covariation similarity between $\bm{x^{\prime}}=\left({{x_{1}},\ldots,{x_{d}}}\right)$ and $\bm{y^{\prime}}=\left({{y_{1}},\ldots,{y_{d}}}\right)$ is defined as

${\hat{\lambda}_{\left(p\right)}}=\frac{{\sum\limits_{i=1}^{d}{{x_{i}}|{y_{i}}{% |^{p-1}}\mbox{sign}\left({{y_{i}}}\right)}}}{{\sum\limits_{i=1}^{d}{|{y_{i}}{|% ^{p}}}}}$

for $1\leqslant p\leqslant 2$ . To switch from similarity to dissimilarity, one could use any monotone decreasing transformation on similarity or its absolute value [9], but we implemented the following:

$\displaystyle CD_{p}(\bm{x},\bm{y})=\frac{1}{{c+|{{\hat{\lambda}}_{\left(p% \right)}}|}}.$

where $c$ can be any arbitrary positive constant value. We choose $c=$ 0.05 based on our experimental results. Covariation similarity similar to Pearson similarity focuses on the shapes of observation rather than their magnitudes. Based on the covariation dissimilarity measure, we perform hierarchical clustering method, and then it is compared with other dissimilarity measures on artificial and real datasets.

There are three groups of dissimilarity measures: Metric, semi-metric and non-metric. When they are metric, they are more correctly called distance measures. Distance measures satisfy all axioms (non-negative, the identity of indiscernibles, symmetry and triangle inequality) of a metric. Covariation dissimilarity does not satisfy all axioms of a metric exception of non-negative axiom. Thus, covariation dissimilarity is non-metric. Proving that covariation dissimilarity does not satisfy the triangle inequality, a counterexample is given by the following choice

$\displaystyle\bm{x}=\left({\begin{array}[]{*{20}{c}}2\\ 7\end{array}}\right),\quad\bm{y}=\left({\begin{array}[]{*{20}{c}}3\\ 1\end{array}}\right),\quad\bm{z}=\left({\begin{array}[]{*{20}{c}}3\\ 3\end{array}}\right).$

Sometimes the non-metrics are better than the metrics. For example, it was shown fractional Minkowski dissimilarity (Minkowski with an exponent less than one) could significantly improve the effectiveness of clustering algorithms.

5. Experimental section

In this section, we compare the covariation similarity measure with the traditional distances in the hierarchical clustering. A lot of algorithms have been proposed for different applications and different data. To compare the results of different clustering algorithms, it is necessary to produce some validity criteria. In general, there are two fundamental criteria to investigate the cluster validity: external criteria, internal criteria. In an external criteria approach, we evaluate the results of a clustering algorithm based on a predefined structure which is imposed on a dataset and reflects the intuitive structure of the dataset. The goal of internal criteria is to evaluate the clustering structure produced by an algorithm using only quantities and features inherited from the dataset. Therefore, the misclassification and internal indices are considered as validation criteria to compare clustering performances. See [13, 9] for more information about these criteria. The proportion of misclassified objects is defined as

$\displaystyle\frac{{\sum\limits_{i=1}^{n}{{I_{\left\{{\mbox{class}_{i}\neq% \mbox{cluster}_{i}}\right\}}}}}}{n},$ (1)

where $\text{class}_{i}$ is true class of the $i^{th}$ object, $\text{cluster}_{i}$ is the class of the $i^{th}$ object that is calculated, $I$ is indicator function and $n$ is the number of objects. All numerical results were obtained by running the Algorithm 1 on the artificial dataset. We simulate data from a mixture of bivariate S $\alpha$ S distributions and assess the clustering performance of our dissimilarity measure compared to the other standard dissimilarity.

Algorithm 1

Computing misclassifications in the Table 1

Parameters of stable distribution: $\alpha,\Gamma,\beta$ , $\bm{\mu_{1}},\bm{\mu_{2}}$

Generate a dataset using the stable distribution [20]

for each dissimilarity measure do

for each linkage measure do

Run the hierarchical clustering algorithm

Compute the misclassification using (1)

end for

Repeat 2000 times steps 2 to 8

Average of misclassifications

Table 1

Percentage of misclassification of hierarchical clustering with various dissimilarity and linkage measures. The best value for each column is bold

Dissimilarity	Linkage	$\alpha$ parameter of stable distribution
measure	measures	0.25	0.5	0.75	1	1.25	1.5	1.75	2
$\textit{CD}_{2}$	E-distance
	0.25 ${}^{*}$	28	31	32	28	24	18	13	9
	0.50	27	31	33	30	26	21	16	12
	0.75	26	31	33	31	28	24	19	15
	1.00	26	31	34	33	31	27	22	18
	1.25	26	30	34	35	32	30	25	21
	1.50	26	30	35	35	34	31	28	23
	1.75	26	30	35	36	35	33	30	26
	2.00	25	30	35	37	36	34	31	27
$\textit{CD}_{2}$	Complete	36	39	40	41	39	38	36	34
	Ward	26	30	35	37	36	33	31	27
	Average	45	45	46	45	43	40	37	32
	Median	48	47	47	47	45	42	40	37
	Single	49	49	49	49	49	49	49	49
	Center	49	48	48	48	48	46	43	40
	McQuitty	43	43	43	43	42	40	37	36
$\textit{CD}_{1}$	Complete	38	40	41	40	40	38	36	34
	Ward	46	45	44	42	40	38	36	34
	Average	48	47	46	45	43	40	38	35
	Median	48	47	47	45	43	41	39	36
	Single	49	49	49	49	49	49	49	49
	Center	49	48	48	47	45	43	40	37
	McQuitty	47	46	45	44	42	40	38	35
$\textit{CD}_{1}$	E-distance
	0.25 ${}^{*}$	47	46	44	42	40	38	36	33
	0.50	47	46	44	42	40	38	36	34
	0.75	47	46	44	42	40	38	36	34
	1.00	46	46	44	42	40	38	36	34
	1.25	46	46	44	42	40	38	36	34
	1.50	46	46	44	42	40	38	36	34
	1.75	46	46	44	42	40	38	36	34
	2.00	46	45	44	42	40	38	36	34
Euclidean	E-distance
	0.25 ${}^{*}$	49	48	47	47	38	33	30	28
	0.50	49	49	49	48	40	34	30	28
	0.75	49	49	49	49	44	35	31	28
	1.00	49	49	49	49	47	39	32	28
	1.25	49	49	49	49	49	42	33	29
	1.50	49	49	49	49	49	44	35	29
	1.75	49	49	49	49	49	46	37	29
	2.00	49	49	49	49	49	47	38	29
Minkowski	Ward
0.25 ${}^{**}$		49	49	49	49	48	42	33	29
0.50		49	49	49	49	48	41	33	29
0.75		49	49	49	49	48	41	33	29
1.00		49	49	49	49	48	40	32	29
1.25		49	49	49	49	48	40	32	28
1.50		49	49	49	49	48	40	32	28
1.75		49	49	49	49	47	39	32	28
2.00		49	49	49	49	47	39	32	28
Mahalanobis	Ward	NA	NA	49	49	49	49	46	36

${}^{*}$ The exponent of E-distance. ${}^{**}$ The exponent of Minkowski distance.

5.1 Artificial experiments

Data are simulated from a mixture of bivariate S $\alpha$ S distributions with various $\alpha$ . To simulate a mixture of S $\alpha$ S distribution, data are simulated from two bivariate S $\alpha$ S distributions with two different shift parameters. In fact, the data generated are two attributes and two clusters. For simulation, the STABLE library of the R software, written by John Nolan, has been used [23, 26]. In Table 1, we demonstrate the performance of the covariation similarity measure compared with other frequently used distances in hierarchical clustering. According to Table 1, the hierarchical clustering method with covariation dissimilarity gives the best results at S $\alpha$ S distribution with various alpha parameters. Moreover, covariation similarity measure and E-distance linkage outperformed the other methods for all parameter values of alpha in the S $\alpha$ S distribution. The misclassification of covariation similarity is approximately between 17 to 25 percent less than others. According to these results, E-distance linkage performed meaningfully better than the others, and single linkage gave the poorest results. When the dissimilarity measure is the Minkowski, Ward’s linkage is the best. Also, the Covariation similarity with $p=1,2$ is better than other values of $p$ . Because of this, other linkages and values of $p$ are not included in Table 1. Due to the high variance, When the $\alpha$ is less than or equal to 0.5, we could not compute the inverse covariance matrix and Mahalanobis distance (Not Available). This result is acquired for 2000 times running the Algorithm 1. Obviously, the maximum misclassification of clustering data with two clusters is 50 percent.

5.2 Real data experiments

We use daily time series of FX data for a set of 45 major currencies in the FX market from March 28, 2014, to January 2, 2015, on three continents: Europe, Asia and America and respectively three base currencies: Euro, Chinese Yuan Renminbi and U.S. Dollar. The empirical data consist of the daily FX rates collected from the website of the Pacific Exchange Rate Service (http://fx.sauder.ubc.ca/data.html). These data cause some ideas, including the interaction of international currencies, clustering of currency nodes and the model of price influences. Given the price time series of a currency exchange rate, we consider their returns series to analyze its behavior. The return of an exchange rate with price ${P}\left(t\right)$ at discrete time $t$ is defined by

${R}\left(t\right)=\ln\frac{{{P}\left(t\right)}}{{{P}\left({t-1}\right)}}.$

The intuition behind using exchange rate data set is based on the following reasons:

•
The authors of [24] have shown that distribution of exchange rate is stable and because of infinite variance property of stable distributions, Pearson similarity is not applicable theoretically. Covariation similarity is defined for the stable distribution. In addition, in the case of Pearson’s correlation coefficient, an approximately Gaussian distribution is considered for the attributes and may not be robust for attributes that are not Gaussian distributed.
•
Usually, the Pearson similarity is used for clustering of the exchange rate [6, 35, 17, 19] and in the same manner covariation similarity considers two objects to be similar if their attributes are highly dependent, even though the observed values may be far apart in terms of Minkowski distance. Because of this, we did not use Minkowski distance.
•
We use hierarchical clustering based on covariation similarity to study nonlinear and linear relationship in the foreign exchange market, but Pearson similarity is just used for discovering the linear relationship between attributes.

The covariation dissimilarity or Pearson distance represents how closely stocks move together based on their dependence or correlation. This paper suggests an improvement to the method for clustering exchange rates by a new similarity measure based on covariation coefficient that is a more appropriate choice in theory and simulation results for stable data.

Figure 1.
Internal index graphic for determining the best number of clusters and dissimilarity measure in the exchange rate dataset.

5.2.1 Internal criteria

Figure 1 shows the efficiency of covariation dissimilarity using the internal indices and compare it to Pearson dissimilarity on exchange rate dataset. Approximately, stable data are similar to data with outliers and noise. The outliers objects that do not belong to any predefined cluster and random noisy dimensions features that do not contribute to separate the clusters. Therefore, we used five internal indices, including Silhouette, Calinski-Harabasz, C, Gamma and Davies-Bouldin indices similar to [13]. For simplicity, we multiply the Davies-Bouldin and C indices in negative. Therefore, the best number of cluster and dissimilarity measure is the one corresponding to the greatest value of the index similar to Silhouette, Calinski-Harabasz and Gamma indices. According to Fig. 1, the hierarchical clustering method with covariation dissimilarity gives the best results on exchange rate dataset. Moreover, the covariation dissimilarity measure outperformed the Pearson similarity. Based on Calinski-Harabas, silhouette (with Ward and E-distance linkage) and C indices, the covariation similarity measure is always better than Pearson in all numbers of clusters (Fig. 1a–d). However, regardless just one point, the covariation dissimilarity measure is better than Pearson dissimilarity in both Gamma and Davies-Bouldin indices (Fig. 1e and f).

Table 2
Percentage of misclassification of hierarchical clustering of exchange rate dataset with various dissimilarity and linkage measures

Dissimilarity	Ward	Exponent of E-distance
measure	linkage	0.25	0.5	0.75	1	1.25	1.5	1.75	2
$\textit{CD}_{2}$	8	11	11	11	11	11	11	11	8
Pearson	47	60	49	49	48	47	47	47	47

5.2.2 External criterion

Based on results of [17, 35], returns of exchange rate dataset were classified into three clusters based on geographical location (continent) and three base currencies. In Table 2, we evaluate the performance of covariation dissimilarity by using the misclassification and compare it to Pearson dissimilarity measure on exchange rate dataset.

According to Table 2, the hierarchical clustering method with covariation dissimilarity gives the best results in exchange rate dataset. According to his results, E-distance and Ward’s linkages performed significantly better than the other clustering procedures. The misclassification of covariation dissimilarity is approximately between 38 to 49 percent less than Pearson. Due to the high variance of exchange rate dataset, we could not compute the inverse covariance matrix and Mahalanobis distance (NA).

6. Conclusion

In this article, we have introduced a similarity measure based on the covariation coefficient in heavy-tailed data. The promising outcomes of the covariation similarity measure have come up with the improvement of hierarchical clustering problems in heavy-tailed data. The covariation similarity is evaluated on both artificial data and real dataset and compared with other distances. Covariation similarity measure and E-distance linkage outperformed the other methods for all values of $\alpha$ in S $\alpha$ S distributions. One of the most interesting findings was that Mahalanobis distance performed the worst in the heavy-tailed dataset. Clustering of exchange rate most often relies on the use of the Pearson similarity. The empirical study shows that the covariation similarity measure overcomes Pearson dissimilarity to clustering of an exchange rate dataset. For sum up, finding the adequate dissimilarity measure is an open field, but this article provides new guidelines on the subject for using a specific dissimilarity measure depending on the characteristics and distribution of the dataset. The suggested method used in this paper has room to improve other areas of clustering, and a possible future line of this article is to include other clustering algorithms in heavy-tailed data like K-means.

Footnotes

Acknowledgments

The authors would like to thank anonymous referees for their helpful comments and for careful reading that greatly improved the article.

References

Aggarwal

C.C.

Hinneburg

and Keim

D.A.

, On the surprising behavior of distance metrics in high dimensional spaces, 8th International Conference Lecture Notes in Computer Science 1973 (2001), 420–434.

Amorim

R.C.

and Mirkin

, Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering, Pattern Recognition 45(3) (2012), 1061–1075.

Assent

, Clustering high dimensional data, WIREs Data Mining and Knowledge Discovery 2(4) (2012), 340–350.

Embrechts

Klppelberg

and Mikosch

, Modelling extremal Events: for insurance and finance, Springer, Berlin Heidelberg, 2013.

Faith

D.P.

, Asymmetric binary similarity measures, Oecologia 57(3) (1983), 287–290.

Fenn

D.J.

Porter

M.A.

Mucha

P.J.

McDonald

Williams

Johnson

N.F.

and Jones

N.S.

, Dynamical clustering of exchange rates, Quantitative Finance 12(10) (2012), 1493–1520.

Filippone

Camastra

Masulli

and Rovetta

, A survey of kernel and spectral methods for clustering, Pattern Recognition 41(1) (2008), 176–190.

Francois

Wertz

and Verleysen

, The concentration of fractional distances, IEEE Transactions on Knowledge and Data Engineering 19(7) (2007), 873–886.

Gan

Ch.

and Wu

, Data clustering: Theory, algorithms, and applications, ASA-SIAM, Philadelphia, 2007.

10.

Garg

Enright

C.G.

and Madden

M.G.

, On asymmetric similarity search, 2015 IEEE 14th International Conference on Machine Learning and Applications (2015), 649–654.

11.

Geva

A.B.

Steinberg

Bruckmair

Sh.

Nahum

, A comparison of cluster validity criteria for a mixture of normal distributed data, Pattern Recognition Letters 21(6–7) (2000), 511–529.

12.

Gil-Garcia

and Pons-Porrata

, Dynamic hierarchical algorithms for document clustering, Pattern Recognition Letters 31(6) (2010), 469–477.

13.

Guerra

Robles

Bielza

and Larraaga

, A comparison of clustering quality indices using outliers and noise, Intelligent Data Analysis 16(4) (2012), 703–715.

14.

Heyer

Kruglyak

and Yooseph

, Exploring expression data: Identification and analysis of coexpressed genes, Genome Research 9 (1999), 1106–1115.

15.

Jia

and Darrell

, Heavy-tailed Distances for Gradient Based Image Descriptors, Advances in Neural Information Processing Systems 24, (2011), 379–405.

16.

Jiang

Tang

and Zhang

, Cluster analysis for gene expression data: A survey, IEEE Transactions on Knowledge and Data Engineering 16(11) (2004) 1370–1386.

17.

Kwapień

Gworek

Drożdż

and Górski

, Analysis of a network structure of the foreign currency exchange market, Journal of Economic Interaction and Coordination 4(1) (2009), 1860–7128.

18.

Martos

Munoz

and Gonzalez

, Generalizing the Mahalanobis distance via density kernels, Intelligent Data Analysis 18(6) (2014), 19–31.

19.

McDonald

Suleman

Williams

Howison

and Johnson

S.N.F.

, Detecting a currency’s dominance or dependence using foreign exchange network trees, Physical Review E 72(4) (2005), 046106.

20.

Modarres

and Nolan

J.P.

, A Method for simulating stable random vectors, Computational Statistics 9 (1994), 11–19.

21.

Nielsen

and Nock

, Clustering multivariate normal distributions, Lecture Notes in Computer Science 5416 (2009), 164–174.

22.

Nikias

C.L.

and Shao

, Signal processing with alpha-stable distributions and applications, Wiley, New York, 1995.

23.

Nolan

J.P.

, Stable: Functions for working with stable distributions, R package version 5.1, 2009.

24.

Nolan

J.P.

Panorska

A.K.

and McCulloch

J.H.

, Estimation of stable spectral measures, Mathematical and Computer Modelling, 34 (2001), 1113–1122.

25.

Omran

M.G.

Engelbrecht

A.P.

and Salman

, An overview of clustering methods, Intelligent Data Analysis 11(6) (2007), 583–605.

26.

R Core Team, R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, 2013. ISBN 3-900051-07-0, URL http://www.R-project.org/.

27.

Reiss

R.D.

and Thomas

, Statistical Analysis of Extreme Values: with applications to insurance, finance, hydrology and other fields, Springer, 2007.

28.

Rencher

A.C.

, Methods of multivariate analysis, Wiley, New York, 2003.

29.

Resnick

S.I.

, Heavy-tail phenomena: Probabilistic and statistical modeling, Springer, New York, 2007.

30.

Salas-Gonzalez

Kuruoglu

E.E.

Ruiz

D.P.

, Modelling with mixture of symmetric stable distributions using Gibbs sampling, Signal Processing, 90(3) (2010), 774–783.

31.

Samorodnitsky

and Taqqu

M.S.

, Stable non-Gaussian random processes: Stochastic models with infinite variance, Chapman & Hall, New York, 1994.

32.

Szekely

G.J.

and Rizzo

M.L.

, Hierarchical clustering via joint between-within distances: Extending Ward’s minimum variance method, Journal of Classification 22(2) (2005), 151–183.

33.

Teimouri

and Rezakhah

Mohammdpour

, EM algorithm for symmetric stable mixture model, Communications in Statistics-Simulation and Computation (2017), 1532–4141.

34.

Wang

Chen

and Pan

, A fast hierarchical clustering algorithm for functional modules discovery in protein interaction networks, IEEE Transactions on Computational Biology and Bioinformatics 8(3) (2011), 607–620.

35.

Wang

and Xie

, Tail dependence structure of the foreign exchange market: A network view, Expert Systems with Applications 46 (2016), 164–179.

Hierarchical clustering of heavy-tailed data using a new similarity measure

Abstract

Keywords

1. Introduction

2. Related work

3. Stable distributions

(Symmetric α Stable, S α S [31]).

([31]).

4. Similarity measure based on covariation

4.1 Covariation similarity

5. Experimental section

5.2 Real data experiments

Table 2 Percentage of misclassification of hierarchical clustering of exchange rate dataset with various dissimilarity and linkage measures

6. Conclusion

Footnotes

Acknowledgments

References

(Symmetric $\alpha$ Stable, S $\alpha$ S [31]).

Table 2
Percentage of misclassification of hierarchical clustering of exchange rate dataset with various dissimilarity and linkage measures