Statistical data integration: Challenges and opportunities

Abstract

The authors Morris and Baladandayuthapani should be congratulated on this comprehensive and compelling review of statistical contributions in bioinformatics. In the last section, the authors discuss a burgeoning new area of research they term ‘integromics’, which involves integrating and jointly analyzing multiple types of ‘omics’ data. While we prefer the term ‘data integration’, we agree that this is an exciting new area of statistical and bioinformatics research. In this piece, we aim to complement Morris and Baladandayuthapani by further discussing data integration from a methodological and practical standpoint. We will specifically outline some of the major challenges of data integration, some recent successes, and highlight open areas for future research.

1 Statistical data integration and multi-view data

The term data integration has come to refer to many things in many different fields. The type encountered in bioinformatics is typically integration of what we call ‘multi-view’ or ‘multi-modal’ data. Suppose that there are multiple types of omics data profiled on the same set of subjects or samples. The multiple omics platforms offer multiple views of the data or multiple data modes. If data is organized as a typical data matrix where the rows correspond to observations and columns as features, then multi-view data yields a series of coupled data matrices, each with different features (different columns) but measured on the same observations (shared rows). In this sense, the task of integrating multi-view data can be thought of as the opposite of meta-analysis: in data integration, common observations are analyzed across different sets of features as opposed to meta-analysis where common features are analyzed across different sets of observations.

1.1 Why data integration?

First, we pause to address a logical question: Why is data integration necessary and especially, why are new statistical methods needed for data integration? From a scientific perspective, and as Morris and Baladandayuthapani well motivate, integration of multi-view bioinformatics data is critically important for scientists to gain a holistic understanding of biological systems. Statistically, we are accustomed to analyzing multivariate data, but the problem of data integration, especially for omics data, brings new challenges. First, each single view of multi-view data is typically high-dimensional, meaning that the number of features in each view is larger than the number of observations. Analyzing high-dimensional data of one type is often a challenge, and thus jointly analyzing multi-view data where each data view is high-dimensional is doubly challenging; many high-dimensional statistical techniques cannot be straightforwardly applied to high-dimensional multi-view data. Second, many examples of data integration problems in bioinformatics consist of multi-view data of ‘mixed types’, meaning that each data view consists of variables from different domains (e.g., continuous, count-valued, categorical, skewed continuous, bounded, among others). For example in integrative genomics, genotype data is typically categorical, gene expression as measured via RNA-sequencing is count-valued or non-negative skewed continuous, DNA methylation data is bounded on the interval zero to one, and so forth. Thus, each data view consists of variables of a different type. While many techniques have been developed to model each of these individual data types separately, there are currently few methods that can jointly analyze high-dimensional mixed multi-view data.

2 Data challenges

Before statistical modelling can occur, multi-view omics data can pose many data challenges that must first be addressed. As Morris and Baladandayuthapani notes, many of these challenges have not been the primary focus of statisticians, but their involvement is important. First, acquiring and preparing multi-view omics data for statistical data integration can be challenging. For example, with The Cancer Genome Research Atlas (TCGA) data available from the TCGA data portal (TCGA Research Network, 2011, 2017), different omics types are often represented and stored differently, each requiring specific domain expertise to understand and pre-process the data. Thus, acquiring, formatting and linking the joint subjects to yield coupled data matrices in a unified format conducive to statistical modelling is nontrivial. For TCGA data, we developed the TCGA2STAT R package (Wan et al., 2015) that automatically downloads and wrangles the TCGA data, yielding a series of coupled dataframes linked by subjects or genes that is ready for integrated statistical analyses. Related to this challenge, often statisticians or bioinformaticians specialize in analyzing a few, but not all types of omics data. This means that in order to pre-process and jointly analyze multi-view omics data, a collaborative team is often needed with members that have expertise in each data type. This further creates challenges for ensuring reproducible research as different team members often use different pipelines and platforms for processing different omics data. Care must be taken to ensure that all of these processing steps are fully documented and reproducible across platforms and across team members.

Finally, problems such as batch effects and missing data which can be worrisome for individual datasets, can be further exacerbated when working with multi-view omics data. Batch effect detection and correction methods are designed for a single dataset, and typically with multi-view data, batch effects are removed independently for each data type (Leek et al., 2010). But, the statistical power to detect and remove batch effects could be substantially increased if all views of multi-view data are considered; new statistical techniques for this task are needed. Furthermore, such considerations can inform experimental design of multi-view studies: if batches of observations are different and randomized across data views (e.g., the group of samples comprising a batch in one data view should not be grouped together in a batch for another data view), then this can be exploited to improve detection and removal of batches from multi-view data. Next, missing data, or more appropriately, missing views from multi-view data can be a major problem. Consider TCGA ovarian cancer, for example, where there are $n = 592$ unique subjects which includes $n = 210$ patients with somatic mutation data, $n = 296$ with RNA-sequencing gene expression, $n = 578$ with miRNA-array expression, $n = 588$ with array-CGH copy number variation and $n = 572$ with methylation array data; only $n = 204$ patients have complete data views across these omics types (Network et al., 2011). The pattern of missing data, then, is not at all random, but instead, entire data views are completely missing for many subjects. If one only uses the complete cases across all data views, then the statistical power is severely limited. But, on the other hand, it is ill-advised to simply impute an entire data view that is missing for a given subject. Thus, this is a wide open area of statistical research that is of great practical importance.

3 Statistical modelling challenges

Integrating multi-view omics data poses many statistical modelling challenges and has become a ripe area for statistical research. If one's goal is to use multi-view data to predict an outcome for each of the observations or subjects, then there are several readily available approaches. First, some existing machine learning methods such as random forests or deep learning are adept at handling data of mixed types; these methods, however, are not ideally suited to high-dimensional data and hence, may not be the best for multi-view omics data. Second, one could build a completely independent predictive model for each data view and then use ensemble learning techniques to combine the predictions from each data view. And finally, one could use feature learning techniques such as feature selection, dimension reduction or pattern recognition techniques to learn features for each data view separately; then a joint predictive model can be fit to the learned features from all the data views. Thus for prediction with multi-view data, several possible approaches exist, although there is certainly room for further research and methods development in this area.

Beyond prediction, one's goal is typically to explore the data to make ‘data-driven discoveries’ which generate new hypotheses from data. These techniques include exploratory data visualization, dimension reduction, pattern recognition, clustering, feature selection or network structure learning, among many others. With multi-view data and especially mixed multi-view data, existing techniques for data-driven discovery cannot be applied in a straightforward manner, and new statistical approaches need to be developed. Recently, there has been a flurry of data integration techniques proposed for dimension reduction based on canonical correlations analysis (Rossouw et al., 2008; Witten and Tibshirani, 2009), coupled matrix factorizations (Acar et al., 2011; Lock et al., 2013), multi-step principal components analysis (PCA) methods (Di et al., 2009), and the Generalized singular value decomposition (SVD) (Van Loan, 1976; Alter and Golub, 2004). These methods offer important advances for multi-view data, but they typically assume that all data views consist of the same types of variables (e.g., all continuous data). Hence, these are not ideally suited for mixed multi-view data. One could use these integrated dimension reduction methods in a latent variable or hierarchical model to capture different types of variables in each data view, but such models may not capture a full range of dependencies and can be computationally more demanding. Another area of recent success in data integration methodology include methods for integrative clustering (Shen et al., 2009; Lock and Dunson, 2013). As with the dimension reduction methods, these techniques typically use latent variable or hierarchical models to capture mixed types of data and hence may present some of the same caveats. An open area of research is to develop clustering or dimension reduction techniques that can more directly model mixed multi-view data.

3.1 Integration via mixed graphical models

One example where new statistical methods have recently been developed that directly model mixed multi-view data is that of graphical models. Graphical models, when applied to bioinformatics data, typically assume that each gene, miRNA, CpG site or other biomarker is a node in the network; graphical models then seek to model and estimate relationships between different biomarkers and represent these as a network where edges between two genes denote a form of dependence between the genes. Recently, Yang et al. (2012, 2015) proposed to build graphical models by assuming that every variable conditional on all others arises from a univariate exponential family distribution. This then leads to a joint graphical model distribution that is suitable for data from a variety of domains (e.g., Poisson or negative binomial graphical models for count-valued data such as from next generation sequencing) and that greatly extends the class of graphical models beyond the typical examples of Ising or Gaussian graphical models which are special cases. To yield graphical models that are appropriate for mixed multi-view data, Yang et al. (2014a) and Chen et al. (2015) proposed to build graphical models by assuming all conditional distributions arise from potentially different exponential families. While this idea is appealing, Yang et al. (2014a) and Chen et al. (2015) also show that the types of dependencies between variables of different types is severely limited, making this model impractical. In another line of work, Lauritzen and Wermuth (1989) and Lauritzen (1996) proposed chain graphical models consisting of a Gaussian graphical model conditional on a discrete (Ising) model; Lee and Hastie (2015) and Cheng et al. (2016) later considered structural graph estimation in the high-dimensional case. These instances, however, are only appropriate for integrating continuous and discrete-valued variables and thus, they are not suitable for count-valued data such as with next generation sequencing.

Most recently, Yang et al. (2014b) proposed to combine the concept of chain graphical models and graphical models via exponential families to yield mixed chain graphical models. These models assume that groups of variables form a chain graph and that all relevant conditional distributions arise from potentially different exponential families. Interestingly, by conditioning on other groups of variables through chain graphs, Yang et al. (2014b) show that this class of models permits a wide and flexible range of dependencies between variables of different types. Furthermore, the chaining of groups of variables is a particularly relevant assumption for omics data where, for example, we know that mutations influence gene expression but gene expression does not influence mutations. Hence, we could assume that mutations point to gene expression variables in the mixed chain graphical model. As Figure 1 in Morris and Baladandayuthapani nicely illustrates, this chaining or directionality assumption is known from the underlying biology for integrative analyses of biomedical data; hence, mixed chain graphical models could be a particularly relevant tool for modelling mixed multi-view omics data. Related to these models, however, there is still much room for further research to yield a practical tool that can be applied to large-scale integrative analyses. Some examples of open areas include developing methods to better fit the models and learn the graph structure in high-dimensional settings, methods to test the model's parametric assumptions or even use semi-parametric approaches as in Yang et al. (2014c), and finally methods to assess the model fit or model uncertainty. Overall, mixed chain graphical models yield an exciting approach to directly integrating mixed multi-view data that could be used to make many discoveries about how biomarkers of different types are related.

4 Discussion

In summary, mixed multi-view data found in bioinformatics offers a host of opportunities for new statistical research. The specific challenges that we have outlined with this data are likely to yield a whole new sub-field of high-dimensional statistics that will spur a flurry of research over the next decade. As new statistical techniques that allow scientists to explore their data holistically are developed, statisticians are poised to lead the way with data-driven scientific discoveries.

Acknowledgments

The author acknowledges support from NSF DMS-1554821 and NSF DMS- 1264058.

References

Acar

Kolda

Dunlavy

(2011) All-at-once optimization for coupled matrix and tensor factorizations. arXiv preprint arXiv:1105.3422

Alter

Golub

(2004) Integrative analysis of genome-scale data by using pseudoinverse projection predicts novel correlation between DNA replication and RNA transcription. Proceedings of the National Academy of Sciences of the United States of America , 101, 16577–82.

Chen

Witten

Shojaie

(2015) Selection and estimation for mixed graphical models. Biometrika , 102, 47.

Cheng

Levina

Zhu

(2016) High-dimensional mixed graphical models. Journal of Computational and Graphical Statistics (to appear).

C-Z

Crainiceanu

Caffo

Punjabi

(2009) Multilevel functional principal component analysis. The Annals of Applied Statistics , 3, 458.

Lauritzen

(1996) Graphical Models , volume 17. New York: Clarendon Press.

Lauritzen

Wermuth

(1989) Graphical models for associations between variables, some of which are qualitative and some quantitative. The Annals of Statistics , 17, 31–57.

Lee

Hastie

(2015) Learning the structure of mixed graphical models. Journal of Computational and Graphical Statistics , 24, 230–53.

Leek

Scharpf

Bravo

Simcha

Langmead

Johnson

Geman

Baggerly

Irizarry

(2010) Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics , 11, 733–39.

10.

Lock

Dunson

(2013) Bayesian consensus clustering. Bioinformatics . page btt425.

11.

Lock

Hoadley

Marron

Nobel

(2013) Joint and individual variation explained (jive) for integrated analysis of multiple data types. The Annals of Applied Statistics , 7, 523.

12.

TCGA Research Network (2011) Integrated genomic analyses of ovarian carcinoma. Nature , 474, 609–615.

13.

TCGA Research Network (2017) The cancer genome atlas. URL http://cancergenome.nih.gov/(last accessed 21 April 2017).

14.

Rossouw

Robert-Granié

Besse

et al. (2008) A sparse pls for variable selection when integrating omics data. Genetics and Molecular Biology , 7, 35.

15.

Shen

Olshen

Ladanyi

(2009) Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics , 25, 2906–12.

16.

Van Loan

(1976) Generalizing the singular value decomposition. SIAM Journal on Numerical Analysis , 13, 76–83.

17.

Wan

Y-W

Allen

Liu

(2015) Tcga2stat: Simple tcga data access for integrated statistical analysis in r. Bioinformatics . page btv677.

18.

Witten

Tibshirani

(2009) Extensions of sparse canonical correlation analysis with applications to genomic data. Statistical Applications in Genetics and Molecular Biology , 8, 1–27.

19.

Yang

Allen

Liu

Ravikumar

(2012) Graphical models via generalized linear models. Advances in Neural Information Processing Systems , 25, 1358–66.

20.

Yang

Baker

Ravikumar

Allen

Liu

(2014a) Mixed graphical models via exponential families. In International Conference on Artificial Intelligence and Statistics (AISTATS), JMLR W & CP, 33, 1042–50.

21.

Yang

Ravikumar

Allen

Baker

Wan

Y-W

Liu

(2014b) A general framework for mixed graphical models. arXiv preprint arXiv:1411.0288

22.

Yang

Ravikumar

Allen

Liu

(2015) Graphical models via univariate exponential family distributions. Journal of Machine Learning Research , 16, 3813–47.

23.

Yang

Ning

Liu

(2014c) On semiparametric exponential family graphical models. arXiv preprint arXiv:1412.8697