Abstract
Large-scale cohorts with combined genetic and phenotypic data, coupled with methodological advances, have produced increasingly accurate genetic predictors of complex human phenotypes called polygenic risk scores (PRSs). In addition to the potential translational impacts of identifying at-risk individuals, PRS are being utilized for a growing list of scientific applications, including causal inference, identifying pleiotropy and genetic correlation, and powerful gene-based and mixed-model association tests. Existing PRS approaches rely on external large-scale genetic cohorts that have also measured the phenotype of interest. They further require matching on ancestry and genotyping platform or imputation quality. In this work, we present a novel reference-free method to produce a PRS that does not rely on an external cohort. We show that naive implementations of reference-free PRS either result in substantial overfitting or prohibitive increases in computational time. We show that our algorithm avoids both of these issues and can produce informative in-sample PRSs over a single cohort without overfitting. We then demonstrate several novel applications of reference-free PRSs, including detection of pleiotropy across 246 metabolic traits and efficient mixed-model association testing.
Introduction
Individual genetic polymorphisms typically explain only a small proportion of the heritability, even for traits that are highly heritable (Nolte et al., 2017). Polygenic risk scores (PRSs) aggregate the contributions of multiple genetic variants to a phenotype (Torkamani et al., 2018). These scores can be calculated using routinely recorded genotypes (Nolte et al., 2017; Torkamani et al., 2018), are strongly associated with heritable traits (Nolte et al., 2017), and are independent of environmental exposures or other factors that are uncorrelated with germ line genetic variants. These properties have motivated a rapidly expanding list of applications from basic science (e.g., causal inference and Mendelian randomization (Burgess and Thompson 2013), hierarchical disease models (Cortes et al., 2017), and identification of pleiotropy (Krapohl et al., 2016) to translation (e.g., estimating disease risk (Maas et al., 2016; Khera et al., 2018), identifying patients who are likely to respond well to a particular therapy (Natarajan et al., 2017), or flagging subjects for modified screening (Seibert et al., 2018).
PRSs are calculated as a weighted sum of genotypes. In some applications, all genotyped single-nucleotide polymorphisms (SNPs) may be used, but often only a small set is given nonzero weight. A subset of SNPs selected to contribute to a PRS may be a genome-spanning-but-uncorrelated LD (Linkage Disequilibrium)-pruned set or a set of SNPs with independent evidence of association with the phenotype of interest. Gene-specific PRSs are also generated using selected sets of SNPs within a region of the genome, such as a window around the coding region of a particular gene (Gamazon et al., 2015; Gusev et al., 2016). The weights on the SNPs included in a polygenic score are often derived from the marginal regression coefficients of an external genome-wide association study (GWAS) (Wray et al., 2007; Dudbridge, 2016), but they may instead be based on predictive models using all SNPs. Joint predictive models include linear mixed models (LMMs) and their sparse extensions (Yang et al., 2011; Zhou et al., 2013; Vilhjálmsson et al., 2015) and other regularized regression models such as the lasso or elastic net (Rakitsch et al., 2012; Warren et al., 2014; Gamazon et al., 2015; Gusev et al., 2016). The predictions from these joint analyses using genome-wide variation are also approximated by postprocessing of GWAS summary statistics (Vilhjálmsson et al., 2015; Gusev et al., 2016).
For these SNP weights to accurately reflect the SNPs' joint association with the phenotype and to generate informative and interpretable PRSs, the reference data set must match the target data set in many ways: the populations must have similar ancestry; the trait of interest must be measured and in a similar way; and identical genotypes must be assayed or imputed. Furthermore, the reference data must be large enough to accurately learn the PRS weights.
An alternative approach is to use the particular data set under consideration to build a reference-free PRS. This eliminates the need for an external reference data set with matched genotypes, phenotypes, and populations. However, as we show below, naive approaches can easily overfit genetic effects. This overfitting results in PRSs correlated with nongenetic components of phenotype, which will induce bias or other errors in downstream applications. Cross-validation is one established approach to mitigate overfitting, which in this context involves holding out and computing a polygenic score for each sample in turn. The main hurdle to this approach is computation time, as standard leave-one-out (LOO) cross-validation requires fitting the PRS model N times in a sample with N individuals.
Here we report an efficient method to generate PRSs using the out-of-sample predictions from a cross-validated LMM. Our approach generates LOO PRSs, which we call cvBLUPs (cross-validated Best Linear Unbiased Predictors), with computational complexity linear in sample size after a single LMM fit. In addition to eliminating the reliance on external data and guaranteeing the PRSs are generated from a relevant population and phenotype, we describe several applications that are only feasible with cvBLUPs. We first demonstrate several desirable statistical properties of cvBLUPs and then consider applications, including evidence of polygenicity across metabolic phenotypes, a novel formulation of mixed model association studies, and selection of relevant principal components for control of confounding by population structure. To facilitate their use, we have incorporated the calculation of cvBLUPs in the genetic analysis program genome-wide complex trait analysis (GCTA) (Yang et al., 2011).
Methods
We consider the continuous phenotype y measured on N individuals, which depends on an N-by-M matrix of additively coded genotypes G, other covariates X, and random noise
For each subject i, the polygenic score PS
where S is the set of SNPs in the polygenic model,
Our objective is to produce an LOO cross-validated polygenic score (PRS) for each subject. We generate our PRS as a genetic prediction from an LMM. LMMs are widely used for genetic prediction (Robinson, 1991), heritability estimation (Kang et al., 2008, 2010; Yang et al., 2010; Lippert et al., 2011; Zhou and Stephens, 2012), and other polygenic analyses (Kang et al., 2008; Lippert et al., 2011; Lee et al., 2012; Zhou and Stephens, 2012). The LMM (Equation 3) jointly models the contributions of all SNPs Z and other covariates X to the phenotype y. Following others (Yang et al., 2011), we define Z by centering and scaling columns of G to have mean 0 and variance 1.
The key LMM parameters are the genetic variance
These variance estimates are then used to estimate b, the genetic effect sizes (i.e., weights), or Zb, the genetic predictions (i.e., BLUPs):
where K is the genetic relatedness matrix computed from the centered and scaled genotypes Z by
While the BLUPs in
We avoid this penalty by leveraging the fact that for linear models, where fitted values are a linear transformation of phenotypes,
In more detail, the LOO prediction errors are the differences between the LOO genetic predictions and the observed residual phenotypes after subtracting fixed effects,
where
We can rearrange these expressions to calculate the LOO predictions, or cvBLUPs, given the standard BLUPs, the phenotype residuals
Because all of these elements are computed when fitting an LMM, cvBLUPs can be produced with no additional computational complexity.
These cvBLUPs are cross-validated genetic predictions or PRSs calculated without using an external reference data set to identify relevant SNPs or to set their weights in the PRS model. Additional or external observations may be assigned PRSs using the data set where the cvBLUPs are calculated as a reference, using the standard BLUP effect size estimates in Equation 4 as SNP weights.
Empirical confirmation of cross-validated predictions
To examine the properties of the proposed cvBLUP formulation, we conducted a set of simulations. We generated 1000 data sets with
Figure 1 shows scatter plots for one simulation of nongenetic (

Independence of cvBLUPs and nongenetic factors. Correlations of genetic predictions, BLUPs and cvBLUPs, true genetic factors Zb, and independent environmental factors
Table 1 shows the mean correlations of true simulated values with standard BLUPs and cvBLUPs. Again, standard BLUPs are clearly correlated with the noise term ɛ due to overfitting, but cvBLUPs are appropriately uncorrelated with
Mean Correlations (and Standard Errors) of BLUPs and cvBLUPs with Each Component of the Additive Simulation Model,
Statistically significant correlations (
BLUPs, Best Linear Unbiased Predictors; cvBLUPs, cross-validated Best Linear Unbiased Predictors; SNP, single-nucleotide polymorphism.
This type of correlation causes downstream problems for residual analyses, predictions, and causal inference. Importantly, the cvBLUPs are independent of all unmodeled effects as desired.
We next applied cvBLUP in an analysis of Finnish men from the metabolic syndrome in men (METSIM) cohort (Laakso et al., 2017). This cohort comprised 10,197 men aged 45 to 73 at recruitment between 2005 and 2010 in Kuopio, Finland. Blood serum samples were collected from each participant, and 228 metabolites in the samples were quantified by nuclear magnetic resonance spectroscopy (NMR). In addition to the metabolites, biometric traits, including height and weight, and epidemiological traits such as diagnoses or family history of diabetes and coronary heart disease (CHD) were recorded for a total of 248 phenotypes. Continuous phenotypes were quantile normalized. All samples were genotyped at 665,478 SNPs on the Illumina OmniExpress chip. After removing subjects with missing rates above 5% and SNPs with missing rates above 5%, 10,070 subjects and 609,131 SNPs remain.
We initially consider genetic predictions of the metabolic, biometric, and epidemiological traits in an unrelated subset of subjects (with genetic relatedness less than 0.05). Since the metabolic traits are expected to be affected by statins and by pharmaceutical interventions for diabetes, we exclude subjects with diabetes or who use statins from the initial analysis and calculation of cvBLUPs. There are no comparable data sets with the set of metabolic measurements available in the METSIM cohort, but cvBLUPs allow computationally efficient genetic predictions of all 246 phenotypes (excluding diabetes status and statin use). With the genetic prediction models learned in the restricted set of subjects, we extended predictions to the excluded subjects with standard BLUP effect-size estimates (Equation 4). Thus, cvBLUPs allow analyses of reference-free genetic predictions in a subset of subjects that is restricted to avoid confounding by known environmental exposures (statins and responses to diagnoses of diabetes) or by family structure; and these genetic predictions may then be extended to subjects who are initially excluded to avoid confounding.
We first estimated

Cross-trait correlations with cvBLUPs. Correlations of phenotypes (rows) and genetic predictions (cvBLUPs, columns) across 246 phenotypes. Many cvBLUPs are strongly correlated with additional phenotypes.
Note that the cvBLUPs are calculated from LMMs that include fixed effects (age, age-squared, and 10 genetic principal components [PCs]). The cvBLUPs are actually genetic predictions for the residual phenotype after projecting out the fixed-effect contributions, and so, there may be cases where the true genetic effect is negatively correlated with the phenotype because of fixed-effect contributions that are negatively correlated with the genetic contributions to the trait. Focusing on the 198 traits with significant heritabilities, the correlations between cvBLUP and phenotype have mean of 0.078 and standard deviation of 0.027. Of the 246 phenotypes, 136 have p-values for significance tests or correlation between cvBLUPs and phenotypes below the Bonferroni-corrected value of 0.05/246 = 0.000203, and 172 correlations are significant at false discovery rate (FDR) = 0.05 (Benjamini and Yekutieli, 2001).
The off-diagonal blue patches in the figure represent cvBLUPs that are positively correlated and predictive of different phenotypes, while red patches represent cvBLUPs that are predictive but negatively correlated with different phenotypes. The off-diagonal correlations show the widespread pleiotropy of genetic effects on metabolism with over 16,203 off-diagonal cvBLUP-phenotype associations at FDR = 0.05 (Benjamini and Yekutieli, 2001). Many cvBLUP-trait correlations are sign-consistent with the respective trait/trait correlations. For example, high-density lipoprotein (HDL) and low-density lipoprotein (LDL) cholesterol are well known be negatively correlated (Terry et al., 1989), and our results demonstrate this is partially due to negative genetic pleiotropy. That is, we observe that SNPs associated with increased LDL are also associated with decreased HDL in aggregate (and vice versa).
In addition to use in detection of pleiotropy, polygenic modeling is a widely used tool in association testing (Kang et al., 2008; Lippert et al., 2011; Zhou and Stephens, 2012; Yang et al., 2014; Loh et al., 2015), and we therefore consider cvBLUPs in this context. We compare the relative performance of five groups of methods. First, unadjusted regression; second, principal component-adjusted regression; third, standard LMM association tests; fourth, LMM residual-based methods; and fifth, cvBLUP-adjusted regression. Standard LMM-based methods use association tests where the covariance of the observations based on the genetic relatedness of the subjects is estimated and used to calculate effect estimates and test statistics by generalized least squares (Kang et al., 2008; Lippert et al., 2011; Zhou and Stephens, 2012; Yang et al., 2014). The LMM residual-based methods perform association tests on the BLUP residualized phenotypes and possibly genotypes. Then, to try and account for the bias inherent in using the overfit BLUPs, they perform an adjustment step on the resulting test statistics and effect sizes. These methods were pioneered by GRAMMAR/GRAMMAR-γ and include the recent BOLT and BOLT-INF methods (Aulchenko et al., 2007; Svishcheva et al., 2012; Loh et al., 2015).
The typical sources of confounding for associations with germ line genetic markers are population structure, family structure, and batch effects in the data collection (Listgarten et al., 2010). Genetic principal components as adjustment covariates may suffice to control for confounding by population structure or batch effects, but LMMs are often more effective at controlling these sources of confounding (Kang et al., 2010), while also helping to control confounding by family structure and boosting power to detect true associations over standard fixed-effect regression models (Yang et al., 2014). These benefits of LMMs come at the expense of an increased computational burden over standard linear regression.
In GWAS, the statistical significance of each variant gj is tested individually. Here, the SNP is jointly analyzed with covariates X and the contributions of unmodeled variants contribute to a larger error term
Often a better estimate of the effect size for a particular SNP gj may be made by accounting for the contributions of the other variants to the phenotype y, and by blocking the effects of confounders of the associations of genotypes and the phenotype—by adjustment with an appropriate set of fixed-effect covariates or other means (Zaitlen et al., 2012a, 2012b; Yang et al., 2014).
Here we demonstrate the use of cvBLUPs as adjustment covariates in a linear regression model that efficiently captures some of the benefits of a standard mixed model association study. To compare the performance of cvBLUP-adjusted analyses to existing methods for association testing under a range of study scenarios, simulations were used. Methods compared were unadjusted linear regression, PC-adjusted linear regression, a standard LMM association test (GCTA), and BOLT association tests. BOLT results were collected both for the infinitesimal genetic model (BOLT-INF) and the sparse causal genetic model (BOLT-LMM). Association tests conducted with cvBLUP adjustment, GCTA, and BOLT were done with leave-one-chromosome-out schemes, in which the variance components, cvBLUPs, and phenotypic predictions and residuals (BOLT) were calculated using only SNPs that are on different chromosomes than the test-SNPs.
In each simulation, data sets were generated with
In Table 2, the results of analyzing data sets with independent subjects and an infinitesimal genetic architecture are shown. All methods produced unbiased effect estimates and well-calibrated tests under the null (
Simulations with Infinitesimal Genetic Model, Without Population or Family Structure
Mean and standard error for effect estimates and test statistics of association tests at 1000 null SNPs with true
LMM, linear mixed model; PC, genetic principal component; BLUP, Best Linear Unbiased Predictor; cvBLUP, cross-validated Best Linear Unbiased Predictor.
In Table 3, the results of analyzing data sets with confounding by population structure are shown. Subjects were drawn from five distinct populations with pairwise Fst set to 0.03. Population-specific allele frequencies were generated by the Balding–Nickels model (Balding and Nichols, 1995). The component
Simulations with Infinitesimal Genetic Model and Population Structure
Mean and standard error for effect estimates and test statistics of association tests at 1000 null SNPs with true
cvBLUPs correct for population structure because they are weighted combinations of ALL principal components, where weights are based on the singular value corresponding to the principal component and on the strength of the association of the principal component with the outcome (see Supplementary Data). Conceptually, cvBLUPs control for population structure as if all PCs were considered and the most relevant ones for the analysis were kept.
In Table 4, the results of analyzing data sets with confounding by family structure are shown. Here the 2000 subjects in each simulation represented 200 families with 10 subjects each. Families were generated in pedigrees with four founders and six of their descendants, with descendants' genotypes selected independently by drop-down from their parents. In the data-generating model, there were covariate effects correlated with family membership,
Simulations with Infinitesimal Model and Family Structure
Mean and standard error for effect estimates and test statistics of association tests at 1000 null SNPs with true
Tables 5, 6, and 7 are analogous to Tables 2, 3, and 4, showing the results of analyses on simulated groups of independent subjects, data sets with population structure, and data with family structure respectively, but the traits analyzed in Tables 5–7 were generated with sparse genetic architectures while those in Tables 2–4 were generated from the infinitesimal model. In Table 5 we see that linear regression with cvBLUP adjustment in a sparse model gives the same effect size estimates and power as the standard linear mixed model (LMM) GTCA.
Simulations with Sparse Genetic Model and Independent Subjects
Mean and standard error for effect estimates and test statistics of association tests at 1000 null SNPs with true
Simulations with Sparse Genetic Architecture and Population Structure
Mean and standard error for effect estimates and test statistics of association tests at 1000 null SNPs with true
Simulations with Sparse Genetic Architecture and Family Structure
Mean and standard error for effect estimates and test statistics of association tests at 1000 null SNPs with true
Finally, we applied these methods to the METSIM data described above. Tables 8 and 9 show results of GWASs for low density lipoprotein cholesterol (LDLc) and high density lipoprotein cholesterol (HDLc) in the METSIM cohort using cvBLUP-adjusted linear regression and comparison methods as in the simulations above. All p-values were GC adjusted for comparison purposes. All mixed-model-based methods, including LR+cvBLUP, were more powerful than standard linear regression. As expected, BOLT-LMM had the highest power due to modeling of noninfinitesimal structure. In this data analysis, unlike in the simulations above, there is no observed evidence of deflation or bias in effect size estimates by BOLT relative to estimates made by standard linear regression or standard LMM effect size estimates (Tables 8 and 9), suggesting that there is not a problem with deflation of effect size estimates in practice.
Genome-Wide Association Study Results for Baseline Low-Density Lipoprotein Cholesterol Using an Unrelated Subset of Subjects from the Metabolic Syndrome in Men Cohort
LMM with GLS analysis of SNP effects implemented in GCTA; cvBLUP, cross-validated prediction-adjusted linear regression; BOLT-INF; BOLT assuming infinitesimal genetic model; BOLT-LMM, BOLT using mixture of normal distributions as prior for SNP effect sizes, that is, sparse genetic architecture. cvBLUP-adjusted analyses, LMM, and BOLT were used in a leave-one-chromosome-out scheme with variance components, cvBLUPs, covariance models (LMM, GCTA), and genetic predictions and residuals (BOLT) generated using SNPs on chromosomes other than that of the test-SNPs.
GCTA, genome-wide complex trait analysis; GWAS, genome-wide association studies; LR, linear regression; METSIM, metabolic syndrome in men.
Genome-Wide Association Study Results for Baseline High-Density Lipoprotein Cholesterol Using an Unrelated Subset of Subjects from the Metabolic Syndrome in Men Cohort
LMM with GLS analysis of SNP effects implemented in GCTA, cvBLUP: cross-validated prediction-adjusted LR, BOLT-INF; BOLT assuming infinitesimal genetic model, BOLT-LMM, BOLT using sparse genetic architecture. cvBLUP-adjusted analyses, LMM, and BOLT were used in a leave-one-chromosome-out scheme with variance components, cvBLUPs, covariance models (LMM, GCTA), and genetic predictions and residuals (BOLT) generated using SNPs on chromosomes other than that of the test-SNPs.
In the Supplementary Material we further characterize the cvBLUPS, showing the “polygenic shrink” relationship between the variance of the cvBLUPs and the trait heritability (Supplementary Tables S1 and S2), and the correlations between cvBLUPs and polygenic risk scores (PRSs) based on external reference data sets (Supplementary Tables S3 and S4).
Here we describe a new and computationally efficient approach for generating PRSs directly from an LMM. We show that the LMM framework allows direct calculation of out-of-sample genetic predictions. Our approach will have immediate utility for the growing list of applications that rely on PRS, and we provide examples of several additional application areas, including detection of pleiotropy, powerful association testing, and estimation of polygenic shrink.
The elimination of overfitting by cvBLUPs relative to BLUPs suggests a solution to the bias problem in residual-based methods. Rather than post hoc correction of residual test statistics as in GRAMMAR-
There are several limitations to this approach. First, cvBLUPs are calculated from the standard LMM framework, which corresponds to the infinitesimal genetic model. As we see in Tables 5–7, in simulations with sparse genetic architectures, sparse models have considerably higher power. In future work, we intend to use sparse analogs of cvBLUPs generated as cross-validated predictions. Unfortunately, sparse models, including the BOLT-LMM model and LASSO, are not amenable to the fast LOO cross-validation as in Equation (7).
Another limitation is that when used as adjustment covariates, cvBLUPs do not control confounding by family structure or cryptic relatedness as well as standard mixed-model association tests. Rather than using all subjects for computation of cvBLUPs, an alternative protocol is to calculate cvBLUPs for an unrelated subset, and also to calculate BLUP effect size estimates
In our cross-trait analysis of the METSIM data set, we show the cross-correlations of PRSs for one trait and actual (normalized) phenotypic measures for other traits. We are working to extend these cross-trait analyses—in particular by using correlations of cvBLUPs for pairs of traits as estimates of the genetic correlation. However, even naive correlations of cvBLUPs give an effective picture of the genetic correlations between traits. Since cvBLUPs are efficiently calculated one at a time, and genetic correlations are then estimated in trivial pairwise analyses of traits, the pairwise correlations of hundreds or thousands of traits may be efficiently calculated this way. Furthermore, in the pairwise analyses of the METSIM phenotypes, and in our pairwise analyses of RNA expression that underlie our trans-eQTL analysis (Liu et al., 2018), we actually generate asymmetric cross-correlation matrices because the correlation of cvBLUP for trait A with measured trait B is not the same as the correlation of measured trait A with cvBLUP for trait B. We are exploring applications of these asymmetric matrices for network analysis and causal inference.
Efficient generation of out-of-sample genetic predictions using LOO cross-validation of the predictions from an LMM is an effective way to generate PRSs, and opens the application of analyses based on PRS to scenarios where there is no available reference data to generate a typical scoring model. It is now well known that PRSs and genetic predictions transfer poorly to populations that are distinct from the reference data set used to learn the genetic model (Scutari et al., 2016; Martin et al., 2017). We look forward to using reference-free PRS methods based on cvBLUPs for applications with data from underrepresented populations.
The principle of using cross-validated predictions from polygenic models as PRSs may be extended to predictions from sparse or complex models, but the cross-validated predictions from a standard LMM, which we call cvBLUPs, are particularly simple to calculate and have novel and interpretable applications. To make the results of this work accessible to the community, we have implemented them in the GCTA software package (Yang et al., 2011).
Footnotes
Acknowledgments
We thank the individuals who participated in the METSIM study.
Author Disclosure Statement
The authors declare they have no competing financial interests.
Funding Information
N.Z., J.M., and D.P. were funded by NIH grants (K25HL121295, U01HG009080, and R01HG006399). This study was funded by the National Institutes of Health (NIH) grants HL-095056, HL-28481, and U01 DK105561. The funders had no role in study design, data collection, and analysis, decision to publish, or preparation of the article. MAK works in a unit that is supported by the University of Bristol and U.K. Medical Research Council (
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
