This dissertation is submitted for the degree of Doctor of Philosophy Genetic association of high-dimensional traits Hannah Verena Meyer September 2017 University of Cambridge Jesus College European Bioinformatics Institute The source code of the thesis is available at https://github.com/HannahVMeyer/thesis. Acknowledgements I am immensely thankful to all the people who have supported me throughout my PhD. First and foremost, a big thank you to my supervisor Ewan Birney who gave me the opportunity to work in his research group. Ewan’s enthusiasm, ideas and sup- port were invaluable to guide me through the past four years. I would also like to thank the other members –current and past– of the Birney research group: Ian Dunham, Sander Timmer, Sandro Morganella, Valentina Iotchkova (thank you for all your advice on the statistics), Helena Kilpinen, Leland Taylor, Nils Kölling, Tom Fitzgerald, Carl Barton, Anat Melamed and the other Hannah (Currant). I really valued the helpful discussions and feedback and enjoyed our times in the Northum- brian wilderness. A special thanks to Tracey Andrews, Stacy Knoop, Debbie Howe and Christina Karikidis for their help with all the administrative tasks and finding time in Ewan’s busy schedule to meet with me. My PhD project would not have been possible without my collaboration partners. Paolo Casale and Oliver Stegle supported me with their experience and construct- ive advice in the method development part of this thesis. Konrad Rudolph helped with the development of the R package. The heart project was based on a close col- laboration with Stuart Cook’s research group in London, in particular with Antonio De Marvao and Declan O’Regan. My thesis advisory committee with John Marioni, Carl Anderson and Jan Korbel was very helpful with their constructive criticism and suggestions, keeping me and Ewan on track and helped with the timely submission of this thesis. I would also like to thank Sylvia Richardson andNicholas Timpson for their time examining my thesis and for the interesting discussions during my viva. Many thanks to the people who helped with proofreading of this thesis: the two Nils (Eling and Kölling), JackMonahan, Hannah Currant, Carl Barton, Paolo Casale, Ewan Birney and David Pattinson (finding the biggest typo in a chapter heading on the day of submission). Aside from the academic support, there are many people who deserve a special thank you for their moral support and friendship over the past years. My experi- 3 ence would not have been the same without the countless hours of early mornings on the Cam with Jesus College Boat Club, the exciting volleyball matches with the Cambridge University Volleyball Club and the great people I met in both teams. I would like to thank my fellow EMBL and Cambridge PhD students for the fun times during the predocs course, the many Cambridge formals and warmwelcomes when coming back to Heidelberg, especially Nils, Jack, Hannah, Michael, Konrad, Silvia, Christina, Julia, Laura, Joran, Kostek, Chris, Paul andmy (former) housemates Sarah, Maria, Nils and Dani. To my close friends from afar, Micha, Melanie, Jana, Lissy, Geli, Léonie and Meike (not quite so far): your support and encouragement, at whatever time and place was wunderbar. I would like to thank Dave, for the dis- cussions and input onmy research, the outdoors activities to distract from them, but mostly, for everything else. Finally, I would like to thankmy family who supportedme in all my decisions, en- couragedme in difficult times and sharedmy happiness in good ones: vielen Dank! 4 Declaration of Originality This dissertation is the result of my ownwork and includes nothingwhich is the out- come of work done in collaboration except as declared in the Preface and specified in the text. It is not substantially the same as any that I have submitted, or, is being concur- rently submitted for a degree or diploma or other qualification at the University of Cambridge or any other University or similar institution except as declared in the Preface and specified in the text. I further state that no substantial part of my disser- tation has already been submitted, or, is being concurrently submitted for any such degree, diploma or other qualification at the University of Cambridge or any other University or similar institution except as declared in the Preface and specified in the text. This dissertation does not exceed the limit of 60,000 words as specified by the Biology Degree Committee. 5 Contents List of Figures 10 List of Tables 15 List of Abbreviations 17 Summary 21 1. Introduction 23 1.1. From ancient ideas of inheritance to the birth of modern genetics 24 1.1.1. Mendelian Laws of Inheritance 26 1.1.2. Biometrics 27 1.1.3. Molecular basis of inheritance 27 1.2. The Laws of Inheritance on a cellular level 28 1.3. Genetic linkage 29 1.4. Towards quantitative genetics 30 1.5. Progress in deciphering the molecular mechanisms of inheritance 31 1.5.1. Novel genotype mapping techniques 32 1.5.2. Deciphering DNA sequences 32 1.6. From genetic linkage analysis to genome-wide association studies 33 1.6.1. Genotype-phenotype mapping until the 1990s 33 1.6.2. “Common disease–common variant” hypothesis 35 1.6.3. Databases of human variation 36 1.6.4. Genotyping of large cohorts 37 1.6.5. Genome-wide association studies 37 1.7. Linear models for genome-wide association studies 38 1.7.1. Linear regression 39 1.7.2. Simple linear model for genotype associations with a single trait 42 1.7.3. Testing the genotype association for significance 43 7 1.7.4. Correcting for multiple hypotheses testing in GWAS 44 1.7.5. Accounting for population structure and genetic kinship 47 1.7.6. Linear mixed models 50 1.7.7. Joint analysis of multiple phenotypes 54 1.7.8. Linear mixed models for the joint analysis of multiple phenotypes 57 2. Cardiac biology 61 2.1. Cardiac cycle 63 2.2. Conduction system 63 2.3. Heart development 64 2.4. Common cardiovascular diseases 66 2.5. Genetics of cardiovascular diseases 67 2.6. Thesis outline 69 3. PhenotypeSimulator 71 3.1. Genotype simulation 72 3.2. Phenotype simulation 76 3.2.1. Phenotype components 78 3.2.2. Scaling and phenotype construction 81 3.2.3. Case study 82 3.3. Conclusion 84 4. Extending linear mixed models to high-dimensional phenotypes 87 4.1. LiMMBo: Linear mixed modeling with bootstapping 90 4.2. Covariance estimation via bootstrapping 91 4.3. Data simulation 91 4.4. Scalability of LiMMBo 92 4.5. LiMMBo yields covariance estimates consistent with REML estimates for moderate trait numbers 96 4.6. mtGWAS with LiMMBo-derived covariance matrices are well calibrated across all phenotype sizes 97 4.7. Multi-trait genotype to phenotype mapping increases power for high-dimensional phenotypes 99 4.8. LiMMBo for multi-trait GWAS and beyond 103 8 5. LiMMBo applied to multi-trait GWAS in Saccharomyces cerevisiae 105 5.1. Dataset and imputation 108 5.1.1. Missing data mechanism 108 5.1.2. Imputation via MICE 110 5.2. Multi-trait GWAS with LiMMBo 117 5.2.1. Estimating the genetic relationship in the yeast cross 117 5.2.2. LiMMBo increases power in detecting genetic associations 117 5.2.3. Multi-trait effect size estimates as indicators for common biology 120 5.3. Summary 121 6. Low-dimensional representations of very high-dimensional data 123 6.1. Review of dimensionality reduction methods 125 6.2. Visualisation of data structures by dimensionality reduction 131 6.3. Quantification of dimensionality reduction performance 137 6.4. Dimensionality reduction for feature extraction 139 6.4.1. Stability of dimensionality reduction 140 6.4.2. Stable features enable discovery of genetic associations 144 6.5. Dimensionality reduction is a powerful tool for genetic association studies 150 7. GWAS of left ventricular wall thickness 155 7.1. Data 157 7.1.1. Genotypes 157 7.1.2. Phenotypes 160 7.2. Dimensionality reduction yields stable low-dimensional phenotype representations 163 7.3. Multi-trait GWAS detects three loci associated with heart wall thickness 169 7.4. Successful imaging genetics of cardiac phenotypes 177 8. GWAS of left ventricular trabeculation 179 8.1. Left ventricular trabeculation 179 8.2. Image acquisition and phenotyping 181 8.3. The complexity of trabeculation shows a consistent base to apex pattern 183 8.4. Relationship between trabeculation phenotypes and covariates 185 9 8.5. Left ventricular trabeculation is associated with two genomic loci 185 8.6. Summary 190 9. Concluding remarks 193 Appendix 197 A. Supplementary tables 199 A.1. Additional information chapter 2 200 A.2. Additional results chapter 7 202 B. Supplementary Figures 203 B.1. Additional results chapter 4 204 B.2. Additional results chapter 5 205 B.3. Additional results chapter 6 206 B.4. Additional results chapter 7 207 B.5. Additional results chapter 8 212 C. Derivations 213 References 215 10 List of Figures 1.1. Genetics over time. 25 1.2. Distributions in LLR testing in GWAS. 45 2.1. Anatomy, circulatory and conductory system of the human heart. 62 2.2. Embryonic heart development. 65 2.3. GWAS on heart-related phenotypes. 68 3.1. Genetic relationship matrices and principal components of three simulated European ancestry cohorts. 75 3.2. Phenotype simulation scheme. 77 3.3. Phenotype simulation. 83 3.4. Comparison of multi-trait to single-trait GWAS. 84 3.5. Relationship between p-values, allele frequencies and simulated effect sizes. 85 4.1. Variance decomposition. 92 4.2. Scalability of LiMMBo compared to standard REML. 95 4.3. Comparison of trait-by-trait covariance estimates derived from standard REML and LiMMBo. 97 4.4. Calibration of mtGWAS based on covariance estimates from standard REML and LiMMBo. 98 4.5. Calibration of mtGWAS via a simple linear model and a linear mixed model. 100 4.6. Power comparison for mvLMM and uvLMMs of high-dimensional phenotypes. 102 5.1. Generation of yeast dataset. 109 5.2. Frequencies and distributions of missing values in the yeast phenotype data. 111 5.3. Correlations of observed phenotypes with missing data values. 112 11 5.4. Pairwise correlations of 46 growth traits in Saccharomyces cerevisiae. 115 5.5. Correlation between imputed and experimentally observed trait values. 116 5.6. Manhattan plot of p-values from single-trait and multi-trait GWAS. 119 5.7. Hierarchical clustering of mtGWAS effects size estimates. 122 6.1. Correlation of flowering phenotypes. 135 6.2. Visualisation of the Iris dataset in two dimensions. 135 6.3. Three-dimensional embedding of datapoints lying on a two-dimensional plane. 136 6.4. Visualisation of the roll dataset in two dimensions. 136 6.5. Quality of the dimensionality reduction in the Iris dataset. 139 6.6. Quality of the dimensionality reduction on the 2D manifold embedded in 3D. 140 6.7. Performance of dimensionality reduction techniques on simulated datasets. 143 6.8. Stability of dimensionality reduction techniques for different background noise models. 145 6.9. Stability of dimensionality reduction techniques for different genetic variant and observational noise models. 148 6.10. Genetic association of stable components from dimensionality reduction. 149 6.11. Effect size distribution of discovered SNPs. 151 7.1. Overview of SNP numbers after imputation and imputation quality control. 160 7.2. Cardiac phenotyping based on cardiac magnetic resonance images. 162 7.3. Phenotype reproducibility. 163 7.4. Distribution of covariates in 3D heart phenotype cohort. 165 7.5. Pair-wise scatterplots of low-dimensional components derived from left-ventricular wall thickness. 167 7.6. Dimensionality reduction of 3D heart phenotypes. 168 7.7. Correlation of low-dimensional components across methods. 169 7.8. Manhattan plot of the multi-trait GWAS on 3D heart phenotypes . 170 12 7.9. Quantile-quantile plot of the multi-trait GWAS on 3D heart phenotypes . 171 7.10. Genomic context of loci associated loci with 3D heart phenotypes. 172 7.11. Regulatory context of locus with strongest association. 173 7.12. Effect size estimates and trait correlation from the 3D heart GWAS. 175 7.13. Association of rs139971383 with left ventricular wall thickness. 176 8.1. FD phenotyping scheme. 182 8.2. FD measurements from base to apex. 183 8.3. Relationship between FD measures and covariates. 184 8.4. Manhattan plot of multi-trait GWAS on left ventricular trabeculation. 186 8.5. Quantile-quantile plot of multi-trait GWAS on left ventricular trabeculation . 187 8.6. Genomic context of loci associated loci with left ventricular trabeculation. 188 8.7. Effect estimates of associated FD SNPs with other cardiovascular phenotypes. 191 B.1. All parameter combinations of power comparison for multivariate and univariate LMMs of high-dimensional phenotypes. 204 B.2. Manhattan plot of traits with strong single-trait associations. 205 B.3. Additional scatterplots for visual assessment of low-dimensional components derived from left-ventricular wall thickness. 206 B.4. Number of DNA probes on the different genotyping chips and their overlap. 207 B.5. Genotyping quality control per sample. 208 B.6. Genotyping quality control per SNP. 209 B.7. Ethnicity of samples within the Digitial Heart project. 210 B.8. Manhattan plots for GWAS on stable components from a single dimensionality reduction method. 211 B.9. Manhattan plot of two single-trait GWAS on left ventricular trabeculation. 212 = 13 List of Tables 4.1. Linear mixed model frameworks for genetic association studies. 89 4.2. Parameters for phenotype simulation. 93 4.3. Parameter values of simulated phenotypes for assessing scalability, calibration and power. 94 5.1. Comparison of loci detected in single-trait and multi-trait GWAS. 119 6.1. Dimensionality reduction methods. 127 6.2. R functions for dimensionality reduction methods and their parameters. 134 6.3. Simulation parameters of phenotypes used for stability estimation. 141 7.1. Sample and SNP numbers before and after the QC. 159 7.2. Strongest genotype-phenotype association per locus for 3D heart GWAS. 173 8.1. Association of FDbasalmax and FD apical max with covariates. 185 8.2. SNPs with strongest association in left ventricular trabeculation GWAS. 186 A.1. GWAS catalogue trait descriptions relating to cardiovascular diseases. 200 A.2. Number of SNPs after imputation, imputation QC and filtering for deviation from HWE and lowMAF. 202 15 List of Abbreviations BFGS algorithm Broyden–Fletcher–Goldfarb–Shanno algorithm. 89, 91 CCA canonical correlation analysis. 52, 53 CEU Utah Residents (CEPH) with Northern and West- ern European Ancestry. 71, 72 CV coefficient of variation. 159 DRR dimensionality reduction via regression. 125, 126, 129, 131, 132, 135–137, 140, 142, 145, 148, 164, 166, 167, 193 FD fractal dimension. 179–181, 183–185, 188, 189, 210 FDR false discovery rate. 42, 44, 116, 117 FIN Finnish in Finland. 71, 72 FWER family-wise error rate. 42, 44 GBR British in England and Scotland. 71, 72 GWAS genome-wide association study. 35–37, 42, 43, 45, 47, 50, 53, 54, 66–68, 80–82, 85, 101, 102, 116–118, 153–155, 161, 166, 169, 172, 175, 176, 178, 183, 185, 187, 188, 191–193, 198 IBD identical by descent. 50, 157 ICA independent component analysis. 125, 126, 131, 132, 136, 140, 142, 143, 145, 148, 150, 162 kPCA kernel principal component analysis. 124, 125, 131, 132, 135, 136, 142, 148, 150, 162, 164, 165 17 LD linkage disequilibrium. 34, 35, 44–47, 52, 70, 115– 118, 120, 170, 171, 184, 186, 188 LiMMBo linear mixedmodel with bootstrapping. 87, 89–91, 93–98, 101–105, 111, 115, 119, 121, 151, 191 LLE Locally linear embedding. 124, 125, 127, 132, 135, 136, 140, 142, 144, 148, 164 LLR log-likelihood ratio. 41, 97 LMM linear mixed models. 48–50, 55–57, 80, 85–90, 101, 104, 108, 115, 116, 119, 145, 147, 183 LVM left ventricular mass. 153, 154, 183 LVNC left ventricular non-compaction. 178, 187 MAF minor allele frequency. 157, 158, 200 MAR missing at random. 104, 105, 108, 112 MCAR missing completely at random. 104, 105, 108 MDS classical multi-dimensional scaling. 123–127, 129, 131, 132, 136, 140, 142, 145, 148, 164, 166, 167 MICE multiple imputation by chain equations. 111, 112, 114 MLE maximum likelihood estimator. 38, 39, 49 MNAR missing not at random. 104, 105, 108 MRI magnetic resonance imaging. 68, 122, 154, 155, 159, 162, 175, 176, 192, 193 mtGWAS multi-trait genome-wide association study. 95–97, 104, 115–119, 167, 168, 170–172, 175, 183, 184, 188, 203, 209 mvLM multivariate linear model. 97, 98, 169 mvLMM multivariate linear mixedmodel. 89, 90, 95, 97–99, 115, 119 nMDS non-metric multi-dimensional scaling. 125, 129, 131, 132, 135, 136, 140, 142, 145, 148, 164, 166, 167 PC principal component. 48, 53, 72, 73, 98, 123, 124, 126, 142, 144, 208 18 PCA principal component analysis. 25, 52, 53, 123–127, 129–132, 136, 140, 142, 145, 148, 150, 164–168, 170, 172, 173, 192, 193, 208, 209 PEER probabilistic estimation of expression residuals. 124, 125, 131, 132, 135, 140, 142, 145, 148, 164, 193 QC quality control. 155–157, 206, 207 REML restricted maximum likelihood. 39–41, 49, 56, 57, 89–91, 93–96, 101 RFLP restriction fragment length polymorphism. 30, 32 RMSD root mean squared deviation. 94, 95 RRM realised relationshipmatrix. 50–52, 72, 86, 115, 183 RSS residual sum of squares. 54 SNP single nucleotide polymorphism. 34–36, 42, 43, 45–47, 51, 52, 70, 71, 80, 81, 83, 86, 98–100, 115– 120, 139, 142, 144, 145, 147–150, 155–158, 167–176, 184, 186–189, 192, 193, 200, 202, 208, 210 stGWAS single-trait genome-wide association study. 104, 115–117, 183, 210 TSI Toscani in Italia. 71, 72 tSNE t-Distributed stochastic neighbourhood embed- ding. 125, 128–132, 135–137, 140, 142, 143, 145, 148, 150 uvLMM univariate linear mixed model. 98, 99 19 Summary Over the past ten years, more than 4,000 genome-wide association studies (GWAS) have helped to shed light on the genetic architecture of complex traits and diseases. In recent years, phenotyping of the samples has often gone beyond single traits and it has become common to record multi- to high-dimensional phenotypes for individu- als. Whilst these rich datasets offer the potential to analyse complex trait structures and pleiotropic effects at a genome-wide level, novel analytic challenges arise. This thesis summarises my research into genetic associations for high-dimensional phen- otype data. First, I developed a novel and computationally efficient approach for multivari- ate analysis of high-dimensional phenotypes based on linear mixed models, com- bined with bootstrapping (LiMMBo). Both in simulation studies and on real data, I demonstrate the statistical validity of LiMMBo and that it can scale to hundreds of phenotypes. I show the gain in power of multivariate analyses for high-dimensional phenotypes compared to univariate approaches, and illustrate that LiMMBo allows for detecting pleiotropy in a large number of phenotypic traits. Aside from their computational challenges in GWAS, the true dimensionality of very high-dimensional phenotypes is often unknownand lies hidden in high-dimen- sional space. Retaining maximum power for association studies of such phenotype data relies on using an appropriate phenotype representation. I systematically ana- lysed twelve unsupervised dimensionality reduction methods based on their per- formance in finding a robust phenotype representation in simulated data of different structure and size. I propose a stability criteria for choosing low-dimensional phen- otype representations and demonstrate that stable phenotypes can recover genetic associations. Finally, I analysed genetic variants for associations to high-dimensional cardiac phenotypes based on MRI data from 1,500 healthy individuals. I used an unsuper- vised approach to extract a low-dimensional representation of cardiacwall thickness and conducted a GWAS on this representation. In addition, I investigated genetic associations to a trabeculation phenotype generated from a supervised feature ex- traction approach on the cardiac MRI data. In summary, this thesis highlights and overcomes some of the challenges in per- forming genetic association studies on high-dimensional phenotypes. It describes new approaches for phenotype processing, and genotype to phenotypemapping for high-dimensional datasets, as well as providing new insights in the genetic structure of cardiac morphology in humans. 21 1 Introduction The field of quantitative genetics has come far since Fisher’s initial studies on human growth traits in 1918. Although the concept of inheritance existed at this time, little was known about the molecule responsible. The discovery of the DNA structure in the 1950s and technical break-throughs in analysing its sequence in the following decades have allowed to investigate genetic variance on a detailed scale, moving from whole chromosomes and linkage studies to the analysis of DNA variation on a single-base pair level. Thedevelopments in genotyping and sequencing technologies in recent years have made large scale studies on genetic variation feasible. With the sinking costs of gen- otyping techniques, the number of samples has risen and studies investigating the effects of single DNA bases often comprise thousands of individuals, in particular in the field of human genetics. Together with the increased number of samples, the number of phenotypes that are measured for each individual has grown from a few measurements to tens, hundreds or even thousands. The availability of these rich datasets provides great opportunities when studying the influence of genetic variation on phenotypic variance. However, it also poses technical challenges when analysing these datasets. In this thesis, I identified some of these challenges and propose new methods for the genetic analysis of high-dimensional datasets. These new methods are first ex- 23 plored on simulation studies and subsequently applied to real datasets. Specifically, I developed a new approach for the joint genetic association testing of a large num- ber of phenotypic traits and applied this method to a publicly available dataset of yeast growth traits. I explored different dimensionality reduction methods for very high-dimensional datasets and propose a new measure to define the stability of the dimensionality reduction. Finally, I analysed human heart morphology data for ge- netic associations, applying the methods from the dimensionality reduction study on simulated data. In this introduction, I will first give a general overview of the history and methods in quantitative genetics, followed by the description of statist- ical models relevant for this thesis. In order to help with an understanding of the genetic association studies on the human heart morphology data, I also introduce basic concepts of cardiac structure and development and their underlying genetics. 1.1. From ancient ideas of inheritance to the birth of modern genetics The formulation of the concept of human inheritance –the passing on of traits from parents to offspring– can already be found in works of Hippocrates and Aristotle. In addition to their theory of the inheritance of acquired traits, Hippocrates andDemo- critus also describe a possiblemechanismof inheritance [Zirkle, 1935], a concept later formalised as “pangenesis” by the English naturalist Charles Darwin [1868] and oth- ers such as the French Comte de Buffon [1749] and Genevan naturalist Charles Bon- net [1779]. The theory of pangenesis – which translates to whole (Greek: pan) origin (Greek: genesis) or birth (Greek: genos) – describes how the entire parental organ- ism participates in passing on traits to the offspring. In this developmental theory of heredity, all cells in an organism were believed to secrete small particles called gemmules, which circulate through the body to congregate in the gonads. While this theory was quickly refuted, Darwin became renowned for his ideas about trait variation and the link to inheritance. In his famous work On the Origin of Species [1859], he postulates natural selection as the central concept of evolution, based on his observations of phenotypic variance in a population, differential fitness based on phenotype and the concept of heritability of this fitness [Lewontin, 1970]. The milestones in genetics made since Darwin’s workOn the Origin of Species as well as accompanying statistical models and techniques in molecular biology are depic- ted in figure 1.1. 24 Corn Finch Fruit fly c n F Rq Pea flower Potato Rat Bacteriophage Bacteria Barley W Yeast WormH M Grasshopper Human Mouse d L 5 7 Salamander Sea urchin Squid SimulationT single-trait multi-trait st mt mm Sa ng er Gi lb er t DN A se qu en ci ng 77 H d M ie sc he r DN A ex tra ct io n 71 H M ul le r In du ca bl e M ut at io ns th ro ug h X- ra y 27 F W at so n He lic al s tru cu tre of D NA 53 Fl em m in g Ch ro m at in 78 7 45 4 Li fe Sc ie nc es Ne xt g en er at io n se qu en ci ng 05 L Bo ts te in Ge ne m ap pi ng v ia RF LP 80 H An de rs on 81 Sh ot gu n se qu en ci ng C W an g DN A m ic ro ar ra ys 98 H Av er y DN A as g en et ic su bs ta nc e 44 L HMW 95 96 98 00 01 02 04 F RqL Genomes sequenced mm Cr ic k Ce nt ra l d og m a of b io lo gy 58 Ni re nb er g Ge ne tic c od e 61 L Da rw in On th e Or ig in o f S pe ci es n 59 Bo ve ri Su tto n Ch ro m os om e th eo ry of in he rit an ce 02/03 M en de l La ws o f I nh er ita nc e 66 M or ga n M ec ha ni sm s of M en de lia n In he rit en ce 10 F Ea st Pr op os ol o f p ol yg en ei ty c Ha pM ap Ge no m e- LD m ap fo r d iff er en t p op ul at io ns 05 H Ba te so n Ge ne tic li nk ag e 05 St ur te va nt Ge ne tic m ap 13 F Fis he r Ge ne tic s of q ua nt ita tiv e tra its 18 H UK 10 k Ca ta lo gu e of ra re hu m an v ar ia tio n 15 H Bi ob an k Hu m an g en et ic a nd ph en ot yp ic v ar ia tio n 17 H Do ni s- Ke lle r Ge ne tic m ap 87 H 10 00 Ge no m es Ca ta lo gu e of c om m on hu m an v ar ia tio n 10 H Ga lto n Re gr es sio n to wa rd s th e m ea n 86 H Pe ar so n p- va lu es , C hi -s ua re te st , P CA 00-01 H Fis he r An al ys is of v ar ia nc e Fis he r's e xa ct te st co rre la tio n co ef fic ie nt s re gr es sio n co ef fic ie nt s 18-30 H5 Ai rd st -c as e/ co nt ro l as so ca tio n st ud y of c an di da te g en es 53 H Pe nr os e Si b- pa ir lin ka ge a na ly sis 35 H 15 H Ca sa le Co m pu ta tio na l ad va nc es in m od el s fo r G W AS 10 H W u H Ka ng H Ko rte 12 Jia ng m t-a ss oc ia tio n st ud y 95 T Sp ie lm an Tr an sm iss io n/ Di s- eq ui lir um te st 93 H Be rn st ei n Tr io li nk ag e an al ys is 30 H H Kl ei n Fir st c as e/ co nt ro l G W AS 05 Fir st q ua nt ita tiv e GW AS H Fa yl in g 07 Fis he r M ax im um L ike lih oo d 12 H 1900s1800s 2000sA B C Pl at e Pr op os ol o f p le io tro py M Figure 1.1: Genetics over time. A. Statistical concepts and B. techniques in molecular bio- logy crucial for the advances in genetics. C. The developments in genetics from its birth by Mendel’s Laws of Inheritance to large databases cataloguing genetic variation of thousands of individuals. Whilst there are many independent studies in all three areas contributing to the successes in genetics that we observe today, I have attempted to depict all major events that lead to the specific field of human quantitative genetics in the GWAS era. The devel- opment of mathematical models is focused towards models used in this thesis. The legend below the timelines specifies the symbols of the organisms used in the respective studies. As references for each entry, the first author of the corresponding publication is shown. Dis- coveries where multiple authors are named indicate independent studies at the same time making the same discovery/developing techniques. 25 1.1.1. Mendelian Laws of Inheritance The Austrian friar Gregor Mendel was the first to systematically study the mechan- isms of heritability. By cross-breeding different varieties of pea plants, he was able to follow the inheritance patterns of a number of visually-observable traits such as flower colour, seed shape and plant height. In 1866, he presented his observations in the paper Versuche über Pflanzenhybride (experiments on plant hybridisation) where he proposes three general concepts of inheritance which later became known as the Mendelian Laws of Inheritance: i) the Law of Independent Segregation (every in- dividual contains two alleles for each trait which segregate in germ cells leading to a random transmission of alleles to the offspring), ii) the Law of Independent Assortment (traits are inherited independently of each other) and iii) the Law of Dominance (recessive alleles will be masked by dominant alleles and the trait cor- responding to the dominant allele will be observed) [Mendel, 1866]. Although his work stayed widely unnoticed during his lifetime, his meticulous studies and doc- umentation ensured his recognition as the father of genetics. In 1900, his work was independently rediscovered by the Dutch botanist Hugo de Vries [De Vries, 1900; Hannah&DeVries, 1950, translation into English] and –although contested by some based on their seeming lack in understanding of Mendel’s work [Keynes & Cox, 2008; Monaghan & Corcos, 1986; Monaghan & Corcos, 1987]– the German botanist Carl Correns [Correns, 1900; Piernick & Correns, 1950, translation into English] and the Austrian agronomist Erich Tschermak [1900]. Around the same time, the British geneticist William Bateson set out to make Mendel’s work accessible to the scientists not proficient in Mendel’s native language German. He translatedMendel’s original papers on the Laws of Inheritance [Mendel, 1866] and cross-breeding studies in Hieracium [Mendel, 1869] into English and pub- lished them in Mendel’s Principles of Heredity: a Defense [Bateson, 1902]. In [1909], Bateson published an extended version of his original bookwhich allowedMendel’s work to become known in the greater scientific world [Keynes & Cox, 2008], more than 40 years after their original publication. In addition to this work, the book Recent Progress in the Study of Variation, Heredity, and Evolution by his former stu- dent Robert H. Lock should be mentioned as the first English textbook embracing Mendel’s ideas of inheritance [Lock, 1906; Edwards, 2013]. In addition to the rediscovery and translation of Mendel’s ideas into English, two other branches of investigations contributed to the understanding of heredity and the identification of themolecular basis of the Laws of Inheritance from 1900 onward: biometrics and molecular biology. 26 1.1.2. Biometrics Inspired by Darwin’s work on evolution, his half-cousin Francis Galton was inter- ested in mathematically describing and analysing evolutionary concepts. In 1886, he published Regression Towards Mediocrity in Hereditary Stature, offering a statistical approach towards understanding inheritance. Based on measurements of height in parents and their children, he observed that the “[t]he height-deviate of the off- spring is, on the average, two-thirds of the height-deviate of its mid-parentage”. He achieved the quantification of the deviation from the mean by fitting straight lines to the observed heights and finding their slope, thereby developing the technique of linear regression analysis and introducing the concept of correlation [Galton, 1886]. An extension of this work and descriptions of different statistical distributions and processes in heredity were published in his book Natural inheritance [Galton, 1889]. Karl Pearson formalised and extended Galton’s statistical models for quantifying the effects of inheritance on trait variance by introducing the concept of p-values, the Chi-sqare test and principal component analysis (PCA) [Pearson, 1900; Pearson, 1901]. The marine biologist Walter Weldon applied these statistical concepts to data he had collected on shrimps and crabs [Weldon, 1890; Weldon, 1892], demonstrating selection in natural populations. Together, Galton, Pearson and Weldon are known as the founders of biometrics, the science of applying statistical methods to the study of evolution on quantitative traits, or as Galton described it: “The primary object of Biometry is to afford material that shall be exact enough for the discovery of incip- ient changes in evolution which are too small to be otherwise apparent.” [Galton, 1901, editorial]. Despite the progress in understanding evolution in the light of stat- istical concepts, their direct study of heredity was impeded by their reluctance to acknowledge the validity of Mendelian genetics [Bulmer, 2003]. 1.1.3. Molecular basis of inheritance Advances in understanding the molecule responsible for inheritance were made by the Swiss physican and biologist Johannes Miescher and the German anatom- ist Walther Flemming. Miescher [1871] was the first to successfully isolate a sub- stance he called nuclein –later known as DNA– from the nucleus . Flemming’s ex- periments on salamander cells lead to the discovery of structures that could easily be stained by basophilic dies and he named them chromatin – “coloured material” (greek:khrōmat). He later found chromatin to be originating from the cell nucleus and did further studies into understanding cell division and mitosis [1878]. Al- 27 though both Miescher’s and Flemming’s methods and discoveries were crucial in the later identification of DNA as the carrier of inheritance, neither of them made the connection at the time. With these advances in molecular and statistical techniques and the rediscovery of theMendelian laws, the new discipline of genetic research attractedmuch attention. 1.2. The Laws of Inheritance on a cellular level The first two scientists proposing howMendel’s Laws could work on a cellular level were the German biologist Theodor Boveri and the American Walter Sutton. By experimentally introduced double-fertilisations of sea urchin eggs and subsequent observations of developmental processes in the resulting embryos, Boveri claimed “dass eine bestimmte Kombination von Chromosomen zur normalen Entwicklung notwendig ist, und dieses bedeutet nichts anderes, als dass die einzelnen Chromo- somen verschiedene Qualitäten besitzen müssen.” i.e. “that a specific combination of chromosomes is necessary for a normal development which in turn means that each chromosome must harbour different qualities” [Boveri, 1902]. At the same time, Sutton described his observations in reduction division (later known as meiosis) and postulated that different chromosomes play different roles in development. Similar to Boveri, he came to the conclusion that “the phenomena of germ cell division and of heredity are seen to have the same essential features […], with purity of units (chromosomes, characters) and the independent transmission of the same” [1903]. Both studies demonstrated the link between the Mendelian Laws of Inheritance and chromosomes as its carrier and are the basis for the chromosome theory of inheritance, also known as Boveri-Sutton Chromosome theory. Around the same time, Bateson worked together with Edith Saunders and Re- ginald Punnett on experiments similar to Mendel’s pea hybrids to understand the physiology of heredity. While they confirmed Mendel’s original observations, they also discovered traits whose segregation did not follow the Law of Independent As- sortment. Although they could not explain the mechanism of these observations, their results lead them to propose the concept of coupling or co-inheritance of traits [Bateson & al., 1905]. The first suggestion that this coupling of traits might result from genes lying on the same chromosome came by Lock [1906] & Edwards [2013]. With the progress in understanding Mendelian Laws on a cellular level came the establishment of terms describing certain entities and properties that are still in use today. In addition to his scientific contributions and his translation of Mendel’s 28 works into English, Bateson became known for coining key terms in the field of genetics, even the term genetics itself [Dunwell, 2007]. He defined the units of in- heritance transmission as allelomorphs, which became later abbreviated as alleles and introduced the terms homozygote and heterozygote for individuals carrying the same or different allelomorphs [Bateson, 1902]. The word gene as a term for the Mendelian factors or units of inheritance was introduced by the Danish botanist Johannsen [1911]. He also introduced the terms phenotype as the outward appear- ance of an individual and genotype as their genetic traits. The terms polygenetic, for traits that are governed by multiple genes [East, 1910] and pleiotropic, for genes that affect multiple, seemingly unrelated phenotypes [Plate, 1910, page 597] also made their first appearance at that time. While these terms are standard in today’s field of genetics, their use in that time only rose slowly over time. For simplicity, however, I will from now on refer to any description of Mendelian factors or units as genes. 1.3. Genetic linkage The American embryologist Thomas Morgan was critical of the ideas of Mendelian inheritance and chromosomes as its carrier [Allen, 1968], yet hewould become a cru- cial figure in establishing the chromosomal theory of heredity and introducing other important concepts of inheritance. In his famous Fly Room at Columbia University, he worked on mutation and breeding experiments in the fruit fly Drosophila melano- gaster aiming to discovermutations that would lead to the emergence of new species, as described in De Vries’ mutation theory [Allen, 1968]. Instead, his experiments on fruit flymutants for eye color (white instead of red) showed that the pattern of inher- itance of the mutant trait followed the Mendelian Law of Dominance. In addition, he discovered that the factor determining eye color was linked to the factor for sex determination [Morgan, 1910; Morgan, 1911a] pointing towards the coupling of traits as observed by Bateson. In subsequent years, Morgan and his students carried out extensive research on mutant fruit flies which lead to the discovery of crossing over (exchange of paternal and maternal chromosomal material during meiosis) and the formalisation of the concept of genetic linkage [Morgan, 1911b]. Based on the hypothesis that the degree of linkage between phenotypes would be inversely correlated to the linear distance of their genes on a chromosome, they developed the technique of genetic mapping: the localisation of genes underlying phenotypes on the basis of correlation with in- heritance patterns [DNAvariation], without the need for prior hypotheses about bio- 29 logical function. Using this technique, where the recombination rate between traits is used to estimate the relative distance of their genes, his student Sturtevant [1913] published the first genetic map1 describing the relative distances between genes on the X chromosome of Drosophila melanogaster. Together with Herman Muller and Calvin Bridges, two other students of Morgan’s, they published the book The Mech- anism of Mendelian Heredity [1915], describing additional genetic maps for chromo- some 2 and 3 and list groups of genes that are jointly inherited. 1.4. Towards quantitative genetics With their development of genetic mapping and cross-breeding of D. melanogaster lines, Morgan, Sturtevant, Muller and Bridges conducted the first genotype-pheno- type analysis studies. As inMendel’s original experiments and later, similar work by Bateson, Saunders and Punett, the phenotypes they observed were predominantly categorical, such as color of seeds and flowers in pea plants or the white-eyed phen- otype in Drosophila melanogaster. In contrast, biometricians like Galton and Pearson analysed quantitative traits such as height. Their models fit with the Darwinian model of gradual change through natural selection, but did not explain the mode of inheritance. A great advance in genotype-phenotype mapping allowing for the analysis of quantitative traits came about with the work by the British statistician and biologist Ronald Aylmer Fisher. An undergraduate student at the University in Cambridge, Fisher [1912] pub- lished his first paper On a absolute criterion for fitting frequency curves where he out- lined the fundamental ideas of maximum likelihood estimation. He later extended on this work and by 1922, he had established the properties of the maximum likeli- hood estimator such as consistency and minimum variability [Fisher, 1922b] that is still used today [Hald, 1999]. He demonstrated the utility of maximum likelihood estimation in genetics by solving a number of equations to elucidate a geneticmap of eight Drosophila melanogaster genes based on their crossing over frequencies [Fisher, 1922d]. In the same year and years to follow, he published a series of papers where he derived the distribution and significance testing of regression coefficients, correl- ation ratios andmultiple regression coefficients [Fisher, 1922c; Fisher, 1928], an exact test for two-by-two contingency tables with small expectations (Fisher’s exact test) [Fisher, 1922a], partial correlation coefficients [Fisher, 1924b] and the variance ratio, 1Asopposed to physicalmapswhich are based on exact chromosomal position andwere only possible with the development ofmolecular biology techniques to examineDNAmolecules directly [Brown, 2002] 30 later named after Fisher as the F statistic [Fisher, 1924a]. In 1918, the cornerstone for quantitative genetics was laid with his publication The correlation between relatives on the supposition ofMendelian inheritancewhere he showed that biometrics andMendelianism are not contradictory but complimentary [Fisher, 1918]. Specifically, by analysing levels of phenotypic correlation between individu- als of differing degrees of relatedness, he showed that the observed phenotypic vari- ation can result fromMendelian inheritance. He further distinguished between two different types of genetic components contributing to the phenotype, one simply ascribed to genotypes and the other to “essential genotypes”. Today, these compon- ents are known as broad-sense and narrow-sense heritability. Broad-sense heritabil- ity is the proportion of phenotypic variance explained by the entire genetic variation including additive, dominance (allelic interaction within loci) and epistatic (allelic interaction between loci) genetic effects, while narrow-sense heritability is defined as the ratio of additive genetic variance to total phenotypic variance. As an additional statistical concept, it was in this work that Fisher defined the term variance as “the square root of the mean squared error”. The analysis of vari- ance in biological experiments would be of interest to Fisher in his appointment at Rothamsted Experimental Station where he analysed data from crop experiments with respect to different variance components and developed statistical techniques such as the analysis of variance (ANOVA) [Fisher, 1921; Fisher & Mackenzie, 1923; Eden & Fisher, 1929]. Extending on his 1918 work on trait correlation in light of Mendel’s Laws, Fisher published the book The Genetical Theory of Natural Selection where he reconciled the long-standing ideas ofDarwin’s evolutionary theory andMendelian inheritance. He gives the first, comprehensive quantitative theory of sexual selection, evolution of recombination rates, polymorphism and many more concepts found in today’s field of population genetics [Fisher, 1930]. 1.5. Progress in deciphering the molecular mechanisms of inheritance Large steps forward in the molecular understanding of inheritance were the discov- ery of DNA as the genetic material in 1944 [Avery & al., 1944] and its composition from the four bases adenine, thymine, cytosine and guanine [Vischer & Chargaff, 1948; Chargaff & al., 1949; Chargaff & al., 1952] as well as the resolution of the DNA structure almost a decade later [Watson & Crick, 1953]. These insights brought for- 31 ward an understanding of other biological concepts such as protein synthesis and enabled Francis Crick to postulate the central dogmaof biology: information is trans- mitted fromDNA and RNA to proteins, but information cannot be transmitted from a protein to DNA [Crick, 1958]. The deciphering of the genetic code through Niren- berg and others followed a few years later [Nirenberg & Matthaei, 1961; Crick & al., 1961; Matthaei & al., 1962]. 1.5.1. Novel genotype mapping techniques Three discoveries and novel techniques at the beginning of the 1970s opened the door for the development of new genetic mapping approaches: the discovery of re- striction enzymes [Smith &Welcox, 1970; Morrow & Berg, 1972], the ability to clone and amplify specific DNA sequences [Jackson & al., 1972; Cohen, 1973], and the detection of specific DNA sequences from a large pool of DNA fragments (South- ern plot) [Southern, 1975]. Based on these techniques, restriction fragment length polymorphism (RFLP) analysis was developed, which allows for the identification of variants fromwithin a specific genomic region using restriction enzyme-digested DNA [Grodzicker & al., 1974; Botstein & al., 1980]. Initially, RFLP analysis was used for genetic linkage maps in model organisms [Goodman & al., 1977; Cameron & al., 1979] and target genes in human [Kan &Dozy, 1978; Jeffreys, 1979; Tuan & al., 1979]. Based on theoretical considerations of using RFLP analysis for a general, target-free genetic mapping in humans [Botstein & al., 1980], the first human genetic map was published in 1987 [Donis-Keller & al., 1987]. 1.5.2. Deciphering DNA sequences While these mapping efforts were underway, the independent development of two different DNA sequencing techniques by two groups, one Frederick Sanger and the other Walter Gilbert together with Allan Maxwell, were a further big leap in un- derstanding the biological basis of genetic variation [Sanger & al., 1977; Maxam & Gilbert, 1977]. Sanger’s method of DNA sequencing with chain-terminating inhib- itors eventually became the standard for DNA sequencing and subsequent innov- ations lead to the development of automatic sequencing machines which allowed for sequencing lengths of about one kilobase [Hunkapiller & al., 1991]. For sequen- cing longer stretches of DNA, a novel strategy named shotgun sequencing was de- veloped [Staden, 1979; Anderson, 1981]. In shotgun sequencing, the long DNA of interest is randomly broken up into shorter DNA fragments which are cloned and 32 sequenced separately. The occurrence of overlapping DNA fragments given by the random nature of creating the short fragments allows for the in silico reconstruction of longer DNA fragments. In 1995, the first genome of a living organism –the bacteria H. influenzae– was se- quenced and assembled by shotgun sequencing [Fleischmann & al., 1995]. The gen- omes of other model organisms were to follow in subsequent years (yeast [Goffeau & al., 1996], C. elegans [C. elegans Sequencing Consortium, 1998], D. melanogaster [Adams & al., 2000]) until the first draft of the human genome was published in 2001 [International Human Genome Sequencing Consortium, 2001]. The sequence of the human genome, the development of faster, massively-parallel next-generation sequencing techniques (reviewed in [Shendure & Ji, 2008; Heather & Chain, 2016]) and DNAmicroarrays that allow for the genotyping of hundreds of thousands of genetic markers simultaneously [Wang & al., 1998], started a new era of human genetic and genomic research. 1.6. From genetic linkage analysis to genome-wide association studies 1.6.1. Genotype-phenotype mapping until the 1990s Genotype-phenotype mapping approaches today can broadly be classified into ge- netic linkage analyses and population-based association studies. Genetic linkage analysis for human traits had already been applied in the 1930s [Bernstein, 1930; Penrose, 1935], while association studies only became known in the 1950s. For a clearer description of the methods and results, the following sections describe the developments in human quantitative genetics based on study type rather than in their chronological order. Genetic linkage analysis Genetic linkage analysis investigates the relationship between a given locus and the trait or disease of interest. As with Morgan’s linkage studies in D. melanogaster, today’smethods are also based on the observation that geneticmarkers in close phys- ical proximity on a chromosome remain mainly linked during meiosis. By follow- ing the segregation of a specific trait in family pedigrees, the recombination rates between genetic markers can be estimated and their relative genomic position de- termined. To quantify the likelihood of linkage, a variety of measures with different 33 pedigree requirements have been developed. Some required full parent-offspring trios [Bernstein, 1930; Haldane, 1934], while others showed the possibility of de- termining genetic linkage based on sib-pairs alone [Penrose, 1935]. A commonly used test allowing for different pedigree structures is the sequential probabilty ra- tio test for linkage [Morton, 1955; Pulst, 1999]. In this test, the logarithm of the odds that the loci are linked is divided by the logarithm of the odds that the loci are unlinked. This log likelihood of the odds score serves as the measure for the likelihood of linkage. Genetic linkage studies often require strict assumptions about the underlying genetic models such as specification of penetrance and disease gene frequency [Morton, 1955; Pulst, 1999] and have a number of potential confounding variables such as genetic heterogeneity and accurate diagnosis [Bird, 1993]. Never- theless, linkage studies have been successful in pinpointing genomic loci associated with disease. Initially restricted to known genes or gene products such as haemo- globin (linked to sickle-cell thalassaemia [Ingram & Stretton, 1959]) or haemophilia and colour-blindness [Haldane & Smith, 1947], the development of techniques such as RFLP mapping (section 1.5.1) enabled the detection of genetic markers in can- didate genes. With these markers, linkage analysis could be extended to a greater number of candidate genes and led to the discovery of genetic links to diseases such as Huntington’s disease [Gusella & al., 1983], cystic fibrosis [Kerem & al., 1989] and bipolar disorder [Baron & al., 1987]. Association studies In contrast to linkage studies with the association between locus and trait in pedi- grees, association studies investigate the relationship of a genetic marker frequency and the trait in a population. The frequencies of the genetic markers in individuals carrying the trait (cases) are compared to those in individuals without the trait (con- trols). Genetic markers whose frequencies are increased in cases compared to con- trols are thought to be associated with the risk for diseases. Often, the significance of the association is evaluated via a simple 𝜒2-test. As with linkage analysis, pop- ulation association studies were initially limited to known genes or gene products such as in the association for blood antigens and stomach cancer [Aird & al., 1953; Aird & al., 1954]. With the new techniques for determining genetic markers in can- didate genes, association studies successfully identified gene-disease associations in for instance diastrophic dysplasia [Hästbacka & al., 1992] and Alzheimers’ Disease 34 [Strittmatter & Roses, 1996]2. Jiang & Zeng [1995] provided an extension to the population-association model, leaving the strict case-control design and proposing a method to detect association with multiple quantitative traits. In the quantitative association study, an individu- als genotype is represented numerically and a model can be fit directly to the geno- types and the continuous trait without relying on case-control status. In the linear model framework introduced by Jiang and colleagues, multiple traits are jointly ana- lysed for genetic association, testing different models such as pleiotropic effects, and gene-environment interaction [Jiang & Zeng, 1995]. 1.6.2. “Common disease–common variant” hypothesis By the mid-1990s, genotype-phenotype mapping in humans was largely focused on candidate gene mapping through either linkage analysis or association studies. Linkage analysis had been very successful in identifying genes linked to Mendeli- an and monogenetic disorders with 671 genes for which at least one disease-related locus3 was detected by 1995. Population-based association studies had so far detec- ted about 250 genes associatedwith disease or dichotomous traits [Hirschhorn & al., 2002]. However, the number of reproducible results was notably lower and showed the difficulties associated with case-control population association studies. Major limitations were seen in the susceptibility to population stratification [Lohmueller & al., 2003] and the low a priori probability of the tested gene to be causal. In ad- dition, for illnesses such as heart disease, diabetes or hypertension, the risk of be- ing affected is likely a combination of multiple genetic and environmental factors [Hunter, 2005], which stands in stark contrast to the pattern observed in monogen- etic diseases. In monogenetic diseases, the presence of a genetic factor or factors (dominant or recessive) almost completely predicts the presence of diseases such as cystic fibrosis or Huntington’s Disease and these factors are generally of low fre- quency [Sankaranarayanan, 1998]. In the complex diseases, the genetic risk factor may be present in higher frequency and only lead to a small increase in disease risk [Reich & Goldstein, 2001]. Based on these arguments, the “common disease–com- mon variant” hypothesis had been proposed, stating that common polymorphisms may play a role in the susceptibility to common diseases [Risch &Merikangas, 1996; 2Spielman & al. [1993] reconciled linkage analysis and case-control association studies, by formally introducing the transmission/disequilibrium test which tests directly for linkage between a disease and marker locus which is known to show population association. 3Statistics extracted fromOnlineMendelian Inheritance inMan: https://omim.org; search paramet- ers: “date_updated:1981/1-1995/12” 35 Lander, 1996; Chakravarti, 1999; Reich & Goldstein, 2001]. For detecting common variants with small or moderate effect sizes, association studies are a more power- ful tool than linkage analyses [Ott & al., 2015] and became the method of choice to investigate common disease variants on a genome-wide level. To enable systematic genome-wide screens of common variants, three components were needed: a cata- logue of common variation in the human population, experimental techniques to obtain these genotypes in large cohorts, and the computational techniques for the subsequent analyses. 1.6.3. Databases of human variation The first genome-wide database of common human sequence variation was created within the scope of the International HapMap project which was launched in 2002 [The International HapMap Consortium, 2005; The International HapMap Consor- tium, 2007; The International HapMap Consortium, 2010]. The HapMap project aimed at characterising the frequencies of single nucleotide polymorphisms (SNPs), i.e. variation on a single base pair level, for different human populations. Based on their genome-wide SNP frequencies, a comprehensive map for linkage disequi- librium (LD) –the non-random association of alleles at different loci [Lewontin & Kojima, 1960]– in different populations was created. By having included parent–off- spring trios in the analysis, computational phasing [Stephens& al., 2001] enabled de- termination of the SNP contribution from each parent and the combination in which they were inherited. This particular combination of SNPs along a chromosome is termed haplotype and was the inspiration for the name of the project. The HapMap collection contains 1.6 million common SNPs in 1,184 reference individuals from 11 global populations. An extension of the work of the HapMap project, the 1000 Gen- ome Project aimed to detect common human genetic variation by whole-genome sequencing of individuals from multiple populations. The project finished in 2015, providing genotypes and haplotypes at more than 88 million variants, including SNPs, short insertions or deletions, and structural variants for 2,504 individuals from 26 populations [1000 Genomes Project Consortium, 2011; 1000 Genomes Pro- ject Consortium, 2012; 1000 Genomes Project Consortium, 2015]. The work of the UK10K consortium complemented the work of both previous projects and extends the spectrum of observed genetic variation to rare variants in nearly 10,000 indi- viduals from population-based and disease collections [UK10K Consortium, 2015]. While the major focus of these consortia laid in the collection of comprehensive gen- otype data, a new resource combining both genotype and phenotype data of more 36 than 500,000 individuals has recently been published. Phenotypes collected within this resource, the UK Biobank, cover amongst others anthropometric, cardiac and disease phenotypes [Sudlow & al., 2015]. 1.6.4. Genotyping of large cohorts Genotype data of common variants is standardly obtained from DNA microarrays which allow for the genotyping of hundreds of thousands of common SNPs simul- taneously [Wang& al., 1998]. Based on the LD structures found in the reference pan- els (described above), haplotypes of the individuals can be estimated. Comparing the estimated haplotypes of the individuals to haplotype patterns in the reference panel enables imputation of unobserved genotypes in the study cohort. A number of different methods for genotype imputation have been developed including IM- PUTE2 [Howie & al., 2009], Beagle [Browning& al., 2007] andMaCH [Li & al., 2010] (reviewed in [Marchini & Howie, 2010]). Via imputation, the number of genotypes per individual can be extended from the hundred thousands on the genotyping ar- ray to millions of observed variants in the reference datasets. Using these imputed genotypes for association studies can increase the power of the study and presents a high-resolution view of all SNPs in the associated region [Marchini & Howie, 2010]. 1.6.5. Genome-wide association studies The first successful study to test the “common disease–common variant” hypothesis without gene-based selection of genetic markers was conducted in 2005. Klein & al. [2005] carried out a case-control genome-wide association study (GWAS) for age- related macular degeneration and found a SNP in complement factor H to be as- sociated with an increase in disease risk. Similar to population association studies of candidate genes, the significance of each SNP-disease association was tested via a 𝜒2-test and the resulting p-values subsequently corrected for multiple testing via Bonferroni correction (see section 1.7.4). Soon after, theWellcome-Trust case-control consortium published large case-control GWAS for seven common diseases, includ- ing bipolar disorder, coronary heart disease and type I and II diabetes [Burton & al., 2007]. In the same year, the first GWAS on quantitative traits followed. Two research groups investigated the genetic effects on body mass index and found links to the FTO gene. In addition, these BMI-associated SNPs also showed strong association to type II diabetes [Frayling & al., 2007] and other SNPs within the FTO gene were also associated to weight and hip-circumference [Scuteri & al., 2007]. Both studies used 37 a simple linear model (see section 1.7) to find the association of the genetic marker as the explanatory variable and BMI as response variable. In the following years, the methods for GWAS were extended to enable the geno- type-phenotype mapping for sets of SNPs [Wu & al., 2010; Casale & al., 2015], the joint mapping of multiple traits [Korte & al., 2012; Yang & al., 2011; Bottolo & al., 2013; Casale & al., 2015] and the use of more complex models to account for popula- tion stratification such as mixed model approaches [Kang & al., 2010; Lippert & al., 2011; Zhang & al., 2010; Svishcheva & al., 2012] and general estimating equations [Cupples & al., 2007]. Based on these methods, thousands of GWAS have been conducted covering com- mon diseases (e.g. asthma [Noguchi & al., 2011; Pickrell & al., 2016], coronary heart disease [Wild & al., 2011; Takeuchi & al., 2012; Lu & al., 2012], migraine [Pickrell & al., 2016; Gormley & al., 2016], blood pressure [Kato & al., 2011; Franceschini & al., 2013]), anthropometric traits (e.g. height [Lango & al., 2010; Wood& al., 2014], weight [Willer & al., 2009], BMI [Speliotes & al., 2010; Yang & al., 2012], waist-hip ratio [Lindgren & al., 2009; Heid & al., 2010]) and other non-disease related quant- itative phenotypes (e.g. eye color [Eriksson & al., 2010; Candille & al., 2012; Zhang &al., 2013], freckling [Sulem&al., 2008], facialmorphology [Paternoster& al., 2012], hair greying [Adhikari & al., 2016]). The results of these studies are collected in the GWAS catalogue, which currently contains 3,092 publications and 49,769 unique SNP-trait associations [MacArthur & al., 2017, accessed 10.09.2017]. In GWAS, the genetic variants associated with the traits of interest are often not directly informative with respect to finding the target gene and causal mechanism. However, bioinformatics fine-mapping approaches and molecular follow up stud- ies have been successful in identifying target genes and proposed mechanisms for many GWAS discoveries. For some of these GWAS results, the mechanistic insights have triggered drug development and drug repurposing studies. With the increas- ing sample sizes such as in the UK Biobank resource ==[Sudlow & al., 2015], many new genetic variants are likely to be discovered in the years to come. They will help accounting formore genetic variation and likely yieldmore accurate genetic predict- ors (reviewed in [Visscher & al., 2017]). 1.7. Linear models for genome-wide association studies Simple linear models and linear mixed models are widely applied in genetic asso- ciation analysis. They offer great control for confounding factors and allow for the 38 joint analysis of multiple traits. In the following sections, I will describe the gen- eral model specifications and parameters, their estimation and application to genetic studies. I will outline the challenges for linear models in GWAS and the approaches developed to overcome these challenges. For mathematical model descriptions throughout this thesis, I used the follow- ing notation: bold, small letters symbolise one-dimensional column vectors e.g. 𝐯 and bold capitalised letters matrices e.g. 𝐌. A normal distribution is specified by 𝒩(mean , variance ), a multivariate normal by 𝒩r×c (mean , variance ) and a ma- trix-variate normal byℳ𝒩r,c (mean , variancerows , variancecolumns ), where r and c are the row and column dimensions, respectively. 1.7.1. Linear regression In the linear model, the continuous response variable (e.g. phenotype) is described as a linear function of one or more explanatory variables (e.g. genotype and cov- ariates). With 𝑁 representing the number of samples, 𝑦𝑖 the response variable for sample 𝑖, {𝑥𝑖1, 𝑥𝑖2,… , 𝑥𝑖𝐹} the 𝐹 explanatory variables for sample 𝑖 and 𝛽𝑓 their cor- responding weights, the linear model can be cast as 𝑦𝑖 = 𝐹 ∑ 𝑓=1 𝑥𝑖𝑓𝛽𝑓 + 𝜓𝑖, with 𝜓𝑖 ∼ 𝒩(0 , 𝜎 2 𝑒 ) . (1.1) In thismodel, the residual term𝜓𝑖 capturesmeasurement noise and other unaccoun- ted factors that influence the response variable. 𝜓𝑖 is modelled to follow a normal distribution with mean 0 and variance 𝜎2𝑒 and to be independent across samples, i.e. with covariance equals to zero: cov (𝜓𝑖, 𝜓𝑗) = 0. Equivalently, equation (1.1) can be written in matrix form 𝐲 = 𝐗𝜷+𝝍, with 𝝍 ∼ 𝒩(𝟎 , 𝜎2𝑒𝐈𝑁 ) , (1.2) where the𝑁×𝑁 identity matrix 𝐈𝑁, the response vector 𝐲, the matrix of explanatory variables𝐗, the weight vector 𝜷 and the vector of residuals 𝝍 are defined as: 𝐲 = ⎡ ⎢ ⎢ ⎢ ⎣ 𝑦1 𝑦2 ⋮ 𝑦𝑁 ⎤ ⎥ ⎥ ⎥ ⎦ , 𝐗 = ⎡ ⎢ ⎢ ⎢ ⎣ 𝑥11 𝑥12 ⋯ 𝑥1𝐹 𝑥21 𝑥22 ⋯ 𝑥2𝐹 ⋮ ⋮ ⋱ ⋮ 𝑥𝑁1 𝑥𝑁2 ⋯ 𝑥𝑁𝐹 ⎤ ⎥ ⎥ ⎥ ⎦ , 𝜷 = ⎡ ⎢ ⎢ ⎢ ⎣ 𝛽1 𝛽2 ⋮ 𝛽𝑁 ⎤ ⎥ ⎥ ⎥ ⎦ and𝝍 = ⎡ ⎢ ⎢ ⎢ ⎣ 𝜓1 𝜓2 ⋮ 𝜓𝑁 ⎤ ⎥ ⎥ ⎥ ⎦ . (1.3) 39 Maximum likelihood estimation The model in equation (1.2) describes the probability distribution of the response variable, given the explanatory variables and corresponding parameter estimates 𝜷 and 𝜎2𝑒 . This probability is also known as the likelihood function or likelihood ℒ and plays a key role in statistical inference of the model parameters. Casting equa- tion (1.2) as the likelihood of the model parameters 𝜷 and 𝜎2𝑒 yields ℒ(𝜷, 𝜎2𝑒) = 𝑝 (𝐲 ∣ 𝐗,𝜷, 𝜎 2 𝑒) = 𝒩(𝐲 ∣ 𝐗𝜷 , 𝜎 2 𝑒𝐈𝑁 ) (1.4) or directly expressed in terms of the response variable 𝐲 ∼ 𝒩(𝐗𝜷 , 𝜎2𝑒𝐈𝑁 ) . (1.5) The parameter estimates ?̂? and ?̂?2𝑒 that maximise the likelihood function are the maximum likelihood estimators (MLEs) of 𝜷 and 𝜎2𝑒 . In order to improve numer- ical stability, the log likelihood is commonly used instead of the likelihood4. The full log-likelihood is expressed as logℒ(𝜷, 𝜎2𝑒) = log 𝑝 (𝐲 ∣ 𝐗,𝜷, 𝜎 2 𝑒) (1.6) = log 𝑁 ∏ 𝑖=1 𝑝 (𝑦𝑖 ∣ 𝐗, 𝜷, 𝜎 2 𝑒) (1.7) = − 𝑁 2 log (2𝜋) − 𝑁 2 log𝜎2𝑒 1 2𝜎2𝑒 (𝐲 −𝐗𝜷)𝑇 (𝐲 −𝐗𝜷) . (1.8) and the MLE ?̂?, ?̂?2𝑒 = argmax𝜷,𝜎2𝑒 logℒ(𝜷, 𝜎2𝑒) . (1.9) The MLE of 𝜷 and 𝜎2𝑒 are found by finding the maxima of the partial derivates of equation (1.8) ( 𝜕 logℒ(𝜷, 𝜎2𝑒) 𝜕𝜷 ) 𝜷=?̂?,𝜎2𝑒=?̂?2𝑒 = 0 (1.10) ( 𝜕 logℒ(𝜷, 𝜎2𝑒) 𝜕𝜎2𝑒 ) 𝜷=?̂?,𝜎2𝑒=?̂?2𝑒 = 0, (1.11) 4Since the logarithm is monotonically increasing, maximisation of the log-likelihood is equivalent to maximising the likelihood itself, but offers mathematically convenient properties. 40 yielding ?̂? = (𝐗𝑇𝐗) −1 𝐗𝑇𝐲 (1.12) ?̂?2𝑒 = 1 𝑁 (𝐲 −𝐗?̂?) 𝑇 (𝐲 −𝐗?̂?) (1.13) = 1 𝑁 (𝐲 −𝐗(𝐗𝑇𝐗) −1 𝐗𝑇𝐲) 𝑇 (𝐲 −𝐗(𝐗𝑇𝐗) −1 𝐗𝑇𝐲) . (1.14) Restricted maximum likelihood In Gaussian models as in equation (1.5), the MLE of the mean estimate ?̂? is unbiased whereas the MLE of the variance component ?̂?2𝑒 suffers from a downward bias. The bias of ?̂?2𝑒 originates from the loss in the degrees of freedom as a consequence of estimating ?̂? from the data. Patterson & Thompson [1971] proposed a solution for a 𝜷-free estimation of ?̂?2𝑒 via restricted maximum likelihood (REML). In short, for a linear regression model with 𝐲 = 𝐗𝜷+ 𝝓, with 𝝓 ∼ 𝒩(𝟎 , 𝐻 (𝜃) ) , (1.15) where the covariance term is now described as a general covariance matrix 𝐻(𝜃) parameterised by 𝜃, the REML is based on the projection𝐰 of 𝐲 by a matrix𝐀with: 𝐀𝐗 = 0. (1.16) Using equation (1.16) and rewriting equation (1.15) in terms of the projection 𝐰 𝐰 = 𝐀𝐲 = 𝐀(𝐗𝜷 + 𝝓) = 𝐀𝝓 (1.17) yields an expression of 𝐲 that is free of 𝜷. By directly estimating ℒ(𝜃 ∣ 𝐀𝐲), the unbiased estimate for 𝜃 can be found. In case of the linear regression in equation (1.5) with 𝐻(𝜃) = 𝜎2𝑒𝐈𝑁, the REML estimate of variance component 𝜎 2 𝑒 is ?̂?2𝑒 = 1 𝑁 − 𝐹 (𝐲 −𝐗(𝐗𝑇𝐗) −1 𝐗𝑇𝐲) 𝑇 (𝐲 −𝐗(𝐗𝑇𝐗) −1 𝐗𝑇𝐲) . (1.18) Comparing equation (1.14) and equation (1.18), it becomes evident that theMLE and REML for the variance component only differ in the denominator where 𝑁 is re- placed by𝑁 −𝐹, reflecting the loss in 𝐹 degrees of freedom (number of explanatory variables in the model). In more complex linear models such as linear mixed models (section 1.7.6), the 41 estimation of the variance component is equally more complex depending on the covariance structure of the residual effects. The detailed derivation of the REML estimators of parameters from the linear model framework used throughout this thesis can be found in [Casale & al., 2015, Supplementary material]. 1.7.2. Simple linear model for genotype associations with a single trait In genetic association studies, the simple linear model describes the phenotype of interest as the sum of the genetic effect and often additional covariate effects such as height or sex: 𝐲 = 𝐱𝛽 + 𝐅𝜶+𝝍, with 𝝍 ∼ 𝒩(0 , 𝜎2𝑒𝐈𝑁 ) (1.19) and the phenotype vector for 𝑁 samples 𝐲 ∈ ℛ𝑁, 1, the genetic profile of the SNP being tested 𝐱 ∈ ℛ𝑁, 1, the effect size of the SNP 𝛽 ∈ ℛ1, 1 the matrix of𝐾 covariates 𝐅 ∈ ℛ𝑁, 𝐾 and the effect of covariates 𝜶 ∈ ℛ𝐾, 1. The residual noise 𝝍 is assumed to follow a normal distribution that is independent across the 𝑁 samples. In order to model the genotypes quantitatively, they have to be encoded numeric- ally. For genetic association studies in diploid organisms, there are different inherit- ance models based on the combination of parental alleles 𝑎 and 𝑏 (for bi-allelic loci). In a recessive inheritance model (with respect to 𝑏), the phenotype is only observed in the presence of two 𝑏 alleles and the genotypes are encoded as 𝑎𝑎 = 0, 𝑎𝑏 = 0 and 𝑏𝑏 = 1. In the dominant model for 𝑏, where only one copy of the allele is necessary to confer the phenotype, the genotypes are 𝑎𝑎 = 0, 𝑎𝑏 = 1 and 𝑏𝑏 = 1. The additive, or allelic dosage, model for 𝑏 assumes that the effect on the phenotype is proportional to the allele count of 𝑏 with 𝑎𝑎 = 0, 𝑎𝑏 = 1 and 𝑏𝑏 = 2 [Bush & Moore, 2012]. For association testing without prior knowledge or assumptions about the mode of in- heritance, the additive model has been widely adapted and will be used throughout this thesis. It shows reasonable performance across all three models for the majority of effects, however, may suffer from a loss in power for recessive traits with a low causal allele frequency [Lettre & al., 2007]. As equation (1.5) shows, the phenotype is assumed to follow anormal distribution. 42 In order to avoid model misspecification when testing for genetic association, it is common practice to ensure approximate normality by transforming the observed phenotypes via methods such as Cox-Box [Etzel & al., 2003; Yang & al., 2006] or inverse normal transformation [Scuteri & al., 2007; Guan & Stephens, 2008; Anttila & al., 2010; Casale & al., 2015]. For any association tests conducted throughout this thesis, a rank-based inverse normal transformation is applied to each phenotype. 1.7.3. Testing the genotype association for significance The significance of the association between phenotypes and the genetic markers can be assessed by testing that the genetic variant has an effect (𝛽 ≠ 0) versus the null hypothesisℋ0of not having an effect (𝛽 = 0) on the phenotype. The log-likelihood ratio (LLR) test statistic Λ is a commonly used statistic to compare the likelihood of the full modelℋ1to the one of the null modelℋ0: ℋ1 ∶ 𝐲 ∼ 𝒩(𝐱𝛽 + 𝐅𝜶 , 𝜎 2 𝑒𝐈𝑁 ) , (1.20) ℋ0 ∶ 𝐲 ∼ 𝒩(𝐅𝜶 , 𝜎 2 𝑒𝐈𝑁 ) . (1.21) The LLR test statistic Λ is defined as Λ = ℒ( ̂𝛽, ?̂?, ̂𝜎𝑒) − ℒ(0, ?̄?, ̄𝜎𝑒) (1.22) whereℒ( ̂𝛽, ̂𝛼, ̂𝜎𝑒) are the REML ofℋ1andℒ(0, ̄𝛼, ̄𝜎𝑒) the REML ofℋ0. 2Λ follows a 𝜒2𝑑-distribution with 𝑑 degrees of freedom equal to the number of tested parameters [Wilks, 1938] and allows for the calculation of the p-value as : 𝑃(Λ) = ∫ ∞ 2Λ 𝜒2 (𝑥; 𝑑) 𝑑𝑥 = 1 − 𝐹𝜒2 (2Λ; 𝑑) , (1.23) where 𝐹𝜒2 (2Λ; 𝑑) is the cumulative density function of the 𝜒 2-distribution with 𝑑 degrees of freedom. For a single-variant single-phenotype test, the degrees of free- dom are 𝑑 = 1 (figure 1.2A, blue). The p-values derived from the 𝜒2-distribution can be used to interpret the association. The p-value is defined as the probability of finding the observed, or more extreme, results when ℋ0is true [Krzywinski & Altman, 2013a], or in other words, it serves as an index measuring the strength of evidence against the null hypothesis [Sterne & al., 2001]. The p-values are compared to a predefined significance level 𝛼, which specifies the probability of rejecting a true null hypothesis. If 𝑝 < 𝛼, the null hypotheses is rejected. Falsely rejected null hypo- 43 thesis are classified as Type I errors, or false positives, and depend on the stringency of the 𝛼 threshold. For instance, with 𝛼 = 0.05, 5% of all rejected null hypotheses might be true. Type II errors, or false negatives, occur when the null hypothesis is falsely accepted, i.e. a true association is not detected. The power in a GWAS is the proportion of true positives associations that can be detected, which corresponds to power = 1−Type II error rate [Krzywinski &Altman, 2013a; Krzywinski &Altman, 2013b]. In a GWAS, 𝑆 genome-wide SNPs are assessed underℋ0(equation (1.21)). With the assumption that the wide majority ofℋ0are true and potential confounding has been properly adjusted for (section 1.7.5), the genome-wide p-values follow a uni- form distribution in (0, 1] (figure 1.2B). To visually examine the p-value distribu- tion, p-values are often depicted in quantile-quantile (qq) plots where the expected − log 10 p-values5 are plotted against the observed− log 10 p-values (both sorted in in- creasing order). AGWAS iswell-calibrated if the expected and observed p-value dis- tribution only showdeviations for SNPs associatedwith the phenotype (figure 1.2C). Deviations of the observed from the expected p-value distribution are commonly observed in GWAS of highly polygenic traits or in studies with confounding factors such as population structure and relatedness which can create spurious associations [Marchini & al., 2004; Balding, 2006; Spielman & al., 1993; Lander & Schork, 1994]. Strategies to adjust for these confounding effects and to tell them apart from the true polygenetic effects are described in (section 1.7.5). 1.7.4. Correcting for multiple hypotheses testing in GWAS The underlying assumption of a GWAS is that the large majority of SNPs will have no impact on the phenotypes, i.e. for each SNP, one tests the null hypothesis of no effect versus the alternative hypothesis of a SNP effect that is different from zero and expects to accept the vast majority of these null hypotheses. However, when testing a large number of null hypotheses, it is likely to observe results with p-values below the significance level even if all null hypotheses are true. In awell-calibrated test, the number of false positive results depends on the a priori specified significance level 𝛼. For instance, with 𝛼 = 0.05 and ten million genome-wide SNPs, 5 × 105 tests would be expected to be false positives. Methods to correct for multiple hypotheses testing, i.e. reduce the number of Type I errors are reviewed in detail in [Shaffer, 1995]. The most commonly applied methods based on false discovery rate (FDR) and family-wise error rate (FWER) are described below. 5In practice, the expected − log 10 p-values are obtained through 𝑆 equally spaced numbers in (0, 1] 44 0.00 0.05 0.10 0.15 0.20 0 50 100 150 Χ2 Pr ob ab ilit y De ns ity A 0 1000 2000 0.00 0.25 0.50 0.75 1.00 p Co un t B 0 20 40 60 0 1 2 3 4 5 Expected − log10(p) O bs e rv e d − lo g 1 0(p ) C d 100 50 10 5 1 causal SNP non−causal SNP Figure 1.2: Distributions in LLR testing in GWAS. A. Cumulative density functions of 𝜒2-distributions with different numbers of degrees of freedom (d). The higher the number of degrees of freedom, the higher the 𝜒2-statistic (x-axis) has to be to obtain p-values re- garded as conclusively showing that the null hypothesis is false (indicated as dotted lines and shaded regions under the curves for 𝛼 = 0.05). B. P-value distribution of a well-calibrated GWAS. P-values are derived from the associations of 50,000 bi-allelic SNPs from 1,000 in- dividuals with a single quantitative phenotype. Out of the 50,000 SNPs, five SNPs were simulated to have an effect 𝛽 ≠ 0. The phenotype was simulated with default parameters as described in chapter 3. C. Quantile-quantile plot of the p-values from the associations in B. The five SNPs with 𝛽 ≠ 0 are indicated in green. 45 False discovery rate The FDR corrects formultiple testing based on the expected pro- portion of false discoveries. The FDRwas introduced by Benjamini andHochberg in [1995] and a number of other FDR-based correction methods were developed there- after e.g. [Storey, 2002; Donoho & Jin, 2006; Sarkar, 2007]. The original method by Benjamini and Hochberg set out to control the expected values of the FDR based on the ratio of wrongly rejected 𝑁 and total number of rejected null hypotheses 𝑅: FDR = 𝔼[ 𝑁 𝑚𝑎𝑥 (𝑅, 1) ] , (1.24) where the maximum in the denominator protects against division by zero. The pro- cedure works as follows: for a total number of 𝑚 tests, with p-values 𝑝1, 𝑝2,… , 𝑝𝑚 ordered in increasing order by their ranks 𝑘1, 𝑘2,… , 𝑘𝑚 (smallest p-value 𝑝1 with 𝑘1 = 1), the adjusted p-value 𝑝 ′ 𝑖 is determined as 𝑝 ′ 𝑖 = 𝑚𝑝𝑖 𝑘𝑖 . Choosing to accept all null hypotheses with 𝑝′𝑖 > 𝛼 ensures FDR < 𝛼. Family-wise error rate The FWER controls for the probability of observing at least one false positive result within a given experiment (family of tests) [Shaffer, 1995]. Among the FWER-based tests, the most simple procedure to adjust for multiple test- ing is multiplying all observed p-values 𝑃 = 𝑝1, 𝑝2,… , 𝑝𝑚 by the total number of tests𝑚: 𝑃 ′ = 𝑚𝑃. This method to compute the adjusted p-values 𝑃 ′ was proposed by Olive Dunn in 1961, based on properties of Bonferroni’s inequalities [Dunn, 1961] and the method is commonly referred to as Bonferroni correction. Accepting all null hypotheses with 𝑝′𝑖 > 𝛼 ensures controlling for FWER < 𝛼. The main assumption in Bonferroni-based adjusting for multiple testing is the independence of the conduc- ted tests. In genome-wide tests of association, LD structure in the genome induces dependence of tests and correction for multiple testing by a strict multiplication of the total number of tests is too conservative. Permutation-based adjusting for FWER In order to account for the dependency of the statistical tests in genetic association studies, one can use permutation-based ap- proaches to control the FWER. In these approaches, the link between the parameter of interest i.e. the genotype and the observed phenotype is broken by random per- mutation of the genotype data across individuals. The association study is conduc- ted 𝑇 times on 𝑇 random permutations of the data and the p-values of the permuta- tion experiments ̄𝑃 compared to the observed p-values of association study. For each 𝑝𝑖, 𝑝 ′ 𝑖 is calculated by recording the number of times 𝑝𝑖 is smaller than any ̄𝑝𝑖 and sub- 46 sequently dividing this number by the total number of permutations. Permutation- based approaches have been employed in whole-genome association studies (about 10,000 genotypes) for yeast [Brem & al., 2002; Ehrenreich & al., 2010; Bloom & al., 2013] and human genotype to gene expression association studies for adjusting on gene level [1000 Genomes Project Consortium, 2015]. In these studies the compu- tational burden is moderate, whereas permutation studies for human GWAS with millions of SNPs might become impractical. LD-corrected genome-wide significance threshold As an alternative to adjusting each p-value individually, a new 𝛼′ can be specified which controls for the same level of type I errors as 𝛼 but takes the number of tests that are conducted into account. For the conservative Bonferroni correction, which does not consider the genomic LD structure 𝛼′ = 𝛼𝑚 . For human GWAS, the multiplication factor for 𝛼 has been estim- ated based on the estimated number of independent variants in the genome. It is is based on an observation of the HapMap project [The International HapMap Con- sortium, 2005] (section 1.6.3) where about 150 independent, common variants were found per 500 kb region. Extrapolating this number to the human genome size of ∼ 3.3 Gb, for 𝛼′ = 0.05 the genome-wide significance threshold was estimated as 𝛼′ = 0.05×150×(500kb× 3.3Gb)−1 = 5.05×10−8. This estimatewas later confirmed in a study using different methods for estimating the number of independent vari- ants [Fadista & al., 2016] and is the commonly employed threshold in today’s human GWAS. However, this threshold can be different in genetic studies of rare variants (for example [Xu & al., 2014]). 1.7.5. Accounting for population structure and genetic kinship Confounding of association results based on genotypic differences between cases and controls had been a known challenge before the GWAS era [Spielman & al., 1993] and has remained a critical issue still. If population structure is not taken into account when testing for genotype-phenotype associations, associations might be observed that simply reflect the underlying population structure and lead to an in- crease in false positive results. Equally, real effects might be masked and genuine associations missed [Marchini & al., 2004]. In the case-control setting, this problem arises easily when the study consists of (undetected) subpopulations which are not evenly distributed among cases and controls. For SNPs where the allele proportions differ between the hidden subgroups, a false positive association will be recorded [Marchini & al., 2004; Balding, 2006]. Quantitative trait association studies can be 47 subject to similar issues. If the study cohort is comprised of individuals from dif- ferent ethnicities, spurious associations can be detected that reflect ethnicity rather than causal variation. Campbell & al. [2005] demonstrated in an association study for height in a European-ancestry cohort that association could simply be attributed to differences in SNP frequencies across European ancestry subpopulations. Other studies confirmed allele-frequency differences within cohorts of the same ethnicity [Tian & al., 2008a; Tian & al., 2008b], thus emphasising the need for proper con- trol of population structure. Similar issues arise for a more fine-scaled structure in the cohort induced by samples with different degrees of relatedness. When re- lated individuals are present in the cohort, their genotypes do not reflect random and independent draws from the population frequencies. While this generally does not affect the allele frequency estimates, their variance might be greater than ex- pected, leading to an overdispersed test statistic and increased false positive rate, as demonstrated by Bacenu, Devlin and Roeder for case-control settings and quantit- ative traits [Devlin & Roeder, 1999; Bacanu & al., 2002]. In addition to population structure and relatedness, spurious associations might arise in studies with recently admixed populations, as described by Lander & Schork [1994] and Ewens & Spiel- man [1995] where false positive disease associations were found due to allele fre- quency differences in the parent populations. A number of different methods have been developed to correct for confounding genotype structures. Post-hoc adjusting The firstmethods to adjust for genetic background structurewas proposed byDevlin &Roeder [1999]. Genomic Control is based on the hypothesis that genetic background structure generates an inflation of the test statistics. Adjustment for population struc- ture is achieved by estimating the inflation factor 𝜆 and dividing the test statistic of each association by 𝜆. Extensions to their initial approach for case-control studies included partially modified approaches for estimating 𝜆 [Reich & Goldstein, 2001], its application for quantitative traits [Bacanu & al., 2002], and an adjusted approach to take the number of SNPs for the estimation of 𝜆 into account [Devlin & al., 2004]. The observation that inflation and sample size seemed to correlate lead Yang & al. [2011] to systematically study different parameters influencing inflation and they found 𝜆 to be a function of sample size, LD structure and narrow-sense heritabil- ity. Importantly, they showed that 𝜆 is also correlated with the number of causal variants, thus studies on traits with polygenetic inheritance can show inflation inde- pendent from confounding. Based on this observation, Bulik-Sullivan & al. [2015] 48 developed LD Score regression, a regression method for distinguishing confound- ing structures from polygenicity in GWAS. As with Genomic Control, the estimated inflation factor from LD Score regression can be used for the post-hoc adjusting of the test statistic. Adjusting by subsampling Shortly after the introduction of Genomic Control, Pritchard & al. [2000] proposed the concept of Structured Associations, where genetic markers unlinked to the pheno- type are used to identify subpopulations of samples. Assigning the samples to their respective unstructured subpopulations and testing for association within subpop- ulations will essentially overcome the problem of population structure present in the overall study population. While useful and employed in association studies for a moderate number of genetic markers and samples [Li & al., 2004; Stein & al., 2009; Kulbrock & al., 2013], it is computationally expensive for large datasets [Price & al., 2006]. In addition, human genetic diversity is better approximated by continuous measures or gradients rather than discrete clustermembership [Serre& Pääbo, 2004; Price & al., 2006]. Relatedness and population estimates as model variables In contrast to the post-hoc and subsampling approaches, adjusting for population structure and relatedness within the association model is possible by estimating the genetic relationship of the samples and using these estimates as additional variables. Studies on genotype variation in relation to geographical distance havedemonstrated that geographic ancestries of individuals can be inferred fromgeneticmarkers [Rosen- berg& al., 2002; Tang& al., 2005]. Sample clustering based on the genetics is thereby largely correlated with their geographic regions [Rosenberg & al., 2005]. In addition to capturing large scale population structure, genetic markers have also been em- ployed to estimate shared ancestry and relatedness in natural populations [Lynch & Ritland, 1999; Ritland, 2000; Thomas, 2005]. Price & al. [2006] proposed to use genome-wide genetic markers to estimate a genetic sample-by-sample covariance matrix. The SNPs of this genetic covariance matrix represent continuous axes of ge- netic variation and can be used to adjust for population structure, either by a priori regression of the principal components from both the genotype and phenotype data, or by including them as additional covariates in the model. They showed that prin- cipal components correctly identified and corrected for population structure based 49 on geographic differences. However, principal components (PCs) perform poorly in modelling family structure or cryptic relatedness [Yu & al., 2006; Zhao & al., 2007; Kang & al., 2010; Casale & al., 2015]. Yu & al. [2006] have proposed to use a lin- ear mixed model approach to control for population structure and relatedness. The key assumption in this approach is that phenotypic covariance between individuals based on population structure or relatedness is proportional to their relative related- ness. They showed together with Malosetti & al. [2007] and Zhao & al. [2007] that linear mixed models in the analysis of structured samples yield higher power while controlling better for type I errors than Genomic Control, Structured Associations and –PCs. 1.7.6. Linear mixed models Linear mixed models (LMM) describe the linear relationship between the response vector and a number of fixed (deterministic) effects and random (unknown) effects. While fixed effects are modelled by estimating the effect sizes of known explanatory variables (equation (1.2)), random effects model a random variable for which distri- bution parameters are estimated. Specifically, for the response vector of 𝑁 samples 𝐲 ∈ ℛ𝑁, 1, the design matrix of 𝐹 fixed effects 𝐗 ∈ ℛ𝑁, 𝐹 and their respective ef- fect size vector 𝜷 ∈ ℛ𝐹, 1, the design matrix of 𝑈 random effects 𝐙 ∈ ℛ𝑁, 𝑈 and the random effect 𝐛, the linear mixed model is cast as 𝐲 = 𝐗𝜷+ 𝐙𝐛 +𝝍, with 𝐛 ∼ 𝒩(0 , 𝜎2𝑏𝚺) and 𝝍 ∼ 𝒩(0 , 𝜎 2 𝑒𝐈𝑁 ) . (1.25) As in the simple linearmodel (equation (1.2)), the residual noise is assumed to follow a normal distribution with mean zero and variance parameter 𝜎2𝑏 . In the formula- tion considered here, the covariance of the random effect is described by a known covariancematrix𝚺 and its variance parameter 𝜎2𝑏 . Equation (1.25) can be expressed as the likelihood of the joint probability distribution of 𝐲 and 𝐛 𝑝 (𝐲, 𝐛 ∣ 𝜷, 𝜎2𝑏 , 𝜎 2 𝑒) = 𝑝 (𝐲 ∣ 𝜷, 𝐛, 𝜎 2 𝑒) 𝑝 (𝐛 ∣ 𝜎 2 𝑏 ) (1.26) = 𝒩(𝐲 ∣ 𝐗𝜷 + 𝐙𝐛 , 𝜎2𝑒𝐈𝑁 )𝒩(𝐛 ∣ 𝟎 , 𝜎 2 𝑏𝚺) (1.27) 50 To find the estimates of the unknown parameters 𝜷, 𝜎2𝑏 , 𝜎 2 𝑒 , one can first marginalise out 𝐛 𝑝 (𝐲 ∣ 𝜷, 𝜎2𝑏 , 𝜎 2 𝑒) = ∫𝑝 (𝐲 ∣ 𝜷, 𝐛, 𝜎 2 𝑒) 𝑝 (𝐛 ∣ 𝜎 2 𝑏 ) 𝑑𝑏 (1.28) = 𝒩(𝐲 ∣ 𝐗𝜷 , 𝜎2𝑏𝐙𝚺𝐙𝑇 + 𝜎 2 𝑒𝐈𝑁 ) (1.29) and then find the estimates that maximise the marginal likelihood ℒ(𝜷, 𝜎2𝑏 , 𝜎 2 𝑒) = 𝑝 (𝐲 ∣ 𝜷, 𝜎2𝑏 , 𝜎 2 𝑒). Estimates are usually found by REML instead of MLE to avoid bias in the estimation of the variance components 𝜎2𝑏 and 𝜎 2 𝑒 . In contrast to the simple linear model (section 1.7.1), the REML of parameters in linear mixed models cannot be solved in closed-form. Different methods for the efficient estimation of the model parameters have been proposed e.g. [Lippert & al., 2011], butwill not be described in detail here. In this thesis, the LMM framework LIMIX and accompanying methods (mtSet)were used to build the associationmodels. Within this framework, the REML of the model parameters are found via Broyden’s method [Broyden, 1965]. Details of the implementation can be found in [Casale & al., 2015, Supplementary material]. Linear mixed models in genetic association studies In genetic association studies, LMMs describe the trait of interest as the sum of ge- netic fixed and random effects, i.e. single variants and background genetic effects, possible additional covariates and residual noise: 𝐲 = 𝐱𝛽 + 𝐅𝜶+ 𝐠 + 𝝍 with 𝐠 ∼ 𝒩(0 , 𝜎2𝑔𝐑) and 𝝍 ∼ 𝒩(0 , 𝜎 2 𝑒𝐈𝑁 ) . (1.30) with the phenotype vector 𝐲 ∈ ℛ𝑁, 1, the genetic profile of the SNP being tested 𝐱 ∈ ℛ𝑁, 1, the effect size of the SNP 𝛽 ∈ ℛ1, 1, the matrix of𝐾 covariates 𝐅 ∈ ℛ𝑁, 𝐾, the effect of covariates 𝜶 ∈ ℛ𝐾, 1 and the genetic relatedeness matrix𝐑 ∈ ℛ𝑁, 𝑁. 51 In analogy to equation (1.29), the random effect 𝐠 can be marginalised out, leading to the likelihood expression for equation (1.30) as 𝐲 ∼ 𝒩(𝐅𝜶+ 𝐱𝛽 , 𝜎2𝑔𝐑+ 𝜎 2 𝑒𝐈𝑁 ) . (1.31) In equation (1.31), the genetic covariance structure of the samples, as expressed by the genetic relatedness matrix 𝐑, is integrated in the overall covariance structure of the model. As discussed in the next section, the covariance structure introduced by 𝐑 captures population structure and polygenic background and leads to well- behaved statistics under the null model [Yu & al., 2006; Kang & al., 2008]. Estimating the kinship between samples Traditionally, LMMs have been widely used in association studies with pedigrees of known relationship [Eu-Ahsunthornwattana & al., 2014]. The pedigree relationship between two individuals was used to estimate their predicted proportion of the gen- ome that is identical by descent (IBD). The concept of IBD is based on the random Mendelian sampling of chromosomes during successivemeiosis from a common an- cestor. As such, IBD as a measure is always relative to the founders in the pedigree. IBD estimates can also be generated for a population, where they have to be defined relative to some ancestral population or time point [Browning & Browning, 2010; Glazner & Thompson, 2012]. Amatrix of pair-wise IBD estimates is then used as the genetic relatedness matrix𝐑 in the linear mixed model (equation (1.31)). Alternatively, the genetic relatedness matrix can be estimated from genome-wide genetic marker information. Nejati-Javaremi & al. [1997] showed in simulations that if all loci contributing to a given trait were known, the accuracy of phenotype pre- dictions based on the relatedness matrix estimated from those loci would be higher than for matrices estimated on pedigree information. Similarly, Villanueva & al. [2005] showed that the accuracy of breeding values from relationship matrices com- puted based on genetic markers is higher than for matrices derived from pedigree information. Extending these simulations, Hayes & al. [2009] showed that the in- creased prediction accuracy also holds when the relatedness matrix is estimated for a cohort of unknown pedigree using dense genetic markers instead of all true, but unknown causal loci. The use of such a realised relationship matrix (RRM) is now widely employed in GWAS of large cohorts and plant and animal breeding studies as it is able to capture small differences in the proportion of genetic markers that are shared between seemingly unrelated individuals [Lee & al., 2010; Lopes & al., 2013]. 52 A common strategy for the estimation of the RRM, which is used in this thesis, is to compute the average allelic correlation matrix 𝐑 = 1 𝑆 𝐗𝐗𝑇 (1.32) where 𝑆 is the number of SNPs used for the estimation and𝐗 is the𝑁 ×𝑆matrix of standardised genotypes of the samples𝑁 [Patterson & al., 2006; Yang & al., 2011]. To derive the standardisation of the genotypes based on their allele frequency [Patter- son & al., 2006; Yang & al., 2011; Casale & al., 2015], consider the bi-allelic genotype at the 𝑖th sample 𝐱𝑖 in Hardy-Weinberg equilibrium i.e. with the allele frequencies of the alleles 𝑝 + 𝑞 = 1 and the genotype frequencies 𝑝2𝑖 + 2𝑝𝑖𝑞𝑖 + 𝑞 2 𝑖 = 1. Here, 𝑝 is defined as the reference allele and 𝑞 as the alternative allele. In the additive geno- type model (section 1.7.2), the genotypes can be described in terms of allele dosage 𝑑, with 𝑑(𝑝𝑖, 𝑝𝑖) = 0, 𝑑(𝑝𝑖, 𝑞𝑖) = 1 and 𝑑(𝑞𝑖, 𝑞𝑖) = 2. Based on allele dosages, the expected value of the genotype is defined as 𝐸(𝑥𝑖) = 𝑑(𝑝𝑖, 𝑝𝑖) × 𝑝 2 𝑖 + 𝑑(𝑝𝑖, 𝑞𝑖) × 2𝑝𝑖𝑞𝑖 + 𝑑(𝑞𝑖, 𝑞𝑖) × 𝑞 2 𝑖 (1.33) = 2𝑝𝑖𝑞𝑖 + 2𝑞 2 𝑖 = 2(1 − 𝑞𝑖)𝑞𝑖 + 2𝑞 2 𝑖 = 2𝑞𝑖. (1.34) With the expected value of the genotype, its variance and standard deviation can be computed 𝑉 𝑎𝑟(𝑥𝑖) = 𝐸(𝑥 2 𝑖 ) − 𝐸(𝑥𝑖) 2 (1.35) = 𝑑(𝑝𝑖, 𝑝𝑖) 2 × 𝑝2𝑖 + 𝑑(𝑝𝑖, 𝑞𝑖) 2 × 2𝑝𝑖𝑞𝑖 + 𝑑(𝑞𝑖, 𝑞𝑖) 2 × 𝑞2𝑖 − (2𝑞𝑖) 2 (1.36) = 2𝑞𝑖(1 − 𝑞𝑖) (1.37) 𝜎(𝑥𝑖) = √𝑉 𝑎𝑟(𝑥𝑖) = √2𝑞𝑖(1 − 𝑞𝑖) (1.38) and the genotypes standardised as ̄𝑥𝑖 = 𝑥𝑖 − 2𝑞 √2𝑞(1 − 𝑞) . (1.39) Different strategies have been proposed for the selection of genetic markers in the RRM estimation, including a two-stepped analysis approach allowing for preselec- tion of phenotype-specific variants [Lippert & al., 2013] and grouping SNPs by hap- lotype [Zhao & al., 2007; Kang & al., 2008]. The latter avoids the bias introduced by the potentially unequal number of experimentally genotyped/imputed SNPs per 53 haplotype [Speed & al., 2017]. Similarly, choosing only SNPs which are in approx- imate linkage equilibrium can avoid this bias [Browning, 2008]. As described in [Eu-Ahsunthornwattana & al., 2014], SNPs in approximate linkage equilibrium can be found by strict LD pruning in genomic windows of appropriate size (depend- ing on the organism and study design). Throughout this thesis, RRM estimates are always based on LD-pruned SNP sets. 1.7.7. Joint analysis of multiple phenotypes Many cohort studies today, ranging from studies in model organism such as yeast and Arabidopsis thaliana to human, have rich, high-dimensional datasets including molecular, morphological or imaging derived traits [Bloom & al., 2013; Atwell & al., 2010; Astle & Balding, 2009; Shaffer & al., 2016; Stein & al., 2010]. However, these traits have often been analysed separately, partly for simplicity and partly because of a paucity of models suitable for the analysis of high-dimensional phenotype data. A variety of multi-trait models have been developed which can be broadly grouped into three different classes: i) dimensionality reduction techniques, ii) meta-analysis approaches and iii) multivariate regressionmodels (reviewed in [Shriner, 2012; Yang & Wang, 2012]). Dimensionality reduction techniques Dimensionality reductionmethods in genotype- phenotype mapping seek to find a suitable projection of high-dimensional pheno- types into a lower dimensional space. Two commonly used dimensionality reduc- tion methods are PCA and canonical correlation analysis (CCA). An overview of other methods and a more detailed description of methods in this section will be given in chapter 6. In PCA , the phenotype data is projected into its principal components - the eigen- vectors of the empirical covariance matrix. The amount of variance that each com- ponent explains is proportional to its corresponding eigenvalue. The dimensionality reduction is achieved by using all those principal components (in increasing order) until the cumulative sum of the eigenvalues reaches a predefined threshold of total phenotypic variance that should be retained. PCA as a dimensionality reduction technique has for instance been used in studies to find links between genotypes and facial features or obesity phenotypes [Liu & al., 2012; Claes & al., 2014; He & al., 2008]. Recently, Aschard and colleagues [Aschard & al., 2014] demonstrated that simply focusing on the principal components with the highest variance might not exploit the full potential of using PCA for genetic association. They propose amodel 54 of combined PCAwhere the PCs are grouped based on the level of variance they ex- plain and show a power gain in detecting genetic associations using this approach. While the PCAdimensionality reduction approach focuses on the phenotype space and subsequent association with the genotypes, CCA seeks to maximise the canon- ical (ordered) correlation between the transformed phenotypes and genotypes i.e. finding the optimal linear transformation of the phenotypes while simultaneously testing for the associationwith the genotypes. For a single geneticmarker, CCAfinds the linear phenotype transformation that explains the maximum amount of covari- ance between this genotype and all traits by solving the eigendecomposition of a complex phenotype-genotype covariance term [Yang & Wang, 2012]. Ferreira and Purcell [2009] showed in simulations that CCA with multiple traits and one genetic marker controls well for type I errors and has increased power compared to univari- ate tests. In order to extend CCA to more than one marker, the genotypes also have to undergo a linear transformation and the maximum canonical correlation is found by solving two eigenvalue problems. As the number of genotype markers in GWAS exceeds the number of samples, estimates of the genotype covariance term becomes unreliable [Schäfer & Strimmer, 2005]. Several methods have been developed to cir- cumvent this issue, making use of sparse matrices [Parkhomenko & al., 2009] or a priori grouping of the genotypes [Naylor & al., 2010]. Meta-analysis approaches Meta-analysis approaches combine the simplicity of the univariate approaches with the advantages of the multivariate approach. For each phenotype, a univariate association study is conducted and the summary statistics of these tests are combined. Many methods for combining the summary statistics [Xu & al., 2003; Yang & al., 2010; Bolormaa & al., 2014] go back to the work by O’Brien [O’Brien, 1984], who proposed to use a linear combination of the observed test stat- istics for each univariate test as the new statistics to be evaluated for significance. Subsequent studies proposed different methods for choosing the weights in the lin- ear combination of the univariate test statistic or keeping the same principle com- putation but re-formulating the alternative hypotheses [Yang & al., 2010; Xu & al., 2003]. These studies showed an increase in power for applying the combined stat- istic on small marker sets or numbers of phenotypic traits. Bolormaa & al. [2014] showed that the power gains also hold for genotype to phenotype mapping of 32 traits across all genome-wide markers. 55 Regression models There are a number of different regression models that allow for the multivariate analysis of phenotypes. Among them are graphical models, gen- eralised estimation equations and frailty models, for which a summary of methods and application can be found in [Shriner, 2012; Yang & Wang, 2012]. Here, I will focus on describing the development of multivariate linear regression models for genotype-phenotype mapping. Before the era of GWAS, Jiang and colleagues [1995] proposed a multi-trait model where the phenotypes are jointly modelled as the sum of the fixed genetic effects of interest, fixed effects for genetic backgroundvariation and residual noise. They show that the joint analysis of correlated traits can increase power to detect the underlying genetics and can increase the precision of the parameter estimates. The significance of the association is determined via a likelihood ratio test of the parameter estimates under the null model, where the fixed genetic effect is zero, and the parameter es- timates under the alternative model. The alternative model design depends on the underlying biological hypothesis regarding the effect of the genetic variant. Here, Jiang and colleagues differentiate between hypotheses for a simple joint mapping of phenotypes, pleiotropy and gene-environment interactions. Methods developed thereafter often use the same underlying hypotheses for the mapping, but different techniques for the evaluation of the significance. For in- stance, two other groups developed methods for the joint analysis of traits based specifically on the residual sum of squares (RSS) matrix of the standard linear model estimated at each locus tested [Knott & Haley, 2000; Korol & al., 2001]. In the model proposed by Knott andHaley, the different properties and descriptors of the RSS are used to determine the significance of the association. To test for pleiotropy for in- stance, the determinant of the RSS at the test locus is compared to the RSS of the null model of no association. In contrast, Korol and colleagues propose to use the RSS of the multi-trait mapping as a means for trait transformation and dimensionality reduction. The resulting one-dimensional trait per sample is fitted in a single-trait test for significance testing. While methods described so far have only used fixed genetic effects, Korte and colleagues [2012] were the first to introduce a random genetic effect into the model. Based on the original model by Jiang, they substituted the fixed effect accounting for background genetics by a random effect, turning the multivariate linear model into a multivariate linear mixed model. Since this initial model for multi-trait testing, a number of publicly available linear mixed model frameworks for the genome-wide mapping of a moderate number of traits have been developed [Yang & al., 2014; 56 Lippert & al., 2014; Zhou & Stephens, 2014; Casale & al., 2015]. Out of the different approaches described above,multivariate linearmixedmodels have the additional advantage that they can control for complex relatedness and population structure (section 1.7.5). 1.7.8. Linear mixed models for the joint analysis of multiple phenotypes Themultivariate linearmixedmodelwith𝑃 = {1, 2,… , 𝑝}phenotypes for𝑁 samples can be derived as an extension of the univariate model with 𝑃 = 1 phenotype for 𝑁 samples described in equation (1.31). Consider equation (1.31) as the model descrip- tion for the 𝑖th phenotype (omitting covariates for simplicity) : 𝐲 𝑖 ∼ 𝒩(𝐱𝛽𝑖 , 𝜎 2 𝑔𝑖 𝐑+ 𝜎2𝑛𝑖𝐈𝑁 ) , (1.40) with 𝐱 ∈ ℛ𝑁, 1 the genotype, 𝛽𝑖 the effect size of the genotype for trait 𝑖, 𝜎 2 𝑔𝑖 and 𝜎2𝑛𝑖 the covariance terms of the genetic and noise random effect for trait 𝑖,𝐑 ∈ ℛ𝑁, 𝑁 the realised relationship matrix estimated from the genotype data and 𝐈𝑁 the identity matrix. As described by Henderson & Quaas [1976], multivariate LMMs model the covariance between trait 𝑖 and 𝑗 as Cov (𝐲 𝑖 , 𝐲 𝑗 ) = 𝜌𝑔𝑖𝑗 𝜎2𝑔𝑖 𝜎2𝑔𝑗 𝐑+ 𝜌𝑛𝑖𝑗𝜎 2 𝑛𝑖 𝜎2𝑛𝑗𝐈𝑁, (1.41) with 𝜌𝑔𝑖𝑗 and 𝜌𝑛𝑖𝑗 the genetic and noise correlation between trait 𝑖 and 𝑗, respectively. Using the multivariate LMM described for trait 𝑖 in equation (1.40) and the expres- sion of the 𝑖𝑗-trait-trait covariance term in equation (1.41), the multivariate LMM for all traits 𝑃 can be expressed as a matrix-normal distribution: 𝐘 = 𝐱𝜷𝑇 +𝐆+𝝍 (1.42) with the phenotype matrix𝐘 and the effect size vector 𝜷 𝐘 = [𝐲 1 ⋯ 𝐲 𝑃 ] ∈ ℛ𝑁, 𝑃, (1.43) 𝜷 = [𝜷 1 ⋯ 𝜷 𝑃 ] ∈ ℛ𝑃, 1, (1.44) (1.45) the randomgenetic effect𝐆 and the randomnoise effect𝝍 following amatrix-variate normal distribution with row covariance 𝐑 and 𝐈𝑁 and column covariance 𝐂𝑔 and 57 𝐂𝑛 𝐆 =ℳ𝒩𝑁,𝑃 ( 𝟎 , 𝐑 , 𝐂𝐠 ) , (1.46) 𝝍 =ℳ𝒩𝑁,𝑃 ( 𝟎 , 𝐈𝐍 , 𝐂𝐧 ) , (1.47) and the genetic and noise trait-by-trait covariance matrices 𝐂𝑔 and 𝐂𝑛 𝐂𝑔 = ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ 𝜎2𝑔1 𝜌𝑔12 𝜎𝑔1 𝜎𝑔2 ⋯ 𝜌𝑔1𝑃 𝜎𝑔1 𝜎𝑔𝑃 𝜌𝑔12 𝜎𝑔1 𝜎𝑔2 𝜎2𝑔2 ⋯ 𝜌𝑔2𝑃 𝜎𝑔2 𝜎𝑔𝑃 ⋮ ⋮ ⋱ ⋮ 𝜌𝑔1𝑃 𝜎𝑔1 𝜎𝑔𝑃 𝜌𝑔2𝑃 𝜎𝑔2 𝜎𝑔𝑃 ⋯ 𝜎2𝑔𝑃 ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ , (1.48) (1.49) 𝐂𝑔 = ⎡ ⎢ ⎢ ⎢ ⎣ 𝜎2𝑛1 𝜌𝑛12𝜎𝑛1𝜎𝑛2 ⋯ 𝜌𝑛1𝑃𝜎𝑛1𝜎𝑛𝑃 𝜌𝑛12𝜎𝑛1𝜎𝑛2 𝜎 2 𝑛2 ⋯ 𝜌𝑛2𝑃𝜎𝑛2𝜎𝑛𝑃 ⋮ ⋮ ⋱ ⋮ 𝜌𝑛1𝑃𝜎𝑛1𝜎𝑛𝑃 𝜌𝑛2𝑃𝜎𝑛2𝜎𝑛𝑃 ⋯ 𝜎 2 𝑛𝑃 ⎤ ⎥ ⎥ ⎥ ⎦ . (1.50) Thematrix-variate distribution of the phenotypematrix𝐘 can be expressed in terms of a multivariate normal distribution (for details refer to equation (C.1) to equa- tion (C.8) in the appendix) vec (𝐘) ∼ 𝒩𝑁×𝑃 (vec (𝐱𝜷 𝐓) , 𝐂𝑔 ⊗𝐑+𝐂𝑛 ⊗ 𝐈𝑁 ) . (1.51) where the Kronecker products⊗ of𝐂𝑔⊗𝐑 and𝐂𝑛⊗𝐈𝑁 follow the definition of the Kronecker product for any two matrices as: 𝐂𝑔 ⊗𝑅 = ⎡ ⎢ ⎢ ⎣ 𝐂𝑔11 𝐑 ⋯ 𝐂𝑔1𝑃 𝐑 ⋮ ⋱ ⋮ 𝐂𝑔𝑃1 𝐑 ⋯ 𝐂𝑔𝑃𝑃 𝐑 ⎤ ⎥ ⎥ ⎦ and 𝐂𝑛 ⊗ 𝐈𝑁 = ⎡ ⎢⎢ ⎣ 𝐂𝑛11𝐈𝑁 ⋯ 𝐂𝑛1𝑃𝐈𝑁 ⋮ ⋱ ⋮ 𝐂𝑛𝑃1𝐈𝑁 ⋯ 𝐂𝑛𝑃𝑃𝐈𝑁 ⎤ ⎥⎥ ⎦ . The likelihood of the multivariate linear mixed model is ℒ(𝜷𝑇,𝐂𝑔,𝐂𝑛) = 𝒩(vec (𝐘) ∣ vec (𝐱𝜷) , 𝐂𝑔 ⊗𝐑+𝐂𝑛 ⊗ 𝐈𝑁 ) . (1.52) Maximising ℒ(𝜷,𝐂𝑔,𝐂𝑛) requires 𝑃 parameter estimates for the fixed effect 𝜷 and 1 2𝑃 (𝑃 + 1) parameter estimates for both of the 𝑃 × 𝑃 covariance matrices 𝐂𝑔 and 𝐂𝑛. Due to the large number of parameters, REML for multivariate LMM often 58 relies on gradient-based optimisation methods. Different schemes have been used in LMM for genetic analysis, including average information REML [Gilmour & al., 1995] (used in [Yang & al., 2011]) quasi-Newton methods like Broyden’s method [Broyden, 1965] (used in [Casale & al., 2015]), and Brent’s algorithm [Brent, 1971] (used in [Lippert & al., 2011; Svishcheva & al., 2012]). The REML implementation of the framework used in this thesis builds on Broyden’s method and the detailed derivation can be found in [Casale & al., 2015, Supplementary material]. Commonly used multi-trait association frameworks and their implementation are discussed in detail in the introduction of chapter 4. Hypothesis testing in multi-trait association studies As described by Jiang & Zeng [1995] and Korte & al. [2012] (summarised in sec- tion 1.7.7, regressionmodels), when testing the association of a geneticmarked across multiple phenotypes, different hypotheses for the underlying genetic trait architec- ture can be formulated. In the most simple case, one can test if the genetic variant has an effect on any of the traits 𝑃 (any effect test) i.e. the effect size of the fixed effect 𝜷 is unequal to zero for at least one trait : 𝐻A ∶ 𝜷 ≠ 𝟎𝑃. In this 𝑃-degrees of freedom test, the corresponding null hypothesis of no association is that the effect size of the fixed effect is equal to zero: 𝐻0 ∶ 𝜷 = 𝟎𝑃. In the common effect model, the variant has the same effect size across all traits with 𝐻A ∶ 𝜷 = 𝟏𝑃𝛽 and is tested for significance in a one degree of freedom model versus the null hypothesis of no association (𝛽 = 0). A more complicated model allows to test for specific effects of the variant on a given trait 𝑝. This can be tested with a one degree of freedom test where a model containing a common effect across all traits and a specific effect for trait 𝑝 is compared against the common effect model. 59 2 Cardiac biology In chapter 7 and chapter 8, I investigate genetic associations of human heart mor- phology. To aid with an understanding of the relevant biology and key terms, I use this chapter to give a basic overview of human heart morphology, cardiovascular diseases and their underlying genetics. The humanheart is composed of four chambers, the left and right ventricle and the left and right atrium. On the outside it is covered by a toughmembranous structure, the pericardium. The innermost layer of the pericardium, the epicardium, is fused to the heart and forms part of the heart wall. It directly connects to the myocardium, the thickest layer of the heart wall which is composed of conductory and contractile cardiomyocytes. On the inside, the myocardium is lined by the endocardium [Betts & al., 2013]. The four chambers of the heart (figure 2.1) are separated through two septal struc- tures, the interventricular and the atrioventricular septum. The blood exchange between the atria and ventricles is enabled through a set of valves embedded in the atrioventricular septum: the mitral valve between left atrium, and ventricle and the tricupsid valve between the right atrium and ventricle. In addition, each ventricle has a valve at its exit point. In the right ventricle, the pulmonary valve separates the ventricle from the pulmonary artery. Similarly, the aortic valve separates the left ventricle from the aorta. There is no direct blood exchange between the left and the 61 right side of the heart in healthy individuals. Diseases of the heart are common and one of the leading health issues world- wide. They include a wide range of disorders from atherosclerosis, diseases of the myocardium and the heart’s electrical circuit to congenital heart diseases. To help with an understanding of these disease pathologies, the circulatory and conductory system as well as the development of the heart are described below. 1 Tricupsid valve 2 Pulmonary valve 3 Aortic valve 4 Mitral valve 5 Interseptal ventricle LV RV RA LA Aorta Pulmonary veins Pulmonary artery Inferior vena cava Superior vena cava 1 2 3 4 5 1 Bachman's Bundle 2 Internodal path 3 Bundle of His 4 Atriventricular bundle branches 5 Purkinje Fibers SA AV 1 2 53 5 4 A B Figure 2.1: Anatomy, circulatory and conductory system of the human heart. A. Circu- latory system. Deoxygenated blood (blue arrows) arrives at the right atrium (RA) from the systemic circulation. From the right atrium, it enters the right ventricle (RV) through the tricupsid valve. It leaves the right ventricle through the pulmonary valve into the pulmonary artery entering the pulmonary circuit. Oxygenated in the lung, blood (red arrows) arrives back at the heart at the left atrium (LA) through two branches of the vena cava and enters the passing the mitral valve. It leaves the left ventricle (LV) through the aorta, entering the systemic circulation. Anatomy: The myocardium of the left ventricle is significantly thicker than the right ventricle, as it has to overcome greater pressure of the systemic circuit. The walls of the atria are smooth, whereas the ventricles show protrusions. The atrioventricular septum separating atria and ventricles is not shown for simplicity. It is located at the level of the tricupsid andmitral valves. B. Conductory system. The sinoatrial (SA) node initiates the contraction of the heart by sending an action potential through the atria via cell-cell contact and specific pathways (Bachmann’s Bundle and internodal paths). The potential arrives at the atrioventricular (AV) node, where it is delayed to allow for full contraction of the atria before it is passed on to the Purkinje Fibers, through the Bundle of His and the atrioventricu- lar bundle branches. The Purkinje Fibers pass the signal on to the ventricles, leading to their contraction and the pumping of the blood outside of the heart. 62 2.1. Cardiac cycle The cardiac cycle beginswith the contraction of the atria and endswith the relaxation of the ventricles. During the cycle, the chambers of the heart can be found in two distinct states, systole and diastole. In systole, the chambers contract and pump blood into either the ventricles (atria) or out of the heart (ventricle). In diastole, the chambers are relaxed and fill with blood. Both atria and ventricle cycle through these states, coordinated by impulses sent from the circulatory system. Ventricles are in diastole when atria undergo systole and vice versa. In atrial diastole, the valves separating atria and ventricles are open and facilitate passive blood flow into the ventricles. When the cardiac cycle starts, atria enter systole and pump the remaining blood into ventricles. The amount of blood contained in the ventricles at the end of atrial systole/ventricular diastole is referred to as end diastolic volume. When the ventricle enter systole, the pressure in the ventricles rise compared to the one in the atria which are in diastole and the separating valves are closed as a response to the increased pressure. Once the ventricular pressure overcomes the pressure in aorta and pulmonary arteries, the respective valves open and equivalent amounts of blood are pumped into the systemic and pulmonary cycle. The larger and higher resistance vessels of the systemic circulation compared to the low-pressure vessels of the pulmonary system put a higher demand on the left ventricle which is met by a proportionally higher mass of the left ventricle compared to the right. The amount of blood that each ventricle can pump within one cardiac cycle is defined as the stroke volume. The volume of blood remaining in the ventricle at the end of systole is referred to as end systolic volume. End diastolic volume, stroke volume, end systolic volume are important clinical parameters [Betts & al., 2013]. 2.2. Conduction system The conduction system of the heart establishes the heart rhythm through electrical impulses sent by specialisedmyocardial conducting cells. The normal cardiac rhythm, called sinus rhythm, is established by the sinoatrial node and is located at the junc- tion of the superior vena cava and the right atrium (figure 2.1B). The sinus node is also called the pacemaker of the heart, since the signal leading to the activation of the myocardial contractile cells and, in consequence, their contraction starts here. Upon initiation of the action potential in the sinus node, the depolarisation spreads through the atria to the atrioventricular node via cell-cell contacts, the internodel 63 pathways and Bachmann’s bundle [Laske & Iaizzo, 2005; Anderson & al., 2009]. The atrioventricular node is located within the atrioventricular septum which pre- vents the signal to spread directly to the ventricle without being processed. At the atrioventricular node, the signal is delayed to allow the atria to complete their con- traction which pumps the blood into the ventricles. From the atrioventricular node, the signal is propagated along the interventricular septum through the bundle ofHis which divides into the atrioventricular bundle branches. These in turn connect with Purkinje Fibers at the apex of heart, which propagate the impulse to the myocardial contractile cells in the ventricles. The contraction of the ventricles follows the direc- tion of the impulse and travels from the apex towards the base, pumping blood out of the ventricles and into the aorta and pulmonary arteries [Laske & Iaizzo, 2005; Sigg & al., 2010]. 2.3. Heart development The heart is the first functional embryonic organ and already starts to beat by the end of the third week of development [Zambrano & al., 2002]. In the developing heart, three major processes have to be orchestrated: the formation and arrangement of the myocardium into the four-chamber heart, the development of the conduction system, and the heart’s circulatory system required for nutrition and oxygen supply to the myocardium. The first two processes happen simultaneously, while the latter can only take place after proper development of the myocardium. The development of the heart starts in the third week of development, just after gastrulation. In gastrulation, the single-layered sheet of epithelial cells that forms the embryo is re-organised into three germ layers, the ectoderm (external layer), meso- derm (middle layer) and endoderm (internal layer). Each layer will give rise to dif- ferent tissues and organs in the developing embryo. The heart development begins with the formation of two cardiac crescents from the mesodermal layer (figure 2.2, 1), which are located near the head of the embryo [Christoffels & al., 2000]. Within each cardiac crescent, two structures develop, a plate ofmyocardial cells and a plexus of endothelial strands. These develop into cardiogenic cords, with the endothelial strands forming a tube structure enveloped by a layer of myocardial cells. By the fusion of the two cardiogenic cords, the early tubular heart is formed (figure 2.2, 2). This early tubular structure already shows peristaltic contraction, despite the lack of valves and conduction system [Goss, 1938; de Jong & al., 1992; Moorman & Lamers, 1994]. The tubular heart then undergoes a right-ward looping where an initial dif- 64 ferentiation into ventricular myocard, atrial myocard and transitional zones occurs (figure 2.2, 3). The transitional zones will form parts of the septa, valves, conduc- tion system and fibrous heart skeleton [Gittenberger-de Groot & al., 2005]. Through the looping of the heart an inner and an outer curvature is created. The developing atria and ventricle stand out on the outer curvature, whereas transitional zones are brought into proximity on the inner curvature (figure 2.2, 4). 1 2 3 4 Cardiac crescent Tubular heart Looping heart LVRV A OFT VAR AVR PR SAR Figure 2.2: Embryonic heart development. 1. The mesoderm gives rise to two cardiac crescents that already show some extend of asymmetry. 2. The cardiac crescent have fused together to from a straight heart tube. 3. The straight heart tube starts a right-ward looping. Parts marked in redwill develop into the ventricles, while parts marked in turquoise will be- come atria. 4. The looping heart with precursors of the atria (A), the left ventricle (LV), the right ventricle (RV) and the outflow tract (OFT). Ring-like structures mark the transitional zones: sinoatrial ring (SAR), atrioventricular ring (AVR), primary ring (PR), ventricularar- terial ring (VAR). Correct looping and positioning ot the transitional zones are critical for the separa- tion of the heart into its functional components. The separation is facilitated through septation at the atria, the ventricles and the arterial pole. For the separation of the ventricles, two processes have to be considered, the inflow and outflow septation. The inflow septation i.e. the septation of the ventricles from one another and from the atria, is mainly achieved through the primary ring. The primary ring gives rise to the ventricular septum that separates left and right ventricle. This process has to be orchestrated with the position of atrioventricular ring, which is pulled towards the right ventricle by a tightening of the inner curvature. The positioning of the atri- oventricular ring above the left and right ventricle builds the base for the formation of the mitral and tricupsid valve, respectively, which will separate the atria from the ventricles. The septation controlling the blood flow from ventricles to the arteries (outflow septation) is achieved through the twisting of the ventricularaterial ring into the precursors of the pulmonic and aortic valve and their positioning above the right and left ventricle. 65 At the end of week nine in development, the heart consists of the four chambers divided by septa with integrated valves. Morphologically, atria and ventricle can be distinguished based on the structure of their myocard. While themyocardium of the atria is thin and has a smooth surface, the ventricles show a much thicker myocar- dium with protrusions (trabeculations) running along the endocardial surface. During these rearrangement processes the myocardium also underwent a differ- entiation into the contracting and conducting myocardium. While many compon- ents of the gene regulatory networks that control the differentiation are known today, mechanisms involved in controlling this differentiation on a cellular and region- specific level remain to be discovered [Christoffels & Moorman, 2009; Paige & al., 2015; Park & Fishman, 2017]. Structures important in the development of the con- duction system are the sinoatrial ring which will develop into the sinoatrial node, the primary ring which will give rise to the atrioventricular conduction system and the atrioventricular ring developing into Bachmann’s Bundles. 2.4. Common cardiovascular diseases According to the International Statistical Classification ofDiseases andRelatedHealth Problems, the classification system of the world health organisation, total cardiovas- cular diseases include hypertension, hypercholesterolemia, coronary heart disease, cardiac arrhythmias, congenital heart diseases and cardiomyopathies (classification codes I00-I99, Q20-28, version ICD-10 [World Health Organisation, 2016]). The largest contribution to cardiovascular diseases are coronary heart diseases. Their major clinical manifestations are myocardial infarction (commonly known as heart attack), angina pectoris (chest pain), and sudden coronary death [Wong, 2014]. The common cause of coronary heart diseases is an interrupted blood and con- sequently oxygen supply to the heart through a blockage of the coronary arteries. Major risk factors are high blood pressure (hypertension) and high blood cholesterol (hypercholesterolemia) [Mackay & al., 2004]. Cardiac arrhythmias are a class of diseases where the observed cardiac rhythm is different from the regular sinus rhythm. They are caused by irregularities of impulse generation and/or conduction. Tachycardia is the condition of an increased heart rate whereas bradicardia describes a lower than normal heart rate. They can cause a reduction in cardiac output and myocardial blood flow and may be life-threatening [Durham &Worthley, 2002]. Congenital heart diseases are diseases with structural abnormalities of the heart 66 or intrathoraic great vessels that are of functional significance and have been present since birth [Mitchell & al., 1971]. They may be caused by genetic or environmental factors during pregnancy and include ventricular outflow tract obstructions i.e. nar- row or blocked arteries and valves and septal defects. Of the latter, interventricular septal defects are the most common [Hoffman, 2005]. Cardiomyopathies describe a class of diseaseswhere the heartmuscle fails to func- tion properly. Traditionally, they are classified based on their anatomy and hemo- dynamics into hypertrophic, dilated, or restrictive cardiomyopathy. The incidence of the latter is rare and no changes in ventricular morphology are observed. This is in stark contrast to hypertrophic and dilated forms, where an increase in ventricular wall thickness or volume are observed, respectively. The increase in wall thickness is caused by a hypertrophy of existing myocytes rather than a hyperplasy as in the developing heart [Lorell & Carabello, 2000]. Dilated cardiomyopathy presents with an increase in cardiac chamber volume and often a modest increase in wall thick- ness. Both mechanism are in response to cardiac stress and initially improve heart function but in the long run increase myocardial strain and raise metabolic demands [Seidman & Seidman, 2001]. Cardiovascular diseases are caused by a combination of environmental and ge- netic risk factors. Amongst the environmental risk factors one can distinguish bet- ween modifiable risks governed by the individual itself and exposure to risk factors which are often beyond the influence of the individual. The latter include expos- ure to solvents, pesticides or extremes in noise and temperature [Bhatnagar, 2004; Brook & al., 2010; Babisch, 2014]. Modifiable risk behaviour such as smoking, phys- ical inactivity and a poor diet have been shown to be highly correlated with the incidence of cardiovascular diseases (reviewed in [O’Toole & al., 2008; Cosselman & al., 2015]). Meta-studies examining behavioural change in the English,Welsh and American populations over a period of 20 years, have shown a decline in coronary heart disease mortality due to a reduction in smoking, increased physical activity and other behavioural factors [Unal& al., 2004; Ford& al., 2007]. Genetic risk factors for cardiovascular diseases are described in the next section. 2.5. Genetics of cardiovascular diseases The genetics of cardiovascular diseases point both to simpleMendelian and complex inheritance patterns. In multiple linkage analyses studies of familial myocardial hy- pertrophy, several genes have been discovered where mutations segregate in a Men- 67 delian fashion. These include mutations in cardiac myosin heavy chain [Geisterfer- Lowrance & al., 1990], 𝛼 tropomyosin, cardiac troponin T and C, [Thierfelder & al., 1994; Kimura & al., 1997] and cardiac mysosin binding protein [Carrier & al., 1993; Bonne & al., 1995]. Another group of familial cardiovascular diseases, familial hy- pertension, has been linked to mutations in epithelial sodium channels SCNN-2 and SCNN3-3 [Boyden & al., 2012; Glover & al., 2014] as well as KLH3-CUL3, genes cod- ing for proteins building a complex involved in Sodium-chloride reabsorption in the kidney [Hansson & al., 1995]. Linkage studies have also pinpointed genes for at- rial and ventricular septal defects. They are linked to mutations in the transcription factors, GATA4 [Schott & al., 1998] and NKX2-5 [Garg & al., 2003], respectively. In contrast, the majority of cardiovascular traits follow complex inheritance pat- tern with interaction between multiple genes and non-genetic factors [Kathiresan & Srivastava, 2012]. GWAS have been successful in finding genetic loci associated with a large number of cardiovascular diseases. Out of the 4,148 studies in theGWAS catalogue (accessed 11.08.2017), 159 contain phenotype descriptions relating to car- diovascular diseases (list of query terms in table A.1 in the appendix). 0 20 40 N um be r o f s tu di es Phenotypes Congenital heart disease Coronary heart disease Blood pressure Heart rate Electrocardiographic traits Morphological traits Heart failure Others Figure 2.3: GWAS on heart-related phenotypes. Overview of 153 GWAS studies with 59 unique heart-related phenotypes (obtained from the GWAS catalogue [MacArthur & al., 2017, accessed on 11.08.2017]). Phenotypes were grouped into eight phenotype classes. The list of query terms and their grouping can be found in table A.1 in the appendix. The highest number of studies has been conducted on blood pressure phenotypes, followedby electrocardiographic traits and coronary heart diseases (figure 2.3). Early GWAS on these traits were conducted on samples of the Framingham heart study, a community-based cohort study founded in 1948 to examine the epidemiology of car- diovascular disease [Dawber & al., 1951; Kannel & McGee, 1979]. The Framingham Heart Study 100K SNP genome-wide association study resource was published in 68 2007 [Cupples & al., 2007] and its 1,345 participants built the basis for 17 GWAS on traits like echochardiographic dimension [Vasan & al., 2007], blood pressure [Levy & al., 2007] and heart rate [Newton-Cheh & al., 2007]. Later studies often contained larger sample sizes or re-analysed previously published studies in meta-analysis. For instance, the international consortium for blood pressure conducted a meta- analysis of 29 previously published GWAS on systolic and diastolic blood pressure phenotypes and discovered 16 novel loci, ten of which were associated with known blood pressure-related genes [Ehret & al., 2011]. Similarly, the large consortium for coronary heart diseases (CARDIoGRAM) conducted a case-control meta-analysis and identified ten novel loci [Nikpay & al., 2015]. The other classes of phenotypes are smaller and more heterogeneous, comprising different congenital heart diseases e.g. congenital left-sided heart lesion [Mitchell & al., 2015; Hanchard& al., 2016] and conotruncal heart defects (i.e. malformations of the cardiac outflow tracts) [Agopian & al., 2014] or morphological traits including cardiomyopathies [Villard & al., 2011] and cardiac wall thickness [Vasan & al., 2009; Arnett & al., 2011]. 2.6. Thesis outline In the following chapters, I describe new methods and applications for the genetic analysis of high-dimensional datasets. In chapter 3, I introduce the R package that I developed for the simulation of com- plex phenotype structures. Simulated phenotypes serve as an approximation for ob- served biological phenotypes and are invaluable for model development. All phen- otypes simulated in this thesis are generated based on the strategies described in this chapter. The simulation strategy and applications have been published in [Meyer & Birney, 2018]. Chapter 4 presents LiMMBo, a new approach for finding genetic associations in high-dimensional phenotypes using linear mixedmodels. I first demonstrate model calibration and power on simulated datasets before I apply LiMMBo to a publicly available dataset of yeast growth traits in chapter 5. A manuscript of LiMMBo and its application is currently under revision and already available in pre-print [Meyer & al., 2018]. In chapter 6, I systematically analysed twelve unsupervised dimensionality reduc- tion methods for their ability to find robust phenotype representations of simulated data with different structure and size. I introduce a new stability measure for choos- ing the low-dimensional representations and demonstrate that the selected repres- 69 entation can recover genetic associations. Finally, I investigate genetic associations for human heart morphology based on magnetic resonance imaging (MRI) data of 1,500 healthy individuals. In chapter 7, I apply the methods and measures described in chapter 6 to obtain a low-dimen- sional representation of the heart morphology and conduct a GWAS based on this representation. Chapter 8 describes the GWAS on a cardiac trabeculation phenotype derived from a supervised feature extraction approach on the MRI data. The work in these chapters was done in collaboration with Antonio De Marvao, Jiashen Cai, Pawel Tokarczuk Declan O’Regan and Stuart Cook from Imperial College London. Specifically, phenotype acquisition and feature extraction was done by my collabor- ators, while I was responsible for all remaining analyses, including genotype quality control and imputation. An initial paper using the imputed genotypes was recently published [Biffi & al., 2017]. 70 3 PhenotypeSimulator For method development in quantitative genetics, one often needs a set of well- characterised genotypes and phenotypes to know the ground truth based on which comparisons of the model performance can be made. In the context of this thesis, genotype and phenotype simulations were crucial for the development of a new method for multi-trait mapping of high-dimensional datasets (chapter 4) and the evaluation of different dimensionality reduction techniques (chapter 6). The complexity of the simulated phenotype components will depend on the spe- cifics of the model that is being developed. With the detailed whole-genome gen- otype data available through standard techniques such as genotyping arrays and subsequent imputation and themeasurement of multiple traits per sample, the com- plexity of the hypotheses for testing the underlying genetics of the observed phen- otypes have increased. Models range from simple linear models with a few fixed effects on a single trait to complex linear mixed models with fixed and random ef- fect components on multiple traits [Stephens, 2013; Marigorta & Gibson, 2014; Zhou & Stephens, 2014; Loh & al., 2014]. With the increase in analysis complexity, soph- isticated approaches for modelling realistic genotype and phenotype structures are needed. These simulated genotypes and phenotypes reflect our perceived under- standing of the true phenotype structure and do not guarantee the biologically cor- rectness of real phenotypes. However, they are invaluable in model design, as any 71 model showing flawed statistics on the possibly simplified biological model will suf- fer from at least the same flaws on the true biological data. In this chapter, I will first describe simulation strategies for genotypes with dif- ferent levels of population structure and relatedness. Following that, I introduce the phenotype simulation strategy used for all simulated datasets within this thesis. In order to broadly distribute this simulation framework, I have developed Pheno- typeSimulator, an R package for phenotype simulation that allows for a flexible and customisable simulation set-up. PhenotypeSimulator can be installed from the Com- prehensive R Archive Network [Meyer, 2017] and its code is available on github: https://github.com/HannahVMeyer/PhenotypeSimulator. PhenotypeSimulator is published as: Meyer, H. & Birney E. (2018) PhenotypeSimulator: A comprehensive framework for simulating multi-trait, multi-locus genotype to phenotype relation- ships, Bioinformatics, bty197. 3.1. Genotype simulation There are a number of different strategies to generate genotype data for genetic as- sociation studies. In the most simple case and assuming bi-allelic SNPs, each SNP is simulated from a binomial distribution with two trials and probability equal to the given allele frequencies (e.g in [Lippert & al., 2013]). This simple approach, how- ever, does not simulate any dependency between the genotypes as is observed with LD structure in the genome. In order to mimic genomic LD structure and allele frequency distributions in the simulated dataset, three general approaches exist: i) backward-time or coalescent simulation, ii) forward time and iii) resampling ap- proaches. The coalescent [Hudson, 2002; Ewing & Hermisson, 2010; Kelleher & al., 2016] and forward-time approaches [Peng & al., 2007; Hoggart & al., 2007; Carvajal- Rodríguez, 2008] use population genetic models to simulate genotypes and are par- ticularly useful for studying evolution and demography. However, they often suffer from computational demands for diploid genome-wide SNP data [Liu & al., 2008; Yuan & al., 2012]. Resampling approaches [Wright & al., 2007; Su & al., 2011; Loh & al., 2014; Casale & al., 2015] offer a practical solution that can be used to efficiently generate genetic data with different relatedness and population structures, which is particularly useful in genetic association studies. They combine existing genotype data into the genotypes of the simulated samples, thereby retaining allele frequency and LD patterns. I choose to follow the resampling strategies described in [Loh & al., 2014; Casale 72 & al., 2015] where each diploid individual is simulated as the mosaic of real geno- types from different populations. Depending on the simulation set-up, cohorts with differing levels of population structure and relatedness can be simulated. Cohorts with different degrees of genetic structure will be valuable for evaluating the per- formance of genetic association models with respect to their adjustment for genetic relatedness and population structure. As far as I am aware, these structures cannot be realisedwith the publicly available tools described in [Wright & al., 2007; Su& al., 2011]. I used the genotype data from 365 individuals of four European ancestry popu- lations from the 1000 Genomes Project [1000 Genomes Project Consortium, 2015], Utah Residents (CEPH) with Northern and Western European Ancestry (CEU) and Finnish in Finland (FIN) and British in England and Scotland (GBR) and Toscani in Italia (TSI), as the sampling dataset. The resampling strategy works as follows: 1. each individual is randomly assigned a predefined number of unique original genotypes which will serve as its ancestors; 2. the ancestors’ genome-wide genotypes are split into blocks of 1,000 SNPs; 3. for each SNP block, one of the ancestor is chosen at random and its genotype is copied to the individual’s genome. The number and the sub-population of ancestors that are chosen for simulating the genomes of a new cohort are critical for controlling the level of structure within the cohort. The number of ancestors sets the level of relatedness within the cohort. Low num- bers of𝑁 introduce relatedness among individuals, while high numbers of𝑁 lead to low levels of structure and relatedness. For instance, with𝑁 = 2, each individual in the newly synthesised cohort is composed of genotypes from only two out of the 365 individuals. Consider individual 𝑔1, whose genotypes are drawn from ancestors 𝑎1 and 𝑎2. For Individual 𝑔2, with a chance of 𝑝 = 1− (3632 ) (3652 ) ≈ 0.01 it shares at least one ancestor with 𝑔1. For exactly one shared ancestor, each SNP block would have a 25% probability of being the same between 𝑔1 and 𝑔2 . With 𝑁 = 10, the probability for at least one common ancestor increases (𝑝 = 1 − ( 355 10 ) (36510 ) ≈ 0.25). However, for exactly one shared ancestor, the sharing of SNP blocks decreases to 1%. The choice of sub-population determines the level of population structure in the simulated genotypes: allowing for random selection of ancestors independent from 73 the four subpopulations in the 1000 Genomes datasets yields low levels of popu- lation structure, as this leads to a random sampling of the individuals’ genotypes across ancestors ethnicities. Including an a priori selection of individuals from one of the four sub-population and subsequently restricting ancestor selection to these individuals will restrict an individuals genotypes to a single sub-population. As all individuals in the cohort are now comprised of distinct genotype subsets, this will give rise to population structure in the simulated cohort. I simulated three genotype sets, each with 1,000 samples, that differed i) in the number of ancestors𝑁 fromwhich the genotypeswere chosen and ii) the sub-popula- tions the ancestors were chosen from: A. unrelatedPopStructure: unrelated individuals with prior assignment of ances- tral population (𝑁 = 10, i.e. only CEU or only FIN or only GBR or only TSI) B. unrelatedNoPopStructure: unrelated individuals with mixed ancestral popu- lation (𝑁 = 10, i.e. CEU and FIN and GBR and TSI) C. relatedNoPopStructure: related individuals with mixed ancestral population (𝑁 = 2, i.e. CEU and FIN and GBR and TSI)) The level of structure and relatedness introduced by this simulation strategy can be visualised by examining the genetic relationship matrix and the PCs of the geno- types. The genetic relationship matrix is estimated as a RRM via equation (1.32) and serves as a measure for relatedness between the individuals, while PCs reflect the genotypic variance in the data (section 1.7.5). The hierarchical clustering of the ge- netic relationship estimates and scatter plots of the first two PCs for each genotype set are shown in figure 3.1. Samples cluster tightly based on their ancestral sub- populations (figure 3.1A), while there is no clustering and an even spread in the PC plot for the cohort of unrelated individuals with ancestors sampled across all sub- populations (figure 3.1B). The cohort of related individuals shows less spread in the second principal component and higher individual genetic relationship estimates (figure 3.1C). 74 Figure 3.1: Genetic relationship matrices and principal components of three simulated European ancestry cohorts. The genotypes were simulated based on genotype data from four European ancestry populations (ancestry colour key in panel A). Depending on the choice and number of ancestors for the sampling of chromosomes to simulate an individual’s genotype, cohorts with differing levels of population and relatedness structure will arise. The left column depicts the hierarchical clustering of the sample-by-sample genetic relation- ship coefficients (complete linkage clustering of Euclidean distance between coefficients), the right column the first and second PC of the sample genotypes for the three different cohorts: A. unrelated individuals, with population structure: 𝑁 = 10, prior assignment to ancestral population. B. unrelated individuals, no population structure: 𝑁 = 10, no prior assignment to ancestral population. C. related individuals, no population structure: 𝑁 = 2, no prior assignment to ancestral population. 75 3.2. Phenotype simulation In this section, I introduce PhenotypeSimulator, an R/CRAN package for the flexible simulation of phenotypes with different genetic and non-genetic variance compon- ents. PhenotypeSimulator is a framework focusing on the simulation of phenotypes, with a particular emphasis on complexity of both multiple phenotypes and mul- tiple genetic loci and genetic background, which is not provided by other multi- phenotype simulation software ([O’Reilly & al., 2012], [Porter & O’Reilly, 2017]). I have written PhenotypeSimulator to be easily integrated with external genotype sim- ulation software (such as coalescent and forward time simulation and re-sampling approaches) and it can generate output suitable as input for a number of standard ge- netic association tools (such as PLINK [Chang & al., 2015], GEMMA [Zhou & Steph- ens, 2014] or SNPTEST [Marchini & al., 2007]). In the following, I will describe the simulation strategy of the different phenotype components, and will demonstrate the usage and application of PhenotypeSimulator by simulating phenotypes to evalu- ate the power of different linear mixed model designs in a genetic association study. Phenotypes are typically generated as the sum of genetic effects, effects from non- genetic factors and observational noise. Genetic effects can represent i) genetic vari- ants that are associated with a phenotype and ii) infinitesimal genetic effects that reflect underlying population structure and relatedness in a cohort. Non-genetic effects are used to simulate environmental, experimental or other unexplained vari- ance in the data. Although in many genetic association studies the sources of non- genetic correlation are often combined, I have found it valuable to separate these components to explore the impact of different correlation structures from these sour- ces (see chapter 6). When simulating non-genetic factors, assumptions about their distribution have to be made and this choice depends on the specific biological ef- fects that should bemodelled. Commondistributions are binomial (e.g. sex), normal or uniform distributions (e.g. weight, height) or categorical variables (e.g disease status). Correlated non-genetic effects can be used to simulate a phenotype com- ponent with a defined level of correlation between traits. For instance, such effects can reflect correlation structure decreasing in phenotypes with ordered or spatial components e.g. in imaging data. Observational noise captures any non-specified effects that arise due to, for instance, experimental measurement error. However, PhenotypeSimulator can also be used with a combined non-genetic covariance model, similar to more standard linear mixed models [O’Reilly & al., 2012; Zhou & Steph- 76 Figure 3.2: Phenotype simulation scheme. PhenotypeSimulator can take genotypes from a number of different input formats and uses these as the basis for the simulation of the ge- netic effects. In addition to the genetic effects, non-genetic covariates, observational noise and non-genetic correlation structure can be simulated. The effect structure of the upper four components can be divided into a shared effect across traits or an independent effect for a number of traits, allowing for complex phenotype structures such as the simulation of pleiotropy. Before combining the phenotype components, they are scaled to a user-defined proportion of the total phenotypic variance. Finally, the simulated phenotype and its com- ponents can be saved into a number of different genetic output formats. Arrows, lines and rectangle mark the dimensions of each component. 77 ens, 2014; Porter & O’Reilly, 2017] The proportion of variance assigned to each component will differ depending on the biological understanding of the simulated phenotype. PhenotypeSimulator allows for the specification of these variance proportions and, in addition, provides the op- tion to divide the explained variance into two components, one that is shared across phenotypes and a second component that acts independently on certain phenotypes. For instance, the level of shared and independent effects for a genetic variant allows for the simulation of different levels of pleiotropy. There are many ways to simulate these phenotype components depending on the scope and the model to be tested. Typically, it is assumed that the overall phenotype structure is well represented by an additive linear combination of individual com- ponents [Stephens, 2013; Marigorta & Gibson, 2014; Zhou & Stephens, 2014; Loh & al., 2014]. For PhenotypeSimulator, I assume this phenotype structure and sum the individual phenotype components to generate the final phenotypes. 3.2.1. Phenotype components In PhenotypeSimulator, the phenotypes𝐘 ∈ ℛ𝑁, 𝑃 of𝑁 samples and 𝑃 traits are gen- erated as the sum of i) genetic variant effects𝐔 ∈ ℛ𝑁, 𝑃 , ii) infinitesimal genetic ef- fects𝐆 ∈ ℛ𝑁, 𝑃, iii) non-genetic effects𝐂 ∈ ℛ𝑁, 𝑃, iv) correlated non-genetic effects 𝐓 ∈ ℛ𝑁, 𝑃 and v) observational noise effects𝚿 ∈ ℛ𝑁, 𝑃 (figure 3.2). For component i-iv, a certain percentage of their variance is shared across all traits (shared) and the remainder is independent (ind) across traits. The option to divide the variance into shared and independent allows for the simulation of phenotypes with additional complexity. For instance, the level of shared and independent fixed genetic effects allows for the simulation of different levels of pleiotropy. 1. Genetic variant effects: For the SNPgenetic effects, 𝑆 randomSNPs for𝑁 samples are drawn from the (simulated) genotypes. From the 𝑆 random SNPs, a pro- portion 𝜽 is selected to be causal across all traits. The shared genetic vari- ant effect is simulated as the matrix product of this shared causal SNP mat- rix 𝐗shared ∈ ℛ𝑁, 𝜃×𝑆 and the shared effect size matrix 𝐁shared ∈ ℛ𝜃×𝑆, 𝑃. The columns of the shared effect size matrix are simulated to be perfectly cor- related, i.e. the effect of a SNP genetic effect is proportionally the same for all affected traits. The effect sizes for 𝐁shared can either be simulated to have normal or uniform properties. The is implemented as follows in Phenotype- Simulator: 𝐁shared is the matrix product of the two vectors 𝑏𝑠 ∈ ℛ 𝜃×𝑆, 1 and 78 𝑏𝑇𝑝 ∈ ℛ 1, 𝑃. To simulate effect sizes with approximately normal properties [Oliveira & Seijas-Macias, 2012, Eq 31-33], 𝑏𝑠 and 𝑏𝑝 are drawn from two nor- mal distributions, where 𝜇𝑏𝑝 = 0 and 𝜎𝑏𝑝 = 1 and 𝜇𝑏𝑠 and 𝜎𝑏𝑠 specified by the user. For the simulation of uniformly distributed effect sizes, 𝑏𝑠 and 𝑏 𝑇 𝑝 are drawn from two exponential distributions whose negative normalised log product yields an approximate uniform distribution [Song, 2005] across the user defined range. The remaining (1 − 𝜃) × 𝑆 SNPs are simulated to have an independent effect across a specified number of traits 𝑃 ind. To realise this structure, 𝐁ind ∈ ℛ(1−𝜃)×𝑆, 𝑃 is initialised with either normally or uniformly distributed entries, with 𝜇𝐵 and 𝜎𝐵 as specified by the user (same as for shared effect). Subsequently, 𝑃 −𝑃 ind traits are randomly selected and the row entries for𝐁ind at these traits set to zero. The independent genetic variant effect is the matrix product of𝐗ind ∈ ℛ𝑁, (1−𝜃)×𝑆 and 𝐁ind. 2. Non-genetic covariate effects: The non-genetic covariate effects are based on 𝐾 non-genetic covariates𝐖∈ ℛ𝑁, 𝐾, with a proportion 𝛾 being shared across all traits yielding the shared covariatesmatrix𝐖shared ∈ ℛ𝑁, 𝛾×𝐾. The proportion of 1 − 𝛾 non-genetic covariates that are independent make up the independ- ent covariates matrix𝐖ind ∈ ℛ𝑁, (1−𝛾)×𝐾. The distributions for each of the 𝐾 non-genetic covariates are independent and can be either normal, uniform, bi- nomial or categorical. The distribution and respective parameters are chosen by the user. The effect size matrices𝐀shared ∈ ℛ𝛾×𝐾, 𝑃 and𝐀ind ∈ ℛ(1−𝛾)×𝐾, 𝑃 were designed as described for the genetic effects. The final non-genetic cov- ariate effects are the matrix product of the covariate matrices and their effect size matrices: 𝐖ind𝐀ind and𝐖shared𝐀shared. 3. Infinitesimal genetic effects: The basis of the infinitesimal genetic effect 𝐔 is the 𝑁 × 𝑁 genetic relationship matrix 𝐊, either estimated from the genotypes of the simulated samples as 1𝑚𝐗𝐗 𝑇, where 𝑚 is the mean value of the diagonal elements of𝐗𝐗𝑇 or provided by the user. A suitable model for simulating the infinitesimal genetic effect𝐔 ∈ ℛ𝑁, 𝑃with the known𝑁×𝑁 sample covariance 𝐊 and trait covariance 𝐂 is a multivariate normal distribution (as for instance in [Zhou & Stephens, 2014; Casale & al., 2015]) where vec(𝐔) ∼ 𝒩𝑁×𝑃 (vec(𝟎) , 𝐂 ⊗𝐊) (3.1) The structure of 𝐂 depends on the desired design of the covariance effect, which can be either shared or independent across traits. This distribution can 79 be realised by simulation a random variable 𝐙 ∈ ℛ𝑀, 𝐿 as iid 𝒩(0 , 1 ) and setting vec(𝐔) = 𝐁𝐙𝐀𝑇 (3.2) where 𝐁 ∈ ℛ𝑁, 𝑀 reflects the genetic relationship i.e. sample covariance with 𝐊 = 𝐁𝐁𝑇 and𝐀 ∈ ℛ𝑃, 𝐿 the trait covariance with𝐂 = 𝐀𝐀𝑇, respectively (𝑀 and𝐿 depend on the rank of𝐾 and𝐶, hence are bound by𝑁 and𝑃). A detailed derivation for equation (3.2) from equation (3.1) can be found in chapter C and has similarly been applied in [Casale & al., 2015]. By recasting Equation 3.1 as Equation 3.2, the infinitesimal genetic effect 𝐔 described by a multivariate-normal distribution is effectively modelled as the product of three matrices, representing the sample covariance (𝐁), a normally distributed variable (𝐙) and the trait covariance (𝐀). Different designs of 𝐀 will allow for the simulation of shared and independent genetic random ef- fects. For the independent effect,𝐀ind is a diagonal matrix with normally dis- tributed entries: (𝐀ind)𝑇 = diag(𝑎1, 𝑎2,… , 𝑎𝑃) ∼ 𝒩( 0 , 1 ), such that 𝐔 ind = vec(𝐁𝐙(𝐀ind)𝑇). 𝐀shared of the shared effect is simulated as amatrix of column rank one, with normally distributed entries in column one and zeros else- where: 𝑎𝑖,1 ∼ 𝒩(0 , 1 ) and 𝑎𝑖,𝑗≠1 = 0 such that𝐔 shared = vec(𝐁𝐙(𝐀shared)𝑇). 4. Correlated non-genetic effects: Correlated non-genetic effects are simulated as a multivariate normal distribution with a covariance matrix described by a defined trait-by-trait correlation. Any correlation structure between the phen- otypes can be simulated with this effect component, as the desired correlation matrix 𝐂 can be supplied by the user. In addition, as a simple approximation for spatially correlated phenotypes as they might occur for instance in image- based phenotypes, PhenotypeSimulator provides the construction of 𝐂 as fol- lows: traits of distance 𝑑 = 1 (adjacent trait columns) will have the highest specified correlation 𝑟, traits with 𝑑 = 2 have a correlation of 𝑟2, up to traits with 𝑑 = (𝑃 − 1) with a correlation of 𝑟(𝑃−1)) , such that the correlation is highest at the first off-diagonal element and decreases exponentially by dis- tance from the diagonal. The correlated non-genetic effect matrix is simulated as 𝐓 ∼ 𝒩𝑁×𝑃 ( 𝟎 , 𝐂 ). 5. Observational noise: The observational noise effects𝚿 are simulated as the sum of a shared and an independent observational noise effect. Both effect com- ponents are simulated by the matrix product of 𝐁 ∈ ℛ𝑁, 𝑃 ∼ 𝒩(0 , 1 ) with 𝐀 ∈ ℛ𝑃, 𝑃. To realise the shared effect𝚿shared,𝐀shared is simulated as a matrix 80 of row rank one, with normally distributed entries in row one and zeros else- where: 𝑎1,𝑗 ∼ 𝒩(0 , 1 ) and 𝑎𝑖≠1,𝑗 = 0. 𝐀 of the independent component is a diagonal matrix with normally distributed entries: (𝐀ind)𝑇 = diag(𝑎1, 𝑎2,… , 𝑎𝑃) ∼ 𝒩( 0 , 1 ). 3.2.2. Scaling and phenotype construction PhenotypeSimulator requires at least one phenotype component to simulate the phen- otypes. Components can be combined as specified by the user and the correlation they introduce in the trait structure can be controlled by the specified levels of inde- pendent and shared effects (at the extremes, components can be simulated to either only have shared or independent effects). If desired, a simple phenotype structure following a model as cast for instance in the multi-variate normal model by [Zhou & Stephens, 2014] can be achieved by specifying only genetic variant effects, non- genetic covariate effects, infinitesimal genetic effects and observational noise. I have designed PhenotypeSimulator such that the amount of variance that each component should contribute to the total phenotypic variance can be specified by the user. Every component is thereby scaled by a factor 𝑎 such that its average column variance ex- plains 𝑥percent of the total variance. The scale factor 𝑎 is derived as follows: Let𝑋 be a randomvariablewith expected value𝐸[𝑋] = 𝜇𝑥 and variance 𝑉 [𝑋] = 𝐸[(𝑋−𝜇𝑥) 2] and let 𝑌 = 𝑎𝑋. Then 𝐸[𝑌 ] = 𝑎𝜇𝑥 𝑉 [𝑌 ] = 𝐸[(𝑌 − 𝜇𝑦) 2] 𝑉 [𝑌 ] = 𝐸[(𝑎𝑋 − 𝑎𝜇𝑥) 2] = 𝑎2𝐸[(𝑋 − 𝜇𝑥) 2]. (3.3) Hence, the scaling of a random variable by 𝑎 leads to the scaling of its variance by 𝑎2. To scale the phenotype components such that their average column variance 𝑉𝑐𝑜𝑙 = 𝑉1+...+𝑉𝑝 𝑝 explains a specified percentage 𝑥 of the total variance, choose the scaling factor 𝑎 such that: 𝑥 = 𝑎2 × 𝑉𝑐𝑜𝑙 𝑎 = √𝑥𝑉𝑐𝑜𝑙 −1 (3.4) The final simulated phenotype𝐘 is expressed as the sum of the scaled genetic vari- ant effects, the non-genetic covariates, the correlated non-genetic effects and obser- 81 vational noise effects: 𝐘 = 𝐗shared𝐁shared +𝐗ind𝐁ind +𝐖shared𝐀shared +𝐖ind𝐀ind +𝐔shared +𝐔ind +𝐓+𝚿shared +𝚿ind. (3.5) 3.2.3. Case study To demonstrate the usage and application of PhenotypeSimulator, I simulated a set of phenotypes and used them to evaluate the power of different linear mixed model designs in GWAS. In order to demonstrate the integration of PhenotypeSimulatorwith already established simulation and GWAS tools, I choose Hapgen2 [Su & al., 2011] for genotype simulation, used PhenotypeSimulator for phenotype simulation based thereon and applied GEMMA (version 0.96) [Zhou & Stephens, 2014] for the GWAS. The analysis code and parameters of this case study, from the data simulation to the genome-wide association study are supplied as a vignette to the R package. I simulated genotype data for 1,000 individuals via Hagen2, mimicking popu- lation structure from four populations in the 1000Genomes project [1000 Genomes Project Consortium, 2012] (similar to the genotype structure described in section 3.1). The simulated genotypes of this cohort served as the basis for the genetic variant and infinitesimal genetic effects. I generated a phenotype set consisting of three traits with ten genetic variant effects and four non-genetic covariates. For the ten genetic variant effects, I randomly selected ten variants from the genotypes and simulated shared genetic variant effects across all phenotypes. I introduced additional correl- ation structure by including an infinitesimal genetic effect based on the individuals’ kinship estimates as well as a non-genetic correlated (correlation: 0.8) and an ob- servational noise effects. The total genetic variance accounts for 60% of the variance leaving 40% of variance explained by the noise terms. Figure 3.3 shows the trait-to- trait correlations of the final phenotype and each of its components. The final phenotypes served as the response variable in theGWASbased onLMMs with the simulated SNPs and non-genetic covariates as fixed effects and the kinship estimated from the genotypes as part of the genetic random effect [Zhou& Stephens, 2014] (see section 1.7.6). I analysed the power of jointly modelling all three pheno- types (multi-trait) and the power of single-trait models where the association of each phenotype is analysed separately. The single-trait GWAS was run for all three traits. All GWASwere conductedwith GEMMA (version 0.96) [Zhou& Stephens, 2014]. In both, the multi-trait and single-trait GWAS, the phenotypes (-p flag) were modelled as the sum of genetic (simulated SNPs; -g flag) and non-genetic (simulated covari- 82 −1.0 −0.5 0.0 0.5 1.0 Pearson Correlation Y WA XB U T Ψ Tr ait 1 Tr ait 2 Tr ait 3 Tr ait 1 Tr ait 2 Tr ait 3 Tr ait 1 Tr ait 2 Tr ait 3 Tr ait 1 Tr ait 2 Tr ait 3 Tr ait 1 Tr ait 2 Tr ait 3 Tr ait 1 Tr ait 2 Tr ait 3 Trait 1 Trait 2 Trait 3 Figure 3.3: Phenotype simulation. Heatmaps of the trait-by-trait correlation (Pearson cor- relation) of a simulated phenotype ( 𝐘) and its five phenotype components: genetic vari- ant effects 𝐗𝐁, infinitesimal genetic effects 𝐔, non-genetic covariates𝐖𝐀, correlated non- genetic effects 𝐓 and observational noise 𝚿. The non-genetic covariates consist of four in- dependent components, two following a binomial and two following a normal distribution. The genetic variant effect of ten causal SNPs with shared effect across all traits, yielding the strong correlation structure observed above. The highest correlation for the correlated non-genetic effect was set at 0.8. ates; -c flag) fixed effects, a random genetic effect (with the eigenvectors and values of the kinship matrix, -u and -d flag) and observational noise (linear mixed model with likelihood ratio test using the -lmm 2 flag). For a comparison of the number of causal SNPs recovered in the multi-trait and single-trait GWAS, the p-values of the single-trait GWAS were adjusted by the number of test conducted (Bonferroni adjustment for three tests). For the simulated phenotypes with shared genetic variant effects only, the multi- trait GWAS shows a greater power compared to any of the single trait analyses (fig- ure 3.4. The multi-trait GWAS detected four out of the ten SNPs for which a phen- otype effect was modelled that pass the commonly used genome-wide significant threshold of 5 × 10−8 [Fadista & al., 2016]. The single-trait GWAS only recovered three of these SNPs. The ability of linear (mixed) models to detect the SNPs for which a phenotype effect was modelled depends on the allele frequencies of these SNPs and the effect size [Cohen, 1992; Halsey & al., 2015]: the higher the effect size and/or the allele frequencies the better the power to detect the SNP effects. The p- values of all SNPs with simulated effect on the phenotypes in relation to their allele frequencies and simulated effect sizes is depicted in figure 3.5. It shows a strong trend for SNPs with high allele frequencies and large simulated effect sizes to have low p-values. 83 Trait 1 Trait 2 Trait 3 Trait 1-3 all with effect Expected -log10(P) O b se rv e d - lo g 1 0 (P ) Traits SNPs mvLMM uvLMM Figure 3.4: Comparison of multi-trait to single-trait GWAS. Quantile-quantile plots of p- values observed from the multi-trait GWAS (via multivariate linear mixed model; mvLMM) to single-trait GWAS (via univariate linearmixedmodels; uvLMM) fitted to each of the about eightmillion genome-wide SNPs (grey), including the ten SNPs forwhich a phenotype effect was modelled (green) 3.3. Conclusion PhenotypeSimulator offers a framework for complex multi-trait, multi-locus pheno- type simulations in quantitative genetics packaged in an easy to use manner for stat- istical geneticists. PhenotypeSimulator it is the only simulation package that I know that can simulate complex multi-trait phenotypes with complex multi-locus genet- ics, including a population structure term with phenotypic correlation. It can create realistic covariate structures with similar properties (e.g. categorical covariates or covariates drawn from different distributions) to real covariates. The different phen- otype components can be independently extracted and scaled, for example having the“true” variance components and covariancematrices from the simulation readily 84 ll l l l l l l l 0 5 10 15 0.0 0.1 0.2 0.3 0.4 Allele frequencies − lo g 1 0(p − va lu e ) 0.2 0.4 0.6 0.8 | simulated β | Phenotype in GWAS l trait1 trait2 trait3 Figure 3.5: Relationship between p-values, allele frequencies and simulated effect sizes. The p-values of all SNPswith a simulated effect on the phenotypes are depicted in relation to their allele frequencies and simulated effect sizes. SNPs with low-allele frequencies and/or small simulated effect sizes do not pass the genome-wide significance threshold (horizontal line). available for comparison to inference schemes. The underlying model for PhenotypeSimulator corresponds to the common place linear mixed model framework. As such, it is limited in its use for benchmarking between methods, where linear mixed models methods are likely to perform best. However, the need for an underlying model is true for any simulation package. I have developedPhenotypeSimulator as a flexible component in the standard genet- ics pipeline, with the ability to both read genetic formats from well used tools and output phenotypes compatible with many tools. It is freely available as R/CRAN package and its code is present on github (https://github.com/HannahVMeyer/ PhenotypeSimulator). This allows easy large scale deployment for comprehensive simulation across many parameter settings. In this thesis, phenotypes simulatedwith PhenotypeSimulator built the basis for the method development in chapter 4 and chapter 6. 85 4 Extending linear mixed models to high-dimensional phenotypes Different strategies and challenges for multi-trait GWAS of high-dimensional phen- otypes have been discussed in section 1.7.7. Phenotypes can either be transformed into a lower dimensional space prior to the association study or the summary stat- istics from single-trait GWAS can be combined post-hoc to obtain quasi multi-trait association results. In contrast, multivariate LMMs can directlymodel the genotypic association across amoderate number of phenotypes. In the following chapter, I will describe the challenges of multivariate LMMs for high-dimensional phenotypes and present LiMMBo, a new method for the genotype-phenotype mapping of high-di- mensional datasets. LMMs have become a workhorse in genetic association studies as they allow to control for complex sample-by-sample covariance structures that can reflect popu- lation structure and relatedness (discussed in detail in section 1.7.6). In summary, LMMs commonly describe the phenotype as a linear combination of fixed effects – experimental and/or technical covariates and the genotype marker of interest, and a random genetic effect and residual noise which capture the genetic and residual covariances between traits. The association of the genetic marker is evaluated by comparing the alternative hypothesis that the genotype has an effect on the pheno- 87 typewhich is unequal to zero to the nullmodel of no effect (section 1.7.8). In practice, this means estimating the effect size of the fixed genetic effects and the random effect covariance terms for the alternative model and the random effect covariance terms for the null model where the effect size of the genetic marker is zero. The first LMM implementations estimated all variance components (genotype ef- fect size and randomeffect covariance terms, equation (1.52)) anew for each SNP-phe- notype association. However, in human genetics effect sizes are generally assumed to be small compared to the overall phenotypic variance [Kang & al., 2010; Zhang & al., 2010]. Consequently, estimates of the random effect covariance terms under the null model can serve as a good approximation. Based on these differences in the estimation of the random effect covariance terms, LMMs can broadly be grouped into two categories. The exact methods with covariance estimates under the altern- ative model and approximate methods, where the random effect covariance terms are only estimated once under the null model of no fixed genetic effect and are then used as predefined random effects in the alternative models for all genome-wide associations. Within these two categories, one can further distinguish between methods only applicable as univariate tests ormethods that allow formultivariate testing. Table 4.1 summarises commonly used frameworks and describes their computational com- plexity1. Among the exact methods, FaST-LMM-select reduces the complexity best in terms of sample size by selecting the number of SNPs to use for the estimation of the RRM. However, it can only be applied in univariate analyses while MTMM and GEMMA extend tomultivariate cases. BOLT-LMMscales bestwith increasing samples sizes in the group of approximate tests, by directly using the genotypes and not computing or storing the RRM. All other methods have an upfront 𝑂(𝑁3) operation for the ei- gendecomposition of the RRM. TASSEL reduces this complexity based on grouping of the samples and thereby effectively reducing the size of the RRM. With the generation of ever-increasing cohort sizes in genetic association studies, most LMM frameworks are optimised for the number of samples as described above for BOLT-LMM and TASSEL. While the remaining methods still have the upfront cubic computation of the RRM’s eigendecomposition, subsequent steps have been adapted to scale linearly or quadratically with the number of samples for the major- ity of the applications. 1The computational complexity and algorithms for the GCTA implementations [Yang & al., 2011] of multivariate genetic variance estimation [Lee & al., 2012] and LMM for association testing [Yang & al., 2014] could not be found in the original publications and are therefore not listed 88 Table 4.1: Linear mixed model frameworks for genetic association studies. A list of pop- ular LMM frameworks, grouped by their usage of covariance estimates when fitting the al- ternative model (first column: E: exact, A: approximate). The complexity describes the com- plexity for fitting a single LMM as specified in the original publication or summarised else- where, as indicated by the footnotes. 𝑃 indicates the trait size that the model was designed for (according to the original publication). Models with specific parameters are described in more detail in the text (FaST-LMM-select and TASSEL). 𝑁: number of samples; 𝑠𝑐: num- ber of SNPs used for singular value decomposition; 𝑐: compression factor with 𝑐 = 𝑁𝑔 for 𝑔 individuals per group; 𝑡, 𝑡1and𝑡2: average number of iterations needed to find parameter es- timates. GRAMMAR-Gamma, FaST-LMM-select: 𝑡 steps of the Brent’s algorithm; GEMMA, MTMM: 𝑡1 steps of the EM algorithm, 𝑡2 steps of the NR algorithm; BOLT-LMM: 𝑡 steps of the variational Bayes and conjugate gradients; TASSEL: 𝑡 steps of the ProcMixed algorithm in SAS; mtSet: 𝑡 steps of the L-FBGS. Framework Complexity 𝑂 𝑃 Reference E FastLMM-select 𝑁𝑠2𝑐 +𝑁 2 + 𝑡𝑁 1 [Lippert & al., 2011] GEMMA 𝑁3 +𝑁2𝑃+ 10 [Zhou & Stephens, 2014] 𝑡1𝑁𝑃 2 + 𝑡2𝑁𝑃 6 [Zhou & Stephens, 2014] A MTMM 𝑡1𝑁 3𝑃 3 + 𝑡2𝑁 3𝑃 7 2 [Korte & al., 2012] 2 +𝑁2𝑃 2 EMMAX 𝑁3 + 𝑡𝑁 +𝑁2 1 [Kang & al., 2010] TASSEL 1𝑐3𝑁 3 1 [Zhang & al., 2010] GRAMMAR- 𝑁3 + 𝑡𝑁 +𝑁 1 [Svishcheva & al., 2012]Gamma BOLT-LMM 𝑡𝑁 1 [Loh & al., 2014] mtSet 𝑁3 + 𝑡(𝑁𝑃 4 + 𝑃 5) 10 [Casale & al., 2015] The reduced complexity in the sample term comes as a trade-off with the number of traits that can be analysed. Specifically, computations become prohibitive as soon as a few tens of traits (table 4.1, column P) are considered, with computational com- plexities ranging from𝑂(𝑃 5) to up to𝑂(𝑃 7) for existingmethods [Casale & al., 2015; Korte & al., 2012]. In practice, this limits these models to moderate trait numbers. To overcome this limitation, I developed a simple, but surprisingly effective heur- istic to efficiently estimate large trait covariance matrices in linear mixedmodel with bootstrapping (LiMMBo), thereby allowing for the analysis of datasets with a large number of phenotypic traits. LiMMBo and its application (chapter 5) is currently under revision and available in pre-print [Meyer & al., 2018]. I conducted all simu- lations and analyses and generated all results. I provide LiMMBo as an open source Python package (https://pypi.org/project/limmbo/) with command line inter- face and its source code is available on github: https://github.com/HannahVMeyer/ limmbo. 2Listed in [Zhou & Stephens, 2014] 89 4.1. LiMMBo: Linear mixed modeling with bootstapping To extend the range of LMMs for high-dimensional phenotype sets, I chose to build on an approximate model in order to avoid the repeated estimation of the trait-by- trait covariance matrices. In that respect, the multivariate LMM developed by Lip- pert, Casale and colleagues [Lippert & al., 2014; Casale & al., 2015] harboured many advantages. It is computationally efficient for a moderate number of traits, has suc- cessfully been used inmulti-trait studies [Cannavò & al., 2016; Schor & al., 2017] and collaboration with its developers was easily realisable. Their model is cast as 𝐘 = 𝐆+𝚿, (4.1) where the 𝑁 × 𝑃 phenotype matrix𝐘 for 𝑁 individuals and 𝑃 traits is modelled as the sum of a genetic (or polygenic) component𝐆 and a noise component𝚿 (I have omitted additional fixed effects for notational brevity). Here, 𝐆 and 𝚿 are random effects following matrix normal distributions: 𝐆 ∼ℳ𝒩𝑁,𝑃 (0,𝐑,𝐂𝑔) 𝚿 ∼ℳ𝒩𝑁,𝑃 (0, 𝐈𝑁,𝐂𝑛) , (4.2) where 𝐑 denotes the 𝑁 × 𝑁 genetic relationship matrix, 𝐈𝑁 is the 𝑁 × 𝑁 identity matrix and𝐂𝑔 and𝐂𝑛 are the genetic and the residual𝑃×𝑃 trait covariancematrices, respectively. Themarginal likelihood of themodel in equation (4.1) can be expressed in terms of a multivariate normal distribution of the form 𝑝 (𝐘|𝐂𝑔,𝐑𝑁,𝐂𝑔) = 𝒩(vec (𝐘)|0,𝐂𝑔 ⊗𝐑𝑁 +𝐂𝑛 ⊗ 𝐈𝑁) , (4.3) where the covariance structure of the phenotypes (in shape of the𝑁 ×𝑃 phenotype vector vec (𝐘) through stacking the columns of the phenotype matrix) is described by the sum of the Kronecker products ⊗ of the sample and trait covariance terms. Thismodel enables efficient inference schemes by exploiting Kronecker identities for the eigendecomposition of the full covariance matrix [Lippert & al., 2014; Rakitsch & al., 2013; Zhou & Stephens, 2014; Casale & al., 2015]. In particular, it allows for decoupling the decomposition of 𝐂𝑔 and 𝐑𝑁, which greatly increase the efficiency of the inference as 𝐑𝑁 is constant. The model in equation (4.1) also corresponds to the null model when using the multi-trait LMM for genetic association mapping. The complexity of this multivariate LMM implementation (from now referred to 90 as “standard REML”) is 𝑂(𝑁2 + 𝑡(𝑁𝑃 4 +𝑃 5))with𝑁 the number of samples, 𝑃 the number of traits, and 𝑡 the number of iterations of Broyden’s method, which uses an approximation of the second derivative for optimising the REML of the parameter estimates. From this equation, it becomes evident that as the number of traits in- creases, the complexity increases steeply and explains why this LMM set-up is not feasible for large trait sets (as is the case for other inference schemes table 4.1). To overcome the bottleneck of estimating the trait-by-trait covariance matrices, I de- veloped a simple method that efficiently uses a subsampling approach to estimate 𝐂𝑔 and 𝐂𝑛. 4.2. Covariance estimation via bootstrapping The key innovation of LiMMBo is to perform the variance decomposition on 𝑏 boot- strap samples of 𝑠 traits instead of on the whole dataset, and use those bootstrap samples to reconstruct the full 𝐂𝑔 and 𝐂𝑛 matrices (figure 4.1). In detail, from the total phenotype set with 𝑃 traits, 𝑏 subset of 𝑠 traits are randomly selected. 𝑏 de- pends on the overall trait number 𝑃 and the sampling size 𝑠 and is chosen such that each two traits are drawn together at least 𝑐 times (default: 𝑐 = 3). For each subset, the variance decomposition is estimated via the null model of themultivariate linear mixedmodel (mvLMM)), i.e. without the genetic variant effect x (equation (4.3)) and the 𝑠×𝑠 covariancematrices𝐂s𝑔 and𝐂 s 𝑛 recorded. For each trait pair, their covariance estimate is averaged over the number of times theywere drawn. The challenge lies in combining the bootstrap results in such a way that the resulting𝐂𝑔 and𝐂𝑛 matrices are true covariance matrices i.e. positive semi-definite and serve as good estimators of the true covariance matrices. This is achieved by fitting (least-squares estimate) the covariance estimates of the 𝑏 subsets to the closest positive-semidefinite matrices via a limited-memory version of the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS algorithm), which uses approximations of the Hessian matrix for finding the parameter estimates [Byrd & al., 1995]). The average estimates of the parameters are used to initiate the matrices. 4.3. Data simulation Using PhenotypeSimulator (chapter 3), I simulated a number of different phenotype datasets to evaluate LiMMBo in terms of scalability, model calibration and power. The datasets differed in their overall trait size𝑃, the percentage of variance explained 91 b bootstraps P traits ... N s a m p le s IN P traits R N samples Combine ...... P traits ... ... s traits ...VD via REML 1 2 b .. . ... ... ... P traits LiMMBo VD Figure 4.1: Variance decomposition. On the left-hand side, the phenotype set of𝑃 traits and 𝑁 samples is decomposed into its 𝑃 × 𝑃 trait-to-trait covariances 𝐂𝑔 and 𝐂𝑛, based on the provided genetic sample-to-sample kinship estimate matrix𝐑. The noise sample-to-sample matrix 𝐈 is assumed to be constant (identity matrix). Standardly, this is done by restricted maximum likelihood estimation of the null model of the mvLMM (Eq. 4.3). However, this direct variance decomposition (VD) via the standard REML implementation only works for moderate number of phenotype sizes. For higher trait-set sizes, LiMMBo serves as an altern- ative to the standardREML (right-hand side). Here, the phenotypes’ variance components are estimated on 𝑏 𝑠-sized subsets of 𝑃 which are subsequently combined into the overall 𝑃 × 𝑃 covariance matrices 𝐂𝑔 and 𝐂𝑛. by genetics ℎ2 (sum of genetic variant and infinitesimal genetic effects) and the num- ber of different phenotype components simulated to create the final phenotype. The phenotypes were simulated as described in section 3.2, based on the parameters and parameter values described in table 4.2 and table 4.3. Parameter values were gener- ally chosen to cover awide range a possible combinations and trait sizes. Parameters for levels of variance explained by the genetic and noise components were set to test their effect on the variance decomposition algorithm of the underlying LMM frame- work [Casale & al., 2015]. The variance decomposition is initiated by allocating an even split of variance explained to the genetic and random noise effects. The levels of variance explained were thus set to 0.5 each and deviations from this equal split into either direction (0.2, 0.8). 4.4. Scalability of LiMMBo The complexity of the variance decomposition of the LMM framework that LiMMBo builds on is𝑂(𝑁2+𝑡(𝑁𝑃 4+𝑃 5)). The second term depends on the overall trait size and describes the complexity of estimating the trait-by-trait covariance matrices 𝐂𝑔 and𝐂𝑛. By bootstrapping 𝑠-sized samples from the overall trait size, this complexity 92 term changes to 𝑏𝑡(𝑁𝑠4 + 𝑠5, with the covariance estimation carried out for 𝑏 boot- straps. In addition to the estimation of the covariance terms, the overall complexity of LiMMBo also depends on the fitting the BFGS algorithm 𝑛 times to the full trait- set of size 𝑃. LiMMBo makes use of a Cholesky decomposition of the matrices to be fitted, resulting in 12𝑃(𝑃 + 1)model parameters to be fitted for both 𝐂𝑔 and 𝐂𝑛. Thus, the overall complexity of LiMMBo is𝑂(𝑁2+𝑏𝑡(𝑁𝑠4+𝑠5)+𝑛𝑃 2), which is the sum of the complexity of the bootstrap variance decompositions and the complexity of fitting the BFGS algorithm. In order to assess and compare how LiMMBo scales, I performed variance de- composition both with LiMMBo and the standard REML approach on phenotypes with trait sizes ranging from 10 to 100 traits (parameters for phenotype simulation as described in table 4.3, total of ten simulated datasets per setup). For 𝑃 = 10, the sampling datasize 𝑠was set to 𝑠 = 5, otherwise 𝑠 = 10. Figure 4.2 shows the overall time taken by the standard REML approach, LiMMBo and its twomain components, the bootstrapping and the combination of the bootstrap results. Table 4.2: Parameters for phenotype simulation. The total variance for the genetic and noise effects is the sum of the variance of their effect components and has to add to 1. Each component has a certain percentage of its variance that is shared across traits, while the rest is independent. variance shared independent genetic effects total ℎ2 genetic variant effect ℎ𝑠2 𝜃 1-𝜃 infinitesimal genetic effects ℎ𝑔2 𝜂 1-𝜂 noise effects total (1-ℎ2) covariate effect (1-ℎ2)𝛿 𝛾 1-𝛾 observational noise (1-ℎ2)(1-𝛿) 𝛼 1-𝛼 93 Table 4.3: Parameter values of simulated phenotypes for assessing scalability, calibration and power. The “genotype” parameter specifies the simulated genotype cohort which was used to simulate genetic effects (described in section 3.1). 𝑃 are the different traitset sizes that were simulated. The parameters that follow are described in table 4.2 and specify the variance explained by each of the phenotype components. A variance explained equals zero means that this component was not simulated and corresponding non-applicable variance terms are designated with “-”. Parameter values Parameter Power Calibration Genotypes relatedNoPopstructure relatedNoPopstructure unrelatedNoPopstructure unrelatedPopstructure 𝑃 10, 50, 100 10, 20,…, 100 ℎ𝑠2 0.05, 0.2, 0.0125 0 ℎ𝑔2 0.95, 0.98, 0.9875 1 ℎ2 0.8, 0.5, 0.2 0.8, 0.5, 0.2 (1-ℎ2)𝛿 0.4 0 (1-ℎ2)(1-𝛿) 0.6 1 (1-ℎ2) 0.2, 0.5, 0.8 0.2, 0.5, 0.8 𝜃 0.6 - 𝜂 0.8 0.8 𝛾 0.6 - 𝛼 0.8 0.8 94 030 60 90 10 20 30 40 50 60 70 80 90 100 Traits Pr oc es s tim e [h] Combine bootstraps Sum bootstraps LiMMBo (total) REML Figure 4.2: Scalability of LiMMBo compared to standard REML. Empirical run times for LiMMBo and the standard REML approach on three simulated datasets per phenotype size, with 𝑁 =1,000 individuals each and different amount of variance explained by the genetic background signal (0.2, 0.5, 0.8). Points mark the mean run time across the different set- ups, error bars indicate their standard deviation. Lines were fitted for the bootstrapping step (orange): 𝑛(𝑁𝑠4 + 𝑠5); the combination of the bootstrapping (blue): 12𝑃(𝑃 + 1) and their combined run time (turquoise): 𝑛(𝑁𝑠4 + 𝑠5) + 12𝑃(𝑃 + 1). 𝑏: number of bootstraps, 𝑠: bootstrap size, 𝑃: phenotype size. The majority of the run time is required for the boot- strapping. The run time for the standard REML results (red) are only depicted up to 𝑃 = 40 when they already exceed the run times for 𝑃 = 100 in the LiMMBo approach (REML: 𝑂(𝑁2 + 𝑡(𝑁𝑃 4 + 𝑃 5))). Themajority of the run time of LiMMBo is taken by the variance decomposition of the bootstrapped subsets, which accounts for at least 85% (70 traits) and on average 97% of the total run time. As a comparison, the time taken by the standard REML approach quickly exceeds the time of LiMMBo and becomes unfeasible for more than 30 traits. While the bootstrapping keeps the complexity of LiMMBo effectively at 𝑂(𝑃 2), it has the major advantage of allowing for parallelisation of the covariance estimation step. Thus, LiMMBo computes the variance decomposition of each bootstrap inde- pendently and enables the use of multiple cores, allowing for an additional speed up of the process. The role of the bootstrap size 𝑠, the number of bootstraps 𝑏 and the co-sampling 95 of traits 𝑐 on complexity has not been evaluated yet. Different combinations of these parameters will potentially yield different run times and might influence the covari- ance estimates and model calibration, which are described in the next sections. For the remainder of this chapter, the bootstrap size 𝑠 = 10 and co-sampling of traits 𝑐 = 3, which were used for the estimation of run time differences, are adapted for all further analyses. The influence of 𝑠, 𝑏 and 𝑐 and additional experiments for eval- uating their role in the model are discussed in section 4.8. 4.5. LiMMBo yields covariance estimates consistent with REML estimates for moderate trait numbers I evaluated the suitability of LiMMBo for covariance estimation of 𝐂𝑔 and 𝐂𝑛 on simulated datasets with different strength of infinitesimal genetic effects. I simu- lated phenotype sets composed of infinitesimal genetic effects𝐆 and observational noise effects 𝚿 only, omitting any genetic variant effects (additional parameters as described in table 4.3) and estimated these variance components subsequently with LiMMBo and standard REML. Variance estimation on simulated datasets allows for the comparison of the estimated covariance matrices to the true covariance matrices based onwhich the phenotypeswere simulated. By computing the rootmean squared deviation (RMSD) between the true and estimated covariance matrices from both methods, I obtain a measure that is directly comparable and independent of the trait set: RMSD = √ ∑𝑛 𝑡=1 (𝐶true −𝐶estimate)2 𝑛 (4.4) Figure 4.3 shows the comparison of both standard REML and LiMMBo-derived co- variance matrices compared to the simulated, true covariance matrices. In the re- gime where REML is feasible, i.e. moderate trait sizes of up to 30, the RMSD can directly be compared: both methods provide consistent estimates across trait sizes with little difference between the methods. Importantly, the RMSD stays constant for the LiMMBo-derived estimates of the covariances, even for phenotypes of higher sizes. 96 0.4 0.5 0.6 0.7 10 20 30 50 100 Number of traits R oo t M ea n Sq ua re d De vi at io n Method LiMMBo RML Figure 4.3: Comparison of trait-by-trait covariance estimates derived from standard REML and LiMMBo. Phenotypes with different percentage of variance explained by genet- ics (ℎ2 = 0.2, 0.5, 0.8) and different trait numbers were simulated. Subsequently, the genetic and noise trait-by-trait covariancematrices𝐂𝑔 and𝐂𝑛were estimated both via LiMMBo and standard REML. These estimates were compared to the true (simulated) covariance matrix by computing their root mean squared deviation (RMSD; equation (4.4)). The boxplots sum- marise the RMSD across different variance levels for ten independent simulations each. For moderate traitset sizes ranging from 10 to 30 traits, LiMMBo and the REML approach yield consistent covariance estimates. Covariance estimation via LiMMBo stays stable with these observations in the higher trait sizes (𝑃 = 50, 100). 4.6. mtGWAS with LiMMBo-derived covariance matrices are well calibrated across all phenotype sizes One key aspect in statistical method development is to ensure that the method is well-calibrated under the null model. Apart from gaining knowledge about the ge- netic and noise trait-by-trait covariance structure of a phenotype, variance decom- position into different random effect components yields estimates that can be sup- plied as known parameters to approximate mvLMM methods and multi-trait ge- nome-wide association study (mtGWAS). As introduced by Jiang & Zeng [1995] and adapted by Korte & al. [2012], there are different model designs for mvLMM, de- pending on the underlying biological hypothesis regarding the effect of the genetic variant. The different models were described in section 1.7.8 and include any effect (effect size is unequal to zero for at least one trait), common effect (same effect size across all traits) and specific effect test (specific effects of the variant on a given trait). In practice, it is common to test for any effect as a means of discovering associated 97 genotypes and to refine the type of association later. As such, I chose to apply an any effect test for both the calibration and power analysis. In order to test if LiMMBo-derived covariance estimates yield well calibrated test statistics, I simulated phenotype sets composed of infinitesimal genetic and observa- tional noise effects onlywith 10, 20, 30, 50 and 100 traits and parameters described in table 4.3. For trait sizes of up to 30 traits, I compared the calibration of mtGWAS for LiMMBo- and standard REML-derived covariance matrices. As shown in figure 4.4, both methods yield p-values following a uniform distribution under the null model (compare figure 1.2C) across all phenotype sizes and variance explained by genet- ics, thus show appropriate calibration. For higher trait sizes, I also compared the Figure 4.4: Calibration of mtGWAS based on covariance estimates from standard REML and LiMMBo. Formoderate trait numbers ranging from 10 to 30 traits, phenotypeswith dif- ferent percentage of variance explained by genetics were simulated. The genetic and noise trait-by-trait covariance matrices 𝐂𝑔 and 𝐂𝑛 were then estimated both via LiMMBo and standard REML. The model calibration i.e. uniform distribution of p-values under the null model was assessed by mtGWAS with covariance estimates derived from either LiMMBo or REML. Quantile-quantile plots show uniform distribution for both methods across all trait sizes and levels of proportion of variance explained by genetics. 98 calibration of mtGWAS using a mvLMM to using a simple multivariate linear model (mvLM). The mvLM does not require the variance decomposition into different ran- dom effects, i.e. avoids the computational bottleneck of estimating the trait-trait co- variance matrices, but simply uses principal components of the genotypes as fixed effects to adjust for population structure. For the residual trait-by-trait covariance structure 𝜎𝑛, I used the empirical phenotypic trait-by-trait covariance. As depic- ted in figure 4.5, the calibration of the mvLM depends strongly on the population structure. For populations without related individuals, the mvLM shows a uniform p-value distribution and points to the usefulness of this simpler model approach for populations with well-defined structure. However for structured populations, the mvLM is poorly calibrated and clearly demonstrates the difficulty of adjusting for population structure via fixed effects in highly structured populations. In these scenarios, multi-trait mapping of high-dimensional phenotypes is only possible via LiMMBo. 4.7. Multi-trait genotype to phenotype mapping increases power for high-dimensional phenotypes Multi-trait linearmixedmodels for low tomoderate phenotype sizes have been shown to improve power by leveraging correlated background structure and trait-by-trait correlations resulting thereof [Casale & al., 2015]. For assessing the significance of the genotype-phenotype association via LLR test statistics where the likelihood of the full model is compared to the likelihood of the null model i.e. without the fixed genetic effect, the LLR statistic are translated into p-values via the appropriate 𝜒2 distribution with 𝑃 degrees of freedom (section 1.7.3 and figure 1.2A, [Wilks, 1938]). In order to test if there is still a gain in power for a mvLMM with high-dimensional phenotypes, i.e. large number of degrees of freedom, I simulated phenotypes where I varied key parameters whose influence on power I wanted to investigate. I varied trait numbers (𝑃 = {10, 50, 100}), the contribution of the genetic effects to the phenotypic variance (ℎ2 = {0.2, 0.5, 0.8} and proportion of traits that are affected by the genetic variant effects (𝑎 = {0.2, 1}). Parameters of this pheno- type simulation are described in table 4.2 and table 4.3. For each of these phen- otype sets, I added 20 genetic variant effects to a subset of traits, creating pheno- types with different proportions of traits affected by the genetic variant effects. For each set-up, I simulated 50 independent phenotypes (a total of 2, 250 phenotypes = 3 ℎ2×3 trait sizes×50 permutations×5 subset sizes) and estimated the trait-by-trait 99 Figure 4.5: Calibration of mtGWAS via a simple linear model and LiMMBo. The three phenotype sets with 100 traits each were modelled as the sum of infinitesimal genetic and observational noise effects. The basis for the infinitesimal genetic effects build the three gen- otype cohorts simulated in section 3.1. The phenotypic variance explained by genetics was set to ℎ2 = 0.8. For the mvLMM (only shown for the population with related individu- als), covariance estimates were derived via LiMMBo. In the mvLM, population structure was adjusted for via the first ten PCs of the genotype data. The mvLM is well calibrated for populations without related individuals. For the populations containing the latter, only the mvLMM is well calibrated. covariance matrices𝐂𝑔 and𝐂𝑛 via LiMMBo. I used these estimates in a mvLMM to test the association between the known causal SNPs (from the simulation) and the phenotypes. In addition, I determined the association of the causal SNPs for each trait independently via univariate linear mixed model (uvLMM). The significance of the associations was assessed by comparing the p-values of these original asso- ciations to p-values obtained from mvLMM and uvLMM on 1,000 permutation of the genotypes. For the uvLMM, the p-values were adjusted for multiple testing by the number of traits that were tested and the minimum adjusted p-value across all traits for a given SNP recorded. For each SNP, the number of times the (adjusted) 100 p-value of the permutation was less or equal to the observed p-value was recorded and divided by the total number of permutations, yielding an empirical p-value per SNP. I compared the results of the univariate and multivariate models to evaluate two key differences in the models. First, I can test which burden of the multiple asso- ciation testing weights heavier, the correction for multiple testing in the uvLMM or the increased degrees of freedom in the mvLMM. This effect can be analysed by varying the number of traits in the phenotypes and keeping the other parameters constant. As depicted in figure 4.6A, for the highest number of phenotypes tested, both models are comparable in the number of causal SNPs they detect. For the other trait sizes tested, the multivariate model out-performs the univariate model by far. For these comparisons, an ideal scenario was assumed and all traits were affected by the genetic variant effects (𝑎 = 1) and the total genetic variance was low (ℎ2 = 0.2). The influence of the proportion of traits affected by the causal SNPs on the power to detect these is depicted in figure 4.6B. This analysis allows for the evaluation of the second key difference in the models. The multivariate model can exploit correl- ated background structure and allows for the detection of pleiotropic effects, while the univariate model can only detect simple SNP-trait associations. This advantage becomes clear in figure 4.6B, where the median number of detected true SNPs de- pending on the proportions of traits affected by the causal SNPs is depicted. Here, the number of traits was kept constant at 𝑃 = 50 and the mean genetic variance across all traits fixed at ℎ2 = 0.2, i.e. with an increase of the number of affected traits the contribution of the genetic component per trait decreases. The univariate model suffers from theweaker genetic componentswhen a large number of traits are affected and loses power. In contrast, the multivariate model can still detect increas- ing percentages of true causal SNPs. The influence of the proportion of phenotypic variance explained by all genetic, i.e. genetic variant and infinitesimal genetic effects is shown in figure 4.6C. For both models, the number of detected SNPs decreases with increasing ℎ2, as the effect sizes of the SNPs become negligible compared to the overall genetic variance. However, the multivariate model is still able to exploit the correlation of the variant effects across traits and detects more SNPs in cases of high ℎ2. An overview of all parameter comparisons can be found in figure B.1 in the appendix. 101 010 20 30 40 50 10 50 100 Number of traits % de te ct ed tr u e S NP s A 0 10 20 30 40 50 20 40 60 80 100 % affected traits B 0 10 20 30 40 50 0.2 0.5 0.8 h2 C Model mvLMM uvLMM Figure 4.6: Power comparison for mvLMM and uvLMMs of high-dimensional pheno- types. Each panels show the influence of one simulation parameter on the power to detect the causal SNPs. When investigating one parameter, the other parameters were fixed at a certain value. For each set-up, 50 independent datasets were simulated and analysed. A. Influence of the number of traits: proportion of traits affected and the total genetic variance fixed at 𝑎 = 1 and ℎ2 = 0.2, respectively. B. Influence of proportion of traits affected: trait size and total genetic variance fixed to 𝑃 = 50 and ℎ2 = 0.2 respectively. C. Influence of total genetic variance: trait size and proportion of traits affected fixed to 𝑃 = 100 and 𝑎 = 0.6. 102 4.8. LiMMBo for multi-trait GWAS and beyond In this chapter, I introduced LiMMBo, a newmethod for the multivariate analysis of large trait numbers, which uses a bootstrap method to estimate complex trait covari- ance matrices. The main benefit of LiMMBo is that it scales to 100s of phenotypes, both because of its inherent sub-sampling method and that the most computation- ally intense part of the method can be parallelised. To take advantage of the paral- lelisation, I implemented an optional automatic detection for multiple cores which allows for easy realisation of this process via the Parallel Python Software [Vanovschi, 2017]. In practice, this means that trait sizes up to 30 or 40 can be in hours, rather than taking several days as for standard REML-based methods. Most notably, com- plex datasets of 100s of traits, which is out of scope for the REML approaches, are feasible when using LiMMBo. I showed that the covariance matrices estimated via LiMMBo are as good an estimator of the real covariance matrices as the ones of the validated REML approach. Consequently, these covariance matrices produce well calibrated nullmodelswhen used in LMM forGWAS, showing the validity of the ap- proach. To show the advance of LiMMBo, I demonstrated the power gain for multi- trait GWAS of high-dimensional phenotypes with LiMMBo over standard single- trait models across a wide range of phenotype architectures. I made LiMMBo ac- cessible as an open source, pythonmodule at https://github.com/HannahVMeyer/ LiMMBo/tree/master/limmbo. LiMMBo is compatible with the LIMIX package for linear mixed models [Lippert & al., 2014]. The bootstrapping has proven powerful to reduce the computational complexity for estimating the covariance parameters andmade the analysis of complex datasets with high trait numbers possible. However, so far, I only examined the complexity and calibration dependent on the size of the overall phenotype set 𝑃. Of additional interest would be understanding the (co-)dependence of the bootstrap size 𝑠, the number of bootstraps 𝑏 and the co-sampling of traits 𝑐. Based on already simulated datasets, a systematic comparison of the run times, covariance estimates and calib- ration of different combinations of 𝑠 and 𝑏 could be conducted. For each of these combinations, different thresholds for 𝑐 could be examined. Much of the attraction of linear mixed models in genetics has been their ability to model complex genetic relatedness. As described by [Kang & al., 2010] and demon- strated in this chapter, simple linear models are not suitable for analysing pheno- types with complex underlying genetic relatedness, whereas linear mixed models with the covariance matrices estimated by LiMMBo are appropriate and possible up 103 to 100s of traits. Complex relatedness in populations is wide-spread in plant and animal breeding [Bolormaa & al., 2014; Yang & al., 2014], and increasingly common in human bottleneck populations [Tachmazidou & al., 2013]. Furthermore, as the population numbers increase in human genetics, complex cryptic relationship struc- tures are more prevalent [Reich & Goldstein, 2001], meaning that methods such as LiMMBo will be more applicable in the future in human genetics. Trait-by-trait covariance matrices are useful for a variety of high dimensional big data problems across genomics, from statistical genetics to single cell analysis. The ability to accurately estimate large trait-by-trait covariance matrices using this boot- strap method may be applicable to more domains than GWAS, e.g. many gene expression studies use covariance matrices. Previous work from Schäfer & Strim- mer [2005] showed the large gene dimensions coupled with small(er) sample sets means that empirical covariance matrices could not be accurately estimated; other investigators [Ledoit &Wolf, 2004; Furrer & Bengtsson, 2007; Bickel & Levina, 2008] used shrinkage methods to create valid covariance matrices. The work from Teng & Huang [2009] uses subsampling but with strong shrinkage priors to generate the final covariance matrix. By fitting the average to closest true covariance, LiMMBo ensures positive-semidefiniteness of the covariance while avoiding ill-conditioned matrices, which usually introduces large biases in the final use of these models. Thus, covariance estimation based on the method implemented in LiMMBo might be applicable and useful in other areas of quantitative genetics. The ability to generate large cohorts of well phenotyped and genotyped individu- als has forced the development of many new methods in statistical genetics. With the advent of genotyped human cohorts up to 500,000 individuals with over 2,000 different traits [Sudlow& al., 2015], and plant phenotyping routinely in the 1,000s of individuals from structured crosses with 100s of (image-based) phenotypes [Atwell & al., 2010; Yang & al., 2014], new informative and scaleable methods are needed. LiMMBo extends the reach of linear mixed models into this new regime, allowing for new complex genetic associations to be made. 104 5 LiMMBo applied to multi-trait GWAS in Saccharomyces cerevisiae In the previous chapter, I introduced LiMMBo and showed its calibration and power on simulated datasets. In this chapter, I will explore its utility on a real dataset. Amongst the publicly available studies, such as flowering, defense and develop- mental phenotypes in Arabidopsis thaliana [Atwell & al., 2010] or human blood meta- bolites [Shin & al., 2014], I found the dataset of 46 quantitative traits in yeast gener- ated and analysed in the study by Bloom and colleagues [Bloom & al., 2013] most suitable for several reasons. First, they investigated the growth of a yeast F2 cross on several different substrates. The genetic architecture of an F2 cross is highly struc- tured, making it an ideal test scenario for a linear mixed model capable of adjusting and profiting from population structure in the sample. Second, the measured phen- otypic traits have a broad spectrum of correlation, with highly related phenotypes formetabolically similar compounds to very low correlation between certain chemic- als. At the same time, the phenotypic measurements are all obtained by measuring the growth size of the colonies and hence, the variable type and unit is the same across phenotypes. Lastly, the collection and quality control of the data were well described and the data were easily accessible in a user-friendly format. However, as with many studies where multiple measurements per sample are obtained, not all 105 samples were fully phenotyped. In the following chapter, I will first describe the data processing and imputation strategy for the yeast phenotypes. I will then show the results of applying LiMMBo and subsequent mtGWAS to the dataset and compare the results to the association obtained from single-trait genome-wide association study (stGWAS). Finally, I will explore the benefits of jointly modelling large numbers of traits in genetic studies. Like LMM and methods based thereon, LiMMBo requires samples to be fully phenotyped as the model cannot deal with missing values. In order to understand how to deal with missing values in the dataset, it is important to have an under- standing of the underlying process generating the missing data [Rubin, 1976]. In general, one can distinguish between three processes, missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) [Little & Rubin, 2002]. Their formal definitions are based on the data 𝐗 ∈ 𝑅𝑁,𝑃, the binary indicator matrix 𝐌 ∈ 𝑅𝑁,𝑃 and 𝜙, the (unknown) parameter of the missing data process, i.e. the parameter of the conditional distribution 𝑔𝜙 of𝐌 given𝐗. 𝑁 is the number of observations and 𝑃 the number of observed variables. Entries in𝐌 take two values,𝑚𝑖𝑗 = 1 if an observation is missing or𝑚𝑖𝑗 = 0 if it is observed. The data 𝐗 can formally be grouped into𝐗 = 𝐗obs+𝐗miss, where𝐗obs and𝐗miss are the ob- served and missing parts of the data, respectively. Data are MAR if the distribution of missingness only depends on𝐗obs 𝑔𝜙(𝐌|𝐗, 𝜙) = 𝑔𝜙(𝐌|𝐗obs, 𝜙)∀𝐗miss, 𝜙. (5.1) If the distribution is also independent of𝐗𝑜𝑏𝑠, 𝑔𝜙(𝐌|𝐗, 𝜙) = 𝑔𝜙(𝐌|𝜙)∀𝐗, 𝜙, (5.2) the data isMCAR. If, on the other hand, the distribution ofmissingness is dependent on𝐗miss, hence 𝑔𝜙(𝐌|𝐗, 𝜙) = 𝑔𝜙(𝐌|𝐗obs,𝐗miss), 𝜙)∀𝐗, 𝜙, (5.3) the data is classified as MNAR. To illustrate these cases, consider an example where there are 𝑁 colonies of yeast and one wants to automatically detect the size and the density of each colony with a suitable instrument (𝑃 = 2). If the instrument fails with a constant probability 𝜙 for any colony independent of the measurement, then the pattern of missing values in the data is MCAR. If the probability that the density measurement is missing changes with the value of the size measurement, but is not 106 dependent on the density of colonies with the same size, then the data are MAR. In contrast, data are MNAR if the probability of obtaining a density measurement depends on the density of colonies with the same size. In practice, detecting the missing data mechanism often proves difficult. Testing for MCAR can be done via statistical tests [Little, 1988], but distinguishing between MAR and MNAR cannot be achieved formally as this would require knowledge of themissing values [Little&Rubin, 2002; vanBuuren&Groothuis-Oudshoorn, 2011]. However, there are visualisation tools that provide diagnostic plots and approxim- ate measures which can help make assumptions about the missingness mechanism [Templ & al., 2012; Garson, 2015]. When analysing datasets with missing data, there are four general approaches to choose from: i) methods simply based on the complete data, ii) methods based on complete data with weighting procedures, iii) model-based and iv) imputation- based procedures. In the first class, incompletely recorded samples are simply ex- cluded, which is the most easy to implement method, but is inefficient and can lead to major bias, especially if the data is MNAR [Little & Rubin, 2002]. Weighting pro- cedures also exclude incompletely sampled data, but apply a weighting to the recor- ded samples, where the weights attempt to adjust for the missing data as if it were part of the sample design. Model-based procedures define a model for the observed data and base inference and parameter estimates on the likelihood or posterior dis- tribution of that model. The last class of methods, imputation-based approaches, estimate the missing values based on the observed values and the completed data- set can be analysed by standard methods (an extensive review of the different meth- ods can be found in [Little & Rubin, 2002]). The precise usage of the methods and underlying assumptions will be dependent on the missing data mechanism. I found the imputation approach most applicable for dealing with the missing phenotype values in the yeast dataset as they were simple to apply, did not lead to a decreased sample size and possible loss in power (as method i would have) and did not require recasting the model underlying LiMMBo (as would have been required for method iii). There are a vast number of imputation methods available, which can be categorised by both the method for imputation and the number of times the missing values are imputed. Methods include simple mean prediction, where the missing data for a given variable is replaced by the mean of all known values of that variable and derivations thereof such as KNN or FKM, which use the mean of the k-nearest neighbours to replace the missing values [Troyanskaya & al., 2001; Li & al., 2004]. Instead of imputing based on the mean, i.e. the centre of a distribution, other 107 strategies use random draws from a predictive distribution of plausible values of the missing value, where the predictive distribution is conditioned on the observed data. These techniques can then be used to either impute one value for each missing item (single imputation) or more than one value to account for imputation uncer- tainty (multiple imputation) [Little & Rubin, 2002]. For complex datasets, multiple imputation has emerged as the method of choice [Rubin, 1987; Schafer, 1997]. 5.1. Dataset and imputation The dataset generated by Bloom & al. [2013] consists of phenotype and genotype data of 1,008 prototrophic haploid Saccharomyces cerevisiae segregants derived from a cross between a laboratory strain (BY MATa) and a wine strain strain (RMMAT𝛼). In brief, the segregants were generated bymating of the haploid parental strains and subsequent sporulation of the diploid heterozygote. Sporulation resulted in 1,008 four-spore tetrads that showed 2:2 segregation of mating type and drug-resistance markers. From each tetrad one spore was selected for further analyses (figure 5.1). For phenotyping, these segregants were grown on agar plates in 46 growth con- ditions. These can broadly be grouped into growth on different carbohydrates or derivatives thereof (lactose, lactate, raffinose, maltose, mannose, sorbitol, trehalose, xylose, galactose), growth on different culture media (YPD, YNB) with different pH (YNB:pH3, YNB:pH8) or in different temperatures (YPD:4C, YPD:15C, YPD:37C), growth on different antibiotics and xenotbiotics (e.g. cadmium chloride, neomycin, zeocin, cis platin). For a full list, see labels in figure 5.4. After incubation for 48h, the colony size of each segregant grown in the different conditions was measured. The final phenotypes were defined as the colony size normalised to colony size growth on control medium. For the remainder of this chapter, a trait is defined as this nor- malised growth size in one condition. Out of the 1,008 segregants, 303 segregants were phenotyped for all 46 traits. Segregants were genotyped using Illumina short-read sequencing. After map- ping, quality control and filtering for unique genotype markers, all 1,008 segregants were genotyped 11,623 unique genotypic markers. 5.1.1. Missing data mechanism In order to gain an understanding of the dataset, I first looked at the frequencies and distribution of missing values. There are 135 different combinations of missing values across the samples and the missing phenotypes are not evenly distributed 108 BY MATα RM MATa x a a α α a a α α a a α α a a α α a a α α... Mating Heterozygous diploid a a a a a Genotyping & phenotyping Sporulation Tetrad dissection Figure 5.1: Generation of yeast dataset. Haploid parental strains BY MATa andRMMAT𝛼 were mated to generate diploid heterozygotes. These diploid heterozygotes were sporu- lated, during which they undergo meiosis and yield tetrads of recombinant haploids. From each tetrad, one spore was selected. For phenotyping, these segregants were grown an agar plates in different conditions. Adapted from [Bloom & al., 2013]. 109 (figure 5.2A). Some traits such as cobalt chloride are present for almost all samples while others such as sorbitol or raffinose are missing in more than a third of the samples. I used Little’s global test for MCAR to analyse whether these observed data patterns can be accounted for by aMCARmechanism. Little’s method tests the null hypothesis that the data isMCAR [Little, 1988; Beaujean, 2015], which can in this case be rejected with a p-value of 2× 10−34 (based on a 𝜒2 distribution, 𝜒2 =5,902, 𝑑𝑓 =4,631). Determining if data is MAR or MNAR cannot be tested for formally and relies on approximate measures and assumptions based on the experimental procedures [Schafer & Graham, 2002; Garson, 2015; Templ & al., 2012]. Garson [2015] suggests to use significance tests of missingness. If it can be demonstrated that one or more variables in the dataset are significantly correlatedwithmissing values, missingness maybepredictable, which is the requirement for imputingMARdata. In order to test for predictable missingness, I created an indicator matrix for the phenotype matrix, where observed values were encoded as zero and missing values as one. For each of the 46 traits in the dataset, I correlated the observed values across all samples with each column of the indicator matrix, i.e. the missingness patterns per trait. If all values were observed for a given trait, all values in the indicator matrix in this columnwere equal to zero and the correlation between the trait and themissingness was set toNA. Figure 5.3 shows the correlation patterns between the phenotypes and the missing values per trait. For traits like cobalt chloride and magnesium sulfate, where little data is missing, many entries are NA. Overall, for a number of traits and missingness patterns, there is sufficient evidence for predictable missingness and MAR assumptions for further analyses were considered valid. Most importantly, for data with MAR, the missing data mechanism is ignorable for maximum likelihood based methods and no further adjustments for the mechanisms have to be made in the modelling [Rubin, 1976; Little, 1988]. Thus, theMAR assumption of missingness in the yeast data allows for imputation via the likelihood-based method of multiple imputation and LMMs. 5.1.2. Imputation via MICE Imputation of missing values requires an understanding of which missing trait val- ues can be reliably imputed and to find the best parameter settings for the imputa- tion. In order to do this, I needed a fully phenotyped dataset with the same structure as the yeast dataset, where missing values could be introduced, imputed and sub- sequently compared to the true values. I chose a simple approach using the subset 110 Co m bin at ion s So rb ito l Ra ffin os e Hy dr og en _P er ox ide Ca dm ium _C hlo rid e YP D: 4C SD S YN B: ph 8 Hy dr ox yu re a M an no se Ca lci um _C hlo rid e Ze oc in x4 −H yd ro xy be nz ald eh yd e Co pp er Et ha no l Hy dr oq uin on e M ag ne siu m _C hlo rid e x5 −F luo ro cy to sin e Co ng o_ re d YN B: ph 3 Ga lac to se La cta te Ci sp lat in In do lea ce tic _A cid Tr eh alo se x5 −F luo ro ur ac il La cto se Pa ra qu at Ca ffe ine Co ba lt_ Ch lor ide Cy clo he xim ide Di am ide E6 _B er ba m ine Fo rm am ide Lit hiu m _C hlo rid e M ag ne siu m _S ulf at e M alt os e M en ad ion e Ne om yc in Tu nic am yc in x4 NQ O x6 −A za ur ac il Xy los e YN B YP D YP D: 15 C YP D: 37 C B Co m bin at ion s So rb ito l Ra ffin os e Hy dr og en _P er ox ide Ca dm ium _C hlo rid e YP D: 4C SD S YN B: ph 8 Hy dr ox yu re a Ca lci um _C hlo rid e M an no se x5 −F luo ro cy to sin e Ze oc in Hy dr oq uin on e Ga lac to se M ag ne siu m _C hlo rid e x4 −H yd ro xy be nz ald eh yd e Et ha no l Co pp er La cta te Co ng o_ re d YN B: ph 3 Ci sp lat in x5 −F luo ro ur ac il In do lea ce tic _A cid Fo rm am ide x6 −A za ur ac il Lit hiu m _C hlo rid e Di am ide Tr eh alo se Xy los e YP D: 37 C Ca ffe ine Cy clo he xim ide La cto se Pa ra qu at E6 _B er ba m ine M alt os e M en ad ion e Ne om yc in x4 NQ O YP D: 15 C M ag ne siu m _S ulf at e Tu nic am yc in YN B YP D Co ba lt_ Ch lor ide A Figure 5.2: Frequencies and distributions of missing values in the yeast phenotype data. In both panels, the aggregation plot (middle) depicts all existing combinations of missing (blue) and non-missing (orange) values in the traits. The bar chart on its right shows the frequencies of occurrence of the different combinations. The histogram on the top shows the frequency of missing values for each trait (R Package: VIM [Templ & al., 2012]). A. The full dataset contains normalised colony sizes for growth in 46 different conditions of 1,008 genotyped yeast segregants. 306 segregants are fully genotyped (bar chart, orange bar). B. Fully-phenotyped dataset of 306 segregants with simulated missing values based on the observed missingness pattern for the entire pool of 1,008 segregants. Generated via R function VIM::aggr. 111 Spearm a n co rrelation −0.16 −0.13 −0.09 −0.06 −0.03 0 0.03 0.06 0.09 0.13 0.16C ad m iu m _C hl or id e Ca ffe in e Ca lci um _C hl or id e Ci sp la tin Co ba lt_ Ch lo rid e Co ng o_ re d Co pp er Cy clo he xi m id e D ia m id e E6 _B er ba m in e Et ha no l Fo rm a m id e G al ac to se H yd ro ge n_ Pe ro xi de H yd ro qu in on e H yd ro xy ur ea In do le ac et ic_ Ac id La ct at e La ct os e Li th iu m _C hl or id e M ag ne siu m _C hl or id e M ag ne siu m _S ul fa te M al to se M an no se M en ad io ne N eo m yc in Pa ra qu at R af fin os e SD S So rb ito l Tr e ha lo se Tu n ic am yc in x4 −H yd ro xy be nz al de hy de x4 N QO x5 −F lu or oc yt os in e x5 −F lu or ou ra ci l x6 −A za ur a ci l Xy lo se YN B YN B: ph 3 YN B: ph 8 YP D YP D :1 5C YP D :3 7C YP D :4 C Ze oc in Cadmium_Chloride Caffeine Calcium_Chloride Cisplatin Cobalt_Chloride Congo_red Copper Cycloheximide Diamide E6_Berbamine Ethanol Formamide Galactose Hydrogen_Peroxide Hydroquinone Hydroxyurea Indoleacetic_Acid Lactate Lactose Lithium_Chloride Magnesium_Chloride Magnesium_Sulfate Maltose Mannose Menadione Neomycin Paraquat Raffinose SDS Sorbitol Trehalose Tunicamycin x4−Hydroxybenzaldehyde x4NQO x5−Fluorocytosine x5−Fluorouracil x6−Azauracil Xylose YNB YNB:ph3 YNB:ph8 YPD YPD:15C YPD:37C YPD:4C Zeocin Ph en ot yp e Missingness Figure 5.3: Correlations of observed phenotypes with missing data values. For each of the 46 traits, the Spearman’s rank correlation coefficient 𝜌was computed with each column of the indicator matrix of the phenotypes, containing zero for observed values and one for missing values. The strength and the direction of correlations are depicted above, with the original phenotypes in rows and the indicator matrix of the phenotypes across columns. Grey squares indicate NA, i.e. columns in the indicator matrix for which no traits were missingwhen correlatedwith the observed values for a given trait. Generated via R function corrplot::corrplot. 112 of the 303 fully phenotyped samples and introducing missing values with a similar pattern of missingness as observed in the original dataset. The results for the real (figure 5.2A) and simulated (figure 5.2B) dataset are similar in terms of frequencies and combinations ofmissing/non-missing traits. I used this simulated dataset as in- put to the imputation framework based on multiple imputation by chain equations (MICE) [van Buuren & Groothuis-Oudshoorn, 2011]. MICE belongs to the general class of multiple imputation frameworks, where sev- eral imputed versions of the dataset are generated and each variable is imputed sep- arately. The imputed values are chosen from plausible values drawn from a distri- bution that is specific for each variable, in this case for each trait. This distribution is derived from the dataset𝐗 ∈ 𝑅𝑁,𝑃 itself, with 𝑋 split into missing and observed parts𝐗 = (𝐗miss,𝐗obs), the binary indicator matrix for missingness𝐌 ∈ 𝑅 𝑁,𝑃 and a set of predictor variables 𝑍. The MICE algorithm is usually divided into four steps [Rubin, 1987; Van Buuren & Oudshoorn, 1999; Pigott, 2001]: 1. Specify the posterior predictive density 𝑝(𝐗miss|𝑍,𝐌) given the non-response mechanism 𝑝(𝐌|𝐗) and the complete data model 𝑝(𝐗). 2. Draw imputations from this density to produce𝑚 complete data sets. 3. Perform𝑚 complete-data analyses on each completed data matrix. 4. Pool the𝑚 analysis results into final point and variance estimates. Garson [2015] approach allowsme to obtain reliable imputation estimateswhile hav- ing to estimate the variance components via LiMMBo only once. As described in the previous chapter, LiMMBo strongly reduces the computation time for the variance decomposition (section 4.4), but it is still the time consuming factor in the analysis. The two main choices when applying MICE for imputation have to be made in step one: the type of the imputation model and the choice of predictor variables. Imputation model. From the different imputation models available (examples de- scribed in [van Buuren & Groothuis-Oudshoorn, 2011]), I found predictive mean matching, a semi-parametric method which preserves non-linear relations in the data [Little, 1988; van Buuren & Groothuis-Oudshoorn, 2011], a fast and sensible imputation option. In brief, predictive mean matching finds the mean and covari- ance of the multivariate distribution 𝐗 with missing values (often simply based on the complete cases). Subsequently, for each incomplete sample it predicts the miss- ing values𝐗miss based on𝐗obs and the provided predictor variables 𝑍. In addition, 113 values of the complete samples for the same set of 𝐗miss are predicted. The pre- dicted values of the incomplete sample are than matched to the predicted values of the complete samples and the closestmatch is chosen. The imputed values for the in- complete sample are set to the observed values of the closest match [Little, 1988]. In this way, only realistic and theoretically observable values (assuming proper quality control of the data prior to imputation) are imputed. Predictor variables. Collins & al. [2001] show that as many valid predictor variables as possible should be included in the imputation to obtain the least amount of bias and maximal certainty about the predictions. In addition, Schafer [1997] demon- strated that using this strategy makes MAR assumptions more plausible. However, not all predictors will be relevant and the choice of predictors can be done on a per-variable level. In order to select suitable predictors for each trait, I first com- puted the pairwise Spearman correlation coefficient 𝜌 for all traits across the 303 fully-phenotyped segregants. Some of the traits like cadmium chloride or neomy- cin show very little correlation to any of the other traits, while many of the traits based on growth on different carbohydrate resources form a large cluster of mod- erate to strong correlation (figure 5.4). I tested several sets of predictor variables, either using all traits as predictors or choosing predictors based on the pairwise 𝜌 of the traits. For each trait, I included predictors that showed a correlation higher than a predefined threshold (𝜌 = {0.1, 0.2, 0.3}). In addition, I restricted the predictors to traits that had been measured in at least 20% of the samples in the dataset. This ex- cluded cadmium chloride (21%missing), hydrogen peroxide (24%), raffinose (34%), sorbitol (41%) and YPD:4C (20%) as predictor variables, but did not prevent them from being imputed. Further parameters for MICE are the number of imputed datasets 𝑚 (set to 𝑚 = 20) and the number of iterations 𝑚𝑎𝑥𝑖𝑡 (set to 𝑚𝑎𝑥𝑖𝑡 = 30). For each predictor set- up, I initiated MICE with the same seed for the random number generator to ensure comparability. After imputation, I evaluated the goodness of the imputation by com- puting the Spearman correlation of the imputed values (averaged across iterations 𝑚) to the experimentally observed ones (figure 5.5). Traits where the imputed val- ues correlated to the original ones by more then 95% in at least one of the predictor set-ups were retained in the analysis. For five traits (cadmium chloride, hydrogen peroxide, raffinose, YNB:ph8, YPD:4C), no suitable predictors could be determined and these were excluded from further analyses (figure 5.5, red labels). For each trait, I chose the predictor scheme that yielded the highest correlation between the im- 114 P e a rso n Correlation −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1C ad m iu m _C hl or id e Co ba lt_ Ch lo rid e N eo m yc in Ca ffe in e M en ad io ne YP D :3 7C Pa ra qu at Co ng o_ re d Ze oc in Fo rm a m id e In do le ac et ic_ Ac id D ia m id e H yd ro ge n_ Pe ro xi de Cy clo he xi m id e E6 _B er ba m in e M ag ne siu m _S ul fa te H yd ro qu in on e H yd ro xy ur ea x4 N QO Co pp er Li th iu m _C hl or id e x6 −A za ur a ci l Et ha no l R af fin os e M ag ne siu m _C hl or id e La ct at e La ct os e Xy lo se So rb ito l Tr e ha lo se YN B YP D x5 −F lu or oc yt os in e x5 −F lu or ou ra ci l Tu n ic am yc in YP D :1 5C YP D :4 C SD S x4 −H yd ro xy be nz al de hy de G al ac to se M an no se Ca lci um _C hl or id e Ci sp la tin M al to se YN B: ph 3 YN B: ph 8 Cadmium_Chloride Cobalt_Chloride Neomycin Caffeine Menadione YPD:37C Paraquat Congo_red Zeocin Formamide Indoleacetic_Acid Diamide Hydrogen_Peroxide Cycloheximide E6_Berbamine Magnesium_Sulfate Hydroquinone Hydroxyurea x4NQO Copper Lithium_Chloride x6−Azauracil Ethanol Raffinose Magnesium_Chloride Lactate Lactose Xylose Sorbitol Trehalose YNB YPD x5−Fluorocytosine x5−Fluorouracil Tunicamycin YPD:15C YPD:4C SDS x4−Hydroxybenzaldehyde Galactose Mannose Calcium_Chloride Cisplatin Maltose YNB:ph3 YNB:ph8 Figure 5.4: Pair-wise correlations of 46 growth traits in Saccharomyces cerevisiae. For each trait pair, Pearson’s correlation coefficient 𝜌 and the p-values of the correlation were com- puted. The p-values were adjusted for multiple testing according to Benjamini and Hoch- berg’s method [Benjamini & Hochberg, 1995]. The strength and the direction of significant correlations (𝐹𝐷𝑅 < 0.05) are depicted above. Non-significant correlations are left blank. The traits are clustered based on complete-linkage clustering of (1 − 𝜌) as distance meas- urement and the largest clusters are indicated by black squares. Generated via R function corrplot::corrplot. 115 puted and observed data for the imputation of the missing values in the full dataset. Missing values were imputed in segregants that were phenotyped for at least 80%of the traits. The final dataset contained 981 segregants with phenotypes for 41 traits each. 0.85 0.90 0.95 1.00 Ca dm iu m _C hl or id e Ca ffe in e Ca lci um _C hl or id e Ci sp la tin Co ba lt_ Ch lo rid e Co ng o_ re d Co pp er Cy clo he xi m id e D ia m id e E6 _B er ba m in e Et ha no l Fo rm a m id e G al ac to se H yd ro ge n_ Pe ro xi de H yd ro qu in on e H yd ro xy ur ea In do le ac et ic_ Ac id La ct at e La ct os e Li th iu m _C hl or id e M ag ne siu m _C hl or id e M ag ne siu m _S ul fa te M al to se M an no se M en ad io ne N eo m yc in Pa ra qu at R af fin os e SD S So rb ito l Tr e ha lo se Tu n ic am yc in x4 −H yd ro xy be nz al de hy de x4 N QO x5 −F lu or oc yt os in e x5 −F lu or ou ra ci l x6 −A za ur a ci l Xy lo se YN B YN B: ph 3 YN B: ph 8 YP D YP D :1 5C YP D :3 7C YP D :4 C Ze oc in Phenotype Pe a rs o n C or re la tio n Predictors All Corr0.1 Corr0.2 Corr0.3 Figure 5.5: Correlation between imputed and experimentally observed trait values. In the subset of 306 fully phenotyped samples, missing values were introduced and sub- sequently imputed via MICE. Different predictor sets were tested based on Pearson’s cor- relation coefficient: traits were considered predictors if their correlation with the target trait was greater than a given threshold. For each predictor setup (all traits as predictors and predictors passing the correlation threshold 𝜌 = {0.1, 0.2, 0.3}), 𝑚 = 20 imputed datasets and 𝑚𝑎𝑥𝑖𝑡 = 30 iterations of MICE were conducted. The goodness of the imputation was evaluated by computing the correlation of the imputed values (averaged across iterations 𝑚) to the experimentally observed ones. Traits with at least one correlation greater than the 0.95 threshold (black vertical line) were retained in the dataset. For traits labelled in red, the imputation was considered to be unreliable and the traits were excluded from further analyses. 116 5.2. Multi-trait GWAS with LiMMBo In order to show the utility of LiMMBo for joint high-dimensional phenotype ana- lyses and to demonstrate the advantages over single-trait approaches, I analysed the imputed dataset both with stGWAS and mtGWAS. 5.2.1. Estimating the genetic relationship in the yeast cross For both analyses, I used a LMM where the sample-by-sample component of the random genetic effect is based on the RRM. To obtain an estimate of the RRM I first pruned the genome-wide SNPs (11,623) for SNPs that are in LD within a window of 3kb and show a correlation 𝑟2 > 0.2. As the dataset is based on an F2 cross, LD structure estimation is not straight-forward and this window size is a simple estimate derived from a study on the population genomics of domestic and wild yeasts [Liti & al., 2009]. The LD pruning reduced the SNP set for RRM estimation to 4,105 SNPs. The RRM was estimated using the method introduced by Yang & al. [2011] (section 1.7.6). PLINK [Chang& al., 2015] was used for both LD pruning (with parameters –indep-pairwise 3kb 5 0.) and RRM estimation ( with parameters –make-rel square gz). For the genotype to phenotype mapping the full set of 11,623 SNPs was used. 5.2.2. LiMMBo increases power in detecting genetic associations The first step in the mtGWAS is the trait-by-trait covariance estimation via LiMMBo. 1,000 bootstraps of 10 traits each were run and their trait-by-trait covariance estim- ated. The combined trait-by-trait covariance estimates𝐂𝑔 and𝐂𝑛were used as input estimates for the second step in the mtGWAS, the mvLMM (equation (1.42)) across all genome-wide SNPs. I used a mvLMM with a trait-design matrix correspond- ing to the any effect test, i.e. testing for an effect of each SNP on any of the traits compared versus the null hypothesis of no association (section 1.7.8). For the stGWAS, the trait-by-trait components of the random effects are point es- timates (𝜎𝑔 and 𝜎𝑛) derived within the LMM framework and do not require a priori estimation. The stGWAS was performed for each trait separately, applying univari- ate LMMs (equation (1.30)) to test the effect of a SNP on each individual trait. To account for the number of univariate tests, the p-values obtained from the stGWAS were adjusted for multiple testing by the effective number of conducted tests 𝑀eff. 𝑀eff was introduced by Galwey [2009] and adjusts for multiple testing in a manner 117 similar to the Bonferroni method (section 1.7.4, [Dunn, 1961]). However, it is less conservative, as it does not adjust for total number of tests, but the estimated, effect- ive number of tests, taking correlation between the variables and tests into account: 𝑀eff = (∑𝑀 𝑖=1 √𝜆𝑖) 2 ∑𝑀 𝑖=1 𝜆𝑖 , (5.4) where 𝜆 are the eigenvalues of the phenotypes’ correlation matrix. To adjust for multiple testing in the stGWAS, the p-values are multiplied with𝑀eff and set to one if the multiplication leads to values greater than one. 𝑀eff for the 41 growth traits was estimated to be 33. In order to compare the single-trait andmulti-trait analyses, I followed approaches of previous association studies in yeast crosses [Brem & al., 2002; Brem & Kruglyak, 2005; Ehrenreich & al., 2010], where permutations were used to estimate empirical FDR levels. With a conservative, theoretical threshold of 𝑝t = 10 −5, at most one SNP is expected to be false positive in a total of 𝑠 = 11, 623 SNPs. To find the empirical FDR corresponding to this threshold, I generated 𝑘 = 50 permutations of the geno- types and fitted the LMMs against these permutations. These p-values were used as the empirical p-value distribution and for 𝑝t = 10 −5, the empirical FDRs estimated as FDRmtGWAS = 1.2 × 10 −5 and FDRstGWAS = 8.6 × 10 −6. Figure 5.6 shows the manhattan plot of the multi-trait and single-trait GWAS. On several chromosomes (e.g. chr1, chr6 and chr15), mtGWASpeaks (blue) are observed whereas no stGWASpeaks (orange; minimump-value per SNP across all 41 stGWAS, adjusted formultiple testing) can be detected. On the other hand, there are a few loci for the stGWASwhere themulti-trait analyses either does not pass the FDR threshold (e.g. on chromosome 7) or does not detect any association (e.g. on chromosome 4). For these loci, the underlying genetics seem to be trait specific to magnesium sulfate and hydroquinone, respectively (figure B.2 in the appendix). Testing with a 41 degrees of freedom test as done in the mtGWAS hinders the detection of these strong mono-trait associations (compare distributions in figure 1.2) and confirms previous studies showing that the single-trait model for uncorrelated traits is more powerful [Korte & al., 2012]. Both, the gain in power for the multi-trait associations and the burden of amultivariate testwhen the underlying effect is univariate confirm the results obtained from the theoretical power analysis (section 4.7). To quantify the increase in power, I counted the number of SNPs detected above the permutation-based thresholds for both the stGWAS and the mtGWAS. Since the number of SNPs per locus is not constant (based on LD structure in the F2 cross and 118 Figure 5.6: Manhattan plot of p-values from single-trait and multi-trait GWAS. The stG- WAS p-values were adjusted for multiple testing by the effective number of tests (𝑀eff = 33) and only the minimum adjusted p-values across all 41 traits per SNP are shown. The threshold line is drawn at the empirical FDRstGWAS = 8.6 × 10 −6. genotyping parameters), I needed a locus-based rather than a SNP-based count for a fair comparison of the two methods. In order to filter SNPs based on locus, I used PLINK for LD pruning of the SNPs, choosing a strict threshold of 𝑟2 > 0.2 and in- creasing LD window sizes ranging from 3 to 100kb. The maximal LD window of 100kb covers between 6% (chromosome 4) and 43% (chromosome 1) of total chro- mosome length (ScerevisaeR64-1-1, ensembl release 90, [Aken& al., 2016]). Table 5.1 shows that the increase in power is present from narrow to broad LD pruning, with on average 29%more loci in mtGWAS. Table 5.1: Comparison of loci detected in single-trait andmulti-trait GWAS. In the column “All SNPs”, the absolute number of SNPs beyond the FDR threshold for multi-trait and single-trait GWAS as well as their ratio (multi-trait:single-trait) are depicted. In order to limit the potential bias in the counting of the loci, introduced by different degrees of LD for different loci, the genome-wide SNPs were LD pruned and the ratio of associated SNPs determined for five different LD window sizes. All SNPs LD pruned with 𝑟2 ≥ 0.2 3kb 10kb 30kb 50kb 100kb NrSNPs 11,623 4,105 1,028 264 161 107 multitrait 1,132 384 101 24 15 9 singletrait 695 275 72 20 13 7 multitrait:singletrait 1.63 1.4 1.4 1.2 1.15 1.29 119 5.2.3. Multi-trait effect size estimates as indicators for common biology As well as providing an increase in power, the mtGWAS inherently provides effect size estimates across all phenotypes for a particular locus, allowing for a richer ex- ploration of pleiotropic effects of each of locus. To analyse the relationship between traits and SNPs based on their effect size estimates, I filtered the genome-wide SNPs for SNPs that fell within a gene body and pruned these 8,135 SNPs for SNPs in LD with 𝑟2 > 0.2 and within a 3kb window (1,412 SNPs). Lastly, I filtered for SNPs passing the FDR = 10−5 yielding 210 SNPs across 15 out of the 16 yeast chromo- somes. Chromosome 5 is the only chromosome without associated SNPs in the single-trait and multi-trait GWAS (figure 5.6). To find groups of SNPs and traits with similar effect size estimates, I clustered the effect size estimates of these SNPs both across traits and SNPs. (figure 5.7). I used the hierarchical clustering algorithm pvclust that provides bootstrap-based p- values as a measure for the stability of a given cluster [Suzuki & Shimodaira, 2006]. The clustering was based on their Pearson correlation coefficients, with 50,000 boot- straps for traits and 10,000 for SNPs. Clusters with 𝑝 < 0.05were considered stable. A heatmap of effect size estimates and the clustering results is depicted in figure 5.7. Ignoring the clustering for a first impression of the results, one can clearly see that most SNPs have non-zero effects in more than one trait (figure 5.7, strong signals across columns). Furthermore some traits have contributions from across the gen- ome, many of which are xenobiotic growth conditions e.g. zeocin [Krol & al., 2015] and neomycin [Foiani & al., 1991]. Turning to the clustering, figure 5.7 (dendro- grams) shows that the clusters are driven by specific combinations of loci and traits, and would be hard to achieve from a single-trait analysis. There are a number of stable clusters of traits (figure 5.7, blue branches in the row dendrogram), including classically linked carbon metabolism sources (lactose, lactate and ethanol), and other clusters for which there is literature support. For ex- ample, expression of genes involved in DNA replication has been shown to change upon treatment with hydroxyurea and 4-nitroquinoline-l-oxide (x4NQO) [Elledge & Davis, 1990], two substances that are linked in this analysis by forming a stable cluster. A study demonstrating trehalose and sorbitol to have synergistic effects on viability in yeast [Hua & al., 2015] demonstrating a biological link of these sugars forming a cluster. For other clusters, such as SDS and Hydroxybenzaldehyde or magnesium sulfate and berbamine I was unable to find literature support. However, these could serve as candidate clusters for further investigation of growth pheno- types. 120 I discovered 31 stable SNP clusters (figure 5.7, blue branches in the columndendro- gram), many of which represent linked loci. However, there are nine clusters (fig- ure 5.7, grey boxes) spanningmultiple chromosomes, andmany clusters linking dis- joint regions across a chromosome. Some SNP clusters have suggestive common annotation, such as cluster a which has two members of the nuclear pore complex (NUP1, NUP188), and cluster b which has a common set of vesicle associated genes (ATG5, PXA1,VPS41; figure 5.7, labelled boxes). The small size of the clusters pre- vented any systematic gene ontology based enrichment. Nevertheless, the ability to explore clusters of both traits and genetic loci demonstrate the utility of mtGWAS for hypothesis generation. 5.3. Summary A particular benefit of LMMs is that complex genetic relationships can be modelled, which is useful in structured populations such as this 𝐹2 cross in yeast. In univariate LMMs, the kinship information is used to account for background genetic effects in associations with a single trait. When used in mvLMMs, the kinship structure allows for the estimation of complex trait covariance structure. However, it is only possible through a combination of appropriate phenotype imputation and amethod like LiMMBo to efficiently map all 41 growth traits together in order to investigate pleiotropic effects on a genome-wide level. I demonstrated that such a multivariate analysis throughLiMMBo ismore powerful in detecting genetic associations in a real dataset, than univariate tests. While the focus of this chapter was to demonstrate the applicability and power of LiMMBo, it also highlighted the potential of multivariate analysis for gaining insights into the underlying biology of pleiotropic loci. The effect sizes estimated by the multivariate LMM provide the relevant data to study shared pathways and regulation and can help to generate hypotheses for future research. 121 Figure 5.7: Hierarchical clustering of mtGWAS effects size estimates. Effect size estimates of LD pruned (3kb window, 𝑟2 > 0.2), trait-associated SNPs located within a gene body were clustered by loci and traits (both hierarchical, average-linkage clustering of Pearson correlation coefficients ). Stable clusters (pvclust 𝑝 < 0.05) are marked in blue. Grey boxes indicate stable SNP clusters spread across at least two chromosomes. a and b label two clusters for which suggestive common annotation was found, for details see text. 122 6 Low-dimensional representations of very high-dimensional data In chapter 4 and chapter 5, I developed and applied methods for the multivariate analyses of hundreds of traits. When evaluation the suitability of LiMMBo on the simulated datasets, I considered each trait as a separate, but correlated measures and used all traits for the multi-trait genotype to phenotype mapping. I used the same strategy for the application of LiMMBo to the growth traits of yeast. How- ever, as described in chapter 3, the simulated phenotypes are generated by adding different phenotype components, and one could argue that depending on the ana- lysis, it might proof useful to extract relevant features representing different phen- otype components prior to the multivariate analyses across all traits. For instance, given very large numbers of measurements or traits, feature extracting will reduce the number of traits and therefore the degrees of freedom for the multivariate ana- lyses. In the following chapter Iwill describe differentmethods to achieve the feature extraction by dimensionality reduction approaches. I will present two case studies that show how these approaches can be used for visualisation of high-dimensional data. Beyond that, I will demonstrate in simulations that they can not only be used for visualisation but also as valid proxy phenotypes for genetic association studies. These simulation results build the basis for the genetic association study on 3D heart 123 phenotypes described in chapter 7. In biological and medical research, samples are often phenotyped for more than one trait. These traits can either be different attributes of the same underling phen- otype or more independent features. In the former case, multiple phenotypes can be related measurements such as length, width and circumference of plant leafs or measurements commonly regarded as covariates such as sample height and weight. For image-based and molecular phenotyping methods, the measured traits can be a mixture of independent features and attributes of the same phenotype. For instance, in computed tomography scans, functionalMRI or high-resolutionmicroscopy, each pixel or voxel can be considered a different measurement. Groups of these describe different morphologies (features) and pixels/voxels within each group can be con- sidered attributes of that feature. Inmolecular phenotyping such as gene expression or metabolite profiling, several hundred or thousand measurements are collected simultaneously. Here, the classification into features and attributes is more difficult, considering the complex structure of gene expression networks and gene regulation. Inmany of these cases, neither the number of attributes nor the number of independ- ent features are know. However, when analysing these large datasets, one is often interested in extracting meaningful variables from the data or compressing the data into a more tractable number of variables. These approaches rely on the assumption that the lower number of variables are a good representation of the true complex- ity of the dataset. In other words, one assumes that the high-dimensional datasets occupy an intrinsically lower-dimensional space (manifold) which is embedded in the observed, high-dimensional space. Low dimensional representations of gene ex- pression measurements might reflect common pathways or transcriptional profiles and image-derived phenotypes could reflect organ shape variation, disease status or functional MRI activity scores. For a high-dimensional dataset𝐗with𝑁 samples and 𝑃 dimensions (traits), dimensionality reduction techniques aim i) to provide a meaningful low-dimensional representation 𝐙 of 𝐾 dimensions while only losing minor amounts of information: 𝐗 ∈ ℛ𝑁, 𝑃 DimReduction −−−−−−−−→ 𝐙 ∈ ℛ𝑁, 𝐾, (6.1) ii) to use only a small number of free parameters and iii) to preserve the quantities of interest in the data. Depending on the algorithm employed, these might be local proximity or global structure. There are a variety of approaches for dimensionality reduction with different un- 124 derlying mathematical concepts and parameters and choosing the most appropriate method for a given dataset is not trivial. Fundamentally, the problem is finding an objective criterion of what a good dimensionality reduction method is. In the following, I will first present a small review of current dimensionality re- duction methods. I will use these methods to demonstrate the application of di- mensionality reduction for visualisation on small datasets with known structure. I will compare the visual results to two published criteria for measuring the quality of dimensionality reduction in terms of neighbourhood-similarities in the low- and high-dimensional space. Then, I will describe the results of the different dimension- ality reduction techniques on simulated high-dimensional datasets and propose an additional stability criterion which aids in choosing the dimensionality of the lower- dimensional phenotype space. Finally, I show that low-dimensional representations of the phenotypes can capture underlying genetic structure. The methods and cri- teria used in this chapter will be applied on clinically relevant high-dimensional heart morphology data in chapter 7. 6.1. Review of dimensionality reduction methods The earliest dimensionality reduction techniques were two linear methods based on spectral decomposition: PCA and classical multi-dimensional scaling (MDS). The general concept of PCA was described by Pearson in 1901 [Pearson, 1901]. In [1933], Hotelling was the first to describe it as a method for dimensionality reduc- tion. In PCA the components of the new phenotype representation are the PCs and are the eigenvectors𝐖 of the empirical covariance matrix𝐂: 𝐂 = 𝐗𝐗𝑇 =𝐖𝚲𝐖𝑇. The eigenvalues in the diagonal matrix𝚲 corresponding to the PCs are equivalent to the variance explained by their components. The transformation of the phenotype data into PCs leads to a projection where the highest amount of phenotypic variance explained lies in the first component, the second highest variance in the second com- ponent and so on. The dimensionality reduction is achieved by using the first𝐾 PCs until the cumulative sum of the eigenvalues reaches a predefined threshold of total phenotypic variance that should be retained: 𝐙 =𝐖1,… ,𝐖𝐾. MDS was introduced by Gower [1966], motivated by his dissatisfaction about the overuse of PCA in biology. MDS is based on the spectral decomposition of a dissimil- arity matrix𝐃 between the samples in𝐗. Classical MDS finds the low-dimensional representation 𝐙 whose pairwise distance matches the dissimilarity 𝑑𝑖𝑗 of the ori- ginal data: 𝑑𝑖𝑗 ≈ ̂𝑑𝑖𝑗 = ‖𝐳𝑖 − 𝐳𝑗‖. 𝐙 can be found by the eigendecomposition of the 125 squared dissimilarity matrix 𝐃2 = 𝐕𝚲𝐕T, where 𝐙 = 𝚲 1 2𝐕𝑇. As in PCA, 𝐙 will be ordered with the components explaining most variance ranked first and dimen- sionality reduction can be achieved by selecting the first 𝐾 vectors [Gower, 1966]. MDS finds an embedding that preserves the inter‐point distances and is equivalent to PCA when those distances are Euclidean. Several decades after the introduction of PCA as a means of linear dimension- ality reduction, Schoelkopf & al. [1998] and colleagues proposed its non-linear ex- tension based on the transformation of 𝐗 into a feature space 𝐅 via the mapping function Φ. Instead of finding the eigendecomposition of the covariance matrix of 𝐗, the aim is the diagonalisation of the covariance 𝐊 of the features of the data Φ(𝐗): 𝐊 = Φ(𝐗)Φ(𝐗)𝑇. Using the kernel representation 𝑘(𝐱𝑖, 𝐱𝑗) = (Φ(𝐱𝑖)Φ(𝐱𝑗)) to compute the dot products of Φ(𝐱𝑖)Φ(𝐱𝑗) allows computation of the dot product in 𝐅 without having to carry out the map Φ. This technique is commonly referred to as the kernel trick and yields the feature covariance matrix𝐊𝑖𝑗 = 𝑘(𝐱𝑖, 𝐱𝑗)𝑖𝑗. The normalised eigenvectors of𝐊 are used to extract the PCs. This kernel principal com- ponent analysis (kPCA) approach allows for non-linear feature extraction, whilst the possibility to select different kernels (e.g. gaussian or sigmoid) makes it applicable for a wide range of cases when non-linearity is assumed [Schoelkopf & al., 1998]. PCA, kPCA and MDS build the basis for many other dimensionality reduction techniques. Notably, Ham and colleagues show that the class of kernel-eigenmap- based dimensionality reductionmethods, such as Isomap, Locally linear embedding (LLE) and Laplacian Eigenmaps can be understood as a variant of kPCAwith differ- ent kernel matrices. Methods of this class are described in detail later, but common to all these methods is the aim to obtain a global representation of the data 𝐗 by using information about local interactions between the data points in 𝐗. The data points are represented as the nodes of a symmetric graph, whose kernel function 𝑘 describes a local geometry of 𝐗. The graph specified by (𝐗, 𝑘) is used to con- struct a square matrix𝐌, which describes the transitions on the graph as a Markow chain. Using this Markov matrix𝐌, one can map the data into a lower dimensional Euclidean space. The difference of the algorithms lies in the definition of the neigh- bourhood structure and themeans to find a global embedding. Table 6.1 summarises these and other commonly used linear and non-linear techniques and the list below gives a short summary of the mathematical principles. 1. Probabilistic estimation of expression residuals: PEER implements factor analysis methods to estimate variance components in 𝐗. The model assumes additive effects from independent sources that influence𝐗 and aims at estim- 126 Table 6.1: Dimensionality reduction methods. The different dimensionality reduction techniques can distinctly be classified into linear and non-linear types. Themethods column broadly groups techniques based on their main mathematical concept and parameters gives the number of parameters that need to be specified for the mathematical model. Type Method Name Parameters linear spectral PCA 0 MDS 0 Factor analysis PEER 1 Generative model ICA 2 non-linear rank-based nMDS 2 PCA-based DRR >1 spectral kPCA 0 Kernel eigenmap Isomap 1 LLE 1 Laplacian Eigenmaps 2 DiffusionMaps >2 Probability distributions tSNE 4 ating these effects in a joint Bayesian inference model. By specifying the only source of variation to be due to unknown effects, PEER can be used to extract latent variables from high-dimensional datasets, where the latent variables are modelled based on a standard normal distribution and are initiated based on PCA of 𝐗. The model specifications are complex and the interested reader is referred to the paper describing the details of the methodology [Stegle & al., 2010]. In the most simple scenario, only the parameter of the maximum num- ber of unobserved latent factors, i.e. the column dimensionality 𝐾 of 𝐙 have to be specified [Stegle & al., 2012]. 2. Independent Component Analysis: ICA belongs to the class of generative models, which describes how the data𝐗 could have been generated by a pro- cess of mixing independent components 𝐒 according to a mixing scheme 𝐀: 𝐗 = 𝐒𝐀. Both the independent components and the mixing matrix are un- known. The key to “un-mixing” the signals are the underlying assumptions that the latent components are independent and have a non-Gaussian distri- bution. In order to find𝐀 and𝐒, ICAfinds the un-mixingmatrix𝐔whichmax- imises the non-gaussianity of 𝐒: 𝐗𝐔 = 𝐒. Non-gaussianity can be quantified 127 by approximating the negentropy of 𝐒, i.e. the difference in entropy between a Gaussian random variable of the same covariance matrix as 𝐒 and the entropy of 𝐒 itself. The parameters to be specified are the threshold for the tolerance at which the un-mixing matrix is considered to have converged and the number of components to be modelled. ICA was first described by Herault & Jutten [1986] and has seen many implementations for finding the maximum of the non-gaussianity (reviewed in [Comon, 1994]), including FastICA [Hyvärinen & Oja, 2000]. ICA often includes a pre-processing step to make the columns of 𝐗 uncorrelated and scale their variances to unity. This process is termed “whitening” and is achieved through PCA of𝐗. 3. non-metric MDS: Extensions of the classical MDS described above relax the matching criterion of dissimilarities and distances to finding the closest match of a monotonic function of the distances to the dissimilarities: 𝑓(𝑑𝑖𝑗) ≈ ̂𝑑𝑖𝑗 = ‖𝐳𝑖 − 𝐳𝑗‖. The closest match is determined by minimising a stress function [Kruskal, 1964a; Kruskal, 1964b]. In the non-metric version of these exten- sions, 𝑓(𝑑𝑖𝑗) simply considers the rank order of the input dissimilarities such that the rank order agreement between the distances and the dissimilarities is maximized [Minchin, 1987]. The parameters to be specified are the threshold for minimum stress at which the distances and dissimilarities are considered to have converged and the number of components to model. 4. Dimensionality reduction via regression: DRR is a PCA-based regression technique which aims to remove redundant information present in the PCs 𝐖 of𝐗. While standard PCA yields decorrelated dimensions, complete inde- pendence of its components is only certain if the high-dimensional data had a Gaussian probability density function [Laparra & al., 2015]. The main idea in DRR is to remove the redundant information contained in partially dependent components and only keep the remaining, non-predictable information in the low-dimensional representation. The removal of the redundant information is achieved in a step-wise manner by starting at the lowest variance component (i.e. smallest eigenvalue) and using it as the response variable for a multivari- ate non-linear regression function 𝑓 with all higher variance components as predictors. This process is repeated for each PC until the component with the second highest eigenvalue is reached and all redundant information has been regressed out. Formally, this iterative prediction scheme can be described as 𝑧𝑖 = 𝐰𝑖−𝑓𝑖(𝐰1,𝐰2,… ,𝐰𝑖−1), where 𝑧𝑖 is the non-predictable information. As 128 in PCA, the first components account for the highest variance. The number of parameters depends on the function 𝑓 specified for the non-linear regression. The standard method described in the original paper uses Kernel Ridge re- gression with a Gaussian kernel function, i.e. one free parameter for the band width of the kernel [Laparra & al., 2015]. 5. Isomap: Isomap builds on classical MDS for the dimensionality reduction and kernel eigenmaps to find the required dissimilarity matrix of 𝐗. The dissim- ilarities are defined as the geodesic manifold distances between all pairs of data points. Isomap constructs a graph of all data points and sets the edge length between neighbouring points to the geodesic distance. For data points in proximity (based on𝑛 nearest neighbours or threshold on the distancemeas- ure), the euclidean distance in input space serves a good approximation. The geodesic distance for points outside the proximity criterion is approximated by adding up a sequence of ‘short hops’ jumps between neighbouring points. The shortest distances between points of the graph are a measure for the dis- similarity between data points and serve as the input data for classical MDS [Tenenbaum & al., 2000]. The proximity threshold is the parameter to specify. 6. Local linear embedding: LLE uses kernel eigenmaps based on the local struc- ture in the data to recover the non-linear global data structure. It assumes that any data point in 𝐗 lies on a close to linear patch with its neighbours and can be reconstructed through linear recombination of these neighbours. The linear recombination is described in the weight matrix𝐇. The objective of the algorithm is to find 𝐇 which minimises the reconstruction error between all data points and their reconstructions. Based on the optimised 𝐇, the data points𝐗 can be transformed into lower dimensional space 𝐙 by solving the ei- gendecomposition of (𝐈𝑁−𝐖) 𝑇(𝐈𝑁−𝐖) [Roweis & Saul, 2000]. LLE requires the specification of the local neighbourhood size 𝑛. 7. Laplacian Eigenmaps: Laplacian Eigenmaps are based on an adjacency graph representing 𝐗. For adjacent data points (proximity measures as in Isomap), the edges of the graph are weighted based on a heat kernel of the euclidean distance: 𝐇𝑖,𝑗 = 𝑒𝑥𝑝(− ‖𝐱𝑖−𝐱𝑗‖2 𝑛 ). Edges for points that do not fall within the proximity threshold 𝑛 are set to zero. Based on theweightmatrix𝐇, a diagonal matrix𝐃 is constructed by𝐷 =∑ 𝑗 𝐇𝑖,𝑗 and the positive, semi-definite Lapla- cian matrix 𝐋 computed as: 𝐋 = 𝐃 − 𝐇. The eigendecomposition of 𝐋𝐕 = 𝝀𝐃𝐕 and selection of the first𝐾 eigenvectors𝐕 yields the𝐾-dimensional em- 129 bedding of 𝐗 in 𝐙 [Belkin & Niyogi, 2003]. For dimensionality reduction via Laplacian Eigenmaps, the threshold for the proximity criterion and 𝑛, the free parameter in the heat-kernel have to be specified. Large values of 𝑛 yield less weight to differences in distance, with 𝑛 = ∞ setting all non-zero distances to one. 8. DiffusionMaps: As for all kernel eigenmapmethods, DiffusionMaps first con- structs a graph representation of 𝐗 which is turned into the Markow matrix 𝐌, used for the low-dimensional embedding. The length of the edges between points on the graph are computed by a kernel 𝑘(𝐱𝑖, 𝐱𝑖) normalised to the local connectivity of the graph, and in such capture the local geometry in the data. This normalised kernel can be interpreted as the transition kernel of𝐌, repres- enting the transition probability from point 𝐱𝑖 to 𝐱𝑗 in one time step. Based on the eigenvalues and eigenvectors of𝐌, diffusion distances and maps between the data points can be computed. These are in turn used to map the data into a Euclidean space, where the distance describes the relationship between data points in terms of their connectivity. The dimensionality of the re-mappeddata depends on the number of eigenvectors used for the embedding into Euclidean space. These are chosen based on the number of transitions 𝑡 on𝐌 and an ac- curacy term 𝜖, which specify the maximum eigenvalue considered informative in the mapping [Coifman & al., 2005; Coifman & Lafon, 2006]. Depending on the kernel function, additional parameters might have to be specified. Typical kernel functions are the Gaussian kernel and heat kernels. 9. t-Distributed stochastic neighbourhood embedding: In tSNE, the Euclidean distance of the data points in𝐗 are converted into joint probabilities 𝑝𝑖,𝑗. Sim- ilarly, for a low-dimensional representation 𝐙 of 𝐗, the distance ‖𝐳𝑖 − 𝐳𝑗‖2 is converted into the joined probabilities 𝑞𝑖,𝑗. The objective of tSNE is to find the configuration of 𝐙 which minimises the Kullback-Leibler divergence 𝐾𝐿 between the probability distributions 𝑃 and 𝑄: 𝐾𝐿(𝑃‖𝑄) = ∑ 𝑖 ∑ 𝑗 𝑝𝑖𝑗 log 𝑝𝑖𝑗 𝑞𝑖𝑗 . 𝐾𝐿 in general is ameasure for howmuch one probability distribution diverges from another [Kullback & Leibler, 1951] and serves in tSNE as the criterion for finding a good low-dimensional representation. The mapping of simil- arities to probabilities in the low-dimensional space are based on a Student t-distribution with one degree of freedom, whereas the mappings in high-di- mensional space are converted using a gaussian distribution. Depending on the data density around each 𝐱𝑖, the standard deviation is adjusted for each 130 gaussian 𝑝𝑖 based on the specified perplexity, a smooth measure of the effect- ive number of neighbours. In addition, parameters for the gradient descent function used to find the minimum𝐾𝐿 have to be specified: the number of it- erations, the learning rate and the momentum. For details of these parameters refer to [Maaten & Hinton, 2008]. Despite the diversity of the dimensionality reduction techniques, there are a num- ber of underlying features which define common properties and can give an indic- ation for their applicability. Methods directly based on PCA (PCA, MDS and DRR), are easy to apply and the extracted features are interpretable (directions of variance). While PCA and MDS mainly work well for linear manifolds, DRR extends the ap- plicability to non-linear manifolds. The ability to learn non-linear manifold struc- tures in the data is also shared by the kernel eigenmap methods, nMDS and DRR [Coifman & Lafon, 2006]. However, non-linear models introduce a number of free parameters, whose choice requires prior assumptions about the manifold charac- teristics. Dimensionality reduction via kernel eigenmaps and tSNE depend on the assumption that distances of points for apart in the global space do not contain in- formation and need not be preserved. Hence, these techniques are simply based on local neighbourhoods and preserve these in the low-dimensional space. This in turn requires dense data points in the low-dimensional space for these strategies to be a good estimation. There are two main purposes for dimensionality reduction, visualisation and fea- ture selection. For visualisation,𝐾 is commonly chosen in a range from one to three such that the data can be presented in a one, two or three dimensional graphic. The choice of dimensionality for feature selection is less trivial, as the dimension of the low-dimensional manifold is unknown. In general, choosing the dimensionality is easiest for PCA and PCA-based methods, where the principal components that cu- mulatively explain a certain fraction of the variance in the data define the dimen- sionality. For other methods, the task is less straight forward and different strategies have to be developed. In the next section, I will show the results of applying the techniques described above for the visualisation of two small datasets with known structure. 6.2. Visualisation of data structures by dimensionality reduction In high-dimensional data analysis, one is often interested in finding a clear visual- isation of the data, which leads to a minimal loss of information and is capable of 131 summarising underlying data structures. Data can be of biological origin, represent- ing features of interest like cell populations or tissue types, or of technical origin such as batch effects. In high-dimensional datasets, visualisation requires either a prior se- lection of the dimensions of the original data to be displayed or the reduction to a dimension that can be represented. Common choices of dimensionality reduction for this task are PCA or tSNE [Deng & al., 2014; Crowley & al., 2015; Corces & al., 2016; Martinez-Jimenez & al., 2017; Huisman & al., 2017]. To understand how the visualisation via dimensionality reduction depends on the underlying dataset, and to see how the true dimensionality of the data is reflected in the visualisation, I needed datasets with known properties. As outlined above, one high-level classification of the dimensionality reduction methods is their grouping into linear and non-linear methods. To understand the relationship between input data and linearity of the dimensionality reduction methods, I selected one dataset with approximately linear structure and created a second dataset with non-linear properties. The datasets are described below and their properties depicted in fig- ure 6.1 and figure 6.3. To allow for an easier comparison of the input data and their visualisation, these figures are locatedwith the figures for the low-dimensional visu- alisation further down in the document. The first, linear dataset is a commonly used sample dataset for statistical functions in R (and is distributed with the R software) and consists of 150 samples of three Iris species (I. setosa, I. versicolor, I. virginica) for which four phenotypes were measured: sepal width, sepal length, petal width and petal length. In order to get an under- standing of the phenotype structure, I computed the pair-wise Pearson correlation coefficient across the three species and across the four phenotypes (one sample ap- pears twice in the dataset and was removed for subsequent analyses). The strongest correlation on species level is observed for I. virginica and I. versicolor (𝑟2 = 0.9). On phenotype level, petal length and width correlate strongly across species (𝑟2 = 0.96, figure 6.1). For the second, non-linear dataset, I simulated 2,000 data points uniformly dis- tributed on a (x,y)-plane and transformed the plane into (x,y,z) coordinates by 𝑧 = 𝑥 sin(𝑥) and 𝑥 = 𝑥 cos(𝑥). The resulting “roll” structure is depicted in figure 6.3. These datasets represent two distinct types of data: the Iris data is a four-dimen- sional dataset comprised of three subgroups, whereas the roll data is two-dimensional manifold non-linearly embedded in a three-dimensional space. In the following, I applied the twelve dimensionality reduction techniques described above to both datasets and compared their low-dimensional representations. 132 For each technique, I used corresponding functions already implemented in pub- licly available R-packages. Table 6.2 summarises the R packages, functions and their parameters used for the dimensionality reduction. Most functions require specifica- tion of the expected number of dimensions 𝑛𝑑𝑖𝑚. For the purpose of visualisation in a Cartesian coordinate system, this parameter choice is straightforward (one, two or three) and was set to 𝑛𝑑𝑖𝑚 = 2. In the case of kernel eigenmap methods and tSNE, the number of 𝑛 nearest neighbours used in the graph construction and probability function have to be provided. This task is less intuitive and different algorithms have been implemented to estimate the optimal number of neighbours for the reconstruc- tion. Choosing a suitable 𝑛 is important, as neighbourhoods chosen too large might eliminate fine structures in the data, while neighbourhoods too small can lead to the division of the continuous input space into smaller, unconnected sub-manifolds [Kayo, 2006]. For any method that required specification of 𝑛, I provided 𝑛 estimated according to the method proposed by Kayo [2006], implemented as the function calc_k in the lle package. Some methods require additional, specific parameters. These are either specified in table 6.2 or the default setting was chosen. For functions that required a distance matrix or metric for the local neighbourhood estimation (MDS, Diffusion- Map, Isomap, nMDS), the default is the Euclidean distance. Methods that require a kernel function (DRR, kPCA) use a gaussian radial basis kernel by default. For ICA and DRR, I choose the default setting of the PCA pre-processing step. For PEER, the functions are implemented in an object-oriented manner and I followed the pro- tocol described in Stegle & al. [2012]. I choose to include the optional parameter of adjusting for the mean. Before applying dimensionality reduction functions to both datasets, I estimated the optimal number of neighbours for the dimensionality reduction techniques based on local neighbourhoods. For the Iris data with 596 data points, the optimal number of neighbours is estimated to be 𝑛 = 26. For the roll data with 2,000 data points it was estimated to be 𝑛 = 36. Figure 6.2 shows the two-dimensional representation of the Iris data after dimensionality reduction by the four linear and eight non-linear dimensionality reduction techniques. Assuming that the allocation to species is the correct intrinsic low-dimensional representation of the Iris dataset, I coloured the data points according to species to enable the visual comparison of the goodness of the dimensionality reduction. PCA, i.e. the representation of the data based on the direction of highest variation in the data is able to clearly separate the I. setosa from I. versicolor and I. virginica across 133 Table 6.2: R functions for dimensionality reduction methods and their parameters. Most functions require a priori specification of the number of 𝑛 nearest neighbours and the expec- ted intrinsic dimensionality 𝑛𝑑𝑖𝑚. Any function-specific parameters different to the default settings are listed. The reference column specifies the publications the R packages are based on. LE: Laplacian Eigenmaps, DM: Diffusion Maps. Name R function Parameters Reference PCA stats::prcomp - [Hotelling, 1933] PEER peer ndim, [Stegle & al., 2010] ICA fastICA::fastICA ndim, fun=logcosh, [Hyvärinen & Oja, 2000] method=”C” MDS stats::cmdscale ndim [Gower, 1966] nMDS vegan::metaMDS ndim [Ripley, 1996] DRR DRR::drr - [Laparra & al., 2015] kPCA kernlab::kpca - [Schoelkopf & al., 1998] Isomap vegan::isomap ndim, k, [Tenenbaum & al., 2000] fragmentedOK=TRUE LLE lle::lle ndim, k [de Ridder & Duin, 2002] LE loe::LOE ndim, k [Belkin & Niyogi, 2003] DM diffusionMap::diffuse k [Lafon & Lee, 2006] tSNE Rtsne::Rtsne ndim, k [Maaten & Hinton, 2008] the first principal component. However, the separation of the strongly correlated I. versicolor and I. virginica species based on the first two principal components alone is not possible. MDS with Euclidean distance is equivalent to PCA and the resulting MDS plot is a mirror image of the PCA result on the x-axis. ICA for this dataset shows the strong influence of the pre-processing via PCA, as it is the mirror image of the PCA result on the x- and y-axis. PEER is capable of separating I. setosa from the other species, but similarily fails at completely separating I. versicolor and I. virginica. Visually the best results of the non-linear methods are obtained from DRR, Isomap and nMDS and perform similarly in their ability to separate the species as the linear methods. The other non-linear methods are able to separate I. setosa, but do worse in separating the other two species. The results of the dimensionality reduction for the non-linear projection of the 2D manifold into 3D space demonstrate the difficulty of the linear methods to deal with non-linear structures (figure 6.4). The color scheme of the original embedding simply represents the location of points in the 2D plane ordered in x-direction. In a good low-dimensional representation, one should be able to observe the gradient of the original (x,y)-plane linearly across either one of the dimensions. While the general order of the points is conserved in the low-dimensional representation for the linear methods, none are able to separate them linearly in either dimension (fig- 134 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1B SepalLength SepalWidth PetalLength PetalWidth −0.12 0.87 0.82 −0.43 −0.36 0.96 A Pearson Correlation I. setosa I. versicolor I. virginica 0.72 0.59 0.9 Figure 6.1: Correlation of flowering phenotypes. For the 149 unique samples in the Iris dataset, the pair-wise Pearson correlation for the three different Iris species across all meas- urements (A) and the four flowering phenotypes sepal width, sepal length, petal width and petal length across the three species (B) are depicted. The color scheme and shapes in the upper triangle of thematrix represent the strength and direction of the correlation, the lower triangle depicts the value of the correlation. Generated via R function corrplot::corrplot. ICA MDS PCA PEER −2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 Dimension 1 D im en si on 2 LinearA LaplacianEigenmap kPCA nMDS tSNE DiffusionsMaps DRR LLE Isomap −2 0 2 4 −2 0 2 4 −2 0 2 4 −2 0 2 4 −2.5 0.0 2.5 5.0 7.5 −2.5 0.0 2.5 5.0 7.5 Dimension 1 D im en si on 2 Non−linearB Species setosa versicolor virginica Figure 6.2: Visualisation of the Iris dataset in two dimensions. The number of dimensions in the Iris dataset was reduced form four to two by the dimensionality reduction techniques described in table 6.1 and computed with the functions and parameters listed in table 6.2. The number of nearest neighbours provided to the local-proximity-based methods was es- timated to be 𝑛 = 26. 135 ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● x y x y z BA Figure 6.3: Three-dimensional embedding of data points lying on a two-dimensional plane. Data points uniformly distributed on a (x,y)-plane (A) are transformed into (x,y,z) coordinates by 𝑧 = 𝑥 sin(𝑥) and 𝑥 = 𝑥 cos(𝑥). The color scheme simply represents the loca- tion in x-direction of the (x,y)-plane. Generated via R function plot3D::scatter3D ICA MDS PCA PEER −2 −1 0 1 −2 −1 0 1 −2 −1 0 1 −2 −1 0 1 −2 −1 0 1 2 Dimension 1 D im en si on 2 LinearA LaplacianEigenmap kPCA nMDS tSNE DiffusionsMaps DRR LLE Isomap −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 −3 −2 −1 0 1 2 −3 −2 −1 0 1 2 Dimension 1 D im en si on 2 Non−linearB Figure 6.4: Visualisation of the roll dataset in two dimensions. The dimensionality reduc- tion methods described in table 6.1 were analysed for their ability to recover the original 2D plane embedded into 3D space (figure 6.3). The 2D-representation was computed with the functions and parameters listed in table 6.2, with the number nearest neighbours provided to the local-proximity-based methods estimated to be 𝑛 = 36. 136 ure 6.4A). PEERperforms best in capturing the spread in y-direction compared to the other linear methods, but equally fails in separating the tight curvature x-direction. In contrast, the non-linearmethod Isomap completely recovers the original 2Dplane. DiffusionMap and Laplacian Eigenmaps are able to separate the structure linearly, but underestimate the spread of the original data in y-direction. LLE recovers the spread in y-direction, but fails to find the order in x-direction for the tight curvature (dark colors) in the 3D space. DRR, nMDS and tSNE suffer from the same issues as the linear methods, with DRR additionally introducing non-smoothness. kPCA recovers the plane structure for the mid-section of the roll, but scrambles the order at both ends. The visualisations clearly demonstrate the difference in ability of the dimensional- ity reductionmethods to find a good low-dimensional representation of the original, known data structures. As a generalisation and unsurprisingly, linear methods per- formwell in separating linear data structures (Iris data) but fail in in recovering non- linear structures (roll data). Non-linear methods perform better in recovering the non-linear structure, but underperform on linear datasets compared to the linear methods. 6.3. Quantification of dimensionality reduction performance In addition to the visualisation, it would be desirable to have a quantitative assess- ment of the performance of the dimensionality reduction techniques. Lee & Verley- sen [2009] reviewed different methods for evaluating the quality of dimensionality reduction methods. Two criteria for the goodness of the low-dimensional repres- entation contained in three out of the five methods reviewed are the closeness of neighbouring samples in the low-dimensional space compared to the original space (trustworthiness of the projection) and the conservation of original neighbourhoods in the low-dimensional space (continuity of the projection). Kaski and colleagues [Kaski & al., 2003] proposed two metrics quantifying the extend of trustworthiness and continuity based on the ranking of 𝑘 neighbours in the original and low-dimen- sional space. For trustworthiness, they define 𝑟(𝑥𝑖, 𝑥𝑗) as the rank of the distance of 𝑥𝑗 to 𝑥𝑖 in the original data space and 𝑈𝑘(𝑥𝑖) as the set of 𝑥𝑗≠𝑖 that are in the neigh- bourhood of 𝑥𝑖 in the low-dimensional space but not in the original space. Similarly, continuity is based on ̂𝑟(𝑥𝑖, 𝑥𝑗), the rank of the distance of 𝑥𝑗 to 𝑥𝑖 in the low-dimen- sional space and 𝑉𝑘(𝑥𝑖) as the set of 𝑥𝑗≠𝑖 that are in the neighbourhood of 𝑥𝑖 in the original space but not in the low-dimensional space. The trustworthiness 𝑇 and the 137 continuity 𝐶 are defined as: 𝑇 = 1 − 𝐴(𝑘) 𝑁 ∑ 𝑖=1 ∑ 𝑥𝑗∈𝑈𝑘(𝑥𝑖) (𝑟(𝑥𝑖, 𝑥𝑗) − 𝑘) (6.2) and 𝐶 = 1 − 𝐴(𝑘) 𝑁 ∑ 𝑖=1 ∑ 𝑥𝑗∈𝑉𝑘(𝑥𝑖) ( ̂𝑟(𝑥𝑖, 𝑥𝑗) − 𝑘), (6.3) where 𝐴(𝑘) = 2𝑁𝑘(2𝑁−3𝑘−1) is introduced as a normalising parameter scaling the values between zero and one. The projection into low-dimensional space is con- sidered trustworthy if the set of 𝑘 closest neighbours of a sample in the low-dimen- sional space are also close in the original space. Continuity quantifies how well the original neighbourhoods are preserved, i.e. it measures if there are neighbourhoods of 𝑘 points in the original space which are not preserved because of discontinuities in the low-dimensional space. I applied these metrics to the results of the low-dimensional projections obtained in section 6.2. Both metrics are dependent on the number of 𝑘 neighbours that they are evaluated on, so I chose different neighbourhood sizes ranging from 1 to 3% of samples (rows) in the dataset. The results are depicted in figure 6.5 and figure 6.6. tSNE, LLE, the PCA-derived linear methods (PCA, ICA, MDS) and nMDS have a trustworthiness measure of more than 0.95 across all neighbourhood sizes in the Iris data (figure 6.5A). The PCA-derived non-linear method DRR performs slightly worse, as do kPCA and Isomap. Laplacian Eigenmaps perform worst only reaching 0.9 for high neighbourhood sizes. In general, the dependency of the local methods on neighbourhood size becomes apparent, as the kernel-eigenmap methods’ trust- worthiness varies strongest across the different neighbourhood sizes. The six meth- ods performing well in terms of trustworthiness for the Iris data (tSNE, LLE, PCA, ICA,MDS and nMDS) also keep the level of discontinuities introduced in the low-di- mensional space low as seen by high measures of continuity (figure 6.5B). To get an estimate for 𝑇 and 𝐶 for a poor representation of the original data, I randomly chose neighbourhoods in the original space and computed trustworthiness and continu- ity measures for these and the original Iris data, leading to medianmeasurements of 0.51 for both 𝑇 and𝐶 (results not shown in graphic to allow for a clearer visualisation of the trustworthiness range 0.85 to 1). For the roll data, Isomap has by far the best performance in terms of trustworthiness (figure 6.6A) and confirms the visual res- ults (figure 6.4B). LLE and nMDS also score above 0.9. These threemethods together with kPCA and DiffusionMaps are best in preserving continuities (figure 6.6B). The 138 trustworthiness for all linear methods is similar and consistently lower than the best scoring non-linear methods. The worst results in terms of continuity are observed for tSNE and DRR and both methods show discontinuities in the visualisation (fig- ure 6.4B). For the reference point of trustworthiness and continuity based on random neighbourhoods, results similar to those found for the Iris dataset were observed, with median 𝑇 = 0.52 and 𝐶 = 0.52. Overall, the trustworthiness and continuity measures reflect the results obtained from the visualisation of the data by their low-dimensional representation: linear methods are most suitable for linear data and non-linear methods for non-linear data. 0.85 0.90 0.95 1.00 1.0 1.5 2.0 2.5 3.0 Neighbours [% of samples] Tr u st w o rth in es s A 0.85 0.90 0.95 1.00 1.0 1.5 2.0 2.5 3.0 Neighbours [% of samples] Co nt in u ity B Method DM DRR ICA, PCA, MDS LLE Isomap LE PEER kPCA nMDS tSNE Type non−linear linear Figure 6.5: Quality of the dimensionality reduction in the Irisdataset. The trustworthiness (A) and Continuity (B) of the projections into the low-dimensional space for the Iris dataset were computed according to equation (6.2) and equation (6.3). The neighbourhood sizes ranged from one to five neighbours, corresponding to 0.6 to 3.4% of samples. 6.4. Dimensionality reduction for feature extraction Apart from serving as a tool for visualisation, dimensionality reduction is often used for feature extraction. While visualisation is limited to one, two or three dimensions, for feature extraction one is interested in the intrinsic dimensionality of the data which can be of much higher dimension. Metrics, such as the one introduced in the previous section, can indicate which methods provide a trustworthy dimensionality reduction. However, they do not help with choosing the number of dimensions in 139 0.85 0.90 0.95 1.00 1 2 3 4 5 Neighbours [% of samples] Tr u st w o rth in es s A 0.85 0.90 0.95 1.00 1 2 3 4 5 Neighbours [% of samples] Co nt in u ity B Method DM DRR ICA, PCA, MDS LLE Isomap LE PEER kPCA nMDS tSNE Type non−linear linear Figure 6.6: Quality of the dimensionality reduction on the 2D manifold embedded in 3D. The trustworthiness (A) and Continuity (B) of the projections into the low-dimensional space for the 2Dmanifoldwere computed according to equation (6.2) and equation (6.3). The neighbourhood sizes ranged from 10 to 50 neighbours, corresponding to 1 to 5% of samples. the low-dimensional space. Here I propose a novel, simple stability criterion for the choice of dimension (section 6.4.1) and show that features selected based on the stability criterion are able to capture underlying genetic structure (section 6.4.2). 6.4.1. Stability of dimensionality reduction An assumption in dimensionality reduction for feature selection is that these tech- niques capture the variation or structure of the high-dimensional space in the low-di- mensional components. In an ideal scenario, any technical or unwanted covariates have been accounted for a priori (e.g. through regression) and the low-dimensio- nal components will only capture the true biological structure in the data. While the dimensionality reduction techniques intrinsically learn structures based on the observed data, they should be robust against small changes in the data such as re- moving or adding a moderate number of samples. In the following, I will describe a method of finding robust low-dimensional representations and will call these rep- resentations stable. As such, stability is a simple but effective way of ensuring repro- ducibility, but cannot be used to distinguish an appropriate from a less appropriate low-dimensional representation. In contrast, a dimensionality reduction that is not stable is certainly not capable of producing reliable results. In order to estimate the stability of the dimensionality reduction techniques and 140 to investigate different parameters potentially influencing the stability, I used Phe- notypeSimulator (chapter 3) to simulate datasets of 1,000 phenotypes with different numbers of samples and phenotype components as described in section 3.2. The sample sizes ranged from 500 samples as observed in small cohort studies with dimensionality-reduced phenotypes [Pausova & al., 2007] to 10,000 [Liu & al., 2012]. All phenotypes were simulated with genetic variant and infinitesimal effects and noise effects. A total of 50,000 SNPs was simulated with allele frequencies of 0.1, 0.2 and 0.4 chosen at equal probability. 20 SNPs were selected for the simulation of genetic variant effects with effect sizes drawn from 𝒩(0 , 1 ) . The genetic kin- ship matrix was estimated based on all simulated SNPs. For each sample size, an additional phenotype set was simulated that also contained non-genetic covariate and correlated noise effects. The parameters for the simulation are summarised in table 6.3. For each simulation set-up, ten independent datasets were simulated and subsequent analyses applied to each dataset individually. Table 6.3: Simulation parameters of phenotypes used for stability estimation. 𝑁: num- ber of samples, 𝑃: number of traits; ℎ2: total genetic variance, ℎ 𝑠 2: variance of genetic variant effects, ℎ𝑔2: variance of genetic random effects, 1−ℎ2: total noise variance, 𝛿: variance of non- genetic covariate effects, 𝑟ℎ𝑜: variance of correlated noise effects; pcorr: correlation of cor- related noise effects, 𝜃: proportion of shared genetic variant effects, 𝜂: proportion of shared genetic random effects,𝛾: proportion of shared non-genetic covariate effects, 𝛼: proportion of shared noise random effects. Parameter Parameter values 𝑁 500, 1,000, 10,000 𝑃 1,000 ℎ2 0.4 ℎ𝑠2 0.01 ℎ𝑔2 0.99 (1-ℎ2) 0.6 (1-ℎ2)𝛿 0.4, 0.4 (1-ℎ2)(1-𝛿)𝜌 0.2, 0 (1-ℎ2)(1-𝛿)(1-𝜌) 0.4, 0.6 pcorr 0.4 𝜃 0.8 𝜂 0.8 𝛾 0.8 𝛼 0.8 To test the stability of dimensionality reduction techniques, I chose a cross-vali- 141 dation approach, where I randomly selected 80% of the simulated samples, applied a dimensional reduction technique and recorded the results. For each dataset, I re- peated this step ten times. Subsequently, I did a pairwise comparison of the ten low-dimensional representations of the dataset, hence 45 comparisons. For each pairwise comparison, I selected the samples common to both datasets and computed the Spearman correlation of the components across these samples. I matched each of the components in the first dataset to the component in the second dataset with which it had maximum correlation. The matching algorithm started at the highest correlation and allowed for each component to be exactly matched once. In case of a tie, it was matched to the closest component in rank that had not been matched yet. After finding the pairs of highest correlation, I counted the number of components that passed a given threshold. Components that showed more than 90% correlation were considered stable. I applied the twelve dimensionality reduction methods described in section 6.1 with the parameters summarised in table 6.2 to the different simulated datasets and determined the trustworthiness, continuity and stability of each method. Instead of directly using the raw simulated data as input for the dimensionality reduction, I followed standard methods used the residuals from a linear regression of the simu- lated data with the known confounders (introduced as non-genetic covariate effects in the simulation). Formethods that required the specification of the dimensionality, I provided an initial estimate of 𝑛𝑑𝑖𝑚 = 100. These 100 dimensions will be the 100 components explaining most variance in the data for methods based on or includ- ing a pre-processing step that uses variance selection (PCA, DRR, tSNE, ICA, MDS, nMDS and Laplacian Eigenmaps). For PEER, which uses iterative model updates, selecting a dimensionality that is too high, will be compensated for by the weights associated with the components, which will effectively set the contribution of the non-informative components to zero. In this way, an initial poor choice of too many dimensions will affect the final estimated components only minimally. In LLE, the provided dimension is only used as a maximum value and the estimation of any component is not affected by the estimation of subsequent components [Roweis & Saul, 2000; Kayo, 2006]. Figure 6.7 summarises the effects of sample size and background structure on the different dimensionality reduction methods. The effect is measured as the trust- worthiness and continuity of the projection across the ten subsets of each dataset. For most methods, the sample size has only minor effects on the trustworthiness of the low-dimensional projection. Laplacian Eigenmaps are the exception to this ob- 142 0.5 0.6 0.7 0.8 0.9 1.0 Tr u st w o rth in es s A 0.5 0.6 0.7 0.8 0.9 1.0 Tr u st w o rth in es s B 0.5 0.6 0.7 0.8 0.9 1.0 D iff us io ns M ap s D R R IC A LL E M D S Is om ap La pl ac ia nE ig en m ap PC A PE ER kP CA n M D S tS NE M ea su re C Samples 500 1000 10000 Model with correlated background without correlated background Type Trustworthiness Continuity Figure 6.7: Performance of dimensionality reduction techniques on simulated datasets. The trustworthiness and continuity (equation (6.2) and equation (6.3)) of twelve dimension- ality reduction methods on ten independent simulated datasets for each phenotype setup were computed. 1,000 phenotypes with non-genetic covariates and observational noise ef- fects or non-genetic covariates, observational noise effects and correlated noise effects were simulated for datasets of 500, 1,000 and 10,000 samples. For each dataset, a ten-fold cross-va- lidation of the dimensionality reduction and subsequent computation of trustworthiness and continuity was conducted. The results of ten evaluations on the ten independent data- sets are summarised in the boxplots. A. Trustworthiness of the dimensionality reduction depending on the number of samples in the simulated dataset (noise backgroundmodel: no correlated background). B. Trustworthiness depending on the background noise structure of the phenotypes (sample size: 10,000). C. Performance of the dimensionality reduction techniques in terms of trustworthiness and continuity (sample size:1,000, noise background model: correlated background). 143 servation, as the trustworthiness of the dimensionality-reduced datasets sharply in- creases with sample size (figure 6.7A). The effect of the background structure of the phenotype is shown in figure 6.7B. Most models perform marginally better on data without correlated background structure, while the trustworthiness of the repres- entation found by Isomap and PEER is distinctly better on this data type. In contrast, ICA performs slightly better on datasets with correlated background structure. Two thirds of themodels that yield trustworthy projections, also performwell in terms of continuity (figure 6.7C). PEER and ICA seem to be better at protecting original neigh- bourhoods (continuity), than they are at ensuring that the samples in low-dimensio- nal space were in proximity in the original space (trustworthiness). The opposite trend can be observed for LLE and Laplacian Eigenmaps. kPCA performs worst overall and is only marginally better than randomly simulated neighbourhoods as a low-dimensional representation (section 6.3). The stability of the dimensionality reduction techniques dependent on the back- ground model is displayed in figure 6.8. For the majority of methods (DRR, MDS, Isomap, PCA, PEER and nMDS), the background structure of the dataset does not influence the stability of the components, with three components reliably recovered in the ten-fold cross-validation. DiffusionMaps and LLE detectmore stable compon- ents for both data types with five and seven stable components in the data with cor- related background and seven and fivewithout correlated background, respectively. kPCA performs worse for both data types, while ICA has no components that pass the 0.9 correlation threshold for either of the data types. tSNE only finds stable com- ponents in the model with correlated background structure. Results for the datasets with 500 and 10,000 samples were consistent with these observations. 6.4.2. Stable features enable discovery of genetic associations In genetic association studies of high-dimensional phenotypes, features selected by dimensionality reduction methods serve as the response variable and one aims to find genetic components that are associated with this low-dimensional phenotype representation. Studies employing these techniques range from genotype associ- ation studies on features extracted from facial images [Liu & al., 2012] andmetabolic profiles [Avery & al., 2011] to genome-wide pathway association studies of multiple correlated phenotypes [Zhang & al., 2012]. These studies commonly test the asso- ciation between SNPs and the top few components that explain most phenotypic variance. For instance, the first eleven PCs capturing more than 90% of variance of facial features were used as the phenotypes in the study by Liu and colleagues. 144 05 10 15 D iff us io ns M ap s D R R IC A LL E M D S Is om ap La pl ac ia nE ig en m ap PC A PE ER kP CA n M D S tS NE St ab ilit y Model with correlated background without correlated background Figure 6.8: Stability of dimensionality reduction techniques for different background noise models. The stability of twelve dimensionality reduction methods on the ten inde- pendently simulated datasets per setupwere computed. 1,000 phenotypes with non-genetic covariates and observational noise effects or non-genetic covariates, observational noise ef- fects and correlated noise effects were simulated for the datasets with 1,000 samples. For each dataset, a ten-fold cross-validation of the dimensionality reduction and subsequent evaluation of the stabilitywas conducted. Components that passed the correlation threshold of 𝑐𝑜𝑟 = 0.9were considered stable and the number of stable components per method is dis- played. For ICA, no stable components were detected for either dataset, for tSNE the same was true for the dataset without correlated background structure. 145 Similarly, Avery and colleagues used the first eight PCs extracted from the meta- bolic profiles based on 19 traits for the genotype to phenotype mapping analysis. Contrary to this common practice, Aschard & al. [2014] showed in simulations and in application to a datasets of coagulation traits that only testing the top PCs can lead to a loss in power for detecting genetic associations. They demonstrated that combining signal across PCs can increase power and that components explaining little phenotypic variance can be equally important as components explaining large variation. However, as seen in the previous section, phenotype components that re- flect lower variance structuresmight reflect technical or biological noise andmay not be recovered when subsampling the dataset. As such, the choice of dimensionality when using the extracted features for genotype to phenotype mapping comes down to a trade-off between gain in power and stability. In order to test if the dimensionality reduction techniques employed so far can stably capture phenotypic components that yield enough power to serve as proxy phenotypes in association studies, I simulated a new set of phenotypes with genetic variant effects that affect different proportions of traits. I used the same strategy and parameter settings for the simulation of the noise effects as described for the phen- otype simulation of the stability analysis, i.e. datasets with non-genetic covariates and observational noise effects or non-genetic covariates, observational noise effects and correlated noise effects (table 6.3). For each of these datasets, I simulated dif- ferent structures of genetic variant effects, by adding 20 SNP effects to a subset of traits. The percentage of affected traits ranged from 1 to 100, corresponding to ten and all 1,000 simulated traits. Independent of the subset size, the proportion of vari- ance of the genetic variant effects in relation to the total phenotypic variance was set to 0.05, corresponding to ℎ𝑠2 = 0.02 for ℎ2 = 0.4. The basis for the simulation of the genetic effects were the genotypes and kinship estimate of the simulated cohort with related individuals described in section 3.1. For each setup, i.e. each back- ground noise model (with/without correlated background structure) and percent- age of traits affected, I generated ten datasets and applied the twelve dimensionality reduction methods to each dataset. To determine the stability of the dimensionality reduction and decide which components to use for the genetic association study, I employed the cross-validation approach described in section 6.4.1. For the majority of dimensionality reduction methods, the percentage of traits af- fected does not affect their stability in the dataset with correlated background struc- ture (figure 6.9A). LLE and LaplacianEigenmaps do not follow this general observa- tion and show some fluctuations in the stability, without showing an obvious trend. 146 ICA and tSNE on average do not find stable components for any number of traits affected. In the model without background structure (figure 6.9B), there is a general trend towards more stable components in the dataset when a larger subset of traits was affected by the genetics. DiffusionMaps and Laplacian Eigenmaps show the opposite behaviour, while there is no clear trend for tSNE. ICA can again not stably recover any components. For all methods, the median number of stable components across different proportions of traits influenced by genetics in the model without background structure is approximately the same as the number of components in the model with background structure. For every setup, stable components were selected and used as the response vari- ables in a multivariate LMMwith an any effect trait design matrix (section 1.7.8) for the 20 causal SNPs and the kinship matrix as the random genetic effect. The signi- ficance of the association was assessed by the permutation approach described in section 4.7, where the original p-values are compared to p-values from the same as- sociation model on permuted genotypes to obtain an empirical p-value. Figure 6.10 shows the percentage of causal SNPs that could be detected with this approach (𝑝empirical < 0.01). ICA is not depicted as it was not possible to find stable com- ponents for any of the phenotype sets. In general, the percentage of detected true SNPs is lower for components derived fromphenotypeswith correlated background structure (figure 6.10A) as compared to those from phenotypes without correlated structure (figure 6.10B). Similar to the observation for the stable number of compon- ents (figure 6.9), the percentage of detected SNPs does not vary much depending on the number of traits affected in the datasets with correlated background struc- ture. For the phenotypes without correlated background structure, there is a trend towards detecting more SNPs for larger subsets of traits affected by the genetics. For both phenotype models, the PCA-based methods (DRR, MDS, PCA and PEER) and nMDS perform better in recovering the underlying genetics. These methods all perform best in finding components that allow for detecting causal SNPs in pheno- types where 40% of all traits where affected by the genetics, with up to 80% of SNPs detected on average. The power to detect SNPs in the standard genotype-phenotypemapping approach depends, among other factors such as sample size and allele frequency, on the effect sizes of the SNP [Cohen, 1992; Halsey & al., 2015; Astle & al., 2016]. For phenotypes derived via dimensionality reduction, the effect size of the SNP has an additional influence on the outcome of the association. While the effect size of the SNP is linked to power as in any genotype-phenotype mapping, its influence is likely to also occur 147 04 8 12 St ab ilit y A 0 4 8 12 D iff us io ns M ap s D R R IC A LL E M D S Is om ap La pl ac ia nE ig en m ap PC A PE ER kP CA n M D S tS NE St ab ilit y B Traits affected [%] 1 10 40 80 100 Figure 6.9: Stability of dimensionality reduction techniques for different genetic vari- ant and observational noise models. A. Components from datasets with correlated back- ground structure. B. Components from datasets without correlated background structure. The stability of twelve dimensionality reduction methods on ten independent simulations of ten datasets (two different noise background models, five subset sizes of traits affected by the genetic variant effect, 1,000 phenotypes) were computed. For each dataset, a ten- fold cross-validation of the dimensionality reduction with 80% of the 1,000 samples and subsequent evaluation of the stability was conducted. Components with 𝑐𝑜𝑟 ≥ 0.9 were considered stable and the median number of stable components per method and dataset is displayed (points). The vertical lines indicate the 25% and 75% quantile for the ten inde- pendent simulations. 148 020 40 60 80 D iff us io ns M ap s D R R Is om ap kP CA La pl ac ia nE ig en m ap LL E M D S n M D S PC A PE ER tS NE D et ec te d tru e S NP s [% ] A 0 20 40 60 80 D iff us io ns M ap s D R R Is om ap kP CA La pl ac ia nE ig en m ap LL E M D S n M D S PC A PE ER tS NE D et ec te d tru e S NP s [% ] B Traits affected [%] 0.01 0.1 0.4 0.8 1 Figure 6.10: Genetic association of stable components from dimensionality reduction. A. Detected SNPs from datasets with correlated background structure. B. Detected SNPs from datasets without correlated background structure. The stable components for each dataset were used as the response variables in a multivariate LMM with an any effect trait design matrix for the 20 causal SNPs and the kinship matrix as the random genetic effect. Vertical lines indicate the 25%and 75%quantile, points represent themedian for the ten independent simulations. 149 before the mapping, namely in finding stable components that reflect this genetic structure. To test if finding low-dimensional components that capture the underlying genet- ics depends on the effect size of the causal SNPs, I computed themean absolute value of effect sizes from the causal SNPs for all simulated datasets. I then classified these SNPs into two categories, based on passing the FDR threshold of 𝑝empirical < 0.01. SNPs with empirical p-values below that threshold are considered “detected”, the remainder are “not-detected”. Figure 6.11 depicts the effect sizes of these SNP cat- egories dependent on the dimensionality reduction technique that was used for de- riving the phenotypes, summarised across all proportions of traits affected by the genetic variant effects. ICA and kPCA are not depicted as they either did not de- tect stable components or their stable components did not detect associations. On average, the effect size of the detected SNPs are larger than the ones for SNPs that are not detected. The results for the linear methods (MDS, PEER, PCA) and nMDS are mostly identical, with median effect sizes for detected SNPs slightly higher in the model with correlated background (figure 6.11A) than without (figure 6.11B). DRR follows the same trend as does LLE, albeit on marginally higher effect size levels. DiffusionMaps, Laplacian Eigenmaps, Isomap and tSNE require higher ef- fect sizes to detect SNPs in the model without correlated background structure. The spread and number of outliers of effect sizes for undetected causal SNPs is smallest for DRR, MDS, PEER, PCA and nMDS for both noise background models. For the backgroundmodel with correlated structure, the spread of effect sizes for Diffusion- Maps is equally low with a median of 0.2. For the other methods, large numbers of outliers for SNPs with high effect sizes that could not be detected are observed, i.e. SNPs with high effect sizes that were not detected. 6.5. Dimensionality reduction is a powerful tool for genetic association studies In this chapter, I reviewed dimensionality reduction methods with different proper- ties and underlying mathematical concepts. I analysed their performance in terms of trustworthiness and continuity and introduced a new measure, stability, to asses the lowdimensional phenotype representations they generate. Finally, I investigated if using low-dimensional representations of the original phenotypes are capable of recovering the underlying genetic structure in simulations. I was able show on datasets with known structure (Iris and roll dataset) that the 150 01 2 Si m u la te d ef fe ct s ize s A 0 1 2 detected not−detected Si m u la te d ef fe ct s ize s B Method DiffusionsMaps DRR Isomap LaplacianEigenmap LLE MDS nMDS PCA PEER tSNE Figure 6.11: Effect size distribution of discovered SNPs. A. Power from datasets with correlated background structure. B. Power from datasets without correlated background structure. The mean of the simulated effect sizes per SNP across all traits was computed. SNPs were classified into “detected” and “not-detected” based on the threshold 𝑝empirical < 0.01. The plot shows the dependence of detecting causal SNPs on their effect size for different backgroundmodels and dimensionality reduction techniques across all proportions of traits affected by the SNPs. 151 trustworthiness and continuity criteria agreewith the visual assessment of themeth- ods’ performance. Based on these results, I used the trustworthiness and continuity criteria to evaluate the effect of sample size and phenotype structure on the per- formance of the different methods. For the majority of methods analysed in this thesis, the sample size has only minor effects on the performance. In general, most models performmarginally better on data without correlated background structure. Trustworthiness and continuity are helpful in determining the correspondence of the high- and low-dimensional space. The stability criterion that I defined in this chapter evaluates a different aspect of the dimensionality reduction. It measures the number of components that can be reliably recovered in cross-validation and thus helps to determine the stable dimensions of the low-dimensional space. Applied to the two different data types, with and without background structure, it shows that background structure alone does not influence the number of stable components much. A stronger effect on the number of stably recovered components is observed when varying the proportions of traits influenced by the genetic variant effects. This seems intuitive since SNP effects are mathematically equivalent to any other type of fixed effect confounders that are present in the data. An increase in the proportion of traits affected generally leads to an increase in components recovered. This in- crease is maximal for about 40 to 80% of traits affected. This trend is reflected in the number of causal SNPs that can be detected when using the stable components as phenotypes in a genetic associationmodel. The higher number of stable components at 40 to 80% of traits affected captures more of the underlying genetics. In the analyses of stability and power to detect genetic associations, the linear and PCA-derived methods seemed to outperform the other methods. In particu- lar, kPCA, ICA and tSNE yielded the least promising results: ICA did not recover any stable components, while the number was very low for kPCA and tSNE did only find stable components in the model of correlated background structure. In the as- sociation analyses, these componentswere either not associated at all (kPCA) or only for SNPs with large effect sizes (tSNE). However, there is a point of caution in these conclusions. Foremost, the performance of all these methods is intrinsically linked to the underlying data structure. Thus, in general, the different mathematical mod- els of the dimensionality reduction methods will make some models more suitable for the analysis of a given dataset than others. In this simulation study, the high-di- mensional datasets for stability and genetic association analyses are more similar to the Iris data than to the roll dataset. As such, it is encouraging that the chosen eval- uation criteria trustworthiness and continuity show similar results for suitability of 152 the methods, i.e. linear methods seem to perform better on linear data than the non- linear methods. In addition, the non-linear methods all require the specification of model parameters for which I chose the default settings. Improved results might be observed when different parameter settings are evaluated for the different meth- ods. To extend and improve this study, high-dimensional datasets more reflective of the non-linear structure of the roll dataset could be simulated and the non-linear dimensionality reduction methods evaluated on a range of parameter settings. This simulation study has shown that dimensionality reduction methods are a valid intermediate step in genotype to phenotypemapping of high-dimensional data- sets. Although methods like LiMMBo (chapter 4) enable association studies with large numbers of phenotypes, there is always a trade-off between exploiting cor- related structure in the phenotypes and the joint mapping cost in form of degrees of freedom when evaluating the test statistic. Employing dimensionality reduction techniques to find the correlated background structures in the phenotypes while simultaneously reducing the degrees of freedom offers huge potential for the mul- tivariate analysis of these phenotypic traits. For applications on real data, one should carefully evaluate different dimensionality reductionmethods as the choice strongly depends on the data and investigate parameter settings to find components that best reflect the original data. The introduced stability criteria is particularly useful in ge- netic association studies as dimensionality reductions that are not stable are guar- anteed not to produce reliable results. 153 7 GWAS of left ventricular wall thickness The structure of the human heart is determined by an interplay of genetic factors and and complex environmental influences [Payne & al., 1995; Sanoudou & al., 2005; O’Toole & al., 2008]. One common, heritable trait used to predict clinically relev- ant heart conditions is left ventricular mass (LVM). In particular, the increase in LVM is associated with an increased risk of heart failure and sudden death [Haider & al., 1998; Post & al., 1997; Lorell & Carabello, 2000]. The increase in LVM through the thickening of the left ventricular wall is a direct response to a rise in hemody- namic burden which causes the hypertrophy of existing myocytes [Lorell & Cara- bello, 2000]. The thickening of the wall can occur in a symmetric fashion through concentric thickening of the ventriclewith a small cavity dimension. However, about 58% of all cases of left ventricular hypertrophy are asymmetric [Davies &McKenna, 1995] and the observed asymmetry patterns are diverse in distribution and occur- rence [Hughes, 2004; Florian & al., 2012]. A number of genetic factors have been shown to be involved in these asymmetric changes in the structure of the left vent- ricle [Davies & McKenna, 1995; Chen & Chien, 1999; van der Merwe & al., 2008]. To date, GWAS in African American [Fox & al., 2013], Caucasian [Vasan & al., 2007; Vasan& al., 2009; Arnett & al., 2009] andmore recently Japanese cohorts [Sano& al., 2016] have attempted to identify genomic loci that are associated with LVM, where LVM was assessed using echocardiographic measures or 2D cardiac magnetic res- 155 onance imaging. However, none of the studies find associations that pass the com- monly applied genome-wide significance threshold. Many factors might have influ- enced the success of the studies and the lack of finding genetic associations such as lack in power through small sample or effect size. Given the genetic effects of the clinical LVM phenotypes observed [Davies & McKenna, 1995; Chen & Chien, 1999; van derMerwe & al., 2008], the assumptions for a genetic contribution to the natural variation in heart morphology holds, despite the negative results obtained in these studies. However, the asymmetric nature of changes in heart morphology might make LVMan inaccurate phenotype for detecting these genetic effects. To investigate genetic influences on overall heart structure instead of on a reduced representation such as LVM, spatially resolved, quantitative heart phenotypes are needed. A recent advance in cardiac MRI is the use of 3D imaging of the heart as a whole as opposed to multiple transverse sections of the heart by 2D imaging. The latter technique has been the clinical gold standard but recent studies have shown that 3D imaging improves spatial resolution especially at the base and apex of the heart (fig- ure 2.1) and can avoid technical issues arising from 2D imaging [de Marvao & al., 2014]. Detailed images derived from the 3D imaging technique combined with gen- otype data would allow for an investigation into spatially-confined changes in heart morphology. Genetic association studies based on imaging phenotypes are widely applied in the field of neuroscience [Filippini & al., 2009; Ho & al., 2010; Jahanshad & al., 2013; Hibar & al., 2015]. The first unbiased study using genome-wide genetic markers to find genetic associations with brain activity patterns was conducted by Stein and colleagues. They associated every voxel of 3D brain scans with all genetic markers. Following this approach, associating heart morphology as represented in the 3D scans would require testing approximately 140,000 voxels. However, voxel- wise GWAS is limited in power and does not take into account any spatial correlation between the voxels [Ge & al., 2014]. To overcome these limitations and offer more practical measurements for clinical use, De Marvao and colleagues have developed a technique to extract 3D features of the cardiac morphology from the 3D scans [de Marvao & al., 2014]. As part of the digital heart project [Cook & O’Regan, 2010], they created the first at scale cohort of about 1,500 detailed 3D statistical models of the variation in cardiac morphology from healthy volunteers. Based on these models, standard clinically relevant meas- urements such as LVMcan be computed. Far beyond these simple 1Dmeasurements, the 3Dmodels allow spatially derived phenotypes such as left-ventricularwall thick- ness or curvature to be resolved for over 27,000 coordinates. However, the substan- 156 tial challenge in handling this still large number of correlated dimensions present in these models remains. In the following chapter, I describe the GWAS of phenotypes derived from the 3D statistical models of the digital heart project. Within this project, I was responsible for the quality control and imputation of the genotypes, and conducted the GWAS from the 3D phenotypes. My colleagues collected the DNA samples, performed MRI scans and provided the 3D phenotyping. I will first describe the genotyping and phenotyping strategy and then show the results from applying different dimen- sionality reduction techniques to the 3D heart phenotypes. Based on the criteria described in chapter 6, I chose the most suitable methods and conducted a GWAS with components derived thereof as proxy phenotypes. Finally, I investigated the associated loci for any spatial association with the 3D heart phenotypes. Using the genotype information which I processed and imputed, a preliminary publication on genetic associations was accepted for publication [Biffi & al., 2017] and we are currently planning the publication of the analyses and results described in this chapter. 7.1. Data 7.1.1. Genotypes Quality Control. Genotyping and genotype calling were carried out at the Genotyp- ing and Microarray facility at the Wellcome Trust Sanger Institute, UK and Duke- NUSMedical School, Singapore. Genotypes were assessed in five batches using Illu- mina HumanOmniExpress- 12v1-1 (Sanger, two batches), Illumina HumanOmniEx- press-24v1-0 (Duke-NUS, two batches) and Illumina HumanOmniExpress- 24v1-1 chips (Duke-NUS). SNPs were called via the GenCall software for clustering, calling and scoring of genotypes [Teo & al., 2007]. For batches run on the same platform, genotype signals were combined and called in a single analysis, leading to three in- dependent genotype batches: Sanger12 (1,344 samples), Duke-NUS12 (284 samples), Duke-NUS3 (96 samples). I carried out the quality control (QC) on the raw genotype calls, the phasing and the imputation at a per-batch level. The final QC of the im- puted data was conducted across all batches and only SNPs passing the control in every batch were used in subsequent analyses. Prior to QC, I matched the rsID descriptions (chromosome, chromosomal posi- tions and allele order) of the three batches to the reference set I would use for im- putation, a combined UK10K [UK10K Consortium, 2015] and 1,000 Genomes [1000 157 Genomes Project Consortium, 2015] reference panel. For rsIDs not included in the reference panel I retrieved location and allele order from the ensembl human vari- ation annotation (GRCh37p13, 15.04.2016). rsIDs that matched to neither reference were removed from further analyses (4,681 across all chips). In order to avoid batch effects in SNP calling simply based on the probe sequences, I confirmed that probes targeting the same SNP on different chip versions had the same sequence. As this was the case, no SNPswere removed at this stage. I followed an adapted quality con- trol protocol from Anderson & al. [2010] to asses the quality of the genotyping on a per-individual and per-marker level. Unless stated otherwise, the PLINK software (version 1.9) [Purcell & al., 2007; Chang & al., 2015] was used for all QC analyses. In summary, the per-individual QC included the identification of individuals with discordant sex information, missing SNP rates (more than 3% of SNPs not called) and heterozygosity rate outliers (three standard deviations outside of the mean het- erozygosity rate). Population substructures arising due to different ethnical origins of samples were examined by comparing the sample genotypes to genotypes from the HapMap Phase III study [The International HapMap Consortium, 2005] for four ethnic populations (with subpopulations, figure B.7 in the appendix). Samples that clustered with HapMap III individuals of European ancestry were kept for further analyses. The per-marker QC included filtering of SNPs with missing call rate in more than 1% of the samples and SNPs which significantly deviate from Hardy- Weinberg equilibrium (HWE, 𝑝 < 0.001). After removing samples and SNPs that failed QC, I confirmed that any pattern of missing genotype information was not batch-specific. To analyse these patterns, I treated each pair-wise combination of batches as a case-control set-up and computed the differential missingness of SNPs common to all batches. None of the 631,877 common SNPs had to be removed due to significant differential missingness (𝑝 < 10−5). Table 7.1 shows an overview of sample and SNP numbers before and after the QC described above. The QC plots for each step can be found in figures B.5 to B.7 in the appendix. Phasing and imputation. Phasing and imputation were conducted in two separate steps. For phasing, I used SHAPEIT ( version 2.r727) [Delaneau&al., 2012; Delaneau & al., 2013] to generate estimated haplotypes for each sample that passed the qual- ity control. The window size for phasing was set to 2Mb, and the number of condi- tioning states per SNP to 200. All other parameters were set to default values. The phased genotypeswere then imputedwith IMPUTE2 (version 2.3.0) [Marchini & al., 158 Table 7.1: Sample andSNPnumbers before and after theQC. For each batch (first column), the number of male (m)/female (f) samples and SNPs before and after QC are listed. Rate specifies the genotyping rate of samples within one batch after QC. pre-QC post-QC samples (m/f) SNPs samples (m/f) SNPs Rate Sanger12 1,344 (614/730) 719,665 998 (463/535) 677,036 0.998 Duke-NUS12 284 (118/166) 716,503 179 (68/111) 682,016 0.998 Duke-NUS3 96 (48/48) 7,713,014 62 (34/28) 657,497 0.998 2007; Howie& al., 2009] based on the combined 1,000Genomes [1000Genomes Pro- ject Consortium, 2015] and UK10K [UK10K Consortium, 2015] reference panel. I set the imputation interval to 3Mb, with a buffer region of 250kb on either side of the analysis interval. As suggested in the user manual, I used an effective population size of 20,000 and set the number of reference haplotypes to 1,000. Again, for the additional, non-specified parameters the default was used. Combining datasets. I combined the three genotype batches after imputation and filtered them again on a per-sample and per-marker level. On the per-sample level, I excluded related individuals because of the difficulties that might arise in adjusting for relatedness in the processing of the phenotypes via dimensionality reduction. A more detailed explanation will follow in section 7.2. Relatedness was estimated by the proportion of SNPs shared between two individuals and subsequent calculation of IBD estimated as PI_HAT on the genotyped SNPs via PLINK as described by [An- derson & al., 2010]. For any pair of individuals with a PI_HAT of greater than 0.125, the individual with the higher SNP calling rate was retained in the analysis. For the quality control on the per-marker level, I used the statistical information about the imputation certainty, the info metric, given as additional output by IMPUTE2. The metric typically takes values between zero and one, with values closer to one indicating high imputation certainty. I excluded any SNP with an info score of less than 0.4 in at least one of the batches. Approximately 60% of all imputed SNPs were excluded based on this criterion. After combining the datasets, I used SNPTEST (v2.5) [Marchini & Howie, 2010] to compute the minor allele frequency (MAF) and p-value for deviation from Hardy-Weinberg equilibrium per SNP. SNPs with a sig- nificant deviation from Hardy-Weinberg equilibrium (𝑝 < 0.001) and a minor allele count of less than 20 alleles (corresponding to a minor allele frequency of 0.008) were removed, leading to a decrease in SNPs of another approximately 41%, a total 159 reduction from imputed SNPs to SNPs that passed every filtering criteria of 23%. A summary showing the magnitude of the number of imputed SNPs per batch, the number of SNPs after imputation quality filtering and filtering for MAF and Hardy- Weinberg-equilibrium deviation is depicted figure 7.1. Exact numbers can be found in table A.2. 0e+00 1e+06 2e+06 3e+06 ch r1 ch r2 ch r3 ch r4 ch r5 ch r6 ch r7 ch r8 ch r9 ch r1 0 ch r1 1 ch r1 2 ch r1 3 ch r1 4 ch r1 5 ch r1 6 ch r1 7 ch r1 8 ch r1 9 ch r2 0 ch r2 1 ch r2 2 Chromosomes N um be r o f S NP s Imputed SNPs (Sanger12) Imputed SNPs (Duke−NUS12) Imputed SNPs (Duke−NUS3) Imputed SNPs after imputation filtering (INFO > 0.4) Imputed SNPs after HWE and MAF filtering Figure 7.1: Overview of SNP numbers after imputation and imputation quality control. The imputation of the SNPs based on the genotypes from SNP arrays was done on a per- batch level. The number of SNPs for each batch after imputation is shown as red bars and is very similar for each of the three batches (exact numbers in table A.2). About 40% of SNPs are retained after filtering for the ‘info‘ metric (light grey bars). The bars in dark grey show the final number of SNPs per chromosome. After imputation and imputation quality control, the dataset contains 9,233,118 SNPs from 1,207 samples. IMPUTE2 yields imputed genotypes encoded in triplets of posterior probabilities for the possible allele combinations (𝐴𝐴,𝐴𝐵,𝐵𝐵). These probabilitieswere converted into expected genotypes𝐺 by the dosagemodel [Howie & al., 2011]: 𝐺 = 0 × 𝑝(𝐴𝐴) + 1 × 𝑝(𝐴𝐵) + 2 × 𝑝(𝐵𝐵) = 𝑝(𝐴𝐵) + 2 × 𝑝(𝐵𝐵) (7.1) 7.1.2. Phenotypes The phenotyping was done by my collaborators, in particular Antonio de Marvao. CMR imaging and generation of 3D models of the left ventricle derived from these images were conducted at Hammersmith Hospital, London. In the following, I will briefly describe the methodology of their automatic phenotyping approach. The technical details of the image acquisition, the analysis and their improved perform- ance over standard methods are described in detail in [de Marvao & al., 2014]. 160 In the automated phenotyping approach developed by Antonio de Marvao and colleagues, cardiac structures are accurately extracted from raw 3D cadiac magnetic resonance images via a multi-atlas PatchMatch (figure 7.2, 1) to generate 3D mod- els of the individuals’ hearts (figure 7.2, 3). The cardiac structures of interest in this study were left ventricular cavity, myocardium and right ventricular bloodpool at end-diastole and end-systole. The multi-atlas PatchMatch algorithm uses a local database of segmented and quality controlled cardiac MRI atlases, to which each newly acquired image is compared. The database of atlases was created by Ant- onio, who initially selected 20 subjects and manually labelled the approximately 140,000 voxels per image into the three categories named above (left ventricular cav- ity, myocardium and right ventricular bloodpool). These manually classified im- ages were then divided into smaller patches – atlases– which served as the initial training dataset for the segmentation algorithm. In the database generation phase, subsequent successful segmentation of new images described by the method be- low were added, yielding a total of 1,072 images in the final database. In addition to serving as a database for the segmentation algorithm, the database images were used to generate a template image of average heart size, position and orientation. For each new image, six landmarks are manually placed on the image, which en- ables the subsequent image registration between the target and the atlas images. After registration, a multi-atlas PatchMatch algorithm finds corresponding patches of adjacent voxels within the atlas and target images (figure 7.2,2). Each patch in the target image is given the label of the closest matching atlas patches and combin- ing the labels of all patches produces the final segmentation. Lastly, the segmented image is registered to the template image to make the spatial coordinates in the 3D models consistent between all samples. Using a surface rendering algorithm allows for the extraction of information from a segmentation volume such as the left ventricular myocardium into a surface rep- resentation. Through such an algorithm, thewall thickness, curvature and fractional wall thickening at 27,623 positions in the left ventricle were extracted for each indi- vidual (figure 7.22). To assess the reproducibility of the phenotyping approach, one individual was scanned eight times and the images segmented as described above. These repeat scans allowed for the quantification of variation in the segmentation by the coeffi- cient of variation (CV). TheCV is a standardisedmeasure of dispersion and is defined as the ratio of the standard deviation to the mean value. I computed the CV for each 161 1. 2. 3.Target Atlas 1 Atlas 2 Figure 7.2: Cardiac phenotyping based on cardiacmagnetic resonance images. 1. Detailed 3D images of the heart were acquired in the left ventricular short axis plane from base to apex. 2. The images were segmented into left ventricular myocardium (green), left ventricu- lar blood pool (red) and right ventricular blood pool (yellow) and registered to a common template image via a multi atlas-based technique. 3. Through a surface rendering algorithm of the registered segmentation, a 3D model of the heart was generated and wall thickness measurements derived at 27,623 positions of the left ventricle. The left ventricle is shown in solid colors, with the color scheme representing average wall thickness, increasing from light to darker colors. As a point of reference, the right ventricle is depicted as a mesh. of the 27,623 positions in the 3D heart model across the eight scans and projected the results onto the template image (figure 7.3). Overall, the dispersion is very low i.e. the reproducibility high. Only at the base of the left ventricle in proximity to the right ventricle can a slight increase in dispersion be observed (figure 7.3, red area). The low dispersion shows the accuracy of the segmentation and surface ren- dering methods. Based on this result and further quality control criteria such as the comparison between the segmentations and manually labelled images (details in [de Marvao & al., 2014] the wall thickness measurements were considered reli- able phenotypes for subsequent analyses. Wall thickness measurements were successfully extracted for 1,185 of the 1,207 individuals that passed the genotyping quality control. 162 0.36 0.44 0.09 0.27 0.18 Dispersion Figure 7.3: Phenotype reproducibility. The dispersion in left ventricular wall thickness at 27,623 positions was computed as the standard deviation over the mean across eight seg- mentation derived from independent scans of one individual. The right ventricle is shown as a point of reference (mesh structure). 7.2. Dimensionality reduction yields stable low-dimensional phenotype representations The detailed 3D models of the heart structure offer a rich dataset for investigating spatially-resolved genetic associations on cardiac morphology. By extracting the rel- evant features from the cardiac magnetic resonance images, the phenotype space has been reduced from intensity values at 140,000 voxels to wall thickness meas- urements at about 27,000 3D coordinates. While this processing condensed the ori- ginal image space into relevant phenotype information, considering each position as a phenotype would still require 2 × 105 single-trait association tests which have to be adjusted for multiple testing and which would not be able to take advantage of correlation structure in the phenotypes. In contrast, a multi-trait association test would be more powerful by modelling the correlated traits jointly, however its test- statistic would be subjected to a 2×105 degree of freedom test. To avoid this burden of correcting for the high-dimensionality of the traits while making use of intrinsic structure in the data, I applied the twelve dimensionality reduction methods tested in chapter 6 to the 27,623 heart wall thickness measurements in order to find the best low-dimensional representation of the dataset. The low-dimensional compon- ents will then serve as proxy phenotypes in the GWAS. Before applying the dimensionality reduction methods, I adjusted each of the 27,623 left ventricular wall thickness measures independently for any known cov- 163 ariates such that the low-dimensional components ideally only reflect structure truly related to the underlying cardiac biology. Any covariates with an assumed linear ef- fect on the wall thickness were used as explanatory variables in a linear model with wall thickness as a response variable. These include the biological covariates sex (643/542, female/male), age (40.3 ± 13.3 years, mean ± standard deviation), height (170.8±9.3 cm) andweight (71.9±13 kg), and technical covariatesMRI operator, date of the image acquisition and date of the image segmentation. Figure 7.4 summar- ises these covariates by their respective univariate distribution (diagonal) and their dependent distributions across the 1,185 genotyped and phenotyped individuals in the cohort. Other, more complicated covariance structure could arise due to related individu- als in the dataset. In order to avoid confounding of subsequent analyses potentially introduced through high levels of relatedness between a number of individuals, re- lated samples were removed from the analysis based on the quality control of the genotypes (section 7.1.1). The parameters for the dimensionality reduction were chosen as in table 6.2 and the maximum dimension set to 100. The optimal number of neighbours was estim- ated as 𝑛 = 40. The dimensionality reduction was performed on the residuals of the linear regression described above. I used the new stability criterion introduced in section 6.4.1 to find the low-dimensional representations that can be reliably re- covered in subsets of the dataset. As described for the simulations (section 6.4.1), I split the dataset into subsets of 80% of the samples, computed the dimensional- ity reduction and repeated this step ten times. For each cross-validation, I com- puted the trustworthiness (equation (6.2)) and continuity (equation (6.3)). Overall, I used the cross-validation to determine the stability. ICA on this dataset with the fastICA::fastICA function in R was not possible and failed with fortran indexing er- rors. As the dimensionality reduction with ICA yielded the least stable results in the previous chapter, this dimensionality reduction strategy was not investigated further on the heart data. An initial look at the number of stable components showed a median of ten stable components across all methods. As a first manual control of the low-dimensional representation, I qualitatively analysed the distribution and pair-wise density of the first ten dimensions for each method. While most methods showed a similar spread anddistribution of their componentswith differing levels of correlation, components from kPCA and DiffusionMaps were clear outliers from this observation. Figure 7.5 shows the pairs-wise comparisons for components from DiffusionMaps and kPCA 164 ●● ●●● ● ● ●●● ●●●● ● ●● ● ● ●● ● ●● ● ●●● ● ●● ● ● ●●● ● ● ●● ●● ● ● ● ● ● ●● ●● ● ● ● Sex [f/m] Age [years] Height [cm] Weight [kg] Segmentation dateMRI Operator MRI date Sex [f/m ] Age [years] Height [cm ] W eight [kg] M RI Operator M RI date 0 10 20 30 0 10 20 30 20 40 60 80 14 0 15 0 16 0 17 0 18 0 19 0 40 60 80 10 0 12 0 0 10 20 30 40 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 20 0 40 0 0 200 400 600 20 40 60 80 140 160 180 50 75 100 125 0 10 20 30 40 0100200300 400 0100200300 400 0100200300 400 0100200300 400 0100200300 400 0 200 400 Segm entation date Figure 7.4: Distribution of covariates in 3D heart phenotype cohort. Continuous vari- ables: the univariate-distribution of each variable is depicted on the diagonal. The upper triangular matrix shows the bi-variate distribution while the lower triangular matrix shows the regression line of their linear fit. Categorical variables: Distribution (row) and counts (column) are depicted. 165 as well as Laplacian Eigenmaps and PCA as references for well-behaved methods. For the PCAdata (figure 7.5B), components show thewidely uncorrelated behaviour expected of orthogonal vectors (section 6.1). Components derived from Laplacian Eigenmaps (figure 7.5D) display different levels of correlation, from mostly uncor- related (DR6 vsDR10) to strong non-linear correlation (DR1 vsDR2). DiffusionMaps and kPCA show very little spread in their data, with the distribution of each com- ponent spiking at a particular value and zero elsewhere (Figure 7.5A,C, diagonal). Similar plots for the other, well-behaved methods can be found in figure B.3 in the appendix. Based on these observations and without a clear indication as to why these results were observed (i.e. no warnings or error messages in the computa- tion), components from DiffusionMaps and kPCA were not considered in further analyses. For the majority of methods, the low-dimensional representation has a high level of trustworthiness, with seven methods above 90% for each cross-validation steps and the full dataset (figure 7.6A, boxplots and diamond shape). Only PEER does not reach that threshold. The same result is observed for continuity, with exception for LLE, whose low-dimensional representation of the full dataset does not lie above 90% (figure 7.6B). To provide a consistent a priori selection of methods, I only con- sidered stable components reliable if their continuity and trustworthiness measures were above 90% for both the full dataset and the cross-validations. Based on these criteria, components retrieved from DRR (ten), MDS (ten), Isomap (four), Laplacian Eigenmaps (five), PCA (ten) and nMDS (ten) were considered for further analyses. As demonstrated in chapter 6 and figure 6.2, there is a considerable degree of similarity in the low-dimensional representations for some of the methods tested, especially the linear and PCA-based methods. I analysed the extend of similarit- ies between the stable components from the six methods passing the trustworthi- ness/continuity threshold by computing the pair-wise Pearson correlation based on the absolute value of their components. The stable components from PCA,MDS and nMDS show perfect correlation, as expected given the strong mathematical similar- ity of these methods when using Euclidean distance as the distance measure. Iso- map, which builds the bridge between the linear and non-linearmodels as it is based onMDS and kernel-eigenmaps (section 6.1) showsweaker but still strong correlation to the first three methods. Components derived from Laplacian Eigenmaps are only weakly correlated with those from any other method. However, chapter 6 and fig- ures 6.2 and 6.4 also demonstrated the differences in low-dimensional represent- ation for the other methods, in particular between linear and non-linear methods. 166 D. LaplacianEigenmaps B. PCA C. kPCA A. DiffusionMaps Figure 7.5: Pair-wise scatterplots of low-dimensional components derived from left- ventricular wall thickness. For components from DiffusionMaps (A), PCA (B), kPCA (C) and LaplacianEigenmaps (D), pairwise scatter plots of the components (lower triangle) and density plots (upper triangle) are depicted. The diagonal of each plot shows the distribution of the respective component. Row and column labels specify the rank of the component out of the 100 low-dimensional components. Before plotting, each componentwasmean-centred and divided by its standard deviation in order to have comparable axis dimensions. Given the normalised scale of the data, and the purpose of qualitative comparison, axis ticks were omitted for a cleaner visualisation. Without prior knowledge about the true biological features, i.e. the “real” low-di- mensional manifold of the left ventricular wall thickness measurements, it is not possible to know which methods will be most suitable in capturing this manifold 167 0.75 0.80 0.85 0.90 0.95 1.00 Tr u st w o rth in es s A 0.80 0.85 0.90 0.95 1.00 Co nt in u ity B 10 10 4 5 10 10 0.0 2.5 5.0 7.5 10.0 Methods St ab ilit y C Method DRR LLE MDS Isomap LaplacianEigenmap PCA PEER nMDS Figure 7.6: Dimensionality reduction of 3D heart phenotypes. The boxplots in A. and B. show the maximum trustworthiness and continuity across neighbourhood sizes ranging from 1 to 3% of the samples for the ten cross-validation sets for each method. The diamonds show the respectivemeasures for the full dataset. Dotted lines are drawn at 0.9, the threshold chosen here at which a projection is considered a good representation of the original space. C. The number of traits passing the stability criterion. Formethods that passed the threshold for both continuity and trustworthiness in the full dataset, the number of stable traits is printed above the bar chart. The corresponding traits are taken as input for the multi-trait GWAS. structure. Instead of choosing a single method to find components to represent the manifold, I combined all components from the models above that pass the stabil- ity criterion. From the group of highly to perfectly correlated methods (DRR and PCA, MDS, nMDS), I choose the components from PCA as it has no parameters to specify. Thus, the final low-dimensional representation of the 27,623 left-ventricular wall thickness measurements is comprised of ten stable components from PCA, four from Isomap and five from Laplacian Eigenmaps, a total of 19 dimensions. 168 00.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 0.36 0.29 0.3 0.3 0.3 1 0.95 0.95 0.95 0.95 1 1 1 1 1 1 1 1 1 1 Laplacian Eigenmaps Isomap DRR nMDS MDS PCA Pe a rso n C o rre la tio n Figure 7.7: Correlation of low-dimensional components across methods. The Pearson correlation coefficients across the stable components of methods that passed the continuity and trustworthiness criteria were computed. The ellipses above show the mean strength of the absolute value of the correlation across all components. For the comparison of PCA, nMDS, MDS, DRR and DiffusionMaps (ten components each) to Isomap and Laplacian Ei- genmaps (three and five components), the first or three five components were chosen for comparison. 7.3. Multi-trait GWAS detects three loci associated with heart wall thickness Treating the 19 components as proxies for the true phenotypes, Iwas then able to con- duct a mtGWAS to capture the genetics of left ventricular wall thickness. Based on previous studies [Price & al., 2006; Patterson & al., 2006] and results obtained in sec- tion 4.6 and figure 4.5, we know thatmtGWAS is well calibrated in cohorts with little population structure and no relatedness. In order to avoid confounding relationship structure in the dimensionality reduction step, I had already removed related indi- viduals and individuals that were not of European ancestry (section 7.1.1). Given this genotype structure, I used a simple linear model with components as response and genotypes as explanatory variables for the mtGWAS of the low-dimensional heart phenotypes. As there are no prior assumptions about the genotype effects, I modelled the SNP effects based on an any effect design matrix (section 1.7.8). The results of the mtGWAS are depicted in figure 7.8, with three loci that pass the genome-wide significance level of 5 × 10−8. The qq-plot in figure 7.9 shows a 169 well-calibrated test statistic. Figure 7.8: Manhattan plot of themulti-trait GWASon 3Dheart phenotypes. The 19 stable components derived from PCA, Isomap and Laplacian Eigenmap were modelled jointly in an any effect mtGWAS. The p-values of all genome-wide SNPs are depicted. The horizontal grey line is drawn at the level of genome-wide significance: 𝑝 = 5 × 10−8. Two loci on chromosome 1 and one locus on chromosome 10 pass the genome-wide significance level. Table 7.2 summarises the chromosomal location, p-values and SNP information of the most strongly associated SNPs per locus. Their genomic context is displayed in figure 7.10. The locus with the strongest association is located in a regulatory region of a gene-rich area between the SKI gene on the forward and the MORN1 gene on the reverse strand (figure 7.10, upper panel; figure 7.11). SKI is develop- mental gene where de novo mutations are associated with a complex early develop- mental syndrome (Shprintzen-Goldberg syndrome) with cranofacial, bone develop- ment and cardiovascular phenotypes [Greally, 1993]. Zebrafish knockdown mod- els of SKI orthologs give rise to complex developmental changes, including cardiac phenotypes [Doyle & al., 2012]. In addition, a non-developmental phenotype for altered expression of a SKI orthologues was observed in rat cardiomyocytes. In this system, the overexpression of the rat SKI orthologue leads to a decrease in fibroblast- to-myofibroblast phenoconversion, the main mechanisms for fibrotic heart disease [Cunnington & al., 2010; Cunnington & al., 2014; Zeglinski & al., 2016]. Taken to- gether, these studies show an involvement of SKI genes in a variety of cardiac phen- otypes across different tissues stages. The other gene in proximity to rs139971383, theMORN1 gene, is relatively unstudied. The second locus on chromosome 1 lies within intron nine of the MEGF6 gene (figure 7.10, middle panel), which encodes for a secreted, calcium-iron binding pro- tein [Nakayama & al., 1998]. It is also in proximity to the PRDM16 gene, wherein 170 Figure 7.9: Quantile-quantile plot of the multi-trait GWAS on 3D heart phenotypes. The observed genome-wide p-values are plotted against p-values drawn from a uniform distri- bution in [0, 1] of the same sample size (expected p-values). The diagonal line starts at the origin and has slope one. deletions and mutations were shown to be implicated in two types of cardiomy- opathies, left ventricular non-compaction (section 8.1) and dilated cardiomyopathy (section 2.4) [Arndt & al., 2013]. Based on zebrafish models of the observed human genotypes, the authors propose that PRDM16 mutations lead to a decreased pro- liferative capacity during cardiogenesis. Interestingly, the study also found a link between the SKI and PRDM16 genes, suggesting a functional synergy that leads to decreased cardiac output in zebrafish models with knock-down phenotypes of SKI and PRDM16. rs143266802 is located downstream of the zinc finger protein- encoding gene ZNF487 (figure 7.10, lower panel), which has no associated pheno- types in human (GRCh38.p10, ensembl release 90, [Aken & al., 2016]). A database search of the GWAS catalogue [MacArthur & al., 2017] (based on entries in the GWAS catalogue, 0.7.08.2018) and the Global Biobank engine, a re- source for estimated genetic effects on cancers, autoimmune diseases, psychiatric, neurological, and cardiometabolic diseases [GBE, 2017] did not yield any other phen- otypes that these SNPs were associated with. The mvLM per SNP yields individual effect size estimates for each trait that is jointly modelled. There are two ways by which these effect size estimates can be helpful in understanding the genotype-phenotype association. Firstly, traits driv- ing the association with the SNP are expected to have high effect size estimates. 171 Figure 7.10: Genomic context of loci associated loci with 3D heart phenotypes . The p- values and genomic location of the three associated loci from the mtGWAS on the stable components from PCA, Laplacian Eigenmaps and Isomap are shown in relation to the p- values of surrounding genotypic markers. Markers are coloured by the level of LD they share with the SNP of interest. There was no LD information available on LocusZoom for the locus depicted in the bottom panel. Generated with LocusZoom [Pruim & al., 2010]. 172 Table 7.2: Strongest genotype-phenotype association per locus for 3D heart GWAS. For each locus, the p-values for SNPs in LD with an 𝑟2 > 0.8 in a 50kb window were compared and only the SNP with smallest p-value per locus listed below. Gene: gene in proximity to SNP and described in detail in the text above. M: major allele, m: minor allele, MAF: minor allele frequency. SNP Gene Chr Position P-value M/m allele MAF rs139971383 SKI 1 2,246,921 1.09× 10−10 C/G 0.013 rs113719231 PRDM16 1 3,427,138 9.04× 10−9 C/T 0.11 rs143266802 ZNF487 10 43,978,849 1.54× 10−8 C/T 0.022 2.20Mb 2.25Mb 2.30Mb rs142034447 rs139971383 mtGWAS 3D heart SKI >Genes (GENCODE 19) < MORN1 Left Ventricle Regulatory Features Promoter Flank Enhancer Activity in epigenome - Inactive Regulation Legend 163.62 kb Forward strand Reverse strand 163.62 kb Figure 7.11: Regulatory context of locus with strongest association. The SNP with the strongest association (rs139971383) in the mtGWAS lies in a promoter flanking region epi- genetically active in myocytes from the left ventricle (Ensembl, Human Regulatory Features, GRCh37.p13). 173 Secondly, traits that are similarly affected by the SNPs will have similar effect size estimates. In figure 7.12, I show the effect sizes for each of the 19 components per SNP clustered by their Euclidean distance. For the locus with the strongest associ- ated with the 19 proxy phenotypes of wall thickness, there are two clusters of high effect size estimates (figure 7.12A, rs139971383). While one of them contains com- ponents from one method only (LaplacianEigenmaps1, 2, and 6), the other cluster contains two components from different methods, Isomap1 and PCA1. Similarily, the association of rs143266802 seems to be driven by a combination of components from all three methods (PCA2, Isomap2 and LaplacianEigenmap3). These results demonstrate the strength of this analysis approach, where different aspects of phen- otype morphology are captured by different methods that can then jointly represent a wider aspect of the phenotype structure. The corresponding trait correlations are shown in figure 7.12B. A number of effect size clusters can seemingly be explained by the strong correlation of their respective traits (indicated by coloured boxes). In con- trast, independent analysis of components from a single method, could not detect these strong signals (figure B.8). Only the locus situated in the regulatory region between MORN1 and SKI was detected in a mtGWAS with the components from Laplacian Eigenmaps alone (figure B.8A); p-value: 1.36× 10−8), confirming the ef- fect size cluster structure observed for this locus, with large effect size for Lapla- cianEigenmaps1, 2, and 6. Additional signal for this independent analysis was over- all weaker than the one for the combined analyses. GWAS with components from Isomap and PCA alone did not yield any associations (figure B.8B and C in the ap- pendix). The proxy phenotypes are critical for the discovery of the genetic associ- ation but do not necessarily represent a biologically meaningful conformation. In order to understand the effect on the underlying biology without mediation via the dimensionality reductionmethods, I linked the SNPs back to the original heart phen- otypes. In a first, simple approach, I used the discovered SNPs as explanatory variables in a simple linear model with left ventricular mass as the phenotype and sex, age, height and weight as additional covariates. None of the three SNP discovered with the mtGWAS shows association with left ventricular mass (rs139971383: 𝑝 = 0.89 , rs11371923: 𝑝 = 0.22 , rs143266802: 𝑝 = 0.68). This result is not discouraging, however, since the hypothesis was that stable components capture regional vari- ation in left ventricular wall thickness. Summarising wall thickness variation in a single scalar value such as left ventricular massmight not be able to capture these re- gional changes in mass. In order to analyse if the discovered SNPs show association 174 rs139971383 rs113719231 rs143266802 SNPs −0.8 −0.4 0.0 0.4 0.8 Effect size SKI PRDM16 ZNF487 A PCA5 PCA6 PCA13 PCA11 IM2 LE7 PCA3 IM3 IM4 PCA4 PCA2 LE3 PCA7 LE2 LE6 PCA1 IM1 LE1 PCA12 Pearson Correlation −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 PCA5 PCA6 PCA13 PCA11 IM2 LE7 PCA3 IM3 IM4 PCA4 PCA2 LE3 PCA7 LE2 LE6 PCA1 IM1 LE1 PCA12 PC A5 PC A6 PC A1 3 PC A1 1 IM 2 LE 7 PC A3 IM 3 IM 4 PC A4 PC A2 LE 3 PC A7 LE 2 LE 6 PC A1 IM 1 LE 1 PC A1 2 B Figure 7.12: Effect size estimates and trait correlation from the 3D heart GWAS. A. The effect size estimates from the most strongly associated SNPs at each locus were clustered across components and SNPs by average-linkage hierarchical clustering of their Euclidean distances. The dendrogram of the components is labelled based on the methods used to generate the low-dimensional representation. The numbering indicates the position of the component as returned from the algorithm, i.e. for PCA the ordering based on the amount of variance explained. LE: Laplacian Eigenmaps; IM: Isomap. B. Trait-trait correlations of the corresponding low-dimensional representations used as response variable in the GWAS. For comparison, the order of the traitswasmatched to the clustering of the effect sizes. Effect size clustered where strong trait-trait correlations are observed are indicated by colour-matched boxes. 175 specific to certain regions in the left ventricle, I evaluated the relationship between the genotypes of the strongest associated SNP and the original, spatially-resolved left ventricular wall thickness measurements. For each of the 27,623 positions, I conducted a simple linear model with covariate-adjusted wall thickness measure- ments (data identical to input data for dimensionality reduction, section 7.2) as the response variable and the genotype of rs139971383 as the explanatory variable. Fig- ure 7.13 shows these associationswith the SNP in relation to their location on the left ventricle. Importantly, although none of these associations would be likely to sur- vive the large multiple testing burden if used for discovery, they do show a specific localisation to the left ventricle which is affected by this SNP. Figure 7.13: Association of rs139971383 with left ventricular wall thickness. The 27,623 covariate-adjusted wall thickness measurements in the left ventricle were used as the re- sponse variable in a simple linear model with the genotype of rs139971383 as the explan- atory variable. The -𝑙𝑜𝑔10(p-value) of the association of each models is projected onto its corresponding 3D position. Darker colors indicate stronger associations. Generated with ParaView. 176 7.4. Successful imaging genetics of cardiac phenotypes In this chapter, I have described the step-wise feature extraction from high-dimen- sional cardiac magnetic resonance images of 140,000 voxels to a low-dimensional representation comprised of 19 components from linear and non-linear dimension- ality reduction methods. The initial, atlas-based image segmentation of the original cardiacmagnetic resonance images yielded reliable cardiac phenotypes atmore than 27,000 positions in the left ventricle. One such phenotype was the left ventricular wall thickness, which I transformed into a significantly lower-dimensional compon- ent space by applying a variety of dimensionality reduction methods with different properties and consequently different low-dimensional representations. Using the three measures I introduced in chapter 6, I was able to make a principled decision about which low dimensional features to retain for further investigations into the ge- netics. Combining all stable and trustworthy components from different dimension- ality reduction methods provided a robustness to the phenotypes which allowed for qualitatively different, latent cardiac structures to be represented in the final phen- otype. I successfully mapped genotypes to these 19 phenotypes in a mtGWAS that detected three significantly associated loci. In order to link these genetic associations back to the observed wall thickness phenotypes, I associated the strongest genetic link with each wall thickness measurement and discovered a region highly associ- ated with this SNP. These results are promising for genetic association studies of very high-dimen- sional and correlated phenotypes, as well as for this specific study on cardiac mor- phology. In the emerging field of imaging genetics [Ge & al., 2014], the phenotype space ranges from simple photographs of face morphology [Liu & al., 2012; Shaffer & al., 2016] to functional MRI scans of brain activity [Stein & al., 2010; Hibar & al., 2015]. While each of these phenotypes are generated by different methods and will be subjected to different challenges in acquisition and quality control, the ultimate challenge lies in handling the high dimensionality of the phenotypes. The dimen- sionality reduction methods tested on the simulated data in chapter 6 and the 3D heart dataset in this chapter are all publicly available and can be readily applied to any fully phenotyped dataset. As well as a practical example of the dimensionality reduction methods, the res- ults of this specific combination of dimensionality reductionwith GWAS are of great interest to my cardiac biology collaborators. The pre-existing cardiac related phen- otypes of SKI and PRDM16 and their interaction in experimental rodent systems 177 is very reassuring. However, before committing to further studies and publication, I will need to undertake additional manual quality control of the genotypes and ideally would formulate additional ways to ensure the soundness of the result. Al- though a stringent quality control has been applied both to the actual genotypes and the imputations, poor genotype calling can lead to faulty imputations [Morris & al., 2010]. I have already manually checked the genotype calling quality of 11,377 genotypes of the Sanger12 batch, but manual quality control of the other datasets, re-imputation and potential direct genotyping of the associated SNPs should be con- ducted. To ensure the soundness of the result, dimensionality reduction and GWAS of 3D heart phenotypes from an independent dataset would be the ideal scenario. Unfortunately, high resolution MRI scans are not routine and even the UK Biobank MRI scans are not directly equivalent. Other possibilities include investigating the specific biology behind these loci or the specific molecular biology of the regulat- ory elements to provide additional evidence for the biological correctness of these associations. We are also planning to investigate these genetic loci and the spatially confined association signal in the left ventricle (figure 7.13) in cohorts of patients suffering from cardiomyopathies (section 2.4). As a first step towards understanding the spatial association signal, we investig- ated additional image analysis approaches for phenotype extraction of left ventricu- lar phenotypes. The extraction of ventricular trabeculation phenotypes and their genetic associations are described in the following chapter. 178 8 GWAS of left ventricular trabeculation In addition to the unsupervised phenotype selection through dimensionality reduc- tion, the raw cardiac magnetic resonance images also provide the opportunity for a guided phenotype extraction. From a combined clinical and research point of view, phenotypes which are implicated in diseases but which also show strong natural variation are of special interest. Trabeculation phenotypes of the left ventricle fit this description. 8.1. Left ventricular trabeculation Trabeculation is the formation of small irregular muscle protrusions from the inside of the heart wall and has its origin in early heart development. As described in section 2.3, the chambers of the human heart develop through the looping of the early cardiac tube. During this process, the compartmentalisation of the heart begins and the composition of the cardiac tissue changes, especially in the ventricles. At this stage, the ventricular myocardium can be described as a loose, “spongy” network of myocardial fibres that form sheet-like protrusions (trabeculae) towards the cardiac lumen. The formation of these structures supports the oxygen andnutrient exchange in the heart [Chen & al., 2009] by blood flowing through the intertrabecular spaces [Zambrano & al., 2002]. Later in development, the myocardium starts to become 179 more compact and thicker and the large protrusions into the heart lumen flatten or disappear [Yousef & al., 2009]. This compaction process progresses from the base of the heart towards the apex and from epicardium to endocardium [Zambrano & al., 2002] Failure of the myocardial compaction process leads to persistence of ventricular hypertrabeculation. Clinically, the majority of hypertrabeculation phenotypes are observed in the left ventricle and are referred to as left ventricular non-compaction (LVNC) [Zambrano& al., 2002]. It is still unknown if LVNC constitutes a distinct dis- ease or is a shared characteristic of different cardiomyopathies [Captur & al., 2013]. Linkage studies and targeted sequencing of associated regions have revealed a num- ber of genes implicated in familial cases of LVNC [Bleyl & al., 1997; Klaassen & al., 2008; Moric-Janiszewska & Markiewicz-Łoskot, 2008], with a wide range of func- tions of the encoded proteins. These include cardiacmuscle 𝛼 actin [Monserrat & al., 2007], 𝛽-Myosin Heavy Chain [Budde & al., 2007] as well as cytoskeletal-associated proteins like 𝛼-dystrobrevin [Ichida & al., 2001] and Cypher/ZASP [Vatta & al., 2003]. Knock-out studies of genes regulating cardiovascular development have con- tributed to amolecular understanding of clinically relevant LVNCphenotypes [Chen & al., 2009; Mysliwiec & al., 2011]. However, the genetics of sporadic LVNC remain largely unknown [Zambrano & al., 2002]. In addition to LVNC as a clinical phenotype, variation in trabeculation pattern and strength have also been observed in healthy volunteers. Several studies have ana- lysed the range of natural and diseased non-compaction phenotypes with respect to clinical and demographic parameters [Petersen & al., 2005; Captur & al., 2014]. In particular, two independent studies have observed an increase in the ratio of non- compacted to compacted myocardium (NC:C) in individuals of African-American and Hispanic descent compared to Caucasian individuals. The lowest NC:C ratios were observed for individuals of Chinese descent [Kawel & al., 2012; Captur & al., 2015]. The genetics of this natural variation and clinically observed sporadic phen- otypes in humans are still poorly understood. In this chapter, I analyse natural genetic variation driving left ventricular trabecu- lation phenotypes in healthy volunteers. Trabeculation phenotypes were extracted automatically via fractal analysis from the cardiac magnetic resonance images of the healthy volunteers by my collaborators. Based on these phenotypes, I conducted a GWAS of left ventricular trabeculation. 180 8.2. Image acquisition and phenotyping The cohort used for the genetic association study of left ventricular trabeculation consists of the samples with European ancestry that passed the genotyping and im- putation quality control described in section 7.1.1. Since there is no ground to sus- pect confounding of the phenotype processing based on the relatedness of samples as it was the case in the previous chapter (chapter 7), related samples were included in the cohort. For each of the 1,207 samples, the level of trabeculation was meas- ured at six to ten positions in the heart. Trabeculation was quantified via fractal analysis, a technique which allows to measure the complexity of patterns [Eke & al., 2002]. Fractal analysis yields a unit-less measure, the fractal dimension (FD), which quantifies the complexity of the analysed structure. The higher the FDmeasure, the higher the complexity of the structure i.e. the more trabeculation is observed in the left ventricular wall. The pipeline for the automatic detection and quantification of trabeculation in the left ventricle was developed by Jiashen Cai and Pawel Tokarczuk. In the following paragraph, I briefly describe the image acquisition and phenotype extraction pro- cedure. 2D cardiacmagnetic resonance imagingwas conducted at theHammersmithHos- pital, London. The fractal dimensions were derived from standard left ventricular short axis 2D cardiac magnetic resonance images in the plane from base to apex. Each section had a thickness of 8 mm with a 2 mm gap between sections. A more detailed description of the imaging parameters can be found in [de Marvao & al., 2014]. Fractal analysis was automated according to the protocol proposed by Cap- tur and colleagues [2013]. First, the images (figure 8.1, 1) were binarised into blood pool and myocardium (figure 8.1, 2) and the endomyocardial border extracted via edge detection (figure 8.1, 3). The FD was determined by placing grids with known spacing (scale) of increasing size (i.e. increasing number of edges) on the image and subsequent counting of the number of boxes with non-zero pixels, i.e. how many boxes contain at least one pixel of the border (figure 8.1, 4). The slope of the linear regression of the ln-transformed scale versus the ln-transformed counts corresponds to the FD (figure 8.1, 5) [Captur & al., 2013]. 181 FD = 1.34 ln (c o u n t) ln(scale) count scale 19 0.25 40 0.13 9 0.38 150 0.05 81 0.08 1 2 3 4 5 Figure 8.1: FD phenotyping scheme. FD is determined for each of the left ventricular short axis slices derived from standard 2D cardiac magnetic resonance images. 1. An example left ventricular short axis slice is depicted on the left, its location in the heart is indicated by the dashed line of the heart image on the right. 2. The image is binarised into blood pool (white) and other structures (black). 3. The border between the white and the black background is the endocardialwall, which can be extracted via edgedetection. 4. A standard box-counting method is applied to the image of the extracted border, where grids of known spacing (scale) are placed on top of the image and boxes containing at least one pixel of endocardial borders are counted. 5. The slope of the regression of the ln-transformed scale versus the ln-transformed count is the FD. Adapted from [Captur & al., 2013]. 182 8.3. The complexity of trabeculation shows a consistent base to apex pattern For 1,192 out of the 1,207 genotyped samples, FD measurements could successfully be computed at each slice. Their distribution from base to apex is depicted in fig- ure 8.2. Both at the tip of the apex and the end of the basal zone, FD is generally lowest and increases towards the mid-section of the heart. Similar results have been observed by [Kawel & al., 2012] and [Captur & al., 2014]. The latter have shown that most variation between healthy and diseased individuals exists in FDmeasurements derived from the apical slices of the heart ( figure 8.2A) and used the maximal FD value observed in these slices as their final phenotype. I followed the strategy of dividing the measurements into apical and basal (figure 8.2B) and used the max- imum FD observed in each group as final phenotypes. For individuals with uneven numbers of slices, the center slice was not considered for the computation of the maximum values. 1.0 1.1 1.2 1.3 1.4 1 2 3 4 5 6 7 8 9 10 F D Apical Basal A B Slices Location Basal Apical Figure 8.2: FD measurements from base to apex. A. Location of the 2D cardiac magnetic resonance image slices and their classification into apical and basal. B. FD measurements for all samples were interpolated via a cubic spline function to the maximum number of 10 slices for easier visualisation. Subsequent analyses were done based on the original, non- interpolated FD measurements. 183 Sex [f/m] Age [years] Height [cm] Weight [kg] LVM [g] Slices FD basal FD apical Se x [f/m] Age [ye a rs] H eight [cm] W eight [kg] LVM [g] Slices FD basal FD apical 0 20 40 60 0 20 40 60 20 40 60 80 14 0 15 0 16 0 17 0 18 0 19 0 40 60 80 10 0 12 0 50 10 0 15 0 20 0 25 0 5 6 7 8 9 10 1. 15 1. 20 1. 25 1. 30 1. 35 1. 40 1. 1 1. 2 1. 3 1. 4 0 200 400 600 20 40 60 80 140 160 180 50 75 100 125 50 100 150 200 250 5 6 7 8 9 10 1.2 1.3 1.4 1.1 1.2 1.3 1.4 Figure 8.3: Relationship between FD measures and covariates. Continuous variables: the univariate-distribution of each variable is depicted on the diagonal. The upper triangular matrix shows the bi-variate distribution while the lower triangular matrix shows the regres- sion line of their linear fit. Categorical variables (sex): Distribution (first row) and counts (first column) are depicted. 184 8.4. Relationship between trabeculation phenotypes and covariates I analysed the distribution of the 2 FD measurements FDbasalmax and FD apical max in rela- tion to biological and cardiac covariates (figure 8.3). Both FD measurements show correlation with age, weight and left ventricular mass. FDbasalmax is additionally asso- ciated with height, while FDapicalmax also shows correlation with sex (table 8.1). The association of LVM and FDapicalmax confirms the findings of the study by Captur and colleagues [Captur & al., 2014], who found increased FD measures for individuals with increased LVM.However, the causality of the relationship has not been determ- ined yet. All associated covariates except for LVM, as the causal relationship to FD measurements is unclear, were used as covariates in the GWAS. Table 8.1: Association of FDbasalmax and FD apical max with covariates. Associationwas determined based on a simple linear model for each FD measurement with all covariates as explanatory variables without interaction effects. FDbasalmax FD apical max Sex 5.47× 10−1 4.96× 10−3 Age 3.04× 10−8 2.87× 10−4 Height 4.33× 10−2 4.43× 10−1 Weight 1.25× 10−4 4.55× 10−5 LVM 1.21× 10−12 2.62× 10−3 Slices 8.02× 10−1 3.89× 10−1 8.5. Left ventricular trabeculation is associated with two genomic loci The extraction of FDmeasurements from the 2D cardiac magnetic resonance images yields quantitative phenotypes capturing the complexity of trabeculation in the left ventricle. I used the two summary measures FDbasalmax and FD apical max described above as the response variables in a mtGWAS with the genetic marker and sex, age, height andweight as covariates. Since the dataset contained related individuals, I extended to model used in section 7.3 to a LMM by including an additional random genetic effect based on the RRM of the samples. The RRMwas estimated from the samples’ genotypes as described in section 1.7.6. The manhattan and qq-plots for the joint analysis of FDbasalmax and FD apical max are depicted in figure 8.4 and figure 8.5, showing two loci that reach genome-wide significance. As a comparison, stGWAS of FDbasalmax and 185 FDapicalmax only discovered the association on chromosome 2 (with response variable FDapicalmax ; figure B.9), demonstrating the power of the multi-trait approach. A sum- Figure 8.4: Manhattan plot of multi-trait GWAS on left ventricular trabeculation. The maximal apical and basal FD were modelled jointly in an any effect mtGWAS. The p-values of all genome-wide SNPs are depicted. The horizontal grey line is drawn at the level of genome-wide significance: 𝑝 = 5 × 10−8. mary of the two loci that reach genome-wide significance is shown in table 8.2 and figure 8.6. The locus on chromosome 2 lies within an intron of a long intergenic noncoding RNAs (lincRNA) of unknown function (figure 8.6, upper panel). The second associated locus is positioned in intron 24 of theADAMTSL1 gene (figure 8.6, lower panel). ADAMTSL1 is also known as Punctin and two of its intronic and in- tergenic variants (rs7869627: intron 17; rs1411242: intergenic between SH3GL2 and ADAMTSL1) have been found associated with blood pressure phenotypes [Sabatti & al., 2009]. rs7855681 is in weak LD with rs7869627 ( 𝑟2 = 0.119). Table 8.2: SNPs with strongest association in left ventricular trabeculation GWAS. For each locus, the p-values for SNPs in LD with an 𝑟2 > 0.8 in a 50kb window were compared and only the SNP with smallest p-value per locus listed below. M allele: major allele, m allele: minor allele, MAF: minor allele frequency. SNP Chr Position P-value M/m allele MAF rs7603133 2 3,103,708 3.23× 10−8 A/G 0.09 rs7855681 9 18,855,498 3.46× 10−8 A/C 0.32 Punctin is a secreted glycoprotein that can be detected in contacts between cells and components of the extra-cellular matrix, but that has not been observed in cell- cell contacts [Hirohata & al., 2002]. It is part of the ADAMTS-like protein fam- ily which lack the proteolytic activity of their name-lending metalloprotease pro- 186 Figure 8.5: Quantile-quantile plot of multi-trait GWAS on left ventricular trabeculation. The observed genome-wide p-values of the multi-trait FD GWAS are plotted against equally spaced values in [0, 1] of the same sample size (expected p-values). The diagonal line starts at the origin and has slope one. 187 Figure 8.6: Genomic context of loci associated loci with left ventricular trabeculation. The p-values and genomic location of the two loci reaching genome-wide significance are shown in relation to the p-values of surrounding genotypic markers. Markers are coloured by the level of LD they share with the SNP of interest. For both loci, all SNPs that are associated were imputed. For the locus on chromosome 9 (lower panel), an additional SNP which was directly genotypes but has not passed the genome-wide significant level has been marked in red. Generated with LocusZoom [Pruim & al., 2010]. 188 tein family. While other proteins of the ADAMTS-like family have been shown to be associated with connective tissue disorders and affecting the formation of the extra-cellular matrix [Ahram & al., 2009; Hubmacher & Apte, 2015], the function of punctin remains unknown. However, progress has been made in understanding the regulation of its secretion through post-translational modification of its tryptophane 42 residue [Wang & al., 2009]. A recently published study shows a strong systemic phenotype for the mutation of this tryptophane residue, inhibiting the secretion of the protein. However, no further advances in understanding the mechanisms or finding binding partners of ADAMTSL1 could be made [Hendee & al., 2017]. The locus on chromosome 1 (SNP: rs113719231 ) discovered in section 7.3 is loc- ated near the PRDM16 gene which has been associated with LVNC [Arndt & al., 2013]. A linear model with the rs113719231 genotypes, sex, age, height and weight as explanatory variables and FDbasalmax /FD apical max as response variables did not show any association, evenwithout the burden of the genome-wide significance threshold (𝑝 = 0.78/𝑝 = 0.77). The clinical phenotype of left ventricular non-compaction has been found asso- ciated with a number of other cardiac and cardiovascular phenotypes such as con- duction abnormalities [Yousef & al., 2009], arrhythmias [Ritter & al., 1997; Oechslin & al., 2000; Yousef& al., 2009], coronary artery disease [Ritter& al., 1997; Junga& al., 1999; Jenni & al., 2002; Soler & al., 2002] and myocardial infarction [Swinkels & al., 2007; Toufan & al., 2012; Güvenç & al., 2012]. In addition, a study on population variation of left ventricular trabeculation found associations between the increase in left ventricular trabeculation and prevalence of hypertension, left ventricular mass and wall thickness [Captur & al., 2015]. For the majority of these phenotypes, ori- ginal GWAS and meta-analysis of GWAS have been conducted including atrial fib- rillation [Gudbjartsson & al., 2007; Christophersen & al., 2017], atrioventricular con- duction [Denny & al., 2010], coronary heart disease [Schunkert & al., 2011; Lee & al., 2013; Nikpay & al., 2015], myocardial infarction [Kathiresan & al., 2009; Hirokawa & al., 2015; Nikpay & al., 2015; Dehghan & al., 2016] and blood pressure phenotypes [Ehret & al., 2011; Wain & al., 2011]. For studies where the summary statistics of the genome-wide associations were made publicly available (blood pressure phen- otypes [Ehret & al., 2011; Wain & al., 2011], coronary artery disease [Schunkert & al., 2011] andmyocardial infarction [Nikpay& al., 2015]), I collected the effect size estim- ate (continuous traits) and odds ratios (case-control setting) for the associated loci on chromosome 2 and 9. The SNP with the highest association on chromosome 9 (rs7855681) was contained in all available studies. For the locus on chromosome 2, 189 the SNP with the highest association was not contained in any of the studies, how- ever rs6758505 which is in strong LDwith the discovered SNP in Europeans (𝑟2 = 1) was found in one of the studies. Figure 8.7 depicts the effect size estimates and odds ratios for both SNPs estimated for different blood pressure measurements, coronary artery disease and myocardial infarction. For all phenotypes, the confidence inter- vals of effect size/odds ratio estimates contain zero and one, respectively and thus show no effect of the SNPs on these phenotypes. A database search of the GWAS catalogue [MacArthur & al., 2017] for associated SNPs and SNPs in LD (based on entries in the GWAS catalogue, 0.7.08.2018) and the Global Biobank engine [GBE, 2017] did not yield associations with any other phenotype. 8.6. Summary In this chapter, I used phenotypes derived from a guided feature extraction method to map naturally occurring genetic variation in healthy individuals to a clinically relevant phenotype. The association of the FD phenotypes as a quantification of left ventricular trabeculation detected two loci that are linked on a genome-wide signi- ficant level. Both loci lie in intronic regions and have no direct protein-coding con- sequences. Loci in proximity to the association detectedwithin theADAMTSL1 gene have been implicated in cardiac phenotypes such a blood pressure. However, the ab- sence of any effect for this locus in well-powered published GWAS of blood pressure phenotypes points towards a blood pressure-independent effect on left ventricular trabeculation. For quantitative, continuous phenotypes and additive genotype effects, under- standing naturally occurring variation can give insights into the genetic architecture of the traits and might help to understand more extreme disease phenotypes. In order to extend this study and confirm results in a larger cohort, we applied for ac- cess to the UK Biobank a “large, population-based prospective study, established to allow detailed investigations of the genetic and non-genetic determinants of the diseases of middle and old age” [Sudlow & al., 2015]. Within this project, 500,000 individuals have been genotyped and phenotyped for wide array of traits, including 2D cardiac magnetic resonance imaging scans on an expected 100,000 individuals. In contrast to the 3D heart phenotypes investigated in chapter 7, the FD phenotypes can be automatically extracted from these images. Upon access to the data, pheno- type extraction and a mtGWAS with the same model and parameters as described in this chapter will be conducted. 190 IC BP −0.4 −0.2 0.0 0.2 0.4 Diastolic BP Mean arterial BP Pulse pressure Systolic BP Effect size estimate A CA RD Io GR AM CA RD Io GR AM plu sC 4D 0.95 0.97 0.99 1.01 1.03 1.05 Coronary artery disease Coronary artery disease Odds ratio CA RD Io GR AM plu sC 4D 0.95 0.97 0.99 1.01 1.03 1.05 Coronary artery disease Odds ratio B Locus chr2 Locus chr9 Figure 8.7: Effect estimates of associated FD SNPs with other cardiovascular phenotypes. Effect size estimates and odds rations for the SNPs associated with FD were derived from previous published studies on blood pressure (BP) phenotypes and risk for coronary artery diseases and myocardial infarction. The diamond indicates the effect estimates, the error bars their confidence interval. The size of the diamond represents the sample size of the study and is normalised to the largest study size (pulse pressure: 𝑁 =71,663). All studies were conducted asmeta-analyses in the scope of large consortia (faceting labels). The dashed vertical line indicates the value of no effect. 191 In addition to this replication study, investigating the genetic variation driving the healthy phenotype differences in individuals of different ethnicities [Kawel & al., 2012; Captur & al., 2014] will be of great interest. While the cohort in this study only contained aminority of non-European samples, amore diverse cohort structure might be observed in the UK Biobank cohort, enabling this analysis. 192 9 Concluding remarks Initially, GWAS used seemingly simple case-control designs to map genotypes to a variety of disease phenotypes. In subsequent years, existingmodelswere discovered for their application in GWAS [Korte & al., 2012] and novel techniques developed, enabling the analysis in cohorts with complex structure [Yu & al., 2006; Kang & al., 2010], the effect estimation of sets of genotypes [Wu & al., 2010; Casale & al., 2015] or gene-environment interaction in the context of GWAS [Casale & al., 2017]. While sophisticated methods for the analysis of multiple traits existed [Korte & al., 2012; Zhou & Stephens, 2012; Casale & al., 2015], they were mainly limited to moderate trait numbers due to their computational complexity. LiMMBo (chapter 4) fills this gap by enabling the joint analysis of hundreds of phenotypes. Its performance on simulated data demonstrated its power even when only a moderate number of ob- served phenotypes is governed by the same genetic factors. The application to a dataset for yeast growth traits did not only show its usefulness on real data, but also demonstrated its value for investigating and generating biologically relevant hypo- theses such as pleiotropy of traits and complex trait structures. I provide the phenotype simulation framework (chapter 3) and LiMMBo as open- source software packages: PhenotypeSimulator (chapter 3) is accessible via the Com- prehensive R Archive Network [Meyer, 2017] and LiMMBo is implemented in a py- thon module which can be used in combination with the publicly available LIMIX 193 suit for flexible linear mixed model designs [Lippert & al., 2014]. For very high-dimensional datasets, one is often interested in applying a priori di- mensionality reduction method to the data to extract information relevant for the biological question of interest. In the biological literature, PCA is standardly em- ployed for this task [Avery & al., 2011; Liu & al., 2012; Zhang & al., 2012]. However, there exist a growing number of dimensionality reduction techniques based on dif- ferent statistical methods and assumptions about the hidden data structures. Twelve of these publicly available dimensionality reduction techniques were explored for their ability to find a robust representation of the input data (chapter 6). I used PhenotypeSimulator to generate datasets of different sizes and underlying structures and introduced stability as a new measure to determine the dimensions of a robust low-dimensional representation. I was able to show that dimensionality reduction techniques are valuable for genotype-phenotype mapping studies of very high-di- mensional datasets as the simulated genetic effects could be discovered in genetic association studies with the stable low-dimensional representations as phenotypes. I directly applied these insights to a clinically interesting dataset of spatially-re- solved three-dimensional human heart phenotypes. Based on the hypothesis that there are genetic factors that influence the heart morphology in a spatially-confined manner, I extracted low-dimensional representations of the left-ventricularwall thick- ness measurements and used these in a genome-wide association study. Associated SNPs did not only show a regional-confined effect but have also been implicated in cardiac phenotypes in model organisms. While further studies are needed to con- firm these findings, the results demonstrate the power of this approach to investigate biologically and clinically relevant questions. In the feature extraction approach used for this GWAS, I combined the stable low-dimensional representations from a variety of different dimensionality reduc- tion approaches, with the underlying hypotheses that different methods capture different aspects of the morphology and a combination of the methods will yield a comprehensive representation. Alternatively, models which are more tailored to the specific structure of the dataset could be employed. The spatially-resolved heart wall thickness measurements in this study are part of a larger class of data struc- tures, where measurements on a two-dimensional surface are embedded in a three- dimensional space. Similar data has been observed for 3D structural MRI or 4D functionalMRI studies in the brain [Van Essen& al., 2012; Glasser & al., 2013]. Novel feature extraction methods for neuroscience data can take a priori knowledge about spatial correlation of the input data into account. For instance, functionalPCA com- 194 bines approaches from PCA and DRR and incorporates additional sparsity priors into the model, which act on the underlying three-dimensional model of the data [Lila & al., 2016]. Similar extensions could be envisaged for the Bayesian factor ana- lysis model PEER [Stegle & al., 2012], where the spatial coordinates could be build into the model as priors. In addition to the wall thickness measurements, the phenotyping approach de- veloped bymy collaborators also provides spatially-resolvedmeasurement for heart wall curvature and fractional wall thickness i.e. wall thickness changes between diastole and systole. In molecular phenotyping of different tissues or conditions the simple, albeit high-dimensional genotype-phenotypemapping is extended from the twodimensional “sample by phenotype” space into the higher-dimensional “sample by phenotype by condition/tissue/etc.” space. Novelmethods have been developed for the task of jointly analysing such datasets [Hore & al., 2016]. These approaches could be applied to extend this study and find stable phenotype components repres- enting a more comprehensive cardiac phenotype based on wall thickness, curvature or fractional wall thickening. In a second genetic association study with heart morphology, I discovered SNP- associations with a trabeculation phenotype from a supervised feature extraction approach on the raw MRI data. The implicated SNPs are located in proximity of a gene important in the developmental process of this trabeculation and follow-up studies are underway to confirm these results. Improved diagnosis and interventional strategies in the past two decades have contributed to the general improvements in fighting cardiovascular diseases. While these improvements were mainly based on large-scale clinical trials, there is a call now for more personalised approaches to further improve the management of car- diovascular diseases [Meder & al., 2016]. The proposed strategies ask for a stronger interaction between clinical, molecular and statistical expertise to enhance the char- acterisation of these diseases. Studies such as the GWAS on cardiac morphology show the feasibility of this proposal, with a strong collaboration between clinical and bioinformatics expertise to investigate the genetic basis of cardiac phenotypes. Follow up studies and further exploration of the data as outlined above can contrib- ute to further characterise the genetics of cardiac structure and function. 195 Appendix 197 A Supplementary tables 199 A.1. Additional information chapter 2 Table A.1: GWAS catalogue trait descriptions relating to cardiovascular diseases. Out of the 4,148 studies in the GWAS catalogue (accessed 11.08.2017), 159 contain phenotype description related to cardiovascular diseases. For a summary of the studies conducted, they were broadly summarised into eight groups (Summary name). A graphical overview is shown in figure 2.3. Summary name GWAS catalogue trait Congenital heart disease Congenital heart disease Congenital left-sided heart lesions (maternal effect) Congenital left-sided heart lesions Conotruncal heart defects Coronary heart disease Coronary heart disease Myocardial infarction Myocardial infarction (early onset) Coronary artery disease Coronary heart disease event reduction in response to statin therapy (interaction) Coronary restenosis Myocardial infarction in coronary artery disease Blood pressure Hypertension Systolic blood pressure Diastolic blood pressure Hypertension (young onset) Systolic blood pressure in sickle cell anemia Blood pressure (smoking interaction) Blood pressure measurement (cold pressor test) Blood pressure Blood pressure measurement (high sodium and potassium intervention) Blood pressure measurement (low sodium intervention) Blood pressure measurement (high sodium intervention) Systolic blood pressure (alcohol consumption interaction) Diastolic blood pressure (alcohol consumption interaction) Mean arterial pressure (alcohol consumption interaction) Pulse pressure (alcohol consumption interaction) Pulse pressure in young-onset hypertension Blood pressure (anthropometric measures interaction) Blood pressure (age interaction) 200 Table A.1: continued Ejection fraction in Tripanosoma cruzi seropositivity Electrocardiographic traits Atrial fibrillation Echocardiographic traits Atrial fibrillation/atrial flutter QT interval Electrocardiographic conduction measures Atrioventricular conduction QRS duration Cardiac repolarization QT interval (interaction) P wave duration PR segment PR interval in Tripanosoma cruzi seropositivity QT interval in Tripanosoma cruzi seropositivity QRS duration in Tripanosoma cruzi seropositivity Heart rate variability traits PR interval Resting heart rate RR interval (heart rate) Left ventricular mass Cardiac structure and function Cardiac muscle measurement Morphological traits Cardiac hypertrophy Dilated cardiomyopathy Chagas cardiomyopathy in Tripanosoma cruzi seropositivity Heart failure Heart failure Sudden cardiac arrest Mortality in heart failure Others Cardiac Troponin-T levels Cardiovascular disease risk factors 201 A.2. Additional results chapter 7 Table A.2: Number of SNPs after imputation, imputation QC and filtering for deviation from HWE and low MAF. Every batch was imputed independently (columns “SNPs after imputation”). SNPs that had an IMPUTE2 “info” metric of > 0.4 in all of the batches were combined and subsequently filtered for SNPs deviating from Hardy-Weinberg equilibrium (𝑝 < 0.001) and with low MAF (< 0.008), corresponding to a minor allele count of less than 20. Chr SNPs after Imputation INFO > 0.4 HWE and MAFSanger12 Duke-NUS12 Duke-NUS3 1 3,196,692 3,197,145 3,196,563 1,251,157 719,882 2 3,515,670 3,515,861 3,515,602 1,360,182 780,152 3 2,941,265 2,941,468 2,941,223 1,156,243 665,038 4 2,900,679 2,900,786 2,900,634 1,154,742 684,602 5 2,688,219 2,688,348 2,688,174 1,049,671 606,951 6 2,581,500 2,581,851 2,581,410 1,058,844 635,257 7 2,359,370 2,359,598 2,359,319 932,726 551,744 8 2,323,181 2,323,290 2,323,144 890,407 514,803 9 1,752,242 1,752,363 1,752,199 698,510 398,777 10 2,003,743 2,003,881 2,003,694 812,616 474,686 11 2,013,331 2,013,535 2,013,273 794,587 481,479 12 1,947,915 1,948,107 1,947,865 767,854 452,193 13 1,458,325 1,458,401 1,458,308 590,863 348,525 14 1,333,919 1,333,973 1,333,901 524,391 309,825 15 1,194,294 1,194,406 1,194,264 458,617 266,813 16 1,289,127 1,289,335 1,289,074 497,688 286,620 17 1,118,587 1,118,772 1,118,528 434,724 252,227 18 1,153,963 1,154,034 1,153,942 457,454 268,986 19 877,689 877,866 877,645 361,419 222,264 20 912,602 912,721 912,574 357,156 210,128 21 546,390 546,414 546,381 216,911 131,079 22 531,437 531,528 531,416 215,547 129,771 genome 42,989,377 42,993,178 42,988,308 16,042,309 9,391,802 202 B Supplementary Figures 203 B.1. Additional results chapter 4 h2 : 0.2 h2 : 0.5 h2 : 0.8 0.04 0.1 0.2 0.4 0.6 0.8 1 10 50 100 10 50 100 10 50 100 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 Number of traits % de te ct ed tr u e S NP s Model lmm_mt lmm_st Figure B.1: All parameter combinations of power comparison for multivariate and uni- variate LMMs of high-dimensional phenotypes. Each panel shows the influence of two simulation parameters on the power to detect the causal SNPs. 204 B.2. Additional results chapter 5 A B Figure B.2: Manhattan plot of traits with strong single-trait associations. Single-trait GWAS of A. magnesium sulfate and B. hydroquinone. The loci marked with a grey star are only found for these two traits and cannot be detected in the mtGWAS (figure 5.6), pointing to purely single-trait association that is burdened by the multi-trait testing based on 41 de- grees of freedom. The p-values were adjusted for multiple testing by the effective number of tests (𝑀eff = 33). The significance line is drawn at the empirical FDRstGWAS = 8.6 × 10 −6. 205 B.3. Additional results chapter 6 D. nMDS B. DRR C. MDS A. Isomap Figure B.3: Additional scatterplots for visual assessment of low-dimensional compon- ents derived from left-ventricular wall thickness. Pairwise scatter plots of the components (lower triangle) and density plots (upper triangle) are depicted. The diagonal of each plot shows the distribution of the respective component. Row and column labels specify the rank of the component out of the 100 low-dimensional components. Before plotting, each com- ponentwasmean-centred and divided by its standard deviation in order to have comparable axis dimensions. Given the normalised scale of the data, and the purpose of qualitative com- parison, axis ticks were omitted for a cleaner visualisation. 206 B.4. Additional results chapter 7 Figure B.4: Number of DNA probes on the different genotyping chips and their overlap. For the genotyping of the individuals in the Digital Heart project three different Illumina HumanOmniExpress genotyping chips were used (24v1-1_A, 12v1-1_A, 24v1-0), differing in the number of probes on the chip (numbers inside Venn diagram) and the number of samples that can be genotyped (12 and 24; indicated in name of chip). 207 Sex check Heterozygosity by Missingness Estimated IBD C h rX i n b re e d in g H e te ro zy g o si ty Male Female 0 .2 0 .8 -5 +4 -4 -3 +3 +5 0.0001 0.001 10.03 0.01 0.0 0.2 1.00.6 0.80.4 # p a ir s [x 1 0 5 ] 5 4 3 2 1 0 Reported sex Proportion of missing SNPs Estimated pairwise IBD C h rX i n b re e d in g H e te ro zy g o si ty Male Female 0 .2 0 .8 0.0001 0.001 10.03 0.01 0.0 0.2 1.00.6 0.80.4 # p a ir s [x 1 0 4 ] 5 4 3 2 1 0 6 7 -5 +4 -4 -3 +3 +5 C h rX i n b re e d in g H e te ro zy g o si ty # p a ir s [x 1 0 5 ] Male Female 0 .2 0 .8 0.0001 0.001 10.03 0.01 0.0 0.2 1.00.6 0.80.4 4 3 2 1 0 -5 +4 -4 -3 +3 A B C Figure B.5: Genotyping quality control per sample. A. Sanger12. B. Duke-NUS12. C. Duke- NUS3. Supplementary plots for genotyping QC described in section 7.1.1. 208 AB C Minor Allele Frequency -log(p-value) HWE p-value N u m b e r o f S N P s N u m b e r o f S N P s N u m b e r o f S N P s Minor Allele Frequency N u m b e r o f S N P s N u m b e r o f S N P s N u m b e r o f S N P s Figure B.6: Genotyping quality control per SNP. A. Sanger12. B. Duke-NUS12. C. Duke- NUS3. Supplementary plots for genotyping QC described in section 7.1.1. 209 A2 3 1 4 5 B 2 3 1 4 5 C 2 3 1 4 5 Figure B.7: Ethnicity of samples within the Digital Heart project. A. Sanger12. B. Duke- NUS12. C. Duke-NUS3. PCA was conducted on the SNP genotypes of the samples within the Digital Heart project (gencall) and genotypes of four greater ethnicities of the HapMap project (black: African, orange:Mexican/Native American, grey: European, yellow: Asian) [The International HapMap Consortium, 2005; The International HapMap Consortium, 2007]. The clustering of the samples based on the first and second PCs are depicted. Red dot- ted lines indicate borders considered to separate ancestries: 1. European, 2: African, 3: Mex- ican/Native American, 4. Asian, 5: Mixed ancestry. Gencall samples within the first group were used in chapters 7 and 8. A description of the analysis is described in section 7.1.1. 210 AB C Figure B.8: Manhattan plots for GWAS on stable components from a single dimensional- ity reduction method. The five stable components derived from Laplacian Eigenmaps (A), four from Isomap (B) and ten from PCA (C) were used as the response variables in three independent any effect mtGWAS. Their p-values were adjusted for the effective number of test conducted, estimated via equation (5.4) based on the correlation across their compon- ents (figure 7.7): 𝑀𝑒𝑓𝑓 = 2.04. The horizontal grey line is drawn at the level of genome-wide significance: 𝑝 = 5×10−8. Only the locus on chromosome 1 which was detected in the com- bined analyses (figure 7.8) could also be detected via components fromLaplacian Eigenmaps alone. 211 B.5. Additional results chapter 8 A B Figure B.9: Manhattan plot of two single-traitGWASon left ventricular trabeculation The maximal apical (A) and basal FD (B)were used as the response variable in a stGWAS. Their p- values were adjusted for the effective number of test conducted, estimated via equation (5.4) based on their correlation: 𝑀𝑒𝑓𝑓 = 1.86. The p-values of all genome-wide SNPs are depicted. The horizontal grey line is drawn at the level of genome-wide significance: 𝑝 = 5 × 10−8. 212 C Derivations The following section describes the derivation of the simulation scheme for the in- finitesimal genetic effects in section 3.2. A suitable model for simulating the infin- itesimal genetic effect 𝐆 ∈ ℛ𝑁, 𝑃 with known 𝑁 × 𝑁 sample (row) covariance is a matrix-normally distributed random variable, defined by its mean 𝐌 ∈ ℛ𝑁, 𝑃, its row covariance𝐃 ∈ ℛ𝑁, 𝑁 and its column covariance 𝐂 ∈ ℛ𝑃, 𝑃: 𝐆 ∼ℳ𝒩𝑁,𝑃 (𝐌 , 𝐃 , 𝐂 ) . (C.1) With the 𝑁 × 𝑁 sample-by-sample covariance captured in 𝑅 and𝐌 = 0, the com- ponent of𝐆which has to be simulated is the trait-by-trait covariance 𝐂: 𝐆 ∼ℳ𝒩𝑁,𝑃 ( 𝟎 , 𝐑 , 𝐂 ) (C.2) The structure of𝐂 depends on the design of the covariance effect. In order to simu- late 𝐂,𝐆 is first expressed in terms of a multivariate normal distribution vec(𝐆) ∼ 𝒩𝑁×𝑃 ( 𝟎 , 𝐂 ⊗𝐑) . (C.3) With the Cholesky decomposition of𝐑 and 𝐂 into𝐑 = 𝐁𝐁𝑇 and 𝐂 = 𝐀𝐀𝑇 vec(𝐆) ∼ 𝒩𝑁×𝑃 ( 𝟎 , 𝐀𝐀 𝑇 ⊗𝐁𝐁𝑇 ) , (C.4) 213 which can be rearranged as vec(𝐆) ∼ 𝒩𝑁×𝑃 ( 𝟎 , (𝐀 ⊗𝐁)𝐈(𝐀 ⊗𝐁) 𝑇) ) . (C.5) 𝐈 is the identity matrix. Using the property of a normally distributed random vari- able𝐘with mean 𝝁 and covariance matrix 𝚺 𝑤𝐘 ∼ 𝒩(𝑤𝝁 , 𝑤𝚺𝑤𝑇 ) , (C.6) we can let vec(𝐆) = (𝐀 ⊗𝐁)vec(𝐘) and𝐘 ∼ 𝒩𝑁×𝑃 ( 𝟎 , 𝐈 ) such that (𝐀 ⊗𝐁)vec(𝐘) ∼ 𝒩𝑁×𝑃 ( 𝟎 , (𝐀 ⊗𝐁)𝐈(𝐀 ⊗𝐁) 𝑇 ) (C.7) Using [Horn & Johnson, 1985]: Lemma 4.3.1, we get (𝐀 ⊗𝐁)vec(𝐘) = vec(𝐁𝐘𝐀𝑇) = vec(𝐆). (C.8) 214 References 1000 Genomes Project Consortium (2011) A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073 (cit. on p. 36). 1000 Genomes Project Consortium (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491(7422): 56–65 (cit. on pp. 36, 82). 1000 Genomes Project Consortium (2015) An integrated map of structural variation in 2,504 human genomes. Nature 526: 75–81 (cit. on pp. 36, 47, 73, 157, 159). Adams, M. D., Celniker, S. E., Holt, R. A., Evans, C. A., Gocayne, J. D., Amanatides, P. G., Scherer, S. E., Li, P. W., Hoskins, R. A., Galle, R. F., George, R. A., Lewis, S. E., Richards, S., Ashburner, M., Henderson, S. N., Sutton, G. G., Wortman, J. R., Yandell, M. D., Zhang, Q., Chen, L. X., & al. (2000) The genome sequence of Dro- sophila melanogaster. Science 287(5461): 2185–95 (cit. on p. 33). Adhikari, K., Fontanil, T., Cal, S.,Mendoza-Revilla, J., Fuentes-Guajardo,M., Chacón- Duque, J.-C., Al-Saadi, F., Johansson, J. A., Quinto-Sanchez, M., Acuña-Alonzo, V., Jaramillo, C., Arias, W., Lozano, R. B., Pérez, G. M., Gómez-Valdés, J., Villamil- Ramírez, H., Hunemeier, T., Ramallo, V., Silva De Cerqueira, C. C., Hurtado, M., & al. (2016) A genome-wide association scan in admixed Latin Americans identi- fies loci influencing facial and scalp hair features. Nature Communications 7: 1–12 (cit. on p. 38). Agopian, A. J., Mitchell, L. E., Glessner, J., Bhalla, A. D., Sewda, A., Hakonarson, H., & Goldmuntz, E. (2014) Genome-Wide Association Study of Maternal and Inher- ited Loci for Conotruncal Heart Defects. PLoS ONE 9(5): e96057 (cit. on p. 69). Ahram, D., Sato, T. S., Kohilan, A., Tayeh, M., Chen, S., Leal, S., Al-Salem, M., & El-Shanti, H. (2009) A homozygous mutation in ADAMTSL4 causes autosomal- recessive isolated ectopia lentis. The American Journal of HumanGenetics 84(2): 274– 8 (cit. on p. 189). 215 Eu-Ahsunthornwattana, J., Miller, N. E., Fakiola, M., Wellcome Trust Case Control Consortium, Jeronimo, S. M. B., Blackwell, J. M., & Cordell, H. J. (2014) Compar- ison of Methods to Account for Relatedness in Genome-Wide Association Studies with Family-Based Data. PLoS Genetics 10(7): e1004445 (cit. on pp. 52, 54). Aird, I., Bentall, H. H., Mehigan, J. A., & Roberts, J. A. F. (1954) The blood groups in relation to peptic ulceration and carcinoma of colon, rectum, breast, and bron- chus; an association between theABOgroups andpeptic ulceration.BritishMedical Journal 2(4883): 315–21 (cit. on p. 34). Aird, I., Bentall, H. H., & Roberts, J. A. F. (1953) A relationship between cancer of stomach and the ABO blood groups. British Medical Journal 1(4814): 799–801 (cit. on p. 34). Aken, B. L., Ayling, S., Barrell, D., Clarke, L., Curwen, V., Fairley, S., Fernandez Banet, J., Billis, K., García Girón, C., Hourlier, T., Howe, K., Kähäri, A., Kokocinski, F., Martin, F. J., Murphy, D. N., Nag, R., Ruffier, M., Schuster, M., Tang, Y. A., Vo- gel, J.-H., & al. (2016) The Ensembl gene annotation system. Database 2016 (cit. on pp. 119, 171). Allen, G. E. (1968) Thomas Hunt Morgan and the Problem of Natural Selection. Journal of the History of Biology 1: 113–139 (cit. on p. 29). Anderson, C. A., Pettersson, F. H., Clarke, G. M., Cardon, L. R., Morris, A. P., & Zon- dervan, K. T. (2010)Data quality control in genetic case-control association studies. Nature Protocols 5(9): 1564–73 (cit. on pp. 158, 159). Anderson, R. H., Yanni, J., Boyett,M. R., Chandler, N. J., &Dobrzynski, H. (2009) The Anatomy of the Cardiac Conduction System. Clinical Anatomy 22: 99–113 (cit. on p. 64). Anderson, S. (1981) Shotgun DNA sequencing using cloned DNase I-generated frag- ments. Nucleic Acids Research 9(13): 3015–27 (cit. on p. 32). Anttila, V., Stefansson, H., Kallela, M., Todt, U., Terwindt, G. M., Calafato, M. S., Nyholt, D. R., Dimas, A. S., Freilinger, T., Müller-Myhsok, B., Artto, V., Inouye, M., Alakurtti, K., Kaunisto, M. a., Hämäläinen, E., de Vries, B., Stam, A. H., Weller, C. M., Heinze, A., Heinze-Kuhn, K., & al. (2010) Genome-wide association study of migraine implicates a common susceptibility variant on 8q22.1. Nature Genetics 42(10): 869–873 (cit. on p. 43). Arndt, A.-K., Schafer, S., Drenckhahn, J.-D., Sabeh, M. K., Plovie, E. R., Caliebe, A., Klopocki, E., Musso, G., Werdich, A. A., Kalwa, H., Heinig, M., Padera, R. F., 216 Wassilew, K., Bluhm, J., Harnack, C., Martitz, J., Barton, P. J., Greutmann, M., Ber- ger, F., Hubner, N., & al. (2013) FineMapping of the 1p36Deletion Syndrome Iden- tifies Mutation of PRDM16 as a Cause of Cardiomyopathy. The American Journal of Human Genetics 93(1): 67–77 (cit. on pp. 171, 189). Arnett, D. K., Meyers, K. J., Devereux, R. B., Tiwari, H. K., Gu, C. C., Vaughan, L. K., Perry, R. T., Patki, A., Claas, S. A., Sun, Y. V., Broeckel, U., & Kardia, S. L. (2011) Genetic Variation in NCAM1 Contributes to Left Ventricular Wall Thickness in Hypertensive Families. Circulation Research 108(3): 279–283 (cit. on p. 69). Arnett, D. K., Li, N., Tang, W., Rao, D. C., Devereux, R. B., Claas, S. A., Kraemer, R., & Broeckel, U. (2009) Genome-wide association study identifies single-nucleotide polymorphism in KCNB1 associated with left ventricular mass in humans: the HyperGEN Study. BMC Medical Genetics 10: 43 (cit. on p. 155). Aschard, H., Vilhjálmsson, B. J., Greliche, N., Morange, P.-E., Trégouët, D.-A., & Kraft, P. (2014) Maximizing the power of principal-component analysis of cor- related phenotypes in genome-wide association studies. The American Journal of Human Genetics 94(5): 662–76 (cit. on pp. 54, 146). Astle, W. J., Elding, H., Jiang, T., Allen, D., Ruklisa, D., Mann, A. L., Mead, D., Bou- man, H., Riveros-Mckay, F., Kostadima, M. A., Lambourne, J. J., Sivapalaratnam, S., Downes, K., Kundu,K., Bomba, L., Berentsen, K., Bradley, J. R., Daugherty, L. C., Delaneau, O., Freson, K., & al. (2016) The Allelic Landscape of Human Blood Cell Trait Variation and Links to Common Complex Disease. Cell 167(5): 1415–1429.e19 (cit. on p. 147). Astle, W. & Balding, D. J. (2009) Population Structure and Cryptic Relatedness in Genetic Association Studies. EN. Statistical Science 24(4): 451–471 (cit. on p. 54). Atwell, S., Huang, Y. S., Vilhjálmsson, B. J., Willems, G., Horton, M., Li, Y., Meng, D., Platt, A., Tarone, A. M., Hu, T. T., Jiang, R., Muliyati, N. W., Zhang, X., Amer, M. A., Baxter, I., Brachi, B., Chory, J., Dean, C., Debieu, M., de Meaux, J., & al. (2010) Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature 465(7298): 627–31 (cit. on pp. 54, 104, 105). Avery, C. L., He, Q., North, K. E., Ambite, J. L., Boerwinkle, E., Fornage,M., Hindorff, L. A., Kooperberg, C., Meigs, J. B., Pankow, J. S., Pendergrass, S. A., Psaty, B. M., Ritchie, M. D., Rotter, J. I., Taylor, K. D., Wilkens, L. R., Heiss, G., & Lin, D. Y. (2011) A Phenomics-Based Strategy Identifies Loci on APOC1, BRAP, and PLCG1 Associated with Metabolic Syndrome Phenotype Domains. PLoS Genetics 7(10): e1002322 (cit. on pp. 144, 194). 217 Avery, O. T., Macleod, C. M., & McCarty, M. (1944) Studies on the chemical nature of the substance inducing transformation of Pneumococcal types. The Journal of Experimental Medicine 79(2): 137–58 (cit. on p. 31). Babisch,W. (2014)Updated exposure-response relationship between road trafficnoise and coronary heart diseases: A meta-analysis. Noise and Health 16(68): 1 (cit. on p. 67). Bacanu, S.-A., Devlin, B., & Roeder, K. (2002) Association studies for quantitative traits in structured populations. Genetic Epidemiology 22(1): 78–93 (cit. on p. 48). Balding, D. J. (2006)A tutorial on statisticalmethods for population association stud- ies. Nature Reviews Genetics 7(10): 781–791 (cit. on pp. 44, 47). Baron, M., Risch, N., Hamburger, R., Mandel, B., Kushner, S., Newman, M., Drumer, D., & Belmaker, R. H. (1987) Genetic linkage between X-chromosomemarkers and bipolar affective illness. Nature 326(6110): 289–292 (cit. on p. 34). Bateson, W. (1902) Mendel’s Principles of Heredity: A Defense. Ed. by C. J. Clay and Sons. Cambridge, UK: Cambridge University Press: 1–212 (cit. on pp. 26, 29). Bateson, W. (1909)Mendel’s Principles of Heredity. Cambridge: Cambridge University Press (cit. on p. 26). Bateson, W., Saunders, E. R., & Punnett, R. C. (1905) Experimental studies in the physiology of heredity. Reports to the Evolution Committee of the Royal Society 2: 1– 55, 80–99 (cit. on p. 28). Beaujean, A. A. (2015) R Package: BaylorEdPsych (cit. on p. 110). Belkin, M. & Niyogi, P. (2003) Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation 15: 1373–1396 (cit. on pp. 130, 134). Benjamini, Y. &Hochberg, Y. (1995) Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological) 57(1): 289–300 (cit. on pp. 46, 115). Bernstein, F. (1930) Ueber die Erblichkeit der Blutgruppen. Zeitschrift für induktive Abstammungs- und Vererbungslehre 54: 400–426 (cit. on pp. 33, 34). Betts, G. J., Desaix, P., Johnson, E., Johnson, J. E., Korol, O., Kruse, D., Poe, B., Wise, J. A., Womble, M., & Young, K. A. (2013) Anatomy and Physiology. Houston: Open- Stax: 1420 (cit. on pp. 61, 63). 218 Bhatnagar, A. (2004) Cardiovascular pathophysiology of environmental pollutants. American Journal of Physiology. Heart and Circulatory Physiology 286(2): H479–85 (cit. on p. 67). Bickel, P. J. & Levina, E. (2008) Regularized estimation of large covariance matrices. The Annals of Statistics 36(1): 199–227 (cit. on p. 104). Biffi, C., de Marvao, A., Attard, M. I., Dawes, T. J., Whiffin, N., Bai, W., Shi, W., Fran- cis, C., Meyer, H., Buchan, R., Cook, S. A., Rueckert, D., & O’Regan, D. P. (2017) Three-dimensional Cardiovascular Imaging-Genetics: A Mass Univariate Frame- work. Bioinformatics (cit. on pp. 70, 157). Bird, T. D. (1993) Are linkage studies boring? Nature Genetics 4(3): 213–214 (cit. on p. 34). Bleyl, S. B., Mumford, B. R., Thompson, V., Carey, J. C., Pysher, T. J., Chin, T. K., & Ward, K. (1997) Neonatal, Lethal Noncompaction of the Left Ventricular Myocar- dium Is Allelic with Barth Syndrome. The American Journal of Human Genetics 61: 868–872 (cit. on p. 180). Bloom, J. S., Ehrenreich, I. M., Loo, W. T., Lite, T.-L. V., & Kruglyak, L. (2013) Finding the sources of missing heritability in a yeast cross.Nature 494(7436): 234–7 (cit. on pp. 47, 54, 105, 108, 109). Bolormaa, S., Pryce, J. E., Reverter, A., Zhang, Y., Barendse, W., Kemper, K., Tier, B., Savin, K., Hayes, B. J., & Goddard, M. E. (2014) A multi-trait, meta-analysis for detecting pleiotropic polymorphisms for stature, fatness and reproduction in beef cattle. PLoS genetics 10(3): e1004198 (cit. on pp. 55, 104). Bonne, G., Carrier, L., Bercovici, J., Cruaud, C., Richard, P., Hainque, B., Gautel, M., Labeit, S., James, M., Beckmann, J., Weissenbach, J., Vosberg, H.-P., Fiszman, M., Komajda,M., & Schwartz, K. (1995) Cardiacmyosin binding protein–C gene splice acceptor site mutation is associated with familial hypertrophic cardiomyopathy. Nature Genetics 11(4): 438–440 (cit. on p. 68). Bonnet, C. (1779)Oeuvres d’histoire naturelle et de philosophie de Charles Bonnet ...Neucha- tel: Chez S. Fauche: 1–444 (cit. on p. 24). Botstein, D., White, R. L., Skolnick, M., & Davis, R. W. (1980) Construction of a ge- netic linkage map in man using restriction fragment length polymorphisms. The American Journal of Human Genetics 32(3): 314–31 (cit. on p. 32). Bottolo, L., Chadeau-Hyam, M., Hastie, D. I., Zeller, T., Liquet, B., Newcombe, P., Yengo, L., Wild, P. S., Schillert, A., Ziegler, A., Nielsen, S. F., Butterworth, A. S., 219 Ho,W. K., Castagné, R., Munzel, T., Tregouet, D., Falchi, M., Cambien, F., Nordest- gaard, B. G., Fumeron, F., & al. (2013) GUESS-ing polygenic associationswithmul- tiple phenotypes using a GPU-based evolutionary stochastic search algorithm. PLoS genetics 9(8): e1003657 (cit. on p. 38). Boveri, T. (1902) Über mehrpolige Mitosen als Mittel zur Analyse des Zellkerns: 67– 90 (cit. on p. 28). Boyden, L. M., Choi, M., Choate, K. A., Nelson-Williams, C. J., Farhi, A., Toka, H. R., Tikhonova, I. R., Bjornson, R., Mane, S. M., Colussi, G., Lebel, M., Gordon, R. D., Semmekrot, B. A., Poujol, A., Välimäki, M. J., De Ferrari, M. E., Sanjad, S. A., Gutkin, M., Karet, F. E., Tucci, J. R., & al. (2012) Mutations in kelch-like 3 and cul- lin 3 cause hypertension and electrolyte abnormalities. Nature 482(7383): 98–102 (cit. on p. 68). Brem, R. B., Yvert, G., Clinton, R., & Kruglyak, L. (2002) Genetic Dissection of Tran- scriptional Regulation in Budding Yeast. Science 296(5568) (cit. on pp. 47, 118). Brem, R. B. & Kruglyak, L. (2005) The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proceedings of the National Academy of Sciences of the United States of America 102(5): 1572–7 (cit. on p. 118). Brent, R. P. (1971) An algorithm with guaranteed convergence for finding a zero of a function. The Computer Journal 14(4): 422–425 (cit. on p. 59). Brook, R.D., Rajagopalan, S., Pope, C.A., Brook, J. R., Bhatnagar,A., Diez-Roux,A.V., Holguin, F., Hong, Y., Luepker, R. V., Mittleman, M. A., Peters, A., Siscovick, D., Smith, S. C., Whitsel, L., Kaufman, J. D., & American Heart Association Council on Epidemiology and Prevention, Council on the Kidney in Cardiovascular Dis- ease, and Council on Nutrition, Physical Activity and Metabolism (2010) Particu- late Matter Air Pollution and Cardiovascular Disease: An Update to the Scientific Statement From the American Heart Association. Circulation 121(21): 2331–2378 (cit. on p. 67). Brown, T. A. (2002) Genomes. 2nd. Wiley-Liss: 600 (cit. on p. 30). Browning, S. R. (2008) Estimation of Pairwise Identity by Descent From Dense Ge- netic Marker Data in a Population Sample of Haplotypes. Genetics 178(4) (cit. on p. 54). Browning, S. R. & Browning, B. L. (2010) High-resolution detection of identity by descent in unrelated individuals. The American Journal of Human Genetics 86(4): 526–39 (cit. on p. 52). 220 Browning, S. R., Browning, B. L., Todd, J., Clayton, D., Liu, G., Hubbell, E., Law, J., Berntsen, T., Chadha, M., Hui, H., & Al., E. (2007) Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. The American Journal of Human Genetics 81(5): 1084–97 (cit. on p. 37). Broyden, C. G. (1965) A class of methods for solving nonlinear simultaneous equa- tions.Mathematics of Computation 19: 577–593 (cit. on pp. 51, 59). Budde, B. S., Binner, P., Waldmüller, S., Höhne, W., Blankenfeldt, W., Hassfeld, S., Brömsen, J., Dermintzoglou, A., Wieczorek, M., May, E., Kirst, E., Selignow, C., Rackebrandt, K., Müller, M., Goody, R. S., Vosberg, H.-P., Nürnberg, P., & Schef- fold, T. (2007) Noncompaction of the Ventricular Myocardium Is Associated with a De Novo Mutation in the 𝛽-Myosin Heavy Chain Gene. PLoS ONE 2(12). Ed. by I. Schrijver: e1362 (cit. on p. 180). Bulik-Sullivan, B. K., Loh, P.-R., Finucane, H. K., Ripke, S., Yang, J., Consortium, S. W. G. o. t. P. G., Patterson, N., Daly, M. J., Price, A. L., & Neale, B. M. (2015) LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nature Genetics advance on(3): 291–295 (cit. on p. 48). Bulmer, M. G. (2003) Francis Galton: Pioneer of heredity and biometry. Baltimore: Johns Hopkins University Press: 357 (cit. on p. 27). Burton, P. R., Clayton, D. G., Cardon, L. R., Craddock, N., Deloukas, P., Duncanson, A., Kwiatkowski, D. P., McCarthy, M. I., Ouwehand, W. H., Samani, N. J., Todd, J. A., Donnelly, P., Barrett, J. C., Burton, P. R., Davison, D., Donnelly, P., Easton, D., Evans, D., Leung, H.-T., Marchini, J. L., & al. (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls.Nature 447(7145): 661–678 (cit. on p. 37). Bush,W. S. &Moore, J. H. (2012) Chapter 11: Genome-wide association studies. PLoS Computational Biology 8(12): e1002822 (cit. on p. 42). Byrd, R. H., Lu, P., Nocedal, J., & Zhu, C. (1995) A Limited Memory Algorithm for Bound Constrained Optimization. en. SIAM Journal on Scientific Computing 16(5): 1190–1208 (cit. on p. 91). C. elegans Sequencing Consortium (1998) Genome sequence of the nematode C. el- egans: a platform for investigating biology. Science 282(5396): 2012–8 (cit. on p. 33). Cameron, J. R., Loh, E. Y., & Davis, R. W. (1979) Evidence for transposition of dis- persed repetitive DNA families in yeast. Cell 16(4): 739–751 (cit. on p. 32). 221 Campbell, C. D., Ogburn, E. L., Lunetta, K. L., Lyon, H. N., Freedman, M. L., Groop, L. C., Altshuler, D., Ardlie, K. G., & Hirschhorn, J. N. (2005) Demonstrating strat- ification in a European American population. Nature Genetics 37(8): 868–872 (cit. on p. 48). Candille, S. I., Absher, D. M., Beleza, S., Bauchet, M., McEvoy, B., Garrison, N. A., Li, J. Z., Myers, R. M., Barsh, G. S., Tang, H., & Shriver, M. D. (2012) Genome-wide association studies of quantitatively measured skin, hair, and eye pigmentation in four European populations. PloS ONE 7(10): e48294 (cit. on p. 38). Cannavò, E., Koelling,N., Harnett, D., Garfield, D., Casale, F. P., Ciglar, L., Gustafson, H. E., Viales, R. R., Marco-Ferreres, R., Degner, J. F., Zhao, B., Stegle, O., Birney, E., & Furlong, E. E. M. (2016) Genetic variants regulating expression levels and isoform diversity during embryogenesis.Nature 541(7637): 402–406 (cit. on p. 90). Captur, G., Lopes, L. R., Patel, V., Li, C., Bassett, P., Syrris, P., Sado, D. M., Maestrini, V.,Mohun, T. J.,McKenna,W. J.,Muthurangu, V., Elliott, P.M., &Moon, J. C. (2014) Abnormal cardiac formation in hypertrophic cardiomyopathy: fractal analysis of trabeculae and preclinical gene expression. Circulation. Cardiovascular genetics 7(3): 241–8 (cit. on pp. 180, 183, 185, 192). Captur, G.,Muthurangu, V., Cook, C., Flett, A. S.,Wilson, R., Barison, A., Sado, D.M., Anderson, S., McKenna, W. J., Mohun, T. J., Elliott, P. M., & Moon, J. C. (2013) Quantification of left ventricular trabeculae using fractal analysis. en. Journal of Cardiovascular Magnetic Resonance 15(1): 36 (cit. on pp. 180–182). Captur, G., Zemrak, F., Muthurangu, V., Petersen, S. E., Li, C., Bassett, P., Kawel- Boehm, N., McKenna, W. J., Elliott, P. M., Lima, J. A. C., Bluemke, D. A., & Moon, J. C. (2015) Fractal Analysis of Myocardial Trabeculations in 2547 Study Parti- cipants: Multi-Ethnic Study of Atherosclerosis. EN. Radiology 277(3): 707–15 (cit. on pp. 180, 189). Carrier, L., Hengstenberg, C., Beckmann, J. S., Guicheney, P., Dufour, C., Bercovici, J., Dausse, E., Berebbi-Bertrand, I., Wisnewsky, C., Pulvenis, D., Fetler, L., Vignal, A., Weissenbach, J., Hillaire, D., Feingold, J., Bouhour, J.-B., Hagege, A., Desnos, M., Isnard, R., Dubourg, O., & al. (1993) Mapping of a novel gene for familial hyper- trophic cardiomyopathy to chromosome 11.Nature Genetics 4(3): 311 (cit. on p. 68). Carvajal-Rodríguez, A. (2008) GENOMEPOP: A program to simulate genomes in populations. BMC Bioinformatics 9(1): 223 (cit. on p. 72). 222 Casale, F. P., Horta, D., Rakitsch, B., & Stegle, O. (2017) Joint genetic analysis using variant sets reveals polygenic gene-context interactions. PLoSGenetics 13(4). Ed. by M. P. Epstein: e1006693 (cit. on p. 193). Casale, F. P., Rakitsch, B., Lippert, C., & Stegle, O. (2015) Efficient set tests for the genetic analysis of correlated traits. Nature Methods 12: 755–758 (cit. on pp. 38, 42, 43, 50, 51, 53, 57, 59, 72, 79, 80, 89, 90, 92, 99, 193). Chakravarti, A. (1999) Population genetics—making sense out of sequence. Nature Genetics 21(1 Suppl): 56–60 (cit. on p. 36). Chang, C. C., Chow, C. C., Tellier, L. C., Vattikuti, S., Purcell, S. M., & Lee, J. J. (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4(1): 7 (cit. on pp. 76, 117, 158). Chargaff, E., Lipshitz, R., & Green, C. (1952) Composition of the desoxypentose nuc- leic acids of four genera of sea-urchin. The Journal of Biological Chemistry 195(1): 155–60 (cit. on p. 31). Chargaff, E., Vischer, E., Doninger, R., Green, C., & Fernanda, M. (1949) The com- position of the desoxypentose nucleic acids of thymus and spleen. The Journal of Biological Chemistry 177(1): 405–16 (cit. on p. 31). Chen, H., Zhang, W., Li, D., Cordes, T. M., Mark Payne, R., & Shou, W. (2009) Ana- lysis of Ventricular Hypertrabeculation and Noncompaction Using Genetically EngineeredMouseModels.Pediatric Cardiology 30(5): 626–634 (cit. on pp. 179, 180). Chen, J. & Chien, K. R. (1999) Complexity in simplicity: monogenic disorders and complex cardiomyopathies. The Journal of Clinical Investigation 103(11): 1483–5 (cit. on pp. 155, 156). Christoffels, V.M., Habets, P. E., Franco, D., Campione,M., de Jong, F., Lamers,W.H., Bao, Z. Z., Palmer, S., Biben, C., Harvey, R. P., & Moorman, A. F. (2000) Chamber formation andmorphogenesis in the developingmammalian heart.Developmental Biology 223(2): 266–78 (cit. on p. 64). Christoffels, V. M. & Moorman, A. F. M. (2009) Development of the cardiac conduc- tion system: why are some regions of the heart more arrhythmogenic than others? Circulation: Arrhythmia and electrophysiology 2(2): 195–207 (cit. on p. 66). Christophersen, I. E., Rienstra, M., Roselli, C., Yin, X., Geelhoed, B., Barnard, J., Lin, H., Arking, D. E., Smith, A. V., Albert, C. M., Chaffin, M., Tucker, N. R., Li, M., Klarin, D., Bihlmeyer, N. A., Low, S.-K., Weeke, P. E., Müller-Nurasyid, M., Gustav 223 Smith, J., Brody, J. A., & al. (2017) Large-scale analyses of common and rare vari- ants identify 12 new loci associated with atrial fibrillation (cit. on p. 189). Claes, P., Liberton, D. K., Daniels, K., Rosana, K. M., Quillen, E. E., Pearson, L. N., McEvoy, B., Bauchet,M., Zaidi, A. A., Yao,W., Tang, H., Barsh, G. S., Absher, D.M., Puts, D. A., Rocha, J., Beleza, S., Pereira, R. W., Baynam, G., Suetens, P., Vander- meulen, D., & al. (2014) Modeling 3D Facial Shape fromDNA. PLoS Genetics 10(3). Ed. by D. Luquetti: e1004224 (cit. on p. 54). Cohen, J. (1992) A power primer. Psychological Bulletin 112(1): 155–9 (cit. on pp. 83, 147). Cohen, S. N. (1973) Recircularization and Autonomous Replication of a Sheared R- Factor DNA Segment in Escherichia coli Transformants. Proceedings of the National Academy of Sciences of the United States of America 70(5): 1293–1297 (cit. on p. 32). Coifman, R. R. & Lafon, S. (2006) Diffusion maps. Applied Computational Harmonic Analysis 21: 5–30 (cit. on pp. 130, 131). Coifman, R. R., Lafon, S., Lee, A. B., Maggioni, M., Nadler, B., Warner, F., & Zucker, S. W. (2005) Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps (cit. on p. 130). Collins, L. M., Schafer, J. L., & Kam, C. M. (2001) A comparison of inclusive and re- strictive strategies in modern missing data procedures. Psychological Methods 6(4): 330–51 (cit. on p. 114). Comon, P. (1994) Independent component analysis. Signal Processing Comon 36(36): 28–314 (cit. on p. 128). Comte de Buffon, G.-L. L. (1749) Oeuvres d’Histoire Naturelle. Volume 8. Imprimerie royale (cit. on p. 24). Cook, S. & O’Regan, D. (2010) Digital Heart Project (cit. on p. 156). Corces, M. R., Buenrostro, J. D., Wu, B., Greenside, P. G., Chan, S. M., Koenig, J. L., Snyder, M. P., Pritchard, J. K., Kundaje, A., Greenleaf, W. J., Majeti, R., & Chang, H. Y. (2016) Lineage-specific and single-cell chromatin accessibility charts human hematopoiesis and leukemia evolution. Nature Genetics 48(10): 1193–1203 (cit. on p. 132). Correns, C. (1900) G. Mendel’s Regel über das Verhalten der Nachkommenschaft der Rassenbastarde. Berichte der deutschen botanischen Gesellschaft. 18: 158–168 (cit. on p. 26). 224 Cosselman, K. E., Navas-Acien, A., & Kaufman, J. D. (2015) Environmental factors in cardiovascular disease. Nature Reviews Cardiology 12(11): 627–642 (cit. on p. 67). Crick, F. H. (1958) On protein synthesis. Symposia of the Society for Experimental Biology 12: 138–63 (cit. on p. 32). Crick, F. H., Barnett, L., Brenner, S., & Watts-Tobin, R. J. (1961) General nature of the genetic code for proteins. Nature 192: 1227–32 (cit. on p. 32). Crowley, J. J., Zhabotynsky, V., Sun, W., Huang, S., Pakatci, I. K., Kim, Y., Wang, J. R., Morgan, A. P., Calaway, J. D., Aylor, D. L., Yun, Z., Bell, T. A., Buus, R. J., Calaway, M. E., Didion, J. P., Gooch, T. J., Hansen, S. D., Robinson,N.N., Shaw,G.D., Spence, J. S., & al. (2015) Analyses of allele-specific gene expression in highly divergent mouse crosses identifies pervasive allelic imbalance. Nature Genetics 47(4): 353– 360 (cit. on p. 132). Cunnington, R. H., Northcott, J. M., Ghavami, S., Filomeno, K. L., Jahan, F., Kavosh, M. S., Davies, J. J. L., Wigle, J. T., & Dixon, I. M. C. (2014) The Ski–Zeb2–Meox2 pathway provides a novel mechanism for regulation of the cardiac myofibroblast phenotype. Journal of Cell Science 127(1) (cit. on p. 170). Cunnington, R.H.,Wang, B., Ghavami, S., Bathe, K. L., Rattan, S. G., &Dixon, I.M. C. (2010) Antifibrotic properties of c-Ski and its regulation of cardiac myofibroblast phenotype and contractility. American Journal of Physiology - Cell Physiology 300(1) (cit. on p. 170). Cupples, L.A.,Arruda,H. T., Benjamin, E. J., D’Agostino, R. B., Demissie, S., DeStefano, A. L., Dupuis, J., Falls, K. M., Fox, C. S., Gottlieb, D. J., Govindaraju, D. R., Guo, C.-Y., Heard-Costa, N. L., Hwang, S.-J., Kathiresan, S., Kiel, D. P., Laramie, J. M., Larson, M. G., Levy, D., Liu, C.-Y., & al. (2007) The FraminghamHeart Study 100K SNP genome-wide association study resource: overview of 17 phenotype working group reports. BMC Medical Genetics 8 Suppl 1(Suppl 1): S1 (cit. on pp. 38, 69). Darwin, C. R. (1859) On the Origin of Species by Means of Natural Selection, Or, The Preservation of Favoured Races in the Struggle for Life. London: John Murray: 556 (cit. on p. 24). Darwin, C. R. (1868) The variation of animals and plants under domestication. 1st ed. London: John Murray (cit. on p. 24). Davies, M. & McKenna, W. (1995) Hypertrophic cardiomyopathy: pathology and pathogenesis. Histopathology 26(6): 493–500 (cit. on pp. 155, 156). 225 Dawber, T. R., Meadors, G. F., & Moore, F. E. J. (1951) Epidemiological approaches to heart disease: the Framingham Study. American Journal of Public Health and the Nation’s Health 41(3): 279–81 (cit. on p. 68). DeVries,H. (1900) Sur la loi de disjonctiondes hybrides. ComptesRendusde l’Academie des Sciences (Paris). Comptes Rendus de l’Academie des Sciences 130: 845–847 (cit. on p. 26). De Jong, F., Opthof, T., Wilde, A. A., Janse, M. J., Charles, R., Lamers, W. H., & Moorman, A. F. (1992) Persisting zones of slow impulse conduction in develop- ing chicken hearts. Circulation Research 71(2) (cit. on p. 64). DeMarvao, A., Dawes, T. J., Shi,W.,Minas, C., Keenan, N. G., Diamond, T., Durighel, G., Montana, G., Rueckert, D., Cook, S. A., & O’Regan, D. P. (2014) Population- based studies ofmyocardial hypertrophy: high resolution cardiovascularmagnetic resonance atlases improve statistical power. Journal of Cardiovascular Magnetic Res- onance 16(1): 16 (cit. on pp. 156, 160, 162, 181). De Ridder, D. & Duin, R. (2002) “Locally linear embedding for classification”. Tech- nical Report PH-2002-01. Delft University of Technology. Delft (cit. on p. 134). Dehghan, A., Bis, J. C., White, C. C., Smith, A. V., Morrison, A. C., Cupples, L. A., Trompet, S., Chasman, D. I., Lumley, T., Völker, U., Buckley, B. M., Ding, J., Jensen, M. K., Folsom, A. R., Kritchevsky, S. B., Girman, C. J., Ford, I., Dörr, M., Salomaa, V., Uitterlinden, A. G., & al. (2016) Genome-Wide Association Study for Incident Myocardial Infarction and Coronary Heart Disease in Prospective Cohort Studies: The CHARGE Consortium. PLoS ONE 11(3). Ed. by M.-P. Dubé: e0144997 (cit. on p. 189). Delaneau,O.,Marchini, J., &Zagury, J.-F. (2012)A linear complexity phasingmethod for thousands of genomes. Nature Methods 9(2): 179–81 (cit. on p. 158). Delaneau,O., Zagury, J.-F., &Marchini, J. (2013) Improvedwhole-chromosomephas- ing for disease and population genetic studies. Nature Methods 10(1): 5–6 (cit. on p. 158). Deng, Q., Ramsköld, D., Reinius, B., & Sandberg, R. (2014) Single-cell RNA-seq re- veals dynamic, random monoallelic gene expression in mammalian cells. Science 343(6167): 193–6 (cit. on p. 132). Denny, J. C., Ritchie, M. D., Crawford, D. C., Schildcrout, J. S., Ramirez, A. H., Pulley, J. M., Basford, M. A., Masys, D. R., Haines, J. L., & Roden, D. M. (2010) Identific- ation of genomic predictors of atrioventricular conduction: Using electronic med- 226 ical records as a tool for genome science. Circulation 122(20): 2016–2021 (cit. on p. 189). Devlin, B., Bacanu, S.-A., &Roeder, K. (2004)GenomicControl to the extreme.Nature Genetics 36(11): 1129–1130, author reply 1131 (cit. on p. 48). Devlin, B. & Roeder, K. (1999) Genomic control for association studies. Biometrics 55(4): 997–1004 (cit. on p. 48). Donis-Keller, H., Green, P., Helms, C., Cartinhour, S., Weiffenbach, B., Stephens, K., Keith, T. P., Bowden, D. W., Smith, D. F., Lander, E. S., Botstein, D., Powers, J. A., Watt, D. E., Kauffman, E. R., Bricker, A., Phipps, P., Muller-Kahle, H., Fulton, T. R., Ng, S., Schumm, J.W., & al. (1987) AGenetic LinkageMap of theHumanGenome. Cell 51(0): 319–337 (cit. on p. 32). Donoho, D. & Jin, J. (2006) Asymptotic minimaxity of false discovery rate threshold- ing for sparse exponential data. The Annals of Statistics 34(6): 2980–3018 (cit. on p. 46). Doyle, A. J., Doyle, J. J., Bessling, S. L., Maragh, S., Lindsay,M. E., Schepers, D., Gillis, E., Mortier, G., Homfray, T., Sauls, K., Norris, R. A., Huso, N. D., Leahy, D., Mohr, D. W., Caulfield, M. J., Scott, A. F., Destrée, A., Hennekam, R. C., Arn, P. H., Curry, C. J., & al. (2012)Mutations in the TGF-𝛽 repressor SKI cause Shprintzen-Goldberg syndrome with aortic aneurysm.Nature Genetics 44(11): 1249–1254 (cit. on p. 170). Dunn, O. J. (1961) Multiple comparisons among means. Journal of the American Stat- istical Association 56(293): 52–64 (cit. on pp. 46, 118). Dunwell, J. M. (2007) 100 years on: a century of genetics.Nature Reviews Genetics 8(3): 231–235 (cit. on p. 29). Durham, D. &Worthley, L. I. G. (2002) Cardiac arrhythmias: diagnosis andmanage- ment. The tachycardias. Critical Care and Resuscitation 4(1): 35–53 (cit. on p. 66). East, E. M. (1910) A Mendelian Interpretation of Variation that is Apparently Continuous (cit. on p. 29). Eden, T. & Fisher, R. A. (1929) Studies in crop variation: VI. Experiments on the response of the potato to potash and nitrogen. The Journal of Agricultural Science 19(02): 201 (cit. on p. 31). Edwards, A. W. F. (2013) Robert Heath Lock and his textbook of genetics, 1906. Ge- netics 194(3): 529–37 (cit. on pp. 26, 28). Ehrenreich, I. M., Torabi, N., Jia, Y., Kent, J., Martis, S., Shapiro, J. A., Gresham, D., Caudy, A. A., & Kruglyak, L. (2010) Dissection of genetically complex traits with 227 extremely large pools of yeast segregants. Nature 464(7291): 1039–1042 (cit. on pp. 47, 118). Ehret, G. B., Munroe, P. B., Rice, K. M., Bochud, M., Johnson, A. D., Chasman, D. I., Smith, A. V., Tobin,M. D., Verwoert, G. C., Hwang, S.-J., Pihur, V., Vollenweider, P., O’Reilly, P. F., Amin, N., Bragg-Gresham, J. L., Teumer, A., Glazer, N. L., Launer, L., Zhao, J. H., Aulchenko, Y., & al. (2011) Genetic variants in novel pathways influ- ence blood pressure and cardiovascular disease risk. en. Nature 478(7367): 103–9 (cit. on pp. 69, 189). Eke, A., Herman, P., Kocsis, L., & Kozak, L. R. (2002) Fractal characterization of com- plexity in temporal physiological signals. Physiological Measurement 23(1): R1–38 (cit. on p. 181). Elledge, S. J. & Davis, R. W. (1990) Two genes differentially regulated in the cell cycle and by DNA-damaging agents encode alternative regulatory subunits of ribonuc- leotide reductase. Genes & Development 4(5): 740–51 (cit. on p. 120). Eriksson, N., Macpherson, J. M., Tung, J. Y., Hon, L. S., Naughton, B., Saxonov, S., Avey, L.,Wojcicki, A., Pe’er, I., &Mountain, J. (2010)Web-Based, Participant-Driven Studies Yield Novel Genetic Associations for Common Traits. PLoS Genetics 6(6). Ed. by G. Gibson: e1000993 (cit. on p. 38). Etzel, C. J., Shete, S., Beasley, T. M., Fernandez, J. R., Allison, D. B., & Amos, C. I. (2003) Effect of Box-Cox transformation on power of Haseman-Elston and maxi- mum-likelihood variance components tests to detect quantitative trait Loci. Hu- man Heredity 55(2-3): 108–16 (cit. on p. 43). Ewens,W. J. & Spielman, R. S. (1995) The Transmission/DisequilibriumTest:History, Subdivision, and Admixture. The American Journal of Human Genetics 57: 455–464 (cit. on p. 48). Ewing, G. &Hermisson, J. (2010) MSMS: a coalescent simulation program including recombination, demographic structure and selection at a single locus. Bioinform- atics 26(16): 2064–5 (cit. on p. 72). Fadista, J., Manning, A. K., Florez, J. C., & Groop, L. (2016) The (in)famous GWAS P- value threshold revisited andupdated for low-frequency variants.European Journal of Human Genetics 24: 1202–1205 (cit. on pp. 47, 83). Ferreira, M. A. R. & Purcell, S. M. (2009) Amultivariate test of association. Bioinform- atics 25(1): 132–3 (cit. on p. 55). 228 Filippini, N., MacIntosh, B. J., Hough, M. G., Goodwin, G. M., Frisoni, G. B., Smith, S. M., Matthews, P. M., Beckmann, C. F., & Mackay, C. E. (2009) Distinct patterns of brain activity in young carriers of the APOE-epsilon4 allele. Proceedings of the National Academy of Sciences of the United States of America 106(17): 7209–14 (cit. on p. 156). Fisher, R. A. (1912) On anAbsolute Criterion for Fitting Frequency Curves.Messenger of Mathematics 41: 155–160 (cit. on p. 30). Fisher, R. A. (1918) The Correlation between Relatives on the Supposition of Mende- lian Inheritance. Philosophical transactions of the Royal Society of Edinburgh 52: 399– 433 (cit. on p. 31). Fisher, R.A. (1921) Studies inCropVariation. I. An examination of the yield of dressed grain from Broadbalk. The Journal of Agricultural Science 11(02): 107 (cit. on p. 31). Fisher, R. A. (1922a) On the Interpretation of 𝜒 2 from Contingency Tables, and the Calculation of P. Journal of the Royal Statistical Society 85(1): 87 (cit. on p. 30). Fisher, R.A. (1922b)On theMathematical Foundations of Theoretical Statistics.Philo- sophical Transactions of the Royal Society of London A 222(594-604) (cit. on p. 30). Fisher, R. A. (1922c) The Goodness of Fit of Regression Formulae, and the Distribu- tion of Regression Coefficients. Journal of the Royal Statistical Society 85(4): 597 (cit. on p. 30). Fisher, R. A. (1922d) The Systematic Location of Genes by Means of Crossover Ob- servations. The American Naturalist 56: 406–411 (cit. on p. 30). Fisher, R. A. (1924a) “On a distribution yielding the error functions of several well known statistics”.Proceedings of the International Congress ofMathematicians. Toronto: 10 (cit. on p. 31). Fisher, R. A. (1924b) The Distribution of the Partial Correlation Coefficient. Metron 3: 329–332 (cit. on p. 30). Fisher, R. A. (1928) The General Sampling Distribution of the Multiple Correlation Coefficient. Proceedings of the Royal Society of London. Series A 121(788): 654–673 (cit. on p. 30). Fisher, R. A. (1930) The Genetical Theory of Natural Selection. 2nd ed. Oxford: Claren- don Press (cit. on p. 31). Fisher, R. A. & Mackenzie, W. A. (1923) Studies in crop variation. II. The manurial response of different potato varieties. The Journal of Agricultural Science 13(03): 311 (cit. on p. 31). 229 Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F., Kerlav- age, A. R., Bult, C. J., Tomb, J. F., Dougherty, B. A., & Merrick, J. M. (1995) Whole- genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269(5223): 496–512 (cit. on p. 33). Flemming, W. (1878) Zur Kenntniss der Zelle und ihrer Theilungs-Erscheinungen. Schriften des Naturwissenschaftlichen Vereins für Schleswig-Holstein 3: 23–27 (cit. on p. 27). Florian, A.,Masci, P. G., De Buck, S., Aquaro, G. D., Claus, P., Todiere, G., Van Cleem- put, J., Lombardi, M., & Bogaert, J. (2012) Geometric Assessment of Asymmetric Septal Hypertrophic Cardiomyopathy by CMR. JACC: Cardiovascular Imaging 5(7): 702–711 (cit. on p. 155). Foiani, M., Cigan, A. M., Paddon, C. J., Harashima, S., Hinnebusch, A. G., Pavitt, G., Ashe, M., Grant, C., Cyert, M., Hughes, T., Boone, C., Andrews, B., Chua, G., Friesen, H., Goldberg, D., Haynes, J., Humphries, C., He, G., Hussein, S., Ke, L., & al. (1991) GCD2, a translational repressor of the GCN4 gene, has a general func- tion in the initiation of protein synthesis in Saccharomyces cerevisiae. Molecular and Cellular Biology 11(6): 3203–3216 (cit. on p. 120). Ford, E. S., Ajani, U. A., Croft, J. B., Critchley, J. A., Labarthe, D. R., Kottke, T. E., Giles, W. H., & Capewell, S. (2007) Explaining the Decrease in U.S. Deaths from Coronary Disease, 1980–2000.New England Journal of Medicine 356(23): 2388–2398 (cit. on p. 67). Fox, E. R., Musani, S. K., Barbalic,M., Lin, H., Yu, B., Ogunyankin, K. O., Smith, N. L., Kutlar, A., Glazer,N. L., Post,W. S., Paltoo, D.N., Dries, D. L., Farlow,D.N., Duarte, C. W., Kardia, S. L., Meyers, K. J., Sun, Y. V., Arnett, D. K., Patki, A. A., Sha, J., & al. (2013) Genome-wide association study of cardiac structure and systolic function in African Americans: the Candidate Gene Association Resource (CARe) study. Circulation. Cardiovascular genetics 6(1): 37–46 (cit. on p. 155). Franceschini, N., Fox, E., Zhang, Z., Edwards, T. L., Nalls, M. A., Sung, Y. J., Tayo, B. O., Sun, Y. V., Gottesman, O., Adeyemo, A., Johnson, A. D., Young, J. H., Rice, K., Duan, Q., Chen, F., Li, Y., Tang, H., Fornage, M., Keene, K. L., Andrews, J. S., & al. (2013) Genome-wide Association Analysis of Blood-Pressure Traits in African- Ancestry Individuals Reveals Common Associated Genes in African and Non- African Populations. The American Journal of Human Genetics 93(3): 545–554 (cit. on p. 38). 230 Frayling, T. M., Timpson, N. J., Weedon, M. N., Zeggini, E., Freathy, R. M., Lindgren, C. M., Perry, J. R. B., Elliott, K. S., Lango, H., Rayner, N. W., Shields, B., Harries, L. W., Barrett, J. C., Ellard, S., Groves, C. J., Knight, B., Patch, A.-M., Ness, A. R., Ebrahim, S., Lawlor, D. A., & al. (2007) A Common Variant in the FTOGene Is As- sociated with BodyMass Index and Predisposes to Childhood and Adult Obesity. Science 316(5826) (cit. on p. 37). Furrer, R. & Bengtsson, T. (2007) Estimation of high-dimensional prior and posterior covariance matrices in Kalman filter variants. Journal of Multivariate Analysis 98(2): 227–255 (cit. on p. 104). Galton, F. (1886) Regression towards mediocrity in heriditary stature. The Journal of the Anthropological Institute of Great Britain and Ireland 15: 246–263 (cit. on p. 27). Galton, F. (1889)Natural inheritance. London:Macmillan Publishers Limited: 282 (cit. on p. 27). Galton, F. (1901) Biometry. Biometrika 1(1) (cit. on p. 27). Galwey, N. W. (2009) A newmeasure of the effective number of tests, a practical tool for comparing families of non-independent significance tests.Genetic Epidemiology 33(7): 559–68 (cit. on p. 117). Garg, V., Kathiriya, I. S., Barnes, R., Schluterman, M. K., King, I. N., Butler, C. A., Rothrock, C. R., Eapen, R. S., Hirayama-Yamada, K., Joo, K., Matsuoka, R., Cohen, J. C., & Srivastava, D. (2003) GATA4 mutations cause human congenital heart de- fects and reveal an interactionwith TBX5.Nature 424(6947): 443–447 (cit. on p. 68). Garson, D. G. (2015) Missing values analysis and data imputation. 2nd ed. Asheboro, NC: Statistical Associates Publishing: 113 (cit. on pp. 107, 110, 113). GBE (2017) Global Biobank Engine. Stanford, CA (cit. on pp. 171, 190). Ge, T., Schumann, G., & Feng, J. (2014) Imaging genetics — towards discovery neur- oscience. Quantitative Biology 1(4): 227–245 (cit. on pp. 156, 177). Geisterfer-Lowrance, A. T., Kass, S., Tanigawa, G., W&erg, H.-P., Mckenna, W., Seid- man, C. E., & Seldmant, J. G. (1990) A Molecular Basis for Familial Hypertrophic Cardiomyopathy: A pCardiacMyosinHeavyChain GeneMissenseMutation.Cell 62: 999–1006 (cit. on p. 68). Gilmour, A. R., Thompson, R., & Cullis, B. R. (1995) Average Information REML: An Efficient Algorithm for Variance Parameter Estimation in Linear Mixed Models. Biometrics 51(4): 1440 (cit. on p. 59). 231 Gittenberger-de Groot, A. C., Bartelings, M. M., Deruiter, M. C., & Poelmann, R. E. (2005) Basics of cardiac development for the understanding of congenital heart malformations. Pediatric Research 57(2): 169–76 (cit. on p. 65). Glasser, M. F., Sotiropoulos, S. N., Wilson, J. A., Coalson, T. S., Fischl, B., Andersson, J. L., Xu, J., Jbabdi, S., Webster, M., Polimeni, J. R., Van Essen, D. C., Jenkinson, M., &WU-MinnHCPConsortium (2013) Theminimal preprocessing pipelines for the Human Connectome Project. NeuroImage 80: 105–124 (cit. on p. 194). Glazner, C. & Thompson, E. A. (2012) Improving pedigree-based linkage analysis by estimating coancestry among families. Statistical Applications in Genetics and Mo- lecular Biology 11(2) (cit. on p. 52). Glover, M., Ware, J. S., Henry, A., Wolley, M., Walsh, R., Wain, L. V., Xu, S., Van ’, W. G., Hoff, T., Tobin, M. D., Hall, I. P., Cook, S., Gordon, R. D., Stowasser, M., & O ’shaughnessy, K.M. (2014) Detection ofmutations in KLHL3 andCUL3 in families with FHHt (familial hyperkalaemic hypertension or Gordon’s syndrome). Clinical Science 126: 721–726 (cit. on p. 68). Goffeau, A., Barrell, B. G., Bussey, H., Davis, R.W., Dujon, B., Feldmann, H., Galibert, F., Hoheisel, J. D., Jacq, C., Johnston, M., Louis, E. J., Mewes, H. W., Murakami, Y., Philippsen, P., Tettelin, H., & Oliver, S. G. (1996) Life with 6000 genes. Science 274(5287): 546, 563–7 (cit. on p. 33). Goodman, H. M., Olson, M. V., & Hall, B. D. (1977) Nucleotide sequence of a mutant eukaryotic gene: the yeast tyrosine-inserting ochre suppressor SUP4-o.Proceedings of the National Academy of Sciences of the United States of America 74(12): 5453–7 (cit. on p. 32). Gormley, P., Anttila, V., Winsvold, B. S., Palta, P., Esko, T., Pers, T. H., Farh, K.-H., Cuenca-Leon, E., Muona, M., Furlotte, N. A., Kurth, T., Ingason, A., McMahon, G., Ligthart, L., Terwindt, G. M., Kallela, M., Freilinger, T. M., Ran, C., Gordon, S. G., Stam, A. H., & al. (2016) Meta-analysis of 375,000 individuals identifies 38 susceptibility loci for migraine. Nature Genetics 48(8): 856–866 (cit. on p. 38). Goss, C. M. (1938) The first contractions of the heart in rat embryos. The Anatomical Record 70(5): 505–524 (cit. on p. 64). Gower, J. C. (1966) Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53: 325 (cit. on pp. 125, 126, 134). Greally, M. T. (1993) Shprintzen-Goldberg Syndrome. University ofWashington, Seattle (cit. on p. 170). 232 Grodzicker, T., Williams, J., Sharp, P., & Sambrook, J. G. (1974) Physical mapping of temperature- sensitive mutations of adenoviruses. Cold Spring Harbor Symp Quant Biol 39: 439–446 (cit. on p. 32). Guan, Y.& Stephens,M. (2008) Practical issues in imputation-based associationmap- ping. PLoS Genetics 4(12): e1000279 (cit. on p. 43). Gudbjartsson, D. F., Arnar, D. O., Helgadottir, A., Gretarsdottir, S., Holm, H., Sig- urdsson, A., Jonasdottir, A., Baker, A., Thorleifsson, G., Kristjansson, K., Palsson, A., Blondal, T., Sulem, P., Backman, V. M., Hardarson, G. A., Palsdottir, E., Hel- gason, A., Sigurjonsdottir, R., Sverrisson, J. T., Kostulas, K., & al. (2007) Variants conferring risk of atrial fibrillation on chromosome 4q25. Nature 448(7151): 353–7 (cit. on p. 189). Gusella, J. F., Wexler, N. S., Conneally, P. M., Naylor, S. L., Anderson, M. A., Tanzi, R. E., Watkins, P. C., Ottina, K., Wallace, M. R., Sakaguchi, A. Y., Young, A. B., Shoulson, I., Bonilla, E., & Martin, J. B. (1983) A polymorphic DNAmarker genet- ically linked to Huntington’s disease. Nature 306(5940): 234–238 (cit. on p. 34). Güvenç, T. S., Erer, H. B., Altay, S., Ilhan, E., Sayar, N., & Eren, M. (2012) ’Idiopathic’ acute myocardial infarction in a young patient with noncompaction cardiomy- opathy. Cardiology Journal 19(4): 429–33 (cit. on p. 189). Haider, A.W., Larson,M.G., Benjamin, E. J., & Levy,D. (1998) Increased left ventricu- lar mass and hypertrophy are associated with increased risk for sudden death. Journal of the American College of Cardiology 32(5): 1454–9 (cit. on p. 155). Hald, A. (1999) On the History of Maximum Likelihood in Relation to Inverse Prob- ability and Least Squares. Statistical Science 14: 214–222 (cit. on p. 30). Haldane, J. B. S. (1934) Methods for the detection of autosomal linkage in man. An- nals of Eugenics 6(1): 26–65 (cit. on p. 34). Haldane, J. B. S. & Smith, C. A. B. (1947) A new estimate of the linkage between the genes for colour-blindness and haemophilia inman.Annals of Eugenics 14(1): 10–31 (cit. on p. 34). Halsey, L. G., Curran-Everett, D., Vowler, S. L., & Drummond, G. B. (2015) The fickle P value generates irreproducible results. Nature Methods 12(3): 179–185 (cit. on pp. 83, 147). Hanchard, N. A., Swaminathan, S., Bucasas, K., Furthner, D., Fernbach, S., Azamian, M. S.,Wang, X., Lewin,M., Towbin, J. A., D’Alessandro, L. C.,Morris, S. A., Dreyer, W., Denfield, S., Ayres, N. A., Franklin, W. J., Justino, H., Lantin-Hermoso, M. R., 233 Ocampo, E. C., Santos, A. B., Parekh, D., & al. (2016) A genome-wide association study of congenital cardiovascular left-sided lesions shows associationwith a locus on chromosome 20. Human Molecular Genetics 25(11): 2331–2341 (cit. on p. 69). Hannah, A. ( & De Vries, H. ( (1950) Concerning the law of segregation in hybrids. Genetics 35(5): 30–32 (cit. on p. 26). Hansson, J. H., Nelson-Williams, C., Suzuki, H., Schild, L., Shimkets, R., Lu, Y., Ca- nessa, C., Iwasaki, T., Rossier, B., & Lifton, R. P. (1995) Hypertension caused by a truncated epithelial sodium channel 𝛾 subunit: genetic heterogeneity of Liddle syndrome. Nature Genetics 11(1): 76–82 (cit. on p. 68). Hästbacka, J., de la Chapelle, A., Kaitila, I., Sistonen, P., Weaver, A., & Lander, E. (1992) Linkage disequilibrium mapping in isolated founder populations: diastro- phic dysplasia in Finland. Nature Genetics 2(3): 204–211 (cit. on p. 34). Hayes, B. J., Visscher, P. M., & Goddard, M. E. (2009) Increased accuracy of artificial selection by using the realized relationship matrix.Genetics Research 91: 47–60 (cit. on p. 52). He, L.-N., Liu, Y.-J., Xiao, P., Zhang, L., Guo, Y., Yang, T.-L., Zhao, L.-J., Drees, B., Hamilton, J., Deng, H.-Y., Recker, R. R., & Deng, H.-W. (2008) Genomewide Link- age Scan for Combined Obesity Phenotypes using Principal Component Analysis. Annals of Human Genetics 72(3): 319–326 (cit. on p. 54). Heather, J. M. & Chain, B. (2016) The sequence of sequencers: The history of sequen- cing DNA. Genomics 107: 1–8 (cit. on p. 33). Heid, I. M., Jackson, A. U., Randall, J. C., Winkler, T. W., Qi, L., Steinthorsdottir, V., Thorleifsson, G., Zillikens, M. C., Speliotes, E. K., Mägi, R., Workalemahu, T., White, C. C., Bouatia-Naji, N., Harris, T. B., Berndt, S. I., Ingelsson, E., Willer, C. J., Weedon,M. N., Luan, J., Vedantam, S., & al. (2010)Meta-analysis identifies 13 new loci associated with waist-hip ratio and reveals sexual dimorphism in the genetic basis of fat distribution. Nature Genetics 42(11): 949–960 (cit. on p. 38). Hendee, K., Wang, L. W., Reis, L. M., Rice, G. M., Apte, S. S., & Semina, E. V. (2017) Identification and functional analysis of an ADAMTSL1 variant associated with a complex phenotype including congenital glaucoma, craniofacial, and other systemic features in a three-generation human pedigree. Human Mutation (cit. on p. 189). Henderson, C. R. & Quaas, R. L. (1976) Multiple Trait Evaluation Using Relatives’ Records. Journal of Animal Science 43(6): 1188 (cit. on p. 57). 234 Herault, J. & Jutten, C. (1986) “Space or time adaptive signal processing by neural networkmodels”.AIP Conference Proceedings. Vol. 151. AIP: 206–211 (cit. on p. 128). Hibar, D. P., Stein, J. L., Renteria, M. E., Arias-Vasquez, A., Desrivières, S., Jahanshad, N., Toro, R., Wittfeld, K., Abramovic, L., Andersson, M., Aribisala, B. S., Arm- strong, N. J., Bernard, M., Bohlken, M. M., Boks, M. P., Bralten, J., Brown, A. A., Mallar Chakravarty, M., Chen, Q., Ching, C. R. K., & al. (2015) Common genetic variants influence human subcortical brain structures. Nature 520(7546): 224–229 (cit. on pp. 156, 177). Hirohata, S., Wang, L. W., Miyagi, M., Yan, L., Seldin, M. F., Keene, D. R., Crabb, J. W., & Apte, S. S. (2002) Punctin, a novel ADAMTS-like molecule, ADAMTSL-1, in extracellular matrix. The Journal of Biological Chemistry 277(14): 12182–9 (cit. on p. 186). Hirokawa, M., Morita, H., Tajima, T., Takahashi, A., Ashikawa, K., Miya, F., Shigem- izu, D., Ozaki, K., Sakata, Y., Nakatani, D., Suna, S., Imai, Y., Tanaka, T., Tsunoda, T., Matsuda, K., Kadowaki, T., Nakamura, Y., Nagai, R., Komuro, I., & Kubo, M. (2015) A genome-wide association study identifies PLCL2 and AP3D1-DOT1L- SF3A2 as new susceptibility loci for myocardial infarction in Japanese. European Journal of Human Genetics 23(3): 374–380 (cit. on p. 189). Hirschhorn, J. N., Lohmueller, K., Byrne, E., & Hirschhorn, K. (2002) A comprehens- ive review of genetic association studies. Genetics in Medicine 4(2): 45–61 (cit. on p. 35). Ho, A. J., Stein, J. L., Hua, X., Lee, S., Hibar, D. P., Leow, A. D., Dinov, I. D., Toga, A. W., Saykin, A. J., Shen, L., Foroud, T., Pankratz, N., Huentelman, M. J., Craig, D. W., Gerber, J. D., Allen, A. N., Corneveaux, J. J., Stephan, D. A., DeCarli, C. S., DeChairo, B. M., & al. (2010) A commonly carried allele of the obesity-related FTO gene is associated with reduced brain volume in the healthy elderly. Proceedings of the National Academy of Sciences of the United States of America 107(18): 8404–9 (cit. on p. 156). Hoffman, J. I. E. (2005) “Congenital Heart Disease”. Essential Cardiology. Totowa, NJ: Humana Press: 393–406 (cit. on p. 67). Hoggart, C. J., Chadeau-Hyam, M., Clark, T. G., Lampariello, R., Whittaker, J. C., De Iorio, M., & Balding, D. J. (2007) Sequence-Level Population Simulations Over Large Genomic Regions. Genetics 177(3): 1725–1731 (cit. on p. 72). 235 Hore, V., Viñuela, A., Buil, A., Knight, J., McCarthy, M. I., Small, K., & Marchini, J. (2016) Tensor decomposition for multiple-tissue gene expression experiments. Nature Genetics 48(9): 1094–1100 (cit. on p. 195). Horn, R. A. & Johnson, C. R. (1985) Matrix analysis. 23rd ed. New York: Cambridge University Press: 561 (cit. on p. 214). Hotelling, H. (1933) Analysis of a complex of statistical variables into principal com- ponents. Journal of Educational Psychology 24(6): 417–441 (cit. on pp. 125, 134). Howie, B. N., Donnelly, P., & Marchini, J. (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genetics 5(6). Ed. by N. J. Schork: e1000529 (cit. on pp. 37, 159). Howie, B., Marchini, J., & Stephens, M. (2011) Genotype imputation with thousands of genomes. G3 1(6): 457–70 (cit. on p. 160). Hua, S. S. T., Hernlem, B. J., Yokoyama, W., & Sarreal, S. B. L. (2015) Intracellular tre- halose and sorbitol synergistically promoting cell viability of a biocontrol yeast, Pichia anomala, for aflatoxin reduction. World Journal of Microbiology and Biotech- nology 31(5): 729–734 (cit. on p. 120). Hubmacher, D. & Apte, S. S. (2015) ADAMTS proteins as modulators of microfibril formation and function.Matrix Biology 47: 34–43 (cit. on p. 189). Hudson, R. R. (2002) Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18(2): 337–338 (cit. on p. 72). Hughes, S. E. (2004) The pathology of hypertrophic cardiomyopathy. Histopathology 44(5): 412–427 (cit. on p. 155). Huisman, S. M. H., van Lew, B., Mahfouz, A., Pezzotti, N., Höllt, T., Michielsen, L., Vilanova, A., Reinders, M. J., & Lelieveldt, B. P. F. (2017) BrainScope: interactive visual exploration of the spatial and temporal human brain transcriptome.Nucleic Acids Research 45(10): e83 (cit. on p. 132). Hunkapiller, T., Kaiser, R., Koop, B., & Hood, L. (1991) Large-scale and automated DNA sequence determination. Science 254(5028) (cit. on p. 32). Hunter, D. J. (2005) Gene–environment interactions in human diseases. Nature Re- views Genetics 6(4): 287–298 (cit. on p. 35). Hyvärinen, A. & Oja, E. (2000) Independent component analysis: algorithms and applications. Neural Networks 13(4): 411–430 (cit. on pp. 128, 134). 236 Ichida, F., Tsubata, S., Bowles, K. R., Haneda, N., Uese, K., Miyawaki, T., Dreyer,W. J., Messina, J., Li, H., Bowles, N. E., & Towbin, J. A. (2001) Novel Gene Mutations in Patients With Left Ventricular Noncompaction or Barth Syndrome. Circulation 103(9) (cit. on p. 180). Ingram, V. M. & Stretton, A. O. W. (1959) Genetic Basis of the Thalassæmia Diseases. Nature 184(4703): 1903–1909 (cit. on p. 34). InternationalHumanGenome SequencingConsortium (2001) Initial sequencing and analysis of the human genome. Nature 409(6822): 860–921 (cit. on p. 33). Jackson, D. A., Symonst, R. H., & Berg, P. (1972) Biochemical Method for Inserting New Genetic Information into DNA of Simian Virus 40: Circular SV40 DNAMo- lecules Containing Lambda PhageGenes and the Galactose Operon of Escherichia coli. Proceedings of the National Academy of Sciences of the United States of America. 69(10): 2904–2909 (cit. on p. 32). Jahanshad, N., Kochunov, P. V., Sprooten, E., Mandl, R. C., Nichols, T. E., Almasy, L., Blangero, J., Brouwer, R. M., Curran, J. E., de Zubicaray, G. I., Duggirala, R., Fox, P. T., Hong, L. E., Landman, B. A., Martin, N. G., McMahon, K. L., Medland, S. E., Mitchell, B. D., Olvera, R. L., Peterson, C. P., & al. (2013) Multi-site genetic analysis of diffusion images and voxelwise heritability analysis: A pilot project of the ENIGMA–DTI working group. NeuroImage 81: 455–469 (cit. on p. 156). Jeffreys, A. J. (1979) DNA sequence variants in the G gamma-, A gamma-, delta- and beta-globin genes of man. Cell 18(1): 1–10 (cit. on p. 32). Jenni, R., Wyss, C. A., Oechslin, E. N., & Kaufmann, P. A. (2002) Isolated Ventricu- lar Noncompaction Is Associated With Coronary Microcirculatory Dysfunction. Journal of the American College of Cardiology 39: 450–454 (cit. on p. 189). Jiang, C. & Zeng, Z. B. (1995) Multiple trait analysis of genetic mapping for quantit- ative trait loci. Genetics 140(3): 1111–27 (cit. on pp. 35, 56, 59, 97). Johannsen, W. (1911) The Genotype Conception of Heredity. The American Naturalist 45: 129–159 (cit. on p. 29). Junga, G., Kneifel, S., Smekal, A. V., Steinert, H., & Bauersfeld, U. (1999) Myocardial ischaemia in children with isolated ventricular non-compaction. European Heart Journal 20: 910–916 (cit. on p. 189). Kan, Y.W. &Dozy, A.M. (1978) Antenatal diagnosis of sickle-cell anaemia byD.N.A. analysis of amniotic-fluid cells. Lancet 2(8096): 910–2 (cit. on p. 32). 237 Kang,H.M., Sul, J. H., Service, S. K., Zaitlen,N.A., Kong, S.-Y., Freimer,N. B., Sabatti, C., & Eskin, E. (2010) Variance component model to account for sample structure in genome-wide association studies. Nature Genetics 42(4): 348–54 (cit. on pp. 38, 50, 88, 89, 103, 193). Kang, H. M., Zaitlen, N. A., Wade, C. M., Kirby, A., Heckerman, D., Daly, M. J., & Eskin, E. (2008) Efficient control of population structure in model organism asso- ciation mapping. Genetics 178(3): 1709–23 (cit. on pp. 52, 53). Kannel, W. B. &McGee, D. L. (1979) Diabetes and cardiovascular disease. The Fram- ingham study. JAMA 241(19): 2035–8 (cit. on p. 68). Kaski, S., Nikkilä, J., Oja, M., Venna, J., Törönen, P., & Castrén, E. (2003) Trustworthi- ness and metrics in visualizing similarity of gene expression. BMC Bioinformatics 4(1): 48 (cit. on p. 137). Kathiresan, S. & Srivastava, D. (2012) Genetics of Human Cardiovascular Disease. Cell 148(6): 1242–1257 (cit. on p. 68). Kathiresan, S., Willer, C. J., Peloso, G. M., Demissie, S., Musunuru, K., Schadt, E. E., Kaplan, L., Bennett, D., Li, Y., Tanaka, T., Voight, B. F., Bonnycastle, L. L., Jack- son, A. U., Crawford, G., Surti, A., Guiducci, C., Burtt, N. P., Parish, S., Clarke, R., Zelenika, D., & al. (2009) Common variants at 30 loci contribute to polygenic dyslipidemia. Nature Genetics 41(1): 56–65 (cit. on p. 189). Kato, N., Takeuchi, F., Tabara, Y., Kelly, T. N., Go, M. J., Sim, X., Tay, W. T., Chen, C.-H., Zhang, Y., Yamamoto, K., Katsuya, T., Yokota, M., Kim, Y. J., Ong, R. T. H., Nabika, T., Gu, D., Chang, L.-c., Kokubo, Y., Huang, W., Ohnaka, K., & al. (2011) Meta-analysis of genome-wide association studies identifies common variants as- sociated with blood pressure variation in east Asians. Nature Genetics 43(6): 531– 538 (cit. on p. 38). Kawel, N., Nacif,M., Arai, A. E., Gomes, A. S., Hundley,W.G., Johnson,W.C., Prince, M. R., Stacey, R. B., Lima, J. A. C., & Bluemke, D. A. (2012) Trabeculated (Non- compacted) and Compact Myocardium in AdultsClinical Perspective. Circulation: Cardiovascular Imaging 5(3) (cit. on pp. 180, 183, 192). Kayo, O. (2006) Locally Linear Embedding Algorithm Extensions and Applications (cit. on pp. 133, 142). Kelleher, J., Etheridge, A. M., & McVean, G. (2016) Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes. PLOS Computational Biology 12(5). Ed. by Y. S. Song: e1004842 (cit. on p. 72). 238 Kerem, B., Rommens, J. M., Buchanan, J. A., Markiewicz, D., Cox, T. K., Chakravarti, A., Buchwald, M., & Tsui, L. C. (1989) Identification of the cystic fibrosis gene: genetic analysis. Science 245(4922): 1073–80 (cit. on p. 34). Keynes, M. & Cox, T. M. (2008) William Bateson, the rediscoverer of Mendel. Journal of the Royal Society of Medicine 101(3): 104 (cit. on p. 26). Kimura, A., Harada, H., Park, J.-E., Nishi, H., Satoh, M., Takahashi, M., Hiroi, S., Sasaoka, T., Ohbuchi, N., Nakamura, T., Koyanagi, T., Hwang, T.-H., Choo, J.-A., Chung, K.-S., Hasegawa, A., Nagai, R., Okazaki, O., Nakamura, H.,Matsuzaki,M., Sakamoto, T., & al. (1997)Mutations in the cardiac troponin I gene associated with hypertrophic cardiomyopathy. Nature Genetics 16(4): 379–382 (cit. on p. 68). Klaassen, S., Probst, S., Oechslin, E., Gerull, B., Krings, G., Schuler, P., Greutmann, M., Hürlimann, D., Yegitbasi, M., Pons, L., Gramlich, M., Drenckhahn, J., Heuser, A., Berger, F., Jenni, R., & Thierfelder, L. (2008) Mutations in Sarcomere Protein Genes in Left Ventricular Noncompaction. Circulation 117(22) (cit. on p. 180). Klein, R. J., Zeiss, C., Chew, E. Y., Tsai, J.-Y., Sackler, R. S., Haynes, C., Henning, A. K., SanGiovanni, J. P., Mane, S. M., Mayne, S. T., Bracken, M. B., Ferris, F. L., Ott, J., Barnstable, C., & Hoh, J. (2005) Complement Factor H Polymorphism in Age-Related Macular Degeneration. Science 308(5720) (cit. on p. 37). Knott, S. A. & Haley, C. S. (2000) Multitrait Least Squares for Quantitative Trait Loci Detection. Genetics 156: 899–911 (cit. on p. 56). Korol, A. B., Ronin, Y. I., Itskovich, A. M., Peng, J., & Nevo, E. (2001) Enhanced Effi- ciency of Quantitative Trait Loci Mapping Analysis Based on Multivariate Com- plexes of Quantitative Traits. Genetics 157: 1789–1803 (cit. on p. 56). Korte, A., Vilhjálmsson, B. J., Segura, V., Platt, A., Long, Q., & Nordborg, M. (2012) Amixed-model approach for genome-wide association studies of correlated traits in structured populations. en.Nature Genetics 44(9): 1066–71 (cit. on pp. 38, 56, 59, 89, 97, 118, 193). Krol, K., Brozda, I., Skoneczny, M., Bretne, M., Skoneczna, A., & Yoshida, S. (2015) A Genomic Screen Revealing the Importance of Vesicular Trafficking Pathways in GenomeMaintenance and Protection against Genotoxic Stress in Diploid Sacchar- omyces cerevisiae Cells. PLoS ONE 10(3). Ed. by M. S.-Y. Huen: e0120702 (cit. on p. 120). Kruskal, J. B. (1964a) Multidimensional scaling by optimizing goodness of fit to a non-metric hypothesis. Psychometrika 29(1): 1–27 (cit. on p. 128). 239 Kruskal, J. B. (1964b) Nonmetric multidimensional scaling: A numerical method. Psychometrika 29(2): 115–129 (cit. on p. 128). Krzywinski, M. & Altman, N. (2013a) Points of significance: Power and sample size. Nature Methods 10(12): 1139–1140 (cit. on pp. 43, 44). Krzywinski, M. & Altman, N. (2013b) Points of significance: Significance, P values and t-tests. Nature Methods 10(11): 1041–1042 (cit. on p. 44). Kulbrock,M., Lehner, S.,Metzger, J., Ohnesorge, B., &Distl, O. (2013)Agenome-wide association study identifies risk loci to equine recurrent uveitis inGermanwarmblood horses. PloS ONE 8(8): e71619 (cit. on p. 49). Kullback, S. & Leibler, R. A. (1951) On Information and Sufficiency. The Annals of Mathematical Statistics 22(1): 79–86 (cit. on p. 130). Lafon, S. & Lee, A. (2006) Diffusion Maps and Coarse-Graining: A Unified Frame- work for Dimensionality Reduction, Graph Partitioning, and Data Set Parameter- ization. EEE Trans. Pattern Anal. and Mach. Intel 28: 1393–1403 (cit. on p. 134). Lander, E. S. (1996) The new genomics: global views of biology. Science 274(5287): 536–9 (cit. on p. 36). Lander, E. S.& Schork,N. J. (1994)Genetic dissection of complex traits.Science 265(5181): 2037–48 (cit. on pp. 44, 48). Lango, A. H., Estrada, K., Lettre, G., Berndt, S. I., Weedon, M. N., Rivadeneira, F., Willer, C. J., Jackson, A. U., Vedantam, S., Raychaudhuri, S., Ferreira, T., Wood, A. R., Weyant, R. J., Segrè, A. V., Speliotes, E. K., Wheeler, E., Soranzo, N., Park, J.-H., Yang, J., Gudbjartsson, D., & al. (2010) Hundreds of variants clustered in genomic loci and biological pathways affect human height.Nature 467(7317): 832– 838 (cit. on p. 38). Laparra, V., Malo, J., & Camps-Valls, G. (2015) Dimensionality Reduction via Regres- sion in Hyperspectral Imagery. IEEE Journal of Selected Topics in Signal Processing 9(6): 1026–1036 (cit. on pp. 128, 129, 134). Laske, T. G. & Iaizzo, P. A. (2005) The Cardiac Conduction System. Handbook of Car- diac Anatomy, Physiology, and Devices. Ed. by P. A. Iaizzo. Totowa, NJ: Humana Press: 123–136 (cit. on p. 64). Ledoit, O. & Wolf, M. (2004) A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis 88(2): 365–411 (cit. on p. 104). Lee, J. A. & Verleysen, M. (2009) Quality assessment of dimensionality reduction: Rank-based criteria. Neurocomputing 72(7): 1431–1443 (cit. on p. 137). 240 Lee, S., Yang, J., Goddard, M., Visscher, P., & Wray, N. (2012) Estimation of pleio- tropy between complex diseases using single-nucleotide polymorphism-derived genomic relationships and restricted maximum likelihood. Bioinformatics 28(19): 2540–2542 (cit. on p. 88). Lee, S., Goddard, M. E., Visscher, P. M., & van der Werf, J. H. (2010) Using the real- ized relationship matrix to disentangle confounding factors for the estimation of genetic variance components of complex traits. Genetics Selection Evolution 42(1): 22 (cit. on p. 52). Lee, J.-Y., Lee, B.-S., Shin, D.-J., Woo Park, K., Shin, Y.-A., Joong Kim, K., Heo, L., Young Lee, J., Kyoung Kim, Y., Jin Kim, Y., Bum Hong, C., Lee, S.-H., Yoon, D., Jung Ku, H., Oh, I.-Y., Kim, B.-J., Lee, J., Park, S.-J., Kim, J., Kawk, H.-k., & al. (2013) A genome-wide association study of a coronary artery disease risk variant. Journal of Human Genetics 58(3): 120–126 (cit. on p. 189). Lettre, G., Lange, C., & Hirschhorn, J. N. (2007) Genetic model testing and statist- ical power in population-based association studies of quantitative traits. Genetic Epidemiology 31(4): 358–362 (cit. on p. 42). Levy, D., Larson, M. G., Benjamin, E. J., Newton-Cheh, C., Wang, T. J., Hwang, S.-J., Vasan, R. S., & Mitchell, G. F. (2007) Framingham Heart Study 100K Project: ge- nome-wide associations for blood pressure and arterial stiffness. BMCMedical Ge- netics 8(Suppl 1): S3 (cit. on p. 69). Lewontin, R. C. (1970) TheUnits of Selection.Annual Review of Ecology and Systematics 1: 1–18 (cit. on p. 24). Lewontin, R. C. & Kojima, K.-i. (1960) The Evolutionary Dynamics of Complex Poly- morphisms. Evolution 14(4): 458 (cit. on p. 36). Li, D., Deogun, J., Spaulding, W., & Shuart, B. (2004) “Towards Missing Data Im- putation: A Study of Fuzzy K-means Clustering Method”. Rough Sets and Current Trends in Computing. Springer, Berlin, Heidelberg: 573–579 (cit. on pp. 49, 107). Li, Y.,Willer, C. J., Ding, J., Scheet, P., &Abecasis, G. R. (2010)MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes.Genetic Epi- demiology 34(8): 816–834 (cit. on p. 37). Lila, E., Aston, J. A. D., & Sangalli, L. M. (2016) Smooth Principal Component Ana- lysis over two-dimensionalmanifoldswith an application toNeuroimaging. arXiv: 1601.03670 (cit. on p. 195). 241 Lindgren, C. M., Heid, I. M., Randall, J. C., Lamina, C., Steinthorsdottir, V., Qi, L., Speliotes, E. K., Thorleifsson, G., Willer, C. J., Herrera, B. M., Jackson, A. U., Lim, N., Scheet, P., Soranzo, N., Amin, N., Aulchenko, Y. S., Chambers, J. C., Drong, A., Luan, J., Lyon, H. N., & al. (2009) Genome-Wide Association Scan Meta-Analysis Identifies Three Loci Influencing Adiposity and Fat Distribution. PLoS Genetics 5(6). Ed. by D. B. Allison: e1000508 (cit. on p. 38). Lippert, C., Casale, F. P., Rakitsch, B., & Stegle, O. (2014) LIMIX: genetic analysis of multiple traits. en. bioRxiv: 003905 (cit. on pp. 57, 90, 103, 194). Lippert, C., Listgarten, J., Liu, Y., Kadie, C. M., Davidson, R. I., & Heckerman, D. (2011) FaST linearmixedmodels for genome-wide association studies.NatureMeth- ods 8(10): 833–837 (cit. on pp. 38, 51, 59, 89). Lippert, C., Quon,G., Kang, E. Y., Kadie, C.M., Listgarten, J., &Heckerman,D. (2013) The benefits of selecting phenotype-specific variants for applications of mixed models in genomics. en. Scientific Reports 3: 1815 (cit. on pp. 53, 72). Liti, G., Carter, D. M., Moses, A. M., Warringer, J., Parts, L., James, S. A., Davey, R. P., Roberts, I. N., Burt, A., Koufopanou, V., Tsai, I. J., Bergman, C. M., Bensasson, D., O’Kelly, M. J. T., van Oudenaarden, A., Barton, D. B. H., Bailes, E., Nguyen, A. N., Jones, M., Quail, M. A., & al. (2009) Population genomics of domestic and wild yeasts. Nature 458(7236): 337 (cit. on p. 117). Little, R. J. A. (1988) A Test of Missing Completely at Random for Multivariate Data with Missing Values. Source Journal of the American Statistical Association 83(404): 1198–1202 (cit. on pp. 107, 110, 113, 114). Little, R. J. A. & Rubin, D. B. (2002) Statistical analysis with missing data. Ed. by D. J. Balding, P. Bloomfield, N. A. C. Cressie, N. I. Fisher, I. M. Johnstone, J. B. Kadane, L. M. Ryan, D. W. Scott, A. F. M. Smith, & J. L. Teugels. 2nd. New Jersey: John Wiley & Sons, Inc: 408 (cit. on pp. 106–108). Liu, F., van der Lijn, F., Schurmann, C., Zhu, G., Chakravarty, M. M., Hysi, P. G., Wollstein, A., Lao, O., de Bruijne, M., Ikram, M. A., van der Lugt, A., Rivadeneira, F., Uitterlinden, A. G., Hofman, A., Niessen, W. J., Homuth, G., de Zubicaray, G., McMahon, K. L., Thompson, P. M., Daboul, A., & al. (2012) A Genome-Wide As- sociation Study Identifies Five Loci Influencing Facial Morphology in Europeans. PLoS Genetics 8(9). Ed. by G. Gibson: e1002932 (cit. on pp. 54, 141, 144, 177, 194). 242 Liu, Y., Athanasiadis, G., & Weale, M. E. (2008) A survey of genetic simulation soft- ware for population and epidemiological studies.Human Genomics 3(1): 79 (cit. on p. 72). Lock, R. H. (1906) Recent progress in the study of variation, heredity, and evolution. Lon- don: J. Murray: 352 (cit. on pp. 26, 28). Loh, P.-r., Tucker, G., Bulik-sullivan, B. K., &Vilhj, B. J. (2014) Efficient Bayesianmixed model analysis increases association power in large cohorts.Nature Genetics 47(3): 1–79 (cit. on pp. 71, 72, 78, 89). Lohmueller, K. E., Pearce, C. L., Pike, M., Lander, E. S., & Hirschhorn, J. N. (2003) Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease. Nature Genetics 33(2): 177–182 (cit. on p. 35). Lopes, M. S., Silva, F. F., Harlizius, B., Duijvesteijn, N., Lopes, P. S., Guimarães, S. E., & Knol, E. F. (2013) Improved estimation of inbreeding and kinship in pigs using optimized SNP panels. BMC Genetics 14(1): 92 (cit. on p. 52). Lorell, B. H. & Carabello, B. A. (2000) Left ventricular hypertrophy: pathogenesis, detection, and prognosis. Circulation 102(4) (cit. on pp. 67, 155). Lu, X., Wang, L., Chen, S., He, L., Yang, X., Shi, Y., Cheng, J., Zhang, L., Gu, C. C., Huang, J., Wu, T., Ma, Y., Li, J., Cao, J., Chen, J., Ge, D., Fan, Z., Li, Y., Zhao, L., Li, H., & al. (2012) Genome-wide association study in Han Chinese identifies four new susceptibility loci for coronary artery disease. Nature Genetics 44(8): 890–894 (cit. on p. 38). Lynch, M. & Ritland, K. (1999) Estimation of pairwise relatedness with molecular markers. Genetics 152(4): 1753–66 (cit. on p. 49). Maaten, L. V. D. &Hinton, G. (2008) VisualizingData using t-SNE. Journal ofMachine Learning Research 9: 2579–2605 (cit. on pp. 131, 134). MacArthur, J., Bowler, E., Cerezo,M.,Gil, L.,Hall, P.,Hastings, E., Junkins,H.,McMa- hon,A.,Milano,A.,Morales, J., Pendlington, Z.M.,Welter, D., Burdett, T.,Hindorff, L., Flicek, P., Cunningham, F., & Parkinson, H. (2017) The new NHGRI-EBI Cata- log of published genome-wide association studies (GWAS Catalog). Nucleic Acids Research 45(D1): D896–D901 (cit. on pp. 38, 68, 171, 190). Mackay, J., Mensah, G., & A, O. (2004) The Atlas of Heart Disease And Stroke. Ed. by A. Haley. 1st ed. World Health Organisation (cit. on p. 66). 243 Malosetti, M., van der Linden, C. G., Vosman, B., & van Eeuwijk, F. A. (2007) A Mixed-ModelApproach toAssociationMappingUsingPedigree InformationWith an Illustration of Resistance to Phytophthora infestans in Potato. Genetics 175(2): 879–889 (cit. on p. 50). Marchini, J., Cardon, L. R., Phillips, M. S., & Donnelly, P. (2004) The effects of human population structure on large genetic association studies. Nature Genetics 36(5): 512–7 (cit. on pp. 44, 47). Marchini, J. & Howie, B. (2010) Genotype imputation for genome-wide association studies. Nature Reviews Genetics 11(7): 499–511 (cit. on pp. 37, 159). Marchini, J., Howie, B., Myers, S., McVean, G., & Donnelly, P. (2007) A new multi- point method for genome-wide association studies by imputation of genotypes. Nature Genetics 39(7): 906–13 (cit. on pp. 76, 158). Marigorta, U. M. & Gibson, G. (2014) A simulation study of gene-by-environment interactions in GWAS implies ample hidden effects. Frontiers in Genetics 5(July): 225 (cit. on pp. 71, 78). Martinez-Jimenez, C. P., Eling, N., Chen, H.-C., Vallejos, C. A., Kolodziejczyk, A. A., Connor, F., Stojic, L., Rayner, T. F., Stubbington, M. J. T., Teichmann, S. A., de la Roche, M., Marioni, J. C., & Odom, D. T. (2017) Aging increases cell-to-cell tran- scriptional variability upon immune stimulation. Science 355(6332): 1433–1436 (cit. on p. 132). Matthaei, H. J., Jones, O. W., Martig, R. G., & W, N. M. (1962) Characteristics and composition of RNA coding units. Proceedings of the National Academy of Sciences of the United States of America 48: 666–677 (cit. on p. 32). Maxam, A. M. & Gilbert, W. (1977) A new method for sequencing DNA. Proceedings of the National Academy of Sciences of the United States of America 74(2): 560–4 (cit. on p. 32). Meder, B., Katus, H. A., & Keller, A. (2016) Computational Cardiology - A New Dis- cipline of Translational Research. Genomics, Proteomics & Bioinformatics 14(4): 177– 8 (cit. on p. 195). Mendel, G. (1866) Versuche über Plflanzenhybriden.Verhandlungen des naturforschen- den Vereines in Brünn, Bd. IV für das Jahr 1865: 3–47 (cit. on p. 26). Mendel, G. (1869)Ueber einige aus künstlicher BefruchtunggewonnenenHieracium- Bastarde.Verhandlungen des naturforschendenVereines in Brünn 8: 26–31 (cit. on p. 26). 244 Meyer, H. V. (2017) R Package: PhenotypeSimulator: Flexible Phenotype Simulation from Different Genetic and Noise Models. Cambridge (cit. on pp. 72, 193). Meyer, H. V. & Birney, E. (2018) PhenotypeSimulator: A comprehensive framework for simulatingmulti-trait, multi-locus genotype to phenotype relationships. Bioin- formatics (cit. on p. 69). Meyer, H. V., Casale, F. P., Stegle, O., & Birney, E. (2018) LiMMBo: a simple, scalable approach for linearmixedmodels in high-dimensional genetic association studies. bioRxiv: 255497 (cit. on pp. 69, 89). Miescher, F. (1871) Ueber die chemische Zusammensetzung der Eiterzellen.Medizi- nisch-chemische Untersuchungen 4: 441–460 (cit. on p. 27). Minchin, P. R. (1987) An evaluation of the relative robustness of techniques for eco- logical ordination. Vegetatio 69: 89–107 (cit. on p. 128). Mitchell, L. E., Agopian, a. J., Bhalla, a., Glessner, J. T., Kim, C. E., Swartz, M. D., Hakonarson, H., & Goldmuntz, E. (2015) Genome-wide association study of ma- ternal and inherited effects on left-sided cardiac malformations.Human Molecular Genetics 24(1): 265–273 (cit. on p. 69). Mitchell, S. C., Korones, S. B., & Berendes, H. W. (1971) Congenital heart disease in 56,109 births incidence and natural history. Circulation 43(3) (cit. on p. 67). Monaghan, F. V. & Corcos, A. F. (1986) Tschermak: a non-discoverer of Mendelism: I. An historical note. Journal of Heredity 77(6): 468–469 (cit. on p. 26). Monaghan, F. V. & Corcos, A. F. (1987) Tschermak: a non-discoverer of Mendelism II. A critique. Journal of Heredity 78(3): 208–210 (cit. on p. 26). Monserrat, L., Hermida-Prieto, M., Fernandez, X., Rodriguez, I., Dumont, C., Cazon, L., Cuesta,M. G., Gonzalez-Juanatey, C., Peteiro, J., Alvarez, N., Penas-Lado,M., & Castro-Beiras, A. (2007) Mutation in the alpha-cardiac actin gene associated with apical hypertrophic cardiomyopathy, left ventricular non-compaction, and septal defects. European Heart Journal 28(16): 1953–1961 (cit. on p. 180). Moorman, A. F. M. & Lamers, W. H. (1994) Molecular anatomy of the developing heart. Trends in Cardiovascular Medicine 4(6): 257–264 (cit. on p. 64). Morgan, T. H. (1910) Sex limited inheritance in Drosophila. Science 32(812): 120–2 (cit. on p. 29). Morgan, T. H. (1911a) An attempt to analyze the constitution of the chromosomes on the basis of sex-limited inheritance in Drosophila. Journal of Experimental Zoology 11(4): 365–413 (cit. on p. 29). 245 Morgan, T. H. (1911b) Random segregation versus coupling in Mendelian inherit- ance. Science 34(873) (cit. on p. 29). Morgan, T. H., Sturtevant, A. H., Muller, H. J., & Bridges, C. B. (1915) The mechanism of Mendelian heredity. New York: H. Holt & company: 288 (cit. on p. 30). Moric-Janiszewska, E.&Markiewicz-Łoskot, G. (2008)GeneticHeterogeneity of Left- ventricularNoncompactionCardiomyopathy.Clinical Cardiology 31(5): 201–204 (cit. on p. 180). Morris, J. A., Randall, J. C., Maller, J. B., & Barrett, J. C. (2010) Evoker: A visualization tool for genotype intensity data. Bioinformatics 26(14): 1786–1787 (cit. on p. 178). Morrow, J. F. & Berg, P. (1972) Cleavage of Simian virus 40 DNA at a unique site by a bacterial restriction enzyme. Proceedings of the National Academy of Sciences of the United States of America 69(11): 3365–9 (cit. on p. 32). Morton,N. E. (1955) Sequential tests for the detection of linkage.The American Journal of Human Genetics 7(3): 277–318 (cit. on p. 34). Mysliwiec, M. R., Bresnick, E. H., & Lee, Y. (2011) Endothelial Jarid2/Jumonji is re- quired for normal cardiac development and proper Notch1 expression. The Journal of Biological Chemistry 286(19): 17193–204 (cit. on p. 180). Nakayama, M., Nakajima, D., Nagase, T., Nomura, N., Seki, N., & Ohara, O. (1998) Identification of high-molecular-weight proteins withmultiple EGF-likemotifs by motif-trap screening. Genomics 51(1): 27–34 (cit. on p. 170). Naylor, M. G., Lin, X., Weiss, S. T., Raby, B. A., & Lange, C. (2010) Using Canonical Correlation Analysis to Discover Genetic Regulatory Variants. PLoS ONE 5(5). Ed. by A. C. Goldberg: e10395 (cit. on p. 55). Nejati-Javaremi, A., Smith, C., & Gibson, J. P. (1997) Effect of total allelic relationship on accuracy of evaluation and response to selection. Journal of Animal Science 75(7): 1738–45 (cit. on p. 52). Newton-Cheh, C., Guo, C.-Y., Wang, T. J., O’donnell, C. J., Levy, D., & Larson, M. G. (2007)Genome-wide association study of electrocardiographic andheart rate vari- ability traits: the FraminghamHeart Study. BMCMedical Genetics 8 Suppl 1: S7 (cit. on p. 69). Nikpay, M., Goel, A., Won, H.-H., Hall, L. M., Willenborg, C., Kanoni, S., Saleheen, D., Kyriakou, T., Nelson, C. P., Hopewell, J. C., Webb, T. R., Zeng, L., Dehghan, A., Alver, M., Armasu, S. M., Auro, K., Bjonnes, A., Chasman, D. I., Chen, S., Ford, I., & al. (2015) A comprehensive 1,000 Genomes-based genome-wide association 246 meta-analysis of coronary artery disease. Nature Genetics 47(10): 1121–30 (cit. on pp. 69, 189). Nirenberg, M. W. & Matthaei, J. H. (1961) The dependence of cell-free protein syn- thesis in E. coli upon naturally occurring or synthetic polyribonucleotides.Proceed- ings of the National Academy of Sciences of the United States of America 47(10): 1588– 602 (cit. on p. 32). Noguchi, E., Sakamoto, H., Hirota, T., Ochiai, K., Imoto, Y., Sakashita, M., Kurosaka, F., Akasawa, A., Yoshihara, S., Kanno, N., Yamada, Y., Shimojo, N., Kohno, Y., Su- zuki, Y., Kang, M.-J., Kwon, J.-W., Hong, S.-J., Inoue, K., Goto, Y.-i., Yamashita, F., & al. (2011) Genome-Wide Association Study Identifies HLA-DP as a Susceptib- ility Gene for Pediatric Asthma in Asian Populations. PLoS Genetics 7(7). Ed. by M. I. McCarthy: e1002170 (cit. on p. 38). O’Brien, P. C. (1984) Procedures for Comparing Samples with Multiple Endpoints. Biometrics 40(40): 1079–1087 (cit. on p. 55). O’Reilly, P. F., Hoggart, C. J., Pomyen, Y., Calboli, F. C. F., Elliott, P., Jarvelin, M.-R., & Coin, L. J. M. (2012) MultiPhen: joint model of multiple phenotypes can increase discovery in GWAS. PloS one 7(5): e34861 (cit. on p. 76). O’Toole, T. E., Conklin, D. J., & Bhatnagar, A. (2008) Environmental risk factors for heart disease. Reviews on Environmental Health 23(3): 167–202 (cit. on pp. 67, 155). Oechslin, E. N., Attenhofer Jost, C. H., Rojas, J. R., Kaufmann, P. A., & Jenni, R. (2000) Long-term follow-up of 34 adults with isolated left ventricular noncompaction: a distinct cardiomyopathy with poor prognosis. Journal of the American College of Cardiology 36(2): 493–500 (cit. on p. 189). Oliveira, A. & Seijas-Macias, A. (2012) AnApproach to Distribution of the Product of Two Normal Variables. Discussiones Mathematicae: Probability and Statistics 32(1-2): 87 (cit. on p. 79). Ott, J., Wang, J., & Leal, S. M. (2015) Genetic linkage analysis in the age of whole- genome sequencing. Nature Reviews Genetics 16(5): 275–284 (cit. on p. 36). Paige, S. L., Plonowska, K., Xu, A., &Wu, S. M. (2015) Molecular regulation of cardi- omyocyte differentiation. Circulation Research 116(2): 341–53 (cit. on p. 66). Park, D. & Fishman, G. (2017) Development and Function of the Cardiac Conduction System in Health and Disease. Journal of Cardiovascular Development and Disease 4(2): 7 (cit. on p. 66). 247 Parkhomenko, E., Tritchler, D., & Beyene, J. (2009) Sparse Canonical Correlation Analysis with Application to Genomic Data Integration. Statistical Applications in Genetics and Molecular Biology 8(1) (cit. on p. 55). Paternoster, L., Zhurov, A. I., Toma, A. M., Kemp, J. P., St. Pourcain, B., Timpson, N. J., McMahon, G.,McArdle,W., Ring, S.M., Smith, G. D., Richmond, S., & Evans, D. M. (2012) Genome-wide Association Study of Three-Dimensional Facial Mor- phology Identifies a Variant in PAX3 Associated with Nasion Position. The Amer- ican Journal of Human Genetics 90(3): 478–485 (cit. on p. 38). Patterson, H. D. & Thompson, R. (1971) Recovery of inter-block information when block sizes are unequal. Biometrika 58(3): 545–554 (cit. on p. 41). Patterson,N., Price, A. L., Reich, D., Reich, D., &Daly,M. (2006) Population Structure and Eigenanalysis. PLoS Genetics 2(12): e190 (cit. on pp. 53, 169). Pausova, Z., Paus, T., Abrahamowicz, M., Almerigi, J., Arbour, N., Bernard, M., Gau- det, D., Hanzalek, P., Hamet, P., Evans, A. C., Kramer, M., Laberge, L., Leal, S. M., Leonard, G., Lerner, J., Lerner, R. M., Mathieu, J., Perron, M., Pike, B., Pitiot, A., & al. (2007) Genes, maternal smoking, and the offspring brain and body during adolescence: Design of the Saguenay Youth Study. Human Brain Mapping 28(6): 502–518 (cit. on p. 141). Payne, R. M., Johnson, M. C., Grant, J. W., & Strauss, A. W. (1995) Toward a Molecu- lar Understanding of Congenital Heart Disease. Circulation 91(2): 494–504 (cit. on p. 155). Pearson, K. (1900) On the criterion that a given system of deviations from the prob- able in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine 50(302): 157–175 (cit. on p. 27). Pearson, K. (1901) On lines and planes of closest fit to systems of points in space. Philosophical Magazine 2: 559–572 (cit. on pp. 27, 125). Peng, B., Amos, C. I., & Kimmel, M. (2007) Forward-Time Simulations of Human Populations with Complex Diseases. PLoS Genetics 3(3): e47 (cit. on p. 72). Penrose, L. S. (1935) The detection of autosomal linkage in data which consist of airs of brothers and sisters of unspecified parentage. Annals of Eugenics 6(2): 133–138 (cit. on pp. 33, 34). 248 Petersen, S. E., Selvanayagam, J. B., Wiesmann, F., Robson, M. D., Francis, J. M., An- derson, R.H.,Watkins,H.,&Neubauer, S. (2005) Left VentricularNon-Compaction. Journal of the American College of Cardiology 46(1): 101–105 (cit. on p. 180). Pickrell, J. K., Berisa, T., Liu, J. Z., Ségurel, L., Tung, J. Y., & Hinds, D. A. (2016) De- tection and interpretation of shared genetic influences on 42 human traits. Nature Genetics advance on (cit. on p. 38). Piernick, L. K. ( & Correns, C. ( (1950) G. Mendel’s law concerning behavior of pro- geny of varietal hybrids. Genetics 35(5): 33–41 (cit. on p. 26). Pigott, T. D. (2001) A Review of Methods for Missing Data. Educational Research and Evaluation 7(4): 353–383 (cit. on p. 113). Plate, C. (1910) “Die Schwanzknickblastovariation”. Festschrift für Richard Hertwig, Zweiter Band. Jena: Gustav Fischer. Chap. Vererbungs: 537–610 (cit. on p. 29). Porter, H. F. & O’Reilly, P. F. (2017) Multivariate simulation framework reveals per- formance of multi-trait GWAS methods. Scientific Reports 7: 38837 (cit. on pp. 76, 78). Post, W. S., Larson, M. G., Myers, R. H., Galderisi, M., & Levy, D. (1997) Heritability of Left Ventricular Mass : The FraminghamHeart Study.Hypertension 30(5): 1025– 1028 (cit. on p. 155). Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., & Reich, D. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 38(8): 904–909 (cit. on pp. 49, 169). Pritchard, J. K., Stephens, M., Rosenberg, N. A., & Donnelly, P. (2000) Association mapping in structured populations. The American Journal of Human Genetics 67(1): 170–81 (cit. on p. 49). Pruim,R. J.,Welch, R. P., Sanna, S., Teslovich, T.M., Chines, P. S., Gliedt, T. P., Boehnke, M., Abecasis, G. R., &Willer, C. J. (2010) LocusZoom: regional visualization of ge- nome-wide association scan results.Bioinformatics 26(18): 2336–2337 (cit. on pp. 172, 188). Pulst, S. M. (1999) Genetic Linkage Analysis. Archives of Neurology 56(6): 667 (cit. on p. 34). Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A. R., Bender, D., Maller, J., Sklar, P., de Bakker, P. I. W., Daly, M. J., & Sham, P. C. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics 81(3): 559–75 (cit. on p. 158). 249 Rakitsch, B., Lippert, C., Borgwardt, K., & Stegle, O. (2013) It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals (cit. on p. 90). Reich, D. E. & Goldstein, D. B. (2001) Detecting association in a case-control study while correcting for population stratification. Genetic Epidemiology 20(1): 4–16 (cit. on pp. 35, 36, 48, 104). Ripley, B. D. (1996) Pattern recognition and neural networks. 7th. Cambridge: Cam- bridge University Press: 416 (cit. on p. 134). Risch, N. & Merikangas, K. (1996) The future of genetic studies of complex human diseases. Science 273(5281): 1516–7 (cit. on p. 35). Ritland, K. (2000) Marker-inferred relatedness as a tool for detecting heritability in nature.Molecular Ecology 9(9): 1195–204 (cit. on p. 49). Ritter,M., Oechslin, E., Sütsch, G., Attenhofer, C., Schneider, J., & Jenni, R. (1997) Isol- ated noncompaction of the myocardium in adults. Mayo Clinic proceedings 72(1): 26–31 (cit. on p. 189). Rosenberg, N. A., Mahajan, S., Ramachandran, S., Zhao, C., Pritchard, J. K., & Feld- man,M.W. (2005) Clines, Clusters, and the Effect of StudyDesign on the Inference of Human Population Structure. PLoS Genetics 1(6): e70 (cit. on p. 49). Rosenberg, N. A., Pritchard, J. K., Weber, J. L., Cann, H.M., Kidd, K. K., Zhivotovsky, L. A., & Feldman, M. W. (2002) Genetic Structure of Human Populations. Science 298(5602) (cit. on p. 49). Roweis, S. T.& Saul, L. K. (2000)Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500): 2323–2326 (cit. on pp. 129, 142). Rubin,D. B. (1976) Inference andmissingdata.Biometrika 63(3): 581–92 (cit. on pp. 106, 110). Rubin, D. B. (1987)Multiple Imputation for nonresponse in surveys. 2nd. NewYork: John Wiley & Sons, Inc (cit. on pp. 108, 113). Sabatti, C., Service, S. K., Hartikainen, A.-L., Pouta, A., Ripatti, S., Brodsky, J., Jones, C. G., Zaitlen, N. A., Varilo, T., Kaakinen, M., Sovio, U., Ruokonen, A., Laitinen, J., Jakkula, E., Coin, L., Hoggart, C., Collins, A., Turunen, H., Gabriel, S., Elliot, P., & al. (2009) Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nature Genetics 41(1): 35–46 (cit. on p. 186). Sanger, F., Nicklen, S., & Coulson, A. R. (1977) DNA sequencing with chain-termi- nating inhibitors. Proceedings of the National Academy of Sciences of the United States of America 74(12): 5463–5467 (cit. on p. 32). 250 Sankaranarayanan, K. (1998) Ionizing radiation and genetic risks: IX. Estimates of the frequencies of mendelian diseases and spontaneous mutation rates in human populations: a 1998 perspective.Mutation Research 411(2): 129–178 (cit. on p. 35). Sano,M., Kamitsuji, S., Kamatani,N., Tabara, Y., Kawaguchi, T.,Matsuda, F., Yamagishi, H., Fukuda, K., & (JPDSC), J. P. D. S. C. (2016) Genome-Wide Association Study of Absolute QRS Voltage Identifies Common Variants of TBX3 as Genetic Determin- ants of Left Ventricular Mass in a Healthy Japanese Population. PLoS ONE 11(5). Ed. by T. Minamino: e0155550 (cit. on p. 155). Sanoudou,D., Vafiadaki, E., Arvanitis, D. a., Kranias, E., &Kontrogianni-Konstantopoulos, A. (2005) Array lessons from the heart: focus on the genome and transcriptome of cardiomyopathies. Physiological genomics 21(2): 131–143 (cit. on p. 155). Sarkar, S. K. (2007) Stepup procedures and controlling generalized FWER and gen- eralized FDR. The Annals of Statistics 35(6): 2405–2420. arXiv: arXiv:0803.2934v1 (cit. on p. 46). Schafer, J. L. (1997)Analysis of incomplete multivariate data. Chapman&Hall/CRC (cit. on pp. 108, 114). Schafer, J. L. & Graham, J. W. (2002) Missing data: Our view of the state of the art. Psychological Methods 7(2): 147–177 (cit. on p. 110). Schäfer, J. & Strimmer, K. (2005) A shrinkage approach to large-scale covariancemat- rix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology 4: Article32 (cit. on pp. 55, 104). Schoelkopf, B., Smola, A., & Uller, K.-R. (1998) Nonlinear Component Analysis as a Kernel Eigenvalue Problem.Neural Computation 10: 1299–1319 (cit. on pp. 126, 134). Schor, I. E., Degner, J. F., Harnett, D., Cannavò, E., Casale, F. P., Shim, H., Garfield, D. A., Birney, E., Stephens,M., Stegle, O., &MFurlong, E. E. (2017) Promoter shape varies across populations and affects promoter evolution and expression noise. Nature Publishing Group 49 (cit. on p. 90). Schott, J.-J., Benson, D. W., Basson, C. T., Pease, W., Silberbach, G. M., Moak, J. P., Maron, B. J., Seidman, C. E., & Seidman, J. G. (1998) Congenital Heart Disease Caused by Mutations in the Transcription Factor NKX2-5. Science 281(5373) (cit. on p. 68). Schunkert, H., König, I. R., Kathiresan, S., Reilly, M. P., Assimes, T. L., Holm, H., Preuss, M., Stewart, A. F. R., Barbalic, M., Gieger, C., Absher, D., Aherrahrou, Z., Allayee, H., Altshuler, D., Anand, S. S., Andersen, K., Anderson, J. L., Ardissino, 251 D., Ball, S. G., Balmforth, A. J., & al. (2011) Large-scale association analysis iden- tifies 13 new susceptibility loci for coronary artery disease. Nature Genetics 43(4): 333–338 (cit. on p. 189). Scuteri, A., Sanna, S., Chen, W.-M., Uda, M., Albai, G., Strait, J., Najjar, S., Nagaraja, R., Orrú, M., Usala, G., Dei, M., Lai, S., Maschio, A., Busonero, F., Mulas, A., Ehret, G. B., Fink, A. A., Weder, A. B., Cooper, R. S., Galan, P., & al. (2007) Genome-Wide Association Scan Shows Genetic Variants in the FTO Gene Are Associated with Obesity-Related Traits. PLoS Genetics 3(7): e115 (cit. on pp. 37, 43). Seidman, J. G.& Seidman,C. (2001) The genetic basis for cardiomyopathy.Cell 104(4): 557–567 (cit. on p. 67). Serre, D. & Pääbo, S. (2004) Evidence for Gradients of Human Genetic Diversity Within and Among Continents. Genome Research 14(9): 1679–1685 (cit. on p. 49). Shaffer, J. R., Orlova, E., Lee, M. K., Leslie, E. J., Raffensperger, Z. D., Heike, C. L., Cunningham, M. L., Hecht, J. T., Kau, C. H., Nidey, N. L., Moreno, L. M., Wehby, G. L., Murray, J. C., Laurie, C. A., Laurie, C. C., Cole, J., Ferrara, T., Santorico, S., Klein, O., Mio, W., & al. (2016) Genome-Wide Association Study Reveals Multiple Loci Influencing Normal Human Facial Morphology. PLoS Genetics 12(8). Ed. by G. S. Barsh: e1006149 (cit. on pp. 54, 177). Shaffer, J. P. (1995) Multiple Hypothesis Testing.Annual Review of Psychology 46: 561– 84 (cit. on pp. 44, 46). Shendure, J. & Ji, H. (2008) Next-generation DNA sequencing. Nature Biotechnology 26(10): 1135–1145 (cit. on p. 33). Shin, S.-Y., Fauman, E. B., Petersen, A.-K., Krumsiek, J., Santos, R., Huang, J., Arnold, M., Erte, I., Forgetta, V., Yang, T.-P., Walter, K., Menni, C., Chen, L., Vasquez, L., Valdes, A. M., Hyde, C. L., Wang, V., Ziemek, D., Roberts, P., Xi, L., & al. (2014) An atlas of genetic influences on human blood metabolites. Nature Genetics 46(6): 543–550 (cit. on p. 105). Shriner, D. (2012) Moving toward System Genetics through Multiple Trait Analysis in Genome-Wide Association Studies. Frontiers in Genetics 3: 1 (cit. on pp. 54, 56). Sigg, D. C., Iaizzo, P. A., Xiao, Y.-F., & Bin, H. (2010) Cardiac ElectrophysiologyMethods and Models. New York: Springer US: 492 (cit. on p. 64). Smith, H. O. & Welcox, K. W. (1970) A Restriction enzyme from Hemophilus influ- enzae. Journal of Molecular Biology 51(2): 379–391 (cit. on p. 32). 252 Soler, R., Rodríguez, E., Monserrat, L., & Alvarez, N. (2002) MRI of subendocardial perfusion deficits in isolated left ventricular noncompaction. Journal of Computer Assisted Tomography 26(3): 373–5 (cit. on p. 189). Song, W. T. (2005) Relationships among some univariate distributions. IIE Transac- tions 37: 651–656 (cit. on p. 79). Southern, E.M. (1975) Detection of specific sequences amongDNA fragments separ- ated by gel electrophoresis. Journal of Molecular Biology 98(3): 503–17 (cit. on p. 32). Speed, D., Cai, N., Johnson, M. R., Nejentsev, S., Balding, D. J., & Balding, D. J. (2017) Reevaluation of SNP heritability in complex human traits. Nature Genetics 49(7): 986–992 (cit. on p. 54). Speliotes, E. K., Willer, C. J., Berndt, S. I., Monda, K. L., Thorleifsson, G., Jackson, A. U., Lango Allen, H., Lindgren, C. M., Luan, J., Mägi, R., Randall, J. C., Vedan- tam, S., Winkler, T. W., Qi, L., Workalemahu, T., Heid, I. M., Steinthorsdottir, V., Stringham, H. M., Weedon, M. N., Wheeler, E., & al. (2010) Association analyses of 249,796 individuals reveal 18 new loci associated with bodymass index.Nature Genetics 42(11): 937–48 (cit. on p. 38). Spielman, R. S., McGinnis, R. E., & Ewens, W. J. (1993) Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). The American Journal of Human Genetics 52(3): 506–16 (cit. on pp. 35, 44, 47). Staden, R. (1979) A strategy of DNA sequencing employing computer programs. Nucleic Acids Research 6(7): 2601–10 (cit. on p. 32). Stegle, O., Parts, L., Durbin, R., & Winn, J. (2010) A Bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eQTL studies. en. PLoS Computational Biology 6(5): e1000770 (cit. on pp. 127, 134). Stegle, O., Parts, L., Piipari, M., Winn, J., & Durbin, R. (2012) Using probabilistic estimation of expression residuals (PEER) to obtain increased power and inter- pretability of gene expression analyses.Nature Protocols 7(3): 500–7 (cit. on pp. 127, 133, 195). Stein, J. L., Hua, X., Lee, S., Ho, A. J., Leow, A. D., Toga, A. W., Saykin, A. J., Shen, L., Foroud, T., Pankratz, N., Huentelman, M. J., Craig, D. W., Gerber, J. D., Allen, A. N., Corneveaux, J. J., Dechairo, B.M., Potkin, S. G.,Weiner, M.W., & Thompson, 253 P. (2010) Voxelwise genome-wide association study (vGWAS). NeuroImage 53(3): 1160–74 (cit. on pp. 54, 177). Stein, M. B., Campbell-Sills, L., & Gelernter, J. (2009) Genetic variation in 5HTTLPR is associated with emotional resilience. American journal of medical genetics. Part B, Neuropsychiatric genetics 150B(7): 900–6 (cit. on p. 49). Stephens, M. (2013) A unified framework for association analysis with multiple re- lated phenotypes. PloS one 8(7): e65245 (cit. on pp. 71, 78). Stephens, M., Smith, N. J., & Donnelly, P. (2001) ANew Statistical Method for Haplo- type Reconstruction from Population Data. The American Journal of Human Genetics 68(4): 978–989 (cit. on p. 36). Sterne, J. A., Smith, G. D., & Cox, D. R. (2001) Sifting the evidence—what’s wrong with significance tests? British Medical Journal 322(7280): 226 (cit. on p. 43). Storey, J. D. (2002) A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B 64(3): 479–498 (cit. on p. 46). Strittmatter, W. J. & Roses, A. D. (1996) Apolipoprotein E and Alzheimer’s Disease. Annual Review of Neuroscience 19(1): 53–77 (cit. on p. 35). Sturtevant, A. H. (1913) The linear arrangement of six sex-linked factors in Droso- phila, as shown by their mode of association. Journal of Experimental Zoology 14(1): 43–59 (cit. on p. 30). Su, Z., Marchini, J., & Donnelly, P. (2011) HAPGEN2: simulation of multiple disease SNPs. Bioinformatics 27(16): 2304–2305 (cit. on pp. 72, 73, 82). Sudlow, C., Gallacher, J., Allen, N., Beral, V., Burton, P., Danesh, J., Downey, P., El- liott, P., Green, J., Landray, M., Liu, B., Matthews, P., Ong, G., Pell, J., Silman, A., Young, A., Sprosen, T., Peakman, T., & Collins, R. (2015) UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age. PLoSMedicine 12(3): e1001779 (cit. on pp. 37, 38, 104, 190). Sulem, P., Gudbjartsson, D. F., Stacey, S. N., Helgason, A., Rafnar, T., Jakobsdottir, M., Steinberg, S., Gudjonsson, S. A., Palsson, A., Thorleifsson, G., Pálsson, S., Sig- urgeirsson, B., Thorisdottir, K., Ragnarsson, R., Benediktsdottir, K. R., Aben, K. K., Vermeulen, S. H., Goldstein, A. M., Tucker, M. A., Kiemeney, L. A., & al. (2008) Two newly identified genetic determinants of pigmentation in Europeans. Nature Genetics 40(7): 835–837 (cit. on p. 38). Sutton, W. S. (1903) The chromosomes in heredity. Biological Bulletin 4: 231–251 (cit. on p. 28). 254 Suzuki, R. & Shimodaira, H. (2006) Pvclust: An R package for assessing the uncer- tainty in hierarchical clustering. Bioinformatics 22(12): 1540–1542 (cit. on p. 120). Svishcheva,G. R., Axenovich, T. I., Belonogova,N.M., vanDuijn, C.M.,&Aulchenko, Y. S. (2012) Rapid variance components-based method for whole-genome associ- ation analysis. Nature Genetics 44(10): 1166–1170 (cit. on pp. 38, 59, 89). Swinkels, B. M., Boersma, L. V. A., Rensing, B. J., & Jaarsma, W. (2007) Isolated left ventricular noncompaction in a patient presenting with a subacute myocardial infarction. Netherlands Heart Journal 15(3): 109–11 (cit. on p. 189). Tachmazidou, I., Dedoussis, G., Southam, L., Farmaki, A.-E., Ritchie, G. R. S., Xi- fara, D. K., Matchan, A., Hatzikotoulas, K., Rayner, N. W., Chen, Y., Pollin, T. I., O’Connell, J. R., Yerges-Armstrong, L. M., Kiagiadaki, C., Panoutsopoulou, K., Schwartzentruber, J., Moutsianas, L., UK10K consortium, E., Tsafantakis, E., Tyler- Smith, C., & al. (2013) A rare functional cardioprotective APOC3 variant has risen in frequency in distinct population isolates.Nature Communications 4: 2872 (cit. on p. 104). Takeuchi, F., Yokota,M., Yamamoto, K., Nakashima, E., Katsuya, T., Asano,H., Isono, M., Nabika, T., Sugiyama, T., Fujioka, A., Awata, N., Ohnaka, K., Nakatochi, M., Kitajima,H., Rakugi,H.,Nakamura, J., Ohkubo, T., Imai, Y., Shimamoto,K., Yamori, Y., & al. (2012) Genome-wide association study of coronary artery disease in the Japanese. European Journal of Human Genetics 20(3): 333–340 (cit. on p. 38). Tang,H.,Quertermous, T., Rodriguez, B., Kardia, S. L. R., Zhu, X., Brown,A., Pankow, J. S., Province, M. A., Hunt, S. C., Boerwinkle, E., Schork, N. J., & Risch, N. J. (2005) Genetic structure, self-identified race/ethnicity, and confounding in case-control association studies. The American Journal of Human Genetics 76(2): 268–75 (cit. on p. 49). Templ, M., Alfons, A., & Filzmoser, P. (2012) Exploring incomplete data using visu- alization techniques. Advances in Data Analysis and Classification 6(1): 29–47 (cit. on pp. 107, 110, 111). Tenenbaum, J. B., de Silva, V., & Langford, J. C. (2000) A Global Geometric Frame- work for Nonlinear Dimensionality Reduction. Science 290(5500): 2319–2323 (cit. on pp. 129, 134). Teng, S. L. & Huang, H. (2009) A Statistical Framework to Infer Functional Gene Relationships From Biologically Interrelated Microarray Experiments. Journal of the American Statistical Association 104(486): 465–473 (cit. on p. 104). 255 Teo, Y. Y., Inouye, M., Small, K. S., Gwilliam, R., Deloukas, P., Kwiatkowski, D. P., & Clark, T. G. (2007) A genotype calling algorithm for the Illumina BeadArray platform. Bioinformatics 23(20): 2741–6 (cit. on p. 157). The International HapMap Consortium (2005) A haplotype map of the human gen- ome. Nature 437(7063): 1299–320 (cit. on pp. 36, 47, 158, 210). The International HapMap Consortium (2007) A second generation human haplo- type map of over 3.1 million SNPs. Nature 449(18): 851– (cit. on pp. 36, 210). The International HapMap Consortium (2010) Integrating common and rare genetic variation in diverse human populations. Nature 467 (cit. on p. 36). Thierfelder, L., Watkins, H., MacRae, C., Lamas, R., McKenna, W., Vosberg, H. P., Seidman, J. G., & Seidman, C. E. (1994) Alpha-tropomyosin and cardiac troponin T mutations cause familial hypertrophic cardiomyopathy: a disease of the sar- comere. Cell 77(5): 701–12 (cit. on p. 68). Thomas, S. C. (2005) The estimation of genetic relationships usingmolecularmarkers and their efficiency in estimating heritability in natural populations. Philosophical transactions of the Royal Society of London. Series B, Biological sciences 360(1459): 1457– 67 (cit. on p. 49). Tian, C., Gregersen, P. K., & Seldin, M. F. (2008a) Accounting for ancestry: popula- tion substructure and genome-wide association studies.HumanMolecular Genetics 17(R2): R143–R150 (cit. on p. 48). Tian, C., Plenge, R. M., Ransom, M., Lee, A., Villoslada, P., Selmi, C., Klareskog, L., Pulver, A. E., Qi, L., Gregersen, P. K., & Seldin, M. F. (2008b) Analysis and Ap- plication of European Genetic Substructure Using 300 K SNP Information. PLoS Genetics 4(1): e4 (cit. on p. 48). Toufan,M., Shahvalizadeh, R., & Khalili, M. (2012)Myocardial infarction in a patient with left ventricular noncompaction: a case report. International Journal of General Medicine 5: 661–5 (cit. on p. 189). Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Bot- stein, D., & Altman, R. B. (2001) Missing value estimation methods for DNA mi- croarrays. Bioinformatics 17(6): 520–525 (cit. on p. 107). Tschermak, E. (1900) Ueber künstliche Kreuzung bei Pisum sativum. Plant Biology 18(6): 232–239 (cit. on p. 26). 256 Tuan, D., Biro, P. A., DeRiel, J. K., Lazarus, H., & Forget, B. G. (1979) Restriction en- donuclease mapping of the human gamma globin gene loci.Nucleic Acids Research 6(7): 2519–44 (cit. on p. 32). UK10K Consortium (2015) The UK10K project identifies rare variants in health and disease. en. Nature 526: 82–90 (cit. on pp. 36, 157, 159). Unal, B., Critchley, J. A., & Capewell, S. (2004) Explaining the Decline in Coronary Heart Disease Mortality in England andWales Between 1981 and 2000. Circulation 109(9) (cit. on p. 67). Van Buuren, S. & Oudshoorn, K. (1999) Flexible multivariate imputation by MICE (cit. on p. 113). Van der Merwe, L., Cloete, R., Revera, M., Heradien, M., Goosen, A., Corfield, V. A., Brink, P. A., & Moolman-Smook, J. C. (2008) Genetic variation in angiotensin- converting enzyme 2 gene is associatedwith extent of left ventricular hypertrophy in hypertrophic cardiomyopathy. Human Genetics 124(1): 57–61 (cit. on pp. 155, 156). Van Essen, D. C., Ugurbil, K., Auerbach, E., Barch, D., Behrens, T. E. J., Bucholz, R., Chang, A., Chen, L., Corbetta, M., Curtiss, S. W., Della Penna, S., Feinberg, D., Glasser, M. F., Harel, N., Heath, A. C., Larson-Prior, L., Marcus, D., Michalareas, G., Moeller, S., Oostenveld, R., & al. (2012) The Human Connectome Project: A data acquisition perspective. NeuroImage 62(4): 2222–2231 (cit. on p. 194). Van Buuren, S. & Groothuis-Oudshoorn, K. (2011) mice : Multivariate Imputation by Chained Equations in R. Journal of Statistical Software 45(3): 1–67 (cit. on pp. 107, 113). Vanovschi, V. (2017) Parallel Python Software (cit. on p. 103). Vasan, R. S., Glazer, N. L., Felix, J. F., Lieb, W., Wild, P. S., Felix, S. B., Watzinger, N., Larson, M. G., Smith, N. L., Dehghan, A., Grosshennig, A., Schillert, A., Teumer, A., Schmidt, R., Kathiresan, S., Lumley, T., Aulchenko, Y. S., König, I. R., Zeller, T., Homuth, G., & al. (2009) Genetic variants associated with cardiac structure and function: a meta-analysis and replication of genome-wide association data. The Journal of the American Medical Association 302(2): 168–78 (cit. on pp. 69, 155). Vasan, R. S., Larson, M. G., Aragam, J., Wang, T. J., Mitchell, G. F., Kathiresan, S., Newton-Cheh, C., Vita, J. A., Keyes, M. J., O’Donnell, C. J., Levy, D., & Benjamin, E. J. (2007) Genome-wide association of echocardiographic dimensions, brachial 257 artery endothelial function and treadmill exercise responses in the Framingham Heart Study. BMC Medical Genetics 8 Suppl 1: S2 (cit. on pp. 69, 155). Vatta,M.,Mohapatra, B., Jimenez, S., Sanchez, X., Faulkner, G., Perles, Z., Sinagra, G., Lin, J.-H., Vu, T. M., Zhou, Q., Bowles, K. R., Di Lenarda, A., Schimmenti, L., Fox, M., Chrisco, M. A., Murphy, R. T., McKenna, W., Elliott, P., Bowles, N. E., Chen, J., & al. (2003) Mutations in Cypher/ZASPin patients with dilated cardiomyopathy and left ventricular non-compaction. Journal of the American College of Cardiology 42(11): 2014–2027 (cit. on p. 180). Villanueva, B., Pong-Wong, R., Fernández, J., & Toro, M. A. (2005) Benefits from marker-assisted selection under an additive polygenic genetic model. Journal of Animal Science 83(8): 1747 (cit. on p. 52). Villard, E., Perret, C., Gary, F., Proust, C., Dilanian, G., Hengstenberg, C., Ruppert, V., Arbustini, E., Wichter, T., Germain, M., Dubourg, O., Tavazzi, L., Aumont, M. C., De Groote, P., Fauchier, L., Trochu, J. N., Gibelin, P., Aupetit, J. F., Stark, K., Erd- mann, J., & al. (2011) A genome-wide association study identifies two loci asso- ciated with heart failure due to dilated cardiomyopathy. European Heart Journal 32(9): 1065–1076 (cit. on p. 69). Vischer, E. &Chargaff, E. (1948) The composition of the pentose nucleic acids of yeast and pancreas. The Journal of Biological Chemistry 176(2): 715–34 (cit. on p. 31). Visscher, P. M., Wray, N. R., Zhang, Q., Sklar, P., McCarthy, M. I., Brown, M. A., & Yang, J. (2017) 10 Years of GWAS Discovery: Biology, Function, and Translation. American Journal of Human Genetics 101(1): 5–22 (cit. on p. 38). Wain, L. V., Verwoert, G. C., O’Reilly, P. F., Shi, G., Johnson, T., Johnson, A. D., Bo- chud, M., Rice, K. M., Henneman, P., Smith, A. V., Ehret, G. B., Amin, N., Larson, M. G., Mooser, V., Hadley, D., Dörr, M., Bis, J. C., Aspelund, T., Esko, T., Janssens, A. C. J. W., & al. (2011) Genome-wide association study identifies six new loci in- fluencing pulse pressure and mean arterial pressure.Nature Genetics 43(10): 1005– 1012 (cit. on p. 189). Wang, D. G., Fan, J. B., Siao, C. J., Berno, A., Young, P., Sapolsky, R., Ghandour, G., Perkins, N., Winchester, E., Spencer, J., Kruglyak, L., Stein, L., Hsie, L., Topalo- glou, T., Hubbell, E., Robinson, E., Mittmann, M., Morris, M. S., Shen, N., Kilburn, D., & al. (1998) Large-scale identification, mapping, and genotyping of single- nucleotide polymorphisms in the human genome. Science 280(5366): 1077–82 (cit. on pp. 33, 37). 258 Wang, L. W., Leonhard-Melief, C., Haltiwanger, R. S., & Apte, S. S. (2009) Post-trans- lationalmodification of thrombospondin type-1 repeats inADAMTS-like 1/punctin- 1 by C-mannosylation of tryptophan. The Journal of Biological Chemistry 284(44): 30004–15 (cit. on p. 189). Watson, J. D.&Crick, F.H.C. (1953)Genetical Implications of the Structure ofDeoxyribo- nucleic Acid. Nature 171(4361): 964–967 (cit. on p. 31). Weldon, W. F. R. (1890) The variations occurring in certain Decapod Crustacea. Pro- ceedings of the Royal Society 47: 445–453 (cit. on p. 27). Weldon,W. F. R. (1892) Certain correlated variations inCrangon vulgaris.Proceedings of the Royal Society 51: 2–21 (cit. on p. 27). Wild, P. S., Zeller, T., Schillert, A., Szymczak, S., Sinning, C. R., Deiseroth, A., Schna- bel, R. B., Lubos, E., Keller, T., Eleftheriadis, M. S., Bickel, C., Rupprecht, H. J., Wilde, S., Rossmann, H., Diemert, P., Cupples, L. A., Perret, C., Erdmann, J., Stark, K., Kleber, M. E., & al. (2011) A Genome-Wide Association Study Identifies LIPA as a Susceptibility Gene for Coronary Artery Disease. Circulation: Cardiovascular Genetics 4(4): 403–412 (cit. on p. 38). Wilks, S. S. (1938) The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses. The Annals of Mathematical Statistics 9(1): 60–62 (cit. on pp. 43, 99). Willer, C. J., Speliotes, E. K., Loos, R. J. F., Li, S., Lindgren, C. M., Heid, I. M., Berndt, S. I., Elliott, A. L., Jackson, A. U., Lamina, C., Lettre, G., Lim, N., Lyon, H. N., McCarroll, S. A., Papadakis, K., Qi, L., Randall, J. C., Roccasecca, R. M., Sanna, S., Scheet, P., & al. (2009) Six new loci associated with body mass index highlight a neuronal influence on body weight regulation. Nature Genetics 41(1): 25–34 (cit. on p. 38). Wong, N. D. (2014) Epidemiological studies of CHD and the evolution of preventive cardiology. Nature Reviews Cardiology 11(5): 276–289 (cit. on p. 66). Wood, A. R., Esko, T., Yang, J., Vedantam, S., Pers, T. H., Gustafsson, S., Chu, A. Y., Estrada, K., Luan, J., Kutalik, Z., Amin, N., Buchkovich, M. L., Croteau-Chonka, D. C., Day, F. R., Duan, Y., Fall, T., Fehrmann, R., Ferreira, T., Jackson, A. U., Kar- jalainen, J., & al. (2014) Defining the role of common variation in the genomic and biological architecture of adult human height. Nature Genetics 46(11): 1173–1186 (cit. on p. 38). 259 World Health Organisation (2016) International Statistical Classification of Diseases and Related Health Problems 10th Revision. 5th. Geneva (cit. on p. 66). Wright, F. A., Huang,H., Guan, X., Gamiel, K., Jeffries, C., Barry,W. T., Pardo-Manuel de Villena, F., Sullivan, P. F., Wilhelmsen, K. C., & Zou, F. (2007) Simulating asso- ciation studies: a data-based resampling method for candidate regions or whole genome scans. Bioinformatics 23(19): 2581–2588 (cit. on pp. 72, 73). Wu, M. C., Kraft, P., Epstein, M. P., Taylor, D. M., Chanock, S. J., Hunter, D. J., & Lin, X. (2010) Powerful SNP-Set Analysis for Case-Control Genome-wide Association Studies. The American Journal of Human Genetics 86(6): 929–942 (cit. on pp. 38, 193). Xu, C., Tachmazidou, I., Walter, K., Ciampi, A., Zeggini, E., Greenwood, C. M. T., & UK10KConsortium (2014) Estimating genome-wide significance forwhole-genome sequencing studies. Genetic Epidemiology 38(4): 281–90 (cit. on p. 47). Xu, X., Tian, L., & Wei, L. J. (2003) Combining dependent tests for linkage or associ- ation across multiple phenotypic traits. Biostatistics 4(2): 223–229 (cit. on p. 55). Yang, J., Lee, S. H., Goddard, M. E. M., Visscher, P. M. P., Hindorff, L., Sethupathy, P., Junkins, H., Ramos, E., Mehta, J., Collins, F., Manolio, T., Manolio, T., Collins, F., Cox, N., Goldstein, D., Hindorff, L., Hunter, D., McCarthy, M., Ramos, E., Cardon, L., & al. (2011) GCTA: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics 88(1): 76–82 (cit. on pp. 38, 48, 53, 59, 88, 117). Yang, J., Loos, R. J. F., Powell, J. E., Medland, S. E., Speliotes, E. K., Chasman, D. I., Rose, L. M., Thorleifsson, G., Steinthorsdottir, V., Mägi, R., Waite, L., Smith, A. V., Yerges-Armstrong, L. M., Monda, K. L., Hadley, D., Mahajan, A., Li, G., Kapur, K., Vitart, V., Huffman, J. E., & al. (2012) FTO genotype is associated with phenotypic variability of body mass index. Nature 490(7419): 267–72 (cit. on p. 38). Yang, Q. & Wang, Y. (2012) Methods for Analyzing Multivariate Phenotypes in Ge- neticAssociation Studies. Journal of Probability and Statistics 2012: 1–13 (cit. on pp. 54– 56). Yang, Q., Wu, H., Guo, C.-Y., & Fox, C. S. (2010) Analyze multivariate phenotypes in genetic association studies by combining univariate association tests. Genetic Epidemiology 34(5): 444–454 (cit. on p. 55). Yang, R., Yi, N., & Xu, S. (2006) Box–Cox transformation for QTL mapping. Genetica 128(1-3): 133–143 (cit. on p. 43). Yang, W., Guo, Z., Huang, C., Duan, L., Chen, G., Jiang, N., Fang, W., Feng, H., Xie, W., Lian, X., Wang, G., Luo, Q., Zhang, Q., Liu, Q., & Xiong, L. (2014) Combin- 260 ing high-throughput phenotyping and genome-wide association studies to reveal natural genetic variation in rice. Nature Communications 5: 5087 (cit. on pp. 56, 88, 104). Yousef, Z. R., Foley, P. W., Khadjooi, K., Chalil, S., Sandman, H., Mohammed, N. U., & Leyva, F. (2009) Left ventricular non-compaction: clinical features and cardi- ovascular magnetic resonance imaging. BMCCardiovascular Disorders 9(37) (cit. on pp. 180, 189). Yu, J., Pressoir, G., Briggs, W. H., Vroh Bi, I., Yamasaki, M., Doebley, J. F., McMullen, M. D., Gaut, B. S., Nielsen, D. M., Holland, J. B., Kresovich, S., & Buckler, E. S. (2006) A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nature Genetics 38(2): 203–208 (cit. on pp. 50, 52, 193). Yuan, X., Miller, D. J., Zhang, J., Herrington, D., & Wang, Y. (2012) An overview of population genetic data simulation. Journal of Computational Biology 19(1): 42–54 (cit. on p. 72). Zambrano, E., Marshalko, S. J., Jaffe, C. C., Hui, P., Sandman, H., Mohammed, N. U., Leyva, F., Zambrano, E., Marshalko, S., Jaffe, E., Hui, P., Jenni, R., Oechslin, E., Jost, C. A., Kaufmann, P., Ritter, M., Oechslin, E., Siutsch, G., Attenhofer, C., Schneider, J., & al. (2002) Isolated Noncompaction of the Ventricular Myocardium: Clinical and Molecular Aspects of a Rare Cardiomyopathy. Laboratory Investigation 82(2): 117–122 (cit. on pp. 64, 179, 180). Zeglinski, M. R., Davies, J. J. L., Ghavami, S., Rattan, S. G., Halayko, A. J., & Dixon, I. M. C. (2016) Chronic expression of Ski induces apoptosis and represses auto- phagy in cardiac myofibroblasts. Biochimica et Biophysica Acta (BBA) - Molecular Cell Research 1863(6): 1261–1268 (cit. on p. 170). Zhang, F., Guo, X.,Wu, S.,Han, J., Liu, Y., Shen,H.,&Deng,H.-W. (2012)Genome-Wide Pathway Association Studies of Multiple Correlated Quantitative Phenotypes Us- ing Principle Component Analyses. PLoS ONE 7(12). Ed. by M. Xiong: e53320 (cit. on pp. 144, 194). Zhang, M., Song, F., Liang, L., Nan, H., Zhang, J., Liu, H., Wang, L.-E., Wei, Q., Lee, J. E., Amos, C. I., Kraft, P., Qureshi, A. A., & Han, J. (2013) Genome-wide asso- ciation studies identify several new loci associated with pigmentation traits and skin cancer risk in European Americans. Human Molecular Genetics 22(14): 2948– 2959 (cit. on p. 38). 261 Zhang, Z., Ersoz, E., Lai, C.-Q., Todhunter, R. J., Tiwari, H. K., Gore, M. A., Brad- bury, P. J., Yu, J., Arnett, D. K., Ordovas, J. M., & Buckler, E. S. (2010) Mixed lin- ear model approach adapted for genome-wide association studies.Nature Genetics 42(4): 355–360 (cit. on pp. 38, 88, 89). Zhao, K., Aranzana, M. J., Kim, S., Lister, C., Shindo, C., Tang, C., Toomajian, C., Zheng, H., Dean, C., Marjoram, P., & Nordborg, M. (2007) An Arabidopsis ex- ample of association mapping in structured samples. PLoS Genetics 3(1): e4 (cit. on pp. 50, 53). Zhou, X. & Stephens, M. (2012) Genome-wide efficient mixed-model analysis for association studies. Nature Genetics 44(7) (cit. on p. 193). Zhou, X. & Stephens, M. (2014) Efficient multivariate linear mixedmodel algorithms for genome-wide association studies. Nature Methods 11(4): 407–9 (cit. on pp. 57, 71, 76, 78, 79, 81, 82, 89, 90). Zirkle, C. (1935) The Inheritance of Acquired Characters and the Provisional Hypothesis of Pangenesis (cit. on p. 24). 262