DNA methylation: the “stable” epigenetic mark Amir Daniel Hay St John’s College This dissertation is submitted for the degree of Doctor of Philosophy Department of Genetics University of Cambridge September 2022 Declaration This thesis is the result of my own work and includes nothing which is the outcome of work done in collaboration except as declared in the text. The use of “we” reflects my own work, unless specifically stated otherwise. It is not substantially the same as any work that has already been submitted for any degree to any university or institution. It does not exceed the prescribed word limit for the School of Biological Sciences Degree Committee. Amir Daniel Hay September 2022 Acknowledgements I would like to thank my supervisor Anne Ferguson-Smith for taking me on as a student, supporting me throughout my time in Cambridge, and pushing me to become a better and more complete scientist. Under your guidance, the scope of what I have learned ranges from the very technical to the highly conceptual; with your trust, I have progressed to conducting independent research with the knowledge that I can always do better. The opportunity to do research in such an enriching environment has been an immense privilege for which I cannot thank you enough – a gratitude that I would also like to express to all the past and present members of the AFS lab. I am especially grateful to fellow PhD student Noah Kessler who has been a peer, mentor, and friend, by showing me the ropes of bioinformatics, challenging me to go beyond my perceived limits, and knowing when it is a good time to take a coffee break. Another special thank you goes to Jessica Elmer for her collaborative ethos that made working together both fun and interesting. Thank you to Tessa Bertozzi for welcoming me to the lab, introducing me to the pyrosequencer, teaching me how to work with mice, and making me feel at home in Cambridge. Thank you to Carol Edwards and Fran Dearden for help with 4C; to Mitsuteru Ito for teaching me how to make MEFs and providing the much-needed day-to-day equipment maintenance that allowed me to work in the lab relatively unimpeded; to Nozomi Takahashi for teaching me to how make mESCs, managing the Dnmt3a/3b mutant mice, and for always providing me with feedback on my project; to Geula Hanin, Shrina Patel, Stephanie Telerman, and Boshra Alsulaiti for support with western blotting; to Jessi Becker for taking on the CTCF project; to Hugo Tavares for supervising my RNA-seq analyses; to Chrysante Iliakis for her assistance at the bench; and to the students whom I had the opportunity to supervise, William Xie, Eve Ainscough, Gloria Jansen, and William Saunter for all their hard work. Acknowledgements ii Outside the AFS lab, the most influential figure on my work has been Felipe Teixeira. The concepts behind Chapter 3 of this thesis arose from our conversations in the department hallways, and simple offhand drawings on the white board that would eventually become serious experiments. These humble beginnings spurred a full-fledged collaboration regarding the direction and scope of the project for which I am extremely appreciative. I would also like to thank Daniel Gebert, postdoc in the Teixeira group, for working with me to elucidate the relationship between DNA methylation fidelity and transposable elements. Our other collaborators include Jamie Hackett for whom I would like to thank for sharing with us piRNA mutant mouse material. I am looking forward to our continued collaboration with Ben Simons, Steffen Rulands, and Matteo Ciarchi to model the inheritance of methylation – an exciting project from which I have learned about the intricacy of interdisciplinary science, as well as the rewards. I have received a fair amount of technical advice and guidance, as well as general goodwill, that has proven to be critical for the completion of this thesis. Thank you to my advisor Julie Ahringer and her research associate Alex Appert for giving me access to a sequencer to test my first ChIP-seq libraries; to Michael Imbeault for helping me optimise my ChIP protocol; to Ben Harvey from Agilent for supervising me the first time I performed target capture bisulphite sequencing; to Rahia Mashoodh for advice on statistical analyses; to the mouse facility for providing care to our mice; to Novogene for sequencing our 4C libraries and to the CRUK genomics facility for sequencing the rest (and majority) of our libraries; to Dan Holland and Ian Henderson for lending me a crucial component of the Covaris sonicator; to Yoach Rais from the Weizmann institute for sending me purified TAT-CRE recombinase to perform my induced knockout experiment. Thank you to the Cambridge Trust, the Department of Genetics, and St. John’s College for supporting me financially. Lastly, I want to thank my parents for their endless support physically, mentally, and scientifically - I am lucky to have you. Summary DNA methylation is regarded as a stable epigenetic mark given its faithful maintenance across successive cell divisions. Methylation occurs at most CpG sites in mammalian genomes and is generally associated with transcriptional repression. An accepted evolutionary role for DNA methylation is to prevent the mobility of transposable elements (TEs). This thesis investigates the stability of DNA methylation in two separate contexts, particularly relating to intermediate levels of methylation. First, I characterise the properties of variably methylated TEs (VM-TEs) between individual mice. Second, I assess the fidelity of DNA methylation inheritance across cellular generations at VM-TEs, and more widely at the genome-scale, to ascertain the heritability, mechanism, and function of intermediate methylation states. My findings show that variable methylation extends beyond the boundaries of the TEs, and that all VM-TEs are enriched for binding of the transcription factor CTCF, which is inversely correlated with DNA methylation. I propose that molecular antagonism between CTCF and DNA methylation machinery influences the formation of variably methylated states in the early embryo. Within an individual mouse, VM-TEs are intermediately methylated between 10% and 90%, representing the cell population average of methylation states. The prevailing hypothesis supports the notion that methylation is established de novo by DNA methyltransferases DNMT3A/3B and then faithfully maintained by DNMT1. Hence, intermediate methylation levels likely represent stochastic de novo establishment (DNMT3A/3B) and clonal maintenance (DNMT1) within the cell population. To test this, I subcloned single cells from both mouse embryonic fibroblasts (MEFs) and embryonic stem cells (mESCs), growing them into multiple subclonal populations to assess methylation fidelity through cell divisions. This allowed me to address the degree to which a particular locus acquires intermediate methylation in the clonal population, as well Summary iv as the properties and mechanism of that state. If methylation is indeed propagated faithfully, one would expect that the single-cell derived populations always exhibit one of the three symmetric methylation states: 0%, 50% or 100%. At VM-TEs, we find that the subcloned cell lines attain intermediate methylation levels that reflect the level of the parent population, which implies that the original single-cell methylation state is not faithfully maintained at these loci. Expanding the analysis genome-wide, I use a target capture bisulphite sequencing method to evaluate methylation fidelity in the subclonal cell lines more globally. I find that CpGs exhibiting intermediate methylation at the cell population level, are generally unfaithfully inherited between cell divisions and attain methylation independently of neighbouring CpGs. While faithful hypo- and hypermethylation associate with transcriptional activity, unfaithful intermediate methylation associates with transcriptionally inactive genes or intergenic regions of the genome. Finally, in DNMT3A/3B mutants, methylation is not depleted consistently at any CpG, regardless of its methylation state in the control. Therefore, DNMT1 has two functions: 1) canonical maintenance of faithful methylation and 2) as shown here, it is responsible for the acquisition of intermediate methylation states that are unfaithfully inherited between cell divisions. Contents Chapter 1 Introduction ................................................................................ 1 1.1 DNA methylation ........................................................................................... 2 1.1.1 Discovering DNA methylation: from prokaryotes to eukaryotes .... 4 1.1.2 Measuring DNA methylation ........................................................... 6 1.1.3 Principles and patterns of DNA methylation ................................... 9 1.1.4 The maintenance and establishment of DNA methylation ............ 14 1.1.5 Crosstalk between DNA methylation and histone tail modifications ........................................................................................................ 23 1.1.6 DNA methylation fidelity .............................................................. 26 1.2 Transposable elements ................................................................................. 29 1.2.1 Retrotransposons ............................................................................ 30 1.2.2 Intracisternal A-particles (IAPs) .................................................... 32 1.2.3 Mechanisms for silencing transposable elements in mammals ..... 33 1.2.4 Metastable epialleles ...................................................................... 38 1.3 CTCF: transcription factor and regulator of genomic architecture .............. 43 1.3.1 CTCF and DNA methylation ......................................................... 44 1.4 Research aims and thesis overview .............................................................. 46 1.4.1 Aims ............................................................................................... 46 1.4.2 Structure and overview .................................................................. 46 Chapter 2 Genomic properties of variably methylated retrotransposons in mouse .................................................................................... 48 2.1 Introduction and objectives .......................................................................... 48 2.2 Results .......................................................................................................... 53 Contents vi 2.2.1 Characterising VM-IAPs by CpG density ..................................... 53 2.2.2 Inter-individual methylation variability is not confined to the LTR boundaries of VM-IAPs ................................................................. 53 2.2.3 CTCF and its motif are enriched at VM-IAPs ............................... 56 2.2.4 CTCF binding and DNA methylation have an inverse relationship at VM-IAPs ........................................................................................ 60 2.2.5 Chromatin interactions with VM-IAPs .......................................... 61 2.2.6 DNA methylation at VM-IAPs is regulated independently of somatic and maternally derived piRNAs ....................................... 64 2.3 Discussion .................................................................................................... 66 Chapter 3 Stochastic and faithful inheritance define DNA methylation patterns through cell divisions ................................................ 71 3.1 Introduction and objectives .......................................................................... 71 3.2 Methodology ................................................................................................ 73 3.3 Results .......................................................................................................... 76 3.3.1 Methylation fidelity of VM-IAPs .................................................. 76 3.3.2 Evaluating methylation fidelity at the genome-scale ..................... 80 3.3.3 Methylation fidelity at transposable elements ............................... 99 3.3.4 Investigating stochastic methylation inheritance ......................... 101 3.4 Discussion .................................................................................................. 112 3.4.1 Variable methylation at VM-IAPs is both stochastically established and maintained ............................................................................. 112 3.4.2 Genome-wide, intermediate methylation is unfaithfully inherited between cell divisions yet can be stochastically retained in a cell population .................................................................................... 113 3.4.3 Methylation fidelity is associated with transcription ................... 116 3.4.4 Methylation fidelity at transposable elements is determined by genomic location .......................................................................... 116 3.4.5 Stochastic methylation associates with repressive histone marks ..... ...................................................................................................... 118 Contents vii 3.4.6 Both stochastic and non-stochastic methylation deposition is mediated by DNMT1 ................................................................... 119 3.4.7 Future work on stochastic methylation regulation in mutant mice ...................................................................................................... 120 Chapter 4 Discussion ................................................................................ 123 4.1 Somatic DNA methylation: “accident” or design? .................................... 124 4.1.1 What is the function of DNA methylation in mammals? ............ 125 4.1.2 Stochastic inheritance is an inherent characteristic of DNA methylation regulation in the genome .......................................... 126 4.2 Modelling methylation inheritance ............................................................ 128 4.3 Implications for DNA methylation as a biomarker .................................... 129 4.4 Beyond DNA methylation: understanding intermediate levels of transcription ............................................................................................. 131 4.5 Concluding remarks ................................................................................... 133 Chapter 5 Materials and methods ........................................................... 135 5.1 Mouse procedures ...................................................................................... 135 5.1.1 PiwiL2 knockouts ........................................................................ 135 5.1.2 Dnmt3a/3b inducible double knockout ........................................ 135 5.1.3 Tissue collection and DNA/RNA extraction ............................... 136 5.2 Cell line generation, maintenance, and subcloning .................................... 136 5.2.1 Mouse embryonic fibroblasts (MEFs) ......................................... 136 5.2.2 Mouse embryonic stem cells (mESCs) ........................................ 136 5.2.3 Dnmt3a/3b inducible double knockout MEFs ............................. 137 5.2.4 5-azacytidine treatment ................................................................ 138 5.3 Bisulphite pyrosequencing ......................................................................... 138 5.4 Sequencing-based techniques and analyses ............................................... 138 5.4.1 Chromatin immunoprecipitation (ChIP) ...................................... 139 5.4.2 Circularised chromatin conformation capture sequencing (4C-seq) ...................................................................................................... 141 5.4.3 Target capture bisulphite sequencing (tcBS-seq) ........................ 142 Contents viii 5.4.4 Total RNA sequencing ................................................................. 144 5.4.5 Publicly available sequencing datasets ........................................ 145 5.5 Western blotting ......................................................................................... 147 References .................................................................................................. 148 Appendix A Complementary information and data ................... 174 Appendix B Related publications .................................................. 196 List of Figures Figure 1.1: 5-methylcytosine represents the addition of a methyl group to the fifth atom of the cytosine ring of DNA and in mammals is predominately found in a CpG dinucleotide context. ......................................................................................... 5 Figure 1.2: Bisulphite treatment followed by PCR and DNA sequencing allows for quantitative nucleotide-level measurements of methylation levels. ................. 8 Figure 1.3: Genomic methylation patterns and prevalence differ between species. .......... 10 Figure 1.4: CpG islands are present in vertebrate genomes and absent from genomes lacking DNA methylation. .............................................................................. 12 Figure 1.5: DNA methylation is established by DNMT3A/3B and maintained by DNMT1. ......................................................................................................................... 15 Figure 1.6: Protein structures and catalytic mechanism of mammalian DNA methyltransferases. .......................................................................................... 19 Figure 1.7: Mechanisms of passive and active demethylation. ......................................... 21 Figure 1.8: Dynamics of methylation throughout mouse development. ............................ 22 Figure 1.9: Histone tail modifications have distinct functions at specific regions of the genome. ........................................................................................................... 24 Figure 1.10: Genomic transposable element content varies between species. .................. 30 Figure 1.11: Genetic structures of LINEs, SINEs, and LTR retrotransposons. ................. 31 Figure 1.12: Model of KRAB-ZFP-mediated heterochromatin formation. ....................... 37 Figure 1.13: Agouti viable yellow phenotypic range and IAP regulation. ......................... 39 Figure 1.14: Loop extrusion model for how CTCF and cohesin jointly mediate 3D chromatin interactions. .................................................................................... 43 Figure 2.1: IAP elements of the LTR1_Mm–Ez-int (fully-structured) and the LTR2_Mm (solo LTR) types are over-represented in cVM-IAPs. .................................... 50 Figure 2.2: Identification of variable methylation at transposable elements. .................... 51 Figure 2.3: Increased CpG density in IAPLTR2_Mm cVM-IAPs. ................................... 54 List of Figures x Figure 2.4: Inter-individual methylation variability is not confined to the LTRs of VM- IAPs. ................................................................................................................ 55 Figure 2.5: CTCF preferentially binds at a subset of IAP LTR types. .............................. 57 Figure 2.6: CTCF binding is enriched at VM-IAPs relative to other IAPs in the mouse genome. ........................................................................................................... 59 Figure 2.7: CTCF binding site motif at non-variable IAPs and VM-IAPs is similar. ....... 60 Figure 2.8: Methylation at six out of seven tested VM-IAPs correlates inversely with CTCF binding assessed by ChIP-sequencing. ........................................................... 62 Figure 2.9: Confirming that DNA methylation and CTCF binding are inversely correlated at VM-IAPs by ChIP-qPCR. ........................................................................... 63 Figure 2.10: VM-IAP methylation is regulated independently of somatic piRNAs. ......... 65 Figure 3.1: Evaluating intermediate methylation inheritance in cell culture. .................... 75 Figure 3.2: Intermediate methylation at VM-IAPs is not faithfully inherited between cellular generations. ........................................................................................ 77 Figure 3.3: Memory of intermediate methylation states at VM-IAPs is better retained in MEFs compared to mESCs. ............................................................................ 78 Figure 3.4: Intermediate methylation levels at VM-IAPs do not recover consistently after recovery from methylation inhibition. ............................................................ 80 Figure 3.5: Filtering and thresholding MEF methylation data. ......................................... 82 Figure 3.6: Validation of MEF target capture bisulphite sequencing. ............................... 85 Figure 3.7: Filtering and thresholding mESC methylation data. ....................................... 86 Figure 3.8: Validation of mESC target capture bisulphite sequencing. ............................. 87 Figure 3.9: Classifying and evaluating methylation states in MEFs. ................................ 88 Figure 3.10: Classifying expression levels in MEFs. ........................................................ 91 Figure 3.11: CpGs that exhibit intermediate methylation associate with transcriptional inactivity in MEFs. .......................................................................................... 92 Figure 3.12: Methylation fidelity associates with transcription in MEFs. ......................... 93 Figure 3.13: Intermediate methylation is generally unfaithful in MEFs. .......................... 94 Figure 3.14: Classifying and evaluating methylation states in mESCs. ............................ 95 Figure 3.15: Classifying expression levels in mESCs using publicly available data. ....... 96 Figure 3.16: Methylation in mESCs associates with transcriptional inactivity. ................ 97 Figure 3.17: Methylation remains low across protein-coding genes in mESCs. ............... 97 Figure 3.18: Methylation is generally low and unfaithful in mESCs. ............................... 98 Figure 3.19: Methylation fidelity at transposable elements. ............................................ 101 List of Figures xi Figure 3.20: Intermediately methylated CpGs are prone to stochastic inheritance between cell divisions in MEF-1 cell lines. ................................................................ 102 Figure 3.21: Stochastically methylated CpGs associate with repressive histone tail modifications H3K27me3 and H3K9me3. .................................................... 104 Figure 3.22: Confirmation of induced Dnmt3a/3b DKO in primary MEFs. ................... 107 Figure 3.23 Conditional loss of DNMT3A/3B shows no consistent changes in methylation. ....................................................................................................................... 109 Figure 3.24: Intermediately methylated CpGs are prone to continue losing methylation following a genome-wide depletion by 5-aza. .............................................. 111 Figure 4.1: DNA methylation is by default stochastically maintained. ........................... 127 Figure 4.2: Local coordination of DNA methylation can largely explain inheritance dynamics between cell divisions. .................................................................. 129 Figure 4.3: Intermediate transcription levels reflect variable transcription between clonal cell lines. ....................................................................................................... 133 Figure A.1: Methylation variability exists beyond the edges of VM-IAPs. 175 Figure A.2: VM-IAPs exist in a variety of genomic methylation contexts. .................... 177 Figure A.3: Confirmation of methylation landscapes surrounding VM-IAPs. ................ 179 Figure A.4: Genomic interactions of VM-IAPs. .............................................................. 185 Figure A.5: Methylation levels and methylation fidelity at protein-coding genes of varying expression in MEFs. ...................................................................................... 186 Figure A.6: Methylation levels and methylation fidelity at protein-coding genes of varying expression in mESCs. ................................................................................... 186 Figure A.7: Intermediately methylated CpGs are prone to stochastic inheritance between cell divisions in MEF-2 cell lines. ................................................................ 187 List of Tables Table 2.1: Read counts, mapping efficiency, and peak counts for eight individual CTCF ChIP-seq and Input libraries. .......................................................................... 57 Table 3.1: Coverage of genic regions and transposable elements by target capture bisulphite sequencing. ...................................................................................................... 83 Table 3.2: Publicly available bisulfite-sequencing and RNA-sequencing datasets used in this chapter. ..................................................................................................... 84 Table 3.3: Number of genes and genic regions represented in methylation data. ............. 93 Table 3.4: Counts and ratios of non-stochastic and stochastic CpG overlap with various histone tail modification peaks. .................................................................... 105 Table 5.1: Summary of all generated sequencing datasets. ............................................. 139 Table A.1: Bisulphite pyrosequencing primers for VM-IAPs, non-variable IAPs, and imprinting control regions. ............................................................................ 188 Table A.2: Bisulphite pyrosequencing primers for regions surrounding VM-IAPs. ....... 189 Table A.3: Primers used for 4C-sequencing. ................................................................... 194 Abbreviations 2i Two inhibitors (PD0325901 and CHIR99021) 4C Circularised chromatin conformation capture 5-aza 5-azacytidine 5hmC 5-hydroxymethylcytosine 5mC 5-methylcytosine A Adenine Avy Agouti viable yellow AxinFu Axin fused BER Base excision repair bp Base pair C Cytosine CGI CpG island ChIP Chromatin immunoprecipitation CpG Cytosine-guanine dinucleotide CTCF CCCTC-binding factor cVM-IAP Constitutive VM-IAP DKO Double knockout DNA Deoxyribonucleic acid DNMT DNA methyltransferase E Embryonic day ERV Endogenous retrovirus G Guanine Gb Gigabase H A, C, or T (nucleotides) H3K27me3 Histone 3 lysine 27 trimethylation H3K36me3 Histone 3 lysine 36 trimethylation Abbreviations xiv H3K4me3 Histone 3 lysine 4 trimethylation H3K9me3 Histone 3 lysine 9 trimethylation HDAC Histone deactylase HP1 Heterochromatin protein 1 I Intermediately methylated IAP Intracisternal A-particle ICM Inner cell mass ICR Imprinting control region KAP1 KRAB-associated protein 1 kb Kilo base KO Knockout KRAB Krüppel-associated box KZFP KRAB zinc finger protein LIF Leukemia inhibitory factor LINE Long interspersed element LTR Long terminal repeat M Hypermethylated Mb Megabase ME Metastable epiallele MEF Mouse embryonic fibroblasts mESC Mouse embryonic stem cell MI Hypermethylated and intermediately methylated nt Nucleotide NuRD Nucleosome remodelling deacetylase complex ORF Open reading frame PBS Primer binding site PCR Polymerase chain reaction PGC Primordial germ cell piRNA Piwi-interacting RNA qPCR Quantitative PCR RNA Ribonucleic acid RRBS Reduced representation bisulphite sequencing rRNA Ribosomal RNA Abbreviations xv SAM S-Adenosyl methionine seq Sequencing SINE Short interspersed element siRNAs Short interfering RNAs T Thymine tcBS-seq Target capture bisulphite sequencing TDG Thymine DNA glycosylase TE Transposable elements TET Ten eleven translocation proteins tRFs tRNA fragments tRNA Transfer RNA tsVM-IAP Tissue-specific VM-IAP TTKO Triple TET knockout U Hypomethylated UCSC University of California, Santa Cruz UHRF1 Ubiquitin-like, with PHD and RING finger domains 1 UI Hypomethylated and intermediately methylated VM-IAP Variably methylated IAP VM-TE Variably methylated TE WGBS Whole genome bisulphite sequencing Chapter 1 Introduction In the 1940s, Conrad Waddington proposed the term epigenetics to describe a field of study that focuses on the causal relationship between genotype and phenotype during development (Waddington 1942). However, by the 1990s, the definition of epigenetics was adapted and narrowed to denote heritable, mitotic or meiotic, changes in gene function that cannot be explained by changes in DNA sequence (Russo et al. 1996). This shift in definition, to one that focuses on inheritance of non-DNA information, was inspired by findings on DNA methylation. As a modification present on – but distinct from – the DNA, DNA methylation is important for development and can be inherited between cellular divisions, and sometimes even organismal generations (Holliday 1990). More recently, the scope of epigenetics has broadened beyond the necessitation for heritability as a defining characteristic to include RNA-mediated transcriptional repression and chromatin features, such as histone variants and tail modifications, as well as many other factors and pathways (Jablonka and Lamb 2002; Bird 2007). Yet, the heritability of non-DNA information is intriguing and continues to be an avenue of study in the now more expansive field of epigenetics. This thesis is divided into two relatively distinct projects that are both related to the stability of DNA methylation as an epigenetic mark between either organismal or cellular generations: one from the perspective of a subset of variably methylated transposable elements (VM-TEs) and the way in which their methylation status informs interactions with other genomic factors (Chapter 2), and the other that evaluates the extent to which methylation is mitotically inherited at VM-TEs and genome-wide (Chapter 3). Therefore, the first half of this introduction contains a thorough evaluation of the literature surrounding DNA methylation, from how it was discovered to recent findings that call into question long-lasting assumptions regarding its basic regulation. Following this extensive review of Chapter 1 2 DNA methylation, we switch gears to provide the basis for the current understanding of TEs in the mouse genome, and more specifically VM-TEs termed “metastable epialleles”, which serve as a hallmark for transgenerational epigenetic studies in mice. 1.1 DNA methylation Despite being the subject of intensive research for the past 50 years, the function of DNA methylation in eukaryotic systems is still debated. Since its discovery, a multitude of functions across many species have been proposed, from repressing transposable elements to regulating gene expression, as well as acting as an agent for transgenerational epigenetic inheritance. However, a unifying function of DNA methylation in eukaryotes does not seem to exist, as patterns throughout the genome can differ greatly between species. This is further complicated by the fact that DNA methylation is not conserved in all eukaryotes, notably being absent from three model organisms: Caenorhabditis elegans, Drosophila melanogaster, and Saccharomyces cerevisiae (Mattei et al. 2022). In contrast, another group of epigenetic marks called histone tail modifications are found in all eukaryotes1. Nevertheless, DNA methylation is essential for mammalian development and is unique in its status as a stable DNA modification that can be preserved between cell divisions, and in rare cases, organismal generations. Extensive research on DNA methylation has been conducted in plants using Arabidopsis as a model organism, which has provided a wealth of knowledge for the field. However, DNA methylation in plants differs from mammals in ways that hinder our ability to indiscriminately infer findings between them. For example, DNA methylation is found at CpG dinucleotides in both plants and mammals, and although mammals have non-CpG methylation, only in plants does DNA methylation occur frequently at CHG and CHH (where H represents A, C, or T) dinucleotides with known functional relevance (Henderson and Jacobsen 2007). In plants, the non-CpG methylation machinery is plant-specific and distinct from the CpG methylation machinery, which is largely conserved across eukaryotes (Chan et al. 2005; Stroud et al. 2014). Meanwhile, in mammals, the same enzymes are responsible for depositing both non-CpG and CpG methylation (Ziller et al. 2011). Methylation of mammalian genomes is extensive, with ~70-80% of CpG dinucleotides 1 However, not all histone tail modifications are found in all eukaryotes. Chapter 1 3 being methylated. In contrast, <1%-7% of mammalian non-CpG dinucleotides are methylated, a phenomenon that occurs in relative abundance in pluripotent cells, as well as the brain where it accumulates after birth (Ziller et al. 2011; Lister et al. 2013; Mo et al. 2015; de Mendoza et al. 2021). The functional role of non-CpG methylation in mammals is debated, but it is known to be recognised in the brain by the highly expressed transcription factor MeCP2, which is important for normal neurological development (Guy et al. 2011; Gabel et al. 2015). Mammals experience a genome-wide loss of methylation during both embryogenesis and gametogenesis, which does not easily allow for methylation states to be passed between generations (Morgan et al. 2005). This is not the case in plants and other non-mammalian vertebrates, such as Xenopus laevis and Danio rerio2 (two more model organisms that have DNA methylation in their genomes), where methylation is not globally lost during embryogenesis (Veenstra and Wolffe 2001; Stancheva et al. 2002; Mhanni and McGowan 2004; Hsieh et al. 2009; Jullien et al. 2012; Potok et al. 2013; Bogdanovic et al. 2016). Interestingly, the extra-embryonic tissue in plants, the endosperm, undergoes extensive demethylation (on the maternal alleles) – a process that is thought to have evolved to establish parental allele specific gene expression, also known as genomic imprinting (Gehring et al. 2009; Hsieh et al. 2009). In fact, mammals harbour genomic imprints in both the extra-embryonic tissue and the embryo, which both experience global loss of methylation suggesting that this process also occurs to re-establish imprints (Reik and Walter 2001; Feil and Berger 2007). This is all to say that the dynamics of DNA methylation, and thereby possibly the function, in mammalian systems varies drastically between early development, gametogenesis, and adulthood. For this introduction, we will focus on the features of DNA methylation in mouse to extrapolate its functions, using relevant insights obtained from studies in other organisms. As an overview for this section, we will first address the discovery of DNA methylation in prokaryotes and then highlight how these findings led to early methods in measuring DNA methylation levels, as well as a brief description of the methods more commonly used today and in this thesis. Next, we outline the patterns and principles of 2 In Danio rerio, or zebrafish, the oocyte is hypomethylated compared to the sperm. The embryo gains methylation on the maternal allele by the 16-cell stage so that the DNA methylation profile is like that of the paternal allele - a state that is largely maintained throughout the development of the fish. Chapter 1 4 DNA methylation as understood from comparative genomic analyses. This is followed by a detailed mechanistic review of how methylation is established and mitotically maintained in the genome, as well as the dynamics of methylation throughout development. We address the complex relationships between histone tail modifications and DNA methylation, and finally summarise findings that suggest DNA methylation is not always stably inherited between cell divisions. 1.1.1 Discovering DNA methylation: from prokaryotes to eukaryotes The identification of DNA methylation took place before the confirmation that DNA, as opposed to protein, is the genetic material that is inherited between generations. Throughout this thesis, the terms “DNA methylation” and “methylation” refer to 5- methylcytosine - the addition of a methyl group (-CH3) to the fifth atom of the cytosine ring of DNA (Figure 1.1A). 5-methylcytosine was synthesised for the first time in 1904 (Wheeler and Johnson 1904; Hitchings et al. 1949) and discovered in nature 20 years later (Johnson and Coghill 1925) in the nucleic acid of Mycobacterium tuberculosis. It took another 20 years for DNA methylation to be detected in mammalian DNA, although it was not identified as such initially. In an independent and similar study to which led Erwin Chargaff to discover that there is a 1:1 stoichiometric ratio of purine and pyrimidine bases, Rollin Hotchkiss in 1948 showed that he was able to purify bases from calf thymus and identified what he called “epi-cytosine” – and what we now know is 5-methycytosine (Hotchkiss 1948; Witkin 2005). The first “discovery” of 5-methylcytosine in non-bacterial nucleic acid (with the knowledge of it being so) was published in 1950 by Gerald R. Wyatt at the Molteno Institute, whose building is now part of the University of Cambridge’s Pathology Department (Wyatt 1950). Wyatt confirmed the presence and amount of 5-methylcytosine (5mC) in different tissues from various organisms3. In a follow-up paper, Wyatt wrote “in the present state of knowledge as to the structure and function of nucleic acids nothing can be said as to the possible function of 5-methylcytosine. The amounts in which it occurs, however, varying with the source but constant from a given source, suggest that it is an essential constituent of certain DNA’s, and no accident of enzyme action” (Wyatt 1951). Today many things can be said about the function of DNA methylation, but the idea that it 3 Curiously, he did not detect 5mC in M. tuberculosis, the organism in which 5mC was first observed. Chapter 1 5 can exist due to an enzymatic “accident” is an unusual concept that we will return to in the Discussion section (4.1) of this thesis. Figure 1.1: 5-methylcytosine represents the addition of a methyl group to the fifth atom of the cytosine ring of DNA and in mammals is predominately found in a CpG dinucleotide context. (A) Molecular structures of an unmethylated and methylated cytosine (methyl group shown in orange). (B) Highlighting the difference between a CpG dinucleotide (in red), of which 70-80% are cytosine methylated in mammals, and a CG base pairing. Early investigations into the function of DNA methylation largely took place in E. coli bacteria, in which not only cytosine, but also adenine can be methylated as N6- methyladenosine. Adenine DNA methylation is a rare occurrence in animal genomes (Vanyushin et al. 1970; Wu et al. 2016) and methylation of the remaining two deoxynucleotides, guanine or thymine, has not been detected in vivo4. 5-methylcytosine and N6-methyladenosine are differentially regulated in bacteria, yet they have overlapping functions (Borek and Srinivasan 1966). The major role of both 5-methylcytosine and N6- methyladenosine in bacteria is as part of the “restriction modification system” that is used to defend the host bacterial cell against foreign DNA, such as that from a bacteriophage. The “restriction” part of this system refers to the restraint of the foreign DNA, which is accomplished by what are called restriction enzymes that cut DNA at specific sites and 4 Aside from O6-methylguanine, which is mutagenic and cytotoxic. Bignami M, O'Driscoll M, Aquilina G, Karran P. 2000. Unmasking a killer: DNA O(6)-methylguanine and the cytotoxicity of methylating agents. Mutat Res 462: 71-82. Chapter 1 6 come in many varieties. The “modification” part of the system refers to DNA methylation – the bacterial DNA is actively methylated whereas the foreign DNA is not. Therefore, there exist methylation-sensitive restriction enzymes that cannot cut methylated DNA, but can cut DNA lacking methylation (Meselson et al. 1972). The function of DNA methylation to mark non-host DNA for degradation in bacteria, in addition to the structural homology between prokaryotic and eukaryotic methylation deposition machinery (Cheng 1995), has led to the hypothesis that the last eukaryotic common ancestor used DNA methylation as a genome defence system (Chan et al. 2005). 1.1.2 Measuring DNA methylation The key to understanding the function of DNA methylation lies in the ability to accurately measure its presence at distinct loci in the genome. Biologists used the tools provided by the bacterial restriction modification system to interrogate DNA methylation in other organisms. HpaII and MspI are restriction enzymes and isoschizomers, meaning that they both recognise and make cuts at the same sequence of DNA: CCGG (Mann and Smith 1977; Waalwijk and Flavell 1978). However, HpaII cannot cleave the sequence when the internal cytosine residue is methylated, while MspI can. These enzymes can be used to digest DNA followed by gel electrophoresis and hybridisation on Southern blots – one can then compare the samples treated with HpaII versus MspI and determine the relative presence or absence of DNA methylation on the specific DNA fragment of interest. Although this method lacks the resolution of methylation levels at individual CpG sites, the context in which methylation is commonly found in mammals (Figure 1.1B), it was powerful enough to fuel many of the early discoveries regarding DNA methylation, like the identification of CpG islands, which will be introduced shortly (Bird 1980). In 1970, a group at the University of Tokyo (Hayatsu et al. 1970) and another at New York University (Shapiro et al. 1970) independently showed that bisulphite can deaminate cytosine residues – they additionally found that 5-methylcytosine is largely resistant to bisulphite-mediated deamination (Wang et al. 1980). It was not until the early 1990s that this technique was used in conjunction with polymerase chain reaction (PCR) and DNA sequencing to deduce methylation levels by comparing the amount of deaminated Ts to Cs at a particular CpG site (Frommer et al. 1992; Clark et al. 1994). This allowed for another surge in DNA methylation studies due to the newfound ability to measure DNA methylation levels at individual CpG sites (Figure 1.2). For site-specific quantitative Chapter 1 7 analyses of DNA methylation in this thesis, we use bisulphite-conversion followed by PCR coupled with pyrosequencing. Pyrosequencing is a “sequencing-by-synthesis” method that uses the single-stranded DNA to enzymatically synthesise the complementary strand, detecting nucleotides as they are incorporated (Tost and Gut 2007). The technique is limited to sequencing DNA that is a few hundred base pairs long, but is remarkably accurate when compared to other locus-specific DNA methylation assays (BLUEPRINT 2016). The advent of next generation sequencing technologies in the 2000s coupled with bisulphite conversion allowed for a deeper understanding of DNA methylation patterns on the genome-wide level, as well as how its function may vary between organisms. The power of whole genome bisulphite sequencing (WGBS) was immediately evident as it both confirmed and clarified many of the previous hypotheses established by locus-specific analyses of methylation. It also provided the scope to establish the fundamental rules and patterns of DNA methylation through comparative analyses between genomes of highly divergent species. The recent advancements in single molecule real-time (SMRT from PacBio) and nanopore (from Oxford Nanopore technologies) sequencing allow for increasingly accurate measurements of DNA methylation without chemically modifying the DNA, at the same time as generating reads long enough to cover repetitive regions of the genome (Flusberg et al. 2010; Gigante et al. 2019; Amarasinghe et al. 2020). As these technologies continue to improve, they will almost certainly lead us into a new era of DNA methylation understanding due to their ability to natively measure methylation across multi-kilobase regions of the genome, as well as to interrogate loci like centromeres, which were previously unmappable (Naish et al. 2021; Gershman et al. 2022). Chapter 1 8 Figure 1.2: Bisulphite treatment followed by PCR and DNA sequencing allows for quantitative nucleotide-level measurements of methylation levels. (A) Diagram of bisulphite treatment of DNA, which results in the conversion of unmethylated cytosines (C) to uracil (U) but does not affect methylated cytosines. Following PCR amplification, using either locus-specific or indexing primers (for a genome-wide approach), the uracil bases are converted to thymine (T). (B) Schematic example of how methylation levels are calculated at a base-level resolution using sequencing reads from bisulphite-converted and PCR amplified DNA derived from a sequence containing CpGs with unknown methylation levels. Methylation levels are calculated by measuring the number of Cs versus Ts sequenced at a particular CpG site. CC C C C C C C C T C T T T T T Pyrosequencing for locus specific methylation analyses Genome sequencing to assess global methylaton levels C C G CT G A M C M C U C G CT G A M C M U T C G CT G AC T Bisulphite conversion PCR Amplification C G CT G ? C ? C T C C C T C T T T CTCT C ? G A B DNA sequence with unknown methylation levels Sequencing reads of bisulphite converted and PCR amplified DNAC T T T T 50% 10% 90% Methylation levels (%) = # C / (# C + # T) Chapter 1 9 1.1.3 Principles and patterns of DNA methylation The first WGBS experiments were published in 2008 for Arabidopsis thaliana and established the different dinucleotide contexts and ratios in which cytosine methylation can exist in plants (Cokus et al. 2008; Lister et al. 2008). In plants, methylation is found at 6.7% of CHG and 1.7% of CHH contexts, while methylation more frequently occurs at CpG dinucleotides (24% are methylated) (Law and Jacobsen 2010). They found that methylation is enriched at the pericentromeric regions, and that CpG methylation is especially enriched at gene bodies and transposable elements compared to the other kinds of DNA in plants. Additionally, non-CpG methylation is found at transposable elements, but not within gene bodies. Shortly following the publication of the plant WGBS studies, came two genome- wide methylation studies for pluripotent and differentiated mammalian cells (human and mouse). They affirmed two fundamental principles regarding DNA methylation in mammals: 1) the mammalian genome is generally depleted of non-CpG methylation and 2) 70-80% of CpGs are methylated (Meissner et al. 2008; Lister et al. 2009). The prevalence of methylation in mammals supports the hypothesis that the genome is by default methylated and regulated to be unmethylated wherever necessary (Edwards et al. 2010). Two additional studies generated WGBS datasets for many more species (20 additional species including the first full mouse methylome) (Feng et al. 2010a; Zemach et al. 2010). These studies found that methylation patterns and prevalence vary greatly amongst divergent species (Figure 1.3), and that gene body methylation is a universally conserved feature of eukaryotic DNA methylation5. In the two following sections, we outline the distinctive patterns of DNA methylation throughout the mammalian genome. More specifically, hypomethylation of CpG islands and hypermethylation of gene bodies and transposable elements. 5 To clarify, this does not mean that all gene bodies of eukaryotic organisms are methylated, but that they are more likely to be methylated compared to gene promoters. Chapter 1 10 Figure 1.3: Genomic methylation patterns and prevalence differ between species. Methylation from whole genome bisulphite sequencing (WGBS) data of eight different species plotted across (A) gene bodies and (B) repetitive regions, which represent both transposable elements and satellite DNA. Figure from Feng et al. 2010a. A B Chapter 1 11 1.1.3.1 CpG islands Originally identified in the 1980s, CpG islands (CGIs) are features unique to vertebrate genomes and can be characterised by their high CpG dinucleotide density and lack of methylation (Bird 1980; Feng et al. 2010a). CGIs are generally around 200-1000 base pairs in length and associate with ~60-70% of annotated gene promoters in both human and mouse, especially those of housekeeping genes (Gardiner-Garden and Frommer 1987; Antequera 2003; Saxonov et al. 2006). It is important to note that only ~10% of CpGs in the mammalian genome are present in CGIs (Bird et al. 1985; Illingworth et al. 2008). Although they are typically unmethylated, there are some CGIs that can become methylated during normal development. Two well-studied examples of biological processes that involve CGI methylation are genomic imprinting and X chromosome inactivation. However, due to the distinctive nature of the mechanisms involved in these two specific cases, they may not necessarily be reflective of other instances of CGI methylation regulation throughout the genome. Genomic imprinting is a phenomenon by which genes are mono-allelically expressed in a parent-of-origin specific manner. There are ~100 genes regulated in this way, and many are in clusters that are under the control of a single imprinting control region (ICR), of which almost all can be classified as CGIs (Ferguson-Smith 2011; Suzuki et al. 2011). It is the establishment during gametogenesis, and subsequent maintenance throughout development, of differential methylation between parental alleles at ICRs, which allows for monoallelic expression of imprinted genes. Aberrant methylation levels at these ICRs can result in a range of detrimental phenotypes from developmental and growth defects to lethality. Thereby, the regulation of genomic imprinting highlights one of the most clearly defined and essential functions of DNA methylation in mammals. The inactivation of an X chromosome in females occurs to mediate gene dosage (due to the presence of two X chromosomes) and transcriptional repression is stably maintained at CGIs by DNA methylation, as well as other heterochromatin-associated factors (Chaligne and Heard 2014). In X inactivation, CGI methylation is not the primary silencing mechanism, but is necessary for long-term transcriptional repression (Lock et al. 1987). In addition to genomic imprints and the X chromosome, there are other CGI promoters in the genome that get methylated during normal development (~10%), such as germline specific gene promoters (Weber…Schubeler 2007), as well as in specific biological contexts such as cancer or in vitro culture (Antequera et al. 1990; Weber et al. Chapter 1 12 2005; Jones and Baylin 2007; Illingworth and Bird 2009; Maunakea et al. 2010). However, it is still unclear to what extent DNA methylation at CGIs is involved in initiating or maintaining transcriptional repression, or if it is just a molecular marker of silenced genes (Suzuki and Bird 2008). It has been proposed that selection acts to preserve CGIs due to their conserved lack of methylation across different species, combined with the potentially mutagenic properties of methylation (Bird 1980; Antequera 2003; Cohen et al. 2011). Methylation at a CpG dinucleotide is thought to be mutagenic because it can allow for spontaneous deamination from the 5-methylcytosine to a thymine nucleotide base in vitro, as well as in bacteria (Duncan and Miller 1980; Shen et al. 1992; Shen et al. 1994). If left uncorrected by the base excision repair (BER) pathway, this could result in the mutation of a cytosine to a thymine after cellular replication. The mutagenic properties of DNA methylation have been difficult to prove experimentally with mouse pluripotent stem cells (Spada et al. 2020). However, in species that have prevalent CpG DNA methylation, such as mouse and human, the CpG dinucleotide occurs at one fourth the expected ratio compared to other dinucleotides in the genome (Bird 1980). This reduced frequency of CpGs in the genome is much less pronounced in species that do not have DNA methylation, such as Drosophila melanogaster, which also lack CGIs (Schorderet and Gartler 1992; Jabbari and Bernardi 2004; Vinson and Chatterjee 2012) (Figure 1.4). Figure 1.4: CpG islands are present in vertebrate genomes and absent from genomes lacking DNA methylation. Density of CpG dinucleotides at the chromosome scale across the mouse (left) and Drosophila (right) genomes. The mouse genome is punctuated by the presence of intermittent CpG dense regions, called CpG islands (CGIs), a distinct feature of vertebrate genomes. On the other hand, the Drosophila genome is devoid of methylation and has relatively consistent CpG density throughout, with no CGIs. CpG density is calculated for a 1000-bp window. Figure adapted from Vinson et al. 2012. Chapter 1 13 1.1.3.2 Gene bodies and transposable elements Unlike CGI-associated promoters, gene bodies and transposable elements are generally hypermethylated. In the case of gene bodies, housekeeping genes are particularly enriched for methylation, with slight preference at exons compared to introns. As mentioned earlier, gene body methylation is a highly conserved feature that seems to be universal amongst almost all species with DNA methylation in their genomes (Feng et al. 2010a; Zemach et al. 2010), yet it does not have a known function. In mammals, gene body methylation has been hypothesised to prevent spurious intragenic transcriptional activation (Neri et al. 2017) and to regulate alternative splicing of genes (Gelfman et al. 2013). However, these phenomena are quite rare and the findings that implied these functions have been proven difficult to reproduce (Teissandier and Bourc'his 2017). In plants, the loss of gene body methylation, through mutations of essential DNA methylation machinery, does not result in major changes in transcription (Roudier et al. 2009), suggesting it is not necessary for gene expression. Further work needs to be done to understand whether gene body methylation is indeed functionally relevant. On the other hand, hypermethylation at transposable elements does seem to have a function to repress transcriptional events. Transposable elements (TEs) are mobile genetic units that have the potential to “jump around” and integrate into the genome indiscriminately when transcribed or excised, thereby threatening genomic stability. There are many mechanisms that have evolved to suppress TE mobilisation through either transcriptional or post-transcriptional interference, and these can vary between species. In both plant and vertebrate genomes, TEs are hypermethylated, which has led to the hypothesis that DNA methylation originally evolved to silence TEs. This is supported by the fact that when the primary component for preserving CpG methylation profiles between cell divisions is removed in either plants (met1) or mouse (Dnmt1; discussed in the following section), TE expression is increased (Lippman et al. 2003). However, TEs are not homogeneous and are represented by various distinct families that are targeted by different silencing mechanisms. In the case of mice, DNA methylation appears to be especially important for the silencing of the mouse-specific and evolutionarily young TE family of intracisternal A-particles (IAPs) (Walsh et al. 1998). With regards to the mutagenic properties of DNA methylation, the presence of methylation at both TEs and gene bodies raises an interesting paradox. Of all genomic features, housekeeping genes are the most conserved throughout evolution, so therefore it Chapter 1 14 is surprising that they are also inundated with the potential for mutation due to the enrichment of methylation. In contrast, hypermethylation at TEs could provide an evolutionary role for DNA methylation to inactivate TEs through the accumulation of mutations (Goll and Bestor 2005). There are two separate enzymes, both acting as part of the base excision repair (BER) pathway, that are thought to be specifically responsible for correcting the T-G mismatch that results from the spontaneous deamination of DNA methylation: methyl-CpG binding domain protein 4 (MBD4) and thymine DNA glycosylase (TDG) (Bellacosa and Drohat 2015). In fact, mice lacking MBD4 accumulate more mutations at CpG sites on a reporter sequence compared with WT mice (Millar et al. 2002). Yet it is still unclear how and whether these proteins behave differentially at gene bodies versus TEs with regards to repairing mismatches that arise due to the spontaneous deamination of DNA methylation. A possible explanation is that the enzyme action of MBD4 and/or TDG is coupled with transcription, thereby more easily dealing with mismatches at transcriptionally active gene bodies compared with TEs, which are generally transcriptionally silent. In the following sections, we will take a step back and introduce the factors and co- factors necessary for the placement and removal of DNA methylation throughout the genome, as well as the dynamics of these processes during development. 1.1.4 The maintenance and establishment of DNA methylation In 1975, two seminal review articles about the potential roles of DNA methylation and its underlying mechanisms were published and subsequently shaped the field (Holliday and Pugh 1975; Riggs 1975). Both papers hypothesised models for how eukaryotic methylation is established and maintained throughout development. Namely, that there are enzymes responsible for depositing methylation at unmethylated sites, and enzymes that preserve methylation during replication. In the current literature, these enzymes are referred to as the de novo and maintenance DNA methyltransferases (DNMTs) respectively. Here, we will first introduce the various DNMTs in the mouse genome and their different functions and structures, as well as an essential co-factor called UHRF1. An originally unanticipated capability of DNA methylation from the two 1975 review articles (Holliday and Pugh 1975; Riggs 1975), is that it can also be actively removed from the genome – a process that is especially important during embryonic development. Following the in-depth overview of DNMTs, we introduce TET enzymes that are involved in the active removal of methylation Chapter 1 15 and describe the extensive dynamics of DNA methylation that occur during mammalian development. 1.1.4.1 DNA methyltransferases (DNMTs) 1.1.4.1.1 DNMT function and activity In the mouse genome, there are six annotated DNMTs: DNMT1, DNMT2, DNMT3A, DNMT3B, DNMT3C, and DNMT3L (Lyko 2018). DNMT1 is canonically referred to as the maintenance methyltransferase due to its preference for acting on hemimethylated DNA following the synthesis of new DNA strands during replication (Okano et al. 1998; Goyal et al. 2006). Meanwhile, DNMT3A and 3B are known as the de novo methyltransferases for establishing methylation patterns during early development, and do not show a preference for hemimethylated versus unmethylated DNA (Figure 1.5). DNMT3C is in the same family of methyltransferases as DNMT3A/3B but is mouse specific and exclusively expressed in the male germline (Barau et al. 2016). Figure 1.5: DNA methylation is established by DNMT3A/3B and maintained by DNMT1. The prevailing hypothesis posits that DNA methylation is established at unmethylated CpG sites by DNMT3A/3B, which are often referred to as the de novo methyltransferases. On the other hand, during replication, DNA methylation is placed on the newly replicated strand (in green) by DNMT1, which is known as the maintenance methyltransferase. Figure inspired by Jones and Liang 2009. Unlike the rest of the DNMTs, DNMT2 and DNMT3L are not directly involved in the catalytic process of adding a methyl group to the fifth carbon (C5) of a cytosine ring in the DNA. Originally discovered due to its highly conserved catalytic 5-methylcytosine domain (Okano et al. 1998), DNMT2 does not modify DNA, but is an RNA methyltransferase that specifically mediates methylation at tRNAs (Goll et al. 2006). On the other hand, DNMT3L is eutherian mammal-specific, has no catalytic function, but acts De novo DNMT3A/3B + Replication Maintenance DNMT1 Chapter 1 16 as an important co-factor of DNMT3A to facilitate methylation patterns at genomic imprints and TEs in the germline (Bourc'his et al. 2001; Bourc'his and Bestor 2004; Yokomine et al. 2006; Jia et al. 2007). Genetic studies of these different factors have elucidated the essential nature of DNA methylation in the mammalian genome. In mice, genetic knockouts of Dnmt1 or Dnmt3b are embryonic lethal (~E9.5), Dnmt3a knockouts are lethal postnatally (~4 weeks old), while knockouts of Dnmt3l or Dnmt3c result in male sterility (Li et al. 1992; Okano et al. 1999; Bourc'his et al. 2001; Barau et al. 2016). Although Dnmt3l-deficient females are fertile, their offspring are not viable – a phenotype recapitulated by a germline conditional knockout of Dnmt3a, which also results in male sterility (Bourc'his et al. 2001; Hata et al. 2002; Kaneda et al. 2004; Dura et al. 2022). Meanwhile, the germline-specific deletion of Dnmt3b does not have a discernible phenotype, further highlighting the differential developmental roles between DNMT3A and 3B (Kaneda et al. 2004). Genomes of mouse embryos lacking Dnmt1 are almost completely devoid of methylation (Grosswendt et al. 2020). Knockout embryos of either Dnmt3a or Dnmt3b show only partial loss of methylation (Auclair et al. 2014), but the double knockout (DKO) of both genes results in a similar amount of loss to Dnmt1-deficient embryos (Dahlet et al. 2020). In vitro studies have shown that DNMT1 functions in a processive manner, methylating long stretches of CpGs without dissociating from the single-stranded DNA (Hermann et al. 2004; Vilkaitis et al. 2005). It is thought that the processive nature of DNMT1 is, in part, what allows for the efficient and stable inheritance of DNA methylation during replication. In support of this, the frequency of errors that DNMT1 makes by missing a CpG site as it processes along the DNA is less than 0.3% of the time (Goyal et al. 2006). In contrast, it is debated the extent to which DNMT3A/3B methylate DNA in a processive manner (Jeltsch and Jurkowska 2016). In some studies, DNMT3A was observed to act with processivity (Holz-Schietinger and Reich 2010), while in others this phenomenon was not detected, and instead it was found that it binds in a cooperative manner (with additional DNMT3A proteins) across the DNA (Emperle et al. 2014). Recently, it was shown that DNMT3B does not behave in a cooperative manner like DNMT3A, and can act with processivity, but it is unclear how this compares to DNMT1’s ability to do the same (Norvil et al. 2018; Lin et al. 2020). The processivity of DNA methylation deposition by DNMT1, and perhaps DNMT3B, suggests that DNA methylation states are coordinated along the DNA. Chapter 1 17 1.1.4.1.2 DNMT structure The basic structures of DNMTs are conserved throughout evolution. Dnmt1 and Dnmt3a were present in the common ancestor to metazoans, which suggests that species such as D. melanogaster and C. elegans lost these genes throughout evolution. On the other hand, Dnmt3b and Dnmt3l are thought to have resulted from a duplication event of Dnmt3a near to the origin of tetrapods and eutherian mammals respectively, while a duplication of Dnmt3b gave rise to Dnmt3c in mice (Molaro et al. 2020). The shared and unique protein structures of the different DNMTs informs their specific functions (Figure 1.6A). The catalytic domains of DNMTs involved in CpG methylation are conserved between mammals and bacteria (Bestor et al. 1988) and include two key motifs: PC and ENV. The mechanism by which DNMTs catalyse the addition of a methyl group is initiated via a nucleophilic substitution between the PC motif and the sixth carbon (C6) of the cytosine ring (Figure 1.6B). This reaction is facilitated by a protonation of the third atom (N3) of the cytosine ring by the ENV motif, which in turn allows for the transfer of the methyl group from S-adenosylmethionine (SAM) to the C5 of the cytosine ring (Jeltsch 2002). DNMT1 and DNMT3A/3B both have fully intact methyltransferase domains (which includes the PC and ENV motifs), while DNMT3L is truncated at the C-terminal portion and divergent in amino acid sequence at the PC and ENV motifs (Aapola et al. 2000), which is the presumed reason for its catalytic inactivity. Besides the catalytic domain at the C-terminal region of DNMTs, DNMT1 and DNMT3A/3B both harbour differential domains that are necessary for their distinct functions. The N-terminal region of DNMT1 is composed of the following domains: DNMT1-associated protein 1 (DMAP1) binding domain, a proliferating cell nuclear antigen (PCNA) binding domain, a replication foci-targeting sequence (RFTS) domain, a CXXC domain, and two bromo-adjacent homology (BAH) domains (Chen and Zhang 2020). The RFTS, PCNA, and BAH domains are all necessary for targeting DNMT1 to replication foci to ensure the maintenance of the parent strand methylation state onto the newly synthesised daughter strand (Leonhardt et al. 1992; Chuang et al. 1997; Yarychkivska et al. 2018). The CXXC and BAH domains, in concert with RFTS, are involved in an autoinhibitory mechanism of DNMT1 that is thought to be resolved by a crucial co-factor, UHRF1 (Song et al. 2012; Bashtrykov et al. 2014; Berkyurek et al. 2014). The autoinhibitory function of DNMT1 has been hypothesised to reduce its ability to catalyse the addition of methylation independently of replication (Garvilles et al. 2015) or Chapter 1 18 at sites that are completely unmethylated (Song et al. 2011). Finally, the DMAP1 binding domain has been proposed to interact with histone deacetlyase 2 (HDAC2) and transcriptional repressor DMAP1 (Rountree et al. 2000). Dnmt3a and Dnmt3b are highly homologous, and both have Pro-Trp-Trp-Pro (PWWP) and ATRX-DNMT3-DNMT3L (ADD) domains, the latter of which is also present in Dnmt3l. The PWWP domain binds to H3K36me3 (Dhayalan et al. 2010), a histone tail modification normally found on the gene bodies of expressed genes, and may be involved in the recruitment of DNMT3A or 3B to these sites (Du et al. 2015). The ADD domain recognises unmodified histone 3 lysine 4 (H3K4) (Otani et al. 2009) and is inhibited by histone methylation at this site (H3K4me3) (Ooi et al. 2007), a mechanism that may be crucial for the maintenance of unmethylated states at CpG islands (Edwards et al. 2010). Chapter 1 19 Figure 1.6: Protein structures and catalytic mechanism of mammalian DNA methyltransferases. (A) There are four families of DNMTs in mammalian genomes and the conserved domains are shown in different colours. The catalytic domain responsible for the deposition of methylation is shown in red. DNMT2 is in fact an RNA methyltransferase, a functional difference indicated structurally by the black stripe within the red catalytic domain. As shown by the truncation of the catalytic domain, DNMT3L is not capable of actively depositing methylation, but is still essential for methylation establishment during early development from its interactions with other DNMT3s. (B) Schematic showing the catalytic mechanism of DNA methylation deposition. The ENV and PC motifs from the catalytic domain of a methyltransferase (in green) facilitate the addition of a methyl group (in red) onto a cytosine base. S-adenosylmethionine (SAM), the substrate that provides the methyl group, is highlighted in purple. Figure adapted from Lyko 2018. 1.1.4.1.3 UHRF1 UHRF1 is an indispensable co-factor of DNMT1. Besides its potential role, mentioned above, to release DNMT1 from its autoinhibited state, UHRF1 recognises hemimethylated DNA via its SET- and RING-associated (SRA) domain, which is critical for the recruitment of DNMT1 to replication foci (Arita et al. 2008; Avvakumov et al. 2008; Hashimoto et al. 2008). Thereby, UHRF1 is essential for embryonic development and the maintenance of methylation, supported by findings that Uhrf1 knockout leads to embryonic lethality and genome-wide depletion of methylation like that of Dnmt1-deficient embryos (Bostick et al. 2007; Sharif et al. 2007). UHRF1 is also thought to play a role in the crosstalk between histone modifications and DNA methylation. This is because the Tandem Tudor and PHD domains of UHRF1 specifically recognise histone tail modification H3K9me3, however it A B (PC motif) (ENV motif) (SAM) Chapter 1 20 is unclear the extent to which this crosstalk is important for DNA methylation genome- wide (Du et al. 2015). 1.1.4.2 TET proteins and active demethylation The loss of a methylated state can occur either through the passive loss of methylation due to errors in maintenance by DNMT1, or the active oxidation of methylation by ten eleven translocation (TET) proteins (Hill et al. 2014). So far, three mammalian TET proteins have been identified TET1, TET2, and TET3 and each one can catalyse the oxidation of 5- methylcytosine (5mC) to 5-hydroxymethylcytosine (5hmC), 5-formylcytosine (5fC), or 5- carboxycytosine (5caC) (Figure 1.7A) (Tahiliani et al. 2009; He et al. 2011; Ito et al. 2011). Despite all having the same functional capacity, the three TET enzymes have relatively distinct biological functions due to differential regulation throughout development and between cell-types (Melamed et al. 2018). Notably, the double knockout of Tet1 and Tet2 can produce viable mice (Dawlaty et al. 2013), but Tet3-deficient mice die at birth (Gu et al. 2011). TET-mediated demethylation has been proposed to occur by two distinct mechanisms (Figure 1.7B) whereby oxidation of 5mC either results in 1) passive dilution of the methylation state during replication because DNMT1 and UHRF1 cannot recognise oxidated forms of 5mC, or 2) targeting for excision by thymine DNA glycosylase (TDG) followed by replacement with an unmethylated cytosine via base excision repair (BER) (Kohli and Zhang 2013). In support of the passive dilution mechanism, in vitro biochemical assays have shown that the efficiency of DNMT1 to catalyse the addition of methyl groups on a newly synthesised strand is greatly reduced when positioned opposite a hydroxy- methylated CpG (5hmC) (Valinluck and Sowers 2007; Ji et al. 2014; Seiler et al. 2018). Despite this, it was recently shown that the TET-TDG-BER pathway, as opposed to the passive dilution of 5hmC, is the major contributor to demethylation events during induced pluripotent stem cell (iPSC) reprogramming from somatic cells (Caldwell et al. 2021). Chapter 1 21 Figure 1.7: Mechanisms of passive and active demethylation. (A) Diagram of cytosine methylation (5mC; black) and iterative oxidation reactions mediated by TET enzymes to produce 5hmC (orange), 5fC (blue), and 5caC (purple). (B) Mechanistic differences between TET-mediated passive and active demethylation. Passive demethylation occurs by DNMT1/UHRF1 not recognising the oxidated forms of 5mC for maintenance during DNA replication. Meanwhile, active demethylation is the targeting of 5hmC, 5fC, or 5caC by thymine DNA glycosylase (TDG) for excision, followed by replacement with an unmethylated cytosine (C; white) via the base excision repair (BER) pathway. Figure adapted from Lio and Rao 2019. 1.1.4.3 Dynamics of DNA methylation throughout mammalian development Patterns of DNA methylation are highly dynamic throughout mammalian development, during which the genome undergoes two distinct rounds of epigenetic reprogramming: 1) in the early embryo during preimplantation development, and 2) in the developing germline post-implantation (Figure 1.8) (Greenberg and Bourc'his 2019). Both reprogramming events result in almost complete erasure of DNA methylation, as well as a near complete reset of histone tail modifications (Feng et al. 2010b). In the zygote, immediately following fertilisation, methylation is both passively lost due to the exclusion of DNMT1 from the nucleus (Carlson et al. 1992; Mertineit et al. 1998) and actively removed via TET3- mediated demethylation (Gu et al. 2011; Guo et al. 2014; Shen et al. 2014). By the A B Chapter 1 22 blastocyst stage of development (E3.5 in mice), genome-wide methylation levels dip to a nadir of ~20% (Wang et al. 2014). Following implantation of the mouse embryo at E4.5, the genome is rapidly methylated by DNMT3A/3B (Santos et al. 2002; Dahlet et al. 2020) to roughly somatic levels of global methylation, ~70%, by E6.5 (Seisenberger et al. 2012). Figure 1.8: Dynamics of methylation throughout mouse development. DNA methylation goes through two distinct rounds of epigenetic reprogramming during early mammalian development. Immediately following fertilisation, methylation is passively and actively lost, due to the exclusion of DNMT1 from the nucleus and TET3-enzymatic activity, respectively. By the blastocyst stage (E3.5), global methylation levels drop to ~20%, but after embryo implantation (E4.5) the genome is rapidly methylated by DNMT3s until E6.5 when methylation levels reach ~70%, a state that is globally maintained throughout the rest of somatic development. The second round of reprogramming occurs during germline development starting at E7.25, when a subset of stem cells passively and actively (by TET1/2) lose methylation to form primordial germ cells (PGCs) by E13.5 with global methylation levels nearing ~7%. The formation of female and male gametes differs in both methylation reacquisition (although both require DNMT3L and DNMT3A) and developmental timing. Male germ cells are remethylated before birth to global levels of ~80% and undergo additional methylation regulation at a subset of transposable elements by DNMT3C. Female germ cells are not completely remethylated until after birth, during ovulation, after which global methylation levels reach to ~50%. Figure from Greenberg and Bourc’his 2019. In the post-implantation epiblast (E7.25), a subset of stem cells undergoes germline specification to primordial germ cells (PGCs) (Ginsburg et al. 1990). These PGCs are epigenetic reprogrammed through both passive and TET1/2-mediated active DNA demethylation mechanisms (Yamaguchi et al. 2012; Hackett et al. 2013; Yamaguchi et al. 2013a; Yamaguchi et al. 2013b). By E13.5, global methylation levels of PGCs are depleted to ~7% (Wang et al. 2014), after which there is a global gain in methylation largely mediated by DNMT3L and DNMT3A (Bourc'his and Bestor 2004; Kaneda et al. 2004; Kato et al. 2007; Smallwood et al. 2011). This process of remethylating the germline genome is distinctive between male and female gametes. With additional methylation deposition provided by DNMT3C, male germ cells are remethylated before birth to ~80% Chapter 1 23 global methylation levels (Barau et al. 2016). On the other hand, female gametes are not completely remethylated until after birth during ovulation (Sasaki and Matsui 2008), after which the levels of methylation reach to ~50% genome-wide, which is notably lower than in sperm (Wang et al. 2014). Of interest is what remains methylated following the global depletion in the developing embryo and gametes, with global methylation levels of ~20% and ~7%, respectively. It is at these sites that it has been proposed DNA methylation can be inherited inter- or even trans-generationally (Skvortsova et al. 2018). In both the developing embryo and the gametes, methylation is partially retained at young TEs, such as intracisternal A- particles (IAPs) (Hajkova et al. 2002; Lane et al. 2003; Lees-Murdock et al. 2003), though the precise loci that retain methylation have been difficult to ascertain. In the germ line, DNMT3L-mediated methylation at these elements is thought to be regulated by the conserved PIWI-associated RNA (piRNA) pathway, which protects genomes against TE mobilisation (Aravin et al. 2008; Tam et al. 2008; Molaro et al. 2014). Besides IAPs, imprinting control regions (ICRs) are also resistant to the global erasure of methylation in the developing embryo, but not the germline (Ferguson-Smith 2011). Genomic imprints are initially established in the PGCs, a process that requires both DNMT3A and DNMT3L. In the developing embryo, parent-of-origin specific methylation states retain their methylation through the action of two Krüppel-associated box (KRAB)- containing zinc finger (ZF) proteins (KZFP), ZFP57 and ZFP445 (Li et al. 2008; Strogantsev et al. 2015; Takahashi et al. 2019). Moreover, it was shown in mouse embryonic stem cells (mESCs; a cell culture model of the pluripotent blastocyst) that KZFP-mediated targeting induces DNA methylation, required for the deposition of the repressive histone mark H3K9me3, at ICRs (Quenneville et al. 2011; Quenneville et al. 2012). This supports a model by which KZFPs are involved in maintaining DNA methylation at a subset of sites during global demethylation in the developing embryo. 1.1.5 Crosstalk between DNA methylation and histone tail modifications As epigenetic marks that are involved in transcriptional regulation, there are many connections and correlations between histone tail modifications and DNA methylation, but they are complex and difficult to untangle. Compared to DNA methylation, histone tail modifications are distinct with regards to their function and genomic location, so it is well understood how different marks regulate different regions of the genome. Here we dissect the relationships between DNA methylation and four different histone tail modifications: Chapter 1 24 histone 3 lysine 9 trimethylation (H3K9me3), histone 3 lysine 27 trimethylation (H3K27me3), histone 3 lysine 4 trimethylation (H3K4me3), and histone 3 lysine 36 trimethylation (H3K36me3). Figure 1.9: Histone tail modifications have distinct functions at specific regions of the genome. Histone tail modifications are either associated with transcriptional activation or repression. H3K4me3 (in green) is found at the promoters and around the transcription start sites of active genes, while H3K36me3 (in yellow) is associated with active gene bodies. On the other hand, H3K9me3 and H3K27me3 are two repressive histone marks. H3K9me3 (in red) is generally targeted to gene-poor regions of the genome where it is involved in the transcriptional repression of transposable elements. H3K27me3 (in blue) is located in gene-rich regions of the genome to silence genes. Figure adapted from Stefano 2022. Genome organisation is facilitated by the wrapping of DNA around histones to form nucleosomes (Luger et al. 1997). Histone proteins have N-terminal amino acid tails that can be modified by a range of chemical modifications, including methylation and acetylation. Histone tail modifications are involved in transcriptional regulation of the genome with some associating with repression and others with activation (Figure 1.9). Like DNA methylation, repressive modifications such as H3K9me3 and H3K27me3 can be stably maintained between cell divisions by histone retention at replication foci (Margueron et al. 2009). Conversely, histone marks associated with transcriptional activation, such as H3K4me3, are re-established following replication (Escobar et al. 2019). In this section, we dissect the relationships between DNA methylation and both transcriptionally repressive and active histone modifications. Chapter 1 25 1.1.5.1 Transcriptionally repressive histone marks H3K9 methylation is enriched in heterochromatic gene-poor regions of the genome, such as TEs and pericentromeric repetitive satellite elements, where DNA methylation can also be found (Pauler et al. 2009). There are six methyltransferases that catalyse H3K9 methylation in the mouse genome: SUV39H1, SUV39H2, G9A, GLP, SETDB1, and SETDB2 (Padeken et al. 2022). The three canonical DNMTs, DNMT1 and DNMT3A/3B, as well as UHRF1, have been shown to directly interact with this H3K9 methylation enzymatic machinery (Fuks et al. 2003; Esteve et al. 2006; Li et al. 2006; Chang et al. 2011), but the extent to which these interactions influence global DNA methylation levels are limited and dependent on developmental and genomic contexts. In Suv39h1-/-Suv39h2-/- mice, DNA methylation is lost at major satellite repeats in pericentromeric regions, while Dnmt1-/- or Dnmt3a-/-Dnmt3b-/- mouse embryos do not show any changes in H3K9 methylation at these sites (Lehnertz et al. 2003). Additionally, mESCs deficient for H3K9 methyltransferase G9a, show reduced levels of DNA methylation at retrotransposons, as well as ICRs (Dong et al. 2008; Zhang et al. 2016). However, this second point is controversial as it has been shown that at ICRs and ZFP57- bound regions, H3K9 methylation is lost in Dnmt1-/-Dnmt3a-/-Dnmt3b-/- triple knockout mESCs, indicating that H3K9me3 is secondary to DNA methylation at these loci (Quenneville et al. 2011; Shi et al. 2019). Overall, these findings suggest a model in which H3K9 methylation can precede DNA methylation to silence most regions of the genome (Padeken et al. 2022), but the opposite is likely to be the case at genomic imprints where DNA methylation precedes H3K9me3. It has been proposed that the direct binding of UHRF1, via its Tandem Tudor domain (TDD), to H3K9 methylation is essential for DNA methylation maintenance mediated by DNMT1 (Rothbart et al. 2012; Rothbart et al. 2013). But when this interaction is abrogated by mutating the TDD of UHRF1, global DNA methylation levels are only reduced by ~10% (Zhao et al. 2016). Additionally, when all six H3K9 methyltransferases are knocked out in mouse embryonic fibroblasts (MEFs), a somatic context in which DNA methylation is globally maintained by DNMT1, DNA methylation is only modestly reduced by ~20% genome-wide (Montavon et al. 2021). Therefore, H3K9 methylation may precede and promote de novo DNA methylation in the early embryo at most genomic loci, but H3K9me3 and DNA methylation at sites catalysed by DNMT1 appear to be largely independently regulated. Chapter 1 26 The histone tail modification H3K27me3 is enriched at inactive genes and maintains their transcriptional repression at promoters via the polycomb repressive complex 2 (PRC2) (Pauler et al. 2009). Loss of H3K27me3 results in little change to DNA methylation, while loss of DNA methylation results in acquisition of H3K27me3 genome- wide, which suggests that DNA methylation and H3K27me3 are molecular antagonists (Brinkman et al. 2012; Hagarman et al. 2013). 1.1.5.2 Transcription-associated histone marks Another histone modification that has a proposed antagonistic relationship with DNA methylation is H3K4me3, which is generally found at the promoters of transcribed genes. In fact, enrichment of H3K4me3 is mutually exclusive with DNA methylation and, whereas DNA methylation seems to block H3K27 methylation, H3K4 methylation seems to block DNA methylation, particularly at CGIs (Ooi et al. 2007; Weber and Schubeler 2007). As a final example of how DNA methylation can crosstalk with histone tail modifications, histone 3 lysine 36 trimethylation (H3K36me3) and DNA methylation are both enriched at exons and introns of actively transcribed genes (Rose and Klose 2014). The PWWP domains of DNMT3A/3B can recognise H3K36me3 (Rondelet et al. 2016), and when they are disrupted by mutations in mESCs, gene body DNA methylation is reduced (Baubec et al. 2015). Gene body DNA methylation is also reduced when the enzyme responsible for catalysing H3K36 methylation, SETD2, is depleted (Morselli et al. 2015). However, there is still strong enrichment of DNA methylation at gene bodies in somatic contexts, when DNMT3A/3B are absent, suggesting that there may be a yet uncovered relationship between H3K36me3 and DNMT1. 1.1.6 DNA methylation fidelity So far, we have presented DNA methylation as an epigenetic mark that is established during early development by DNMT3s and then stably propagated through mitosis by DNMT1. However, there is accumulating evidence that DNA methylation is not always stably inherited between cellular divisions. In fact, two studies from the 1980s were among the first to observe this instability (Pollack et al. 1980; Wigler et al. 1981). By transfecting viral plasmid containing unmethylated or artificially methylated DNA into mouse fibroblasts, Pollack and colleagues used methylation-sensitive restriction enzymes to examine the presence or absence of DNA methylation at the transfected sequences in clonal cell lines after 25 to 30 divisions (Pollack et al. 1980). They found that Chapter 1 27 methylation was only preserved in 2 out of the 8 clonal lines derived from methylated DNA transfections, and maintenance of the unmethylated state in 13 out of 14 clonal lines derived from the unmethylated DNA transfections – meaning that one line gained methylation. The findings were generally inconclusive about whether DNA methylation is replicated via a semiconservative process. Wigler and colleagues, using almost the same techniques, found that of 9 clonal lines derived from methylated DNA transfections, all of them appear to retain inconsistent levels of DNA methylation, and of 10 clonal lines derived from unmethylated DNA transfections, none of the lines exhibited methylation (Wigler et al. 1981). They concluded that somatic cells can mediate the faithful inheritance of methylation, but not that they always do. The two studies reviewed above in detail analysed methylation levels through cell divisions at sequences not originally present in the genome, suggesting that perhaps DNA methylation is inherited differently at transfected loci. However, other groups also using subcloning approaches to infer methylation fidelity of endogenous sequences in the genome, came to similar conclusions. Using largely qualitative approaches to assess methylation levels, as well as being limited to specific loci that were of interest with regards to early functional methylation studies, led to varied and sometimes confounding interpretations and conclusions regarding methylation fidelity (Shmookler Reis and Goldstein 1982a; Shmookler Reis and Goldstein 1982b; Turker et al. 1989; Pfeifer et al. 1990; Shmookler Reis et al. 1990). Nevertheless, across all the studies, DNA methylation levels at a particular CpG site, or group of sites, were not always consistent between subclonal cell lines – implicating imperfect mitotic inheritance of methylation at the tested genomic loci. More recent quantitative and genome-scale approaches have observed the existence of unfaithful methylation (also referred to as “stochastic” methylation) and have shown that it is more prevalent in pluripotent compared to somatic cells (Landan et al. 2012; Shipony et al. 2014). Another way to approximate methylation fidelity is to measure the presence of hemimethylation. This is done by hairpin-bisulphite PCR (or sequencing), which allows for measuring of strand-specific methylation patterns. The logic for why hemimethylation can be used as a proxy for methylation fidelity stems from the widely accepted mechanism for how methylation patterns are inherited during cell replication. Through UHRF1, DNMT1 recognises hemimethylated CpG sites after replication and deposits methylation onto the newly synthesised strand (Hermann et al. 2004), therefore the presence of homeostatic hemimethylation is indicative of this process not coming to completion. Using Chapter 1 28 hairpin-bisulphite PCR in a locus-specific manner revealed that hemimethylation, and by proxy methylation infidelity, can exist in varying amounts between different genomic loci (Laird et al. 2004; Arand et al. 2012). This concept promoted the idea that hemimethylation induced by imperfect DNMT1-mediated methylation maintenance will lead to loss of methylation unless accompanied by de novo methylation (Riggs and Xiong 2004). When assessed at the global scale by genomic sequencing, more than half of CpGs with 50-90% methylation levels exhibited hemimethylation (Zhao et al. 2014). Additionally, the prevalence of hemimethylation genome-wide was shown to decrease during mESC differentiation – confirming the finding mentioned above that unfaithful methylation inferred by subcloning is more prevalent in pluripotent cells compared to somatic ones. From these studies, ideas have begun to emerge for how DNA methylation may be maintained through cellular divisions unfaithfully or stochastically (Riggs and Xiong 2004; Jones and Liang 2009; Jeltsch and Jurkowska 2014). However, the extent and principles dictating this instability have yet to be fully determined – in Chapter 3 of this thesis, we decipher more precisely how methylation is propagated between cell divisions and to what degree it is stochastically, rather than clonally, inherited. Chapter 1 29 1.2 Transposable elements Repressing the transcription of transposable elements (TEs) is an established example of a distinct biological role for DNA methylation in the genome. For the remaining sections of the introduction, we will describe TEs in greater depth and focus in on a subset of variably methylated TEs called “metastable epialleles”. Lastly, we will introduce the transcription factor CTCF, which has many binding sites within young TEs, is enriched at metastable epialleles, and can exhibit methylation sensitivity with regards to binding the DNA. TEs are DNA sequences that can change location within a genome (Bourque et al. 2018). Originally identified in maize by Barbara McClintock in the 1950s (McClintock 1950), it has since become clear that TEs, and their evolutionarily inactive remnants, often account for large proportions of many eukaryotic genomes. For example, 40% of the mouse genome is thought to be made up of TE genetic material (Mouse Genome Sequencing et al. 2002). This proportion is comparable to the TE content of other mammals (Figure 1.10) and very likely underestimates the true proportion due to the limitations of short-read Illumina sequencing (Platt et al. 2018), which is unable to read sequences long enough for unique mapping to the genome of identical (or similar) TEs. The recent advent of long-read sequencing will allow for more accurate estimates of TE content in the future (Shahid and Slotkin 2020). TEs are classified by their mechanism of transposition, although most identified TEs in the mouse genome no longer have the ability to actively mobilise (Huang et al. 2012). The overwhelming majority of TEs in the mouse genome (96%) are classified as retrotransposons, which can mobilise via an RNA intermediate prior to reintegration into the genome (Mouse Genome Sequencing et al. 2002; Nellaker et al. 2012). The remaining 4% of TEs are classified as DNA transposons, which can be excised as double-stranded DNA and reintegrate into the genome without being transcribed as an RNA intermediate. Although DNA transposons are still actively mobile in many species, in the mouse genome there is no evidence for transposition of a DNA transposon for the last 40 million years (Feschotte and Pritham 2007) – some retrotransposons, on the other hand, are still active. Chapter 1 30 Figure 1.10: Genomic transposable element content varies between species. Transposable elements (TEs) make up ~40% of the mouse genome, which is comparable to the human genome in terms of amount and composition, both of which can differ quite radically from other species. For example, more than 50% of the zebra fish (Danio rerio) genome is composed of TEs, most of which are DNA transposons (in purple), which are present in comparatively low amounts in both mouse and human. Helitrons (in orange), a specific family of DNA transposons, are found in many genomes throughout the tree of life, but the only mammalian genomes in which they have been documented are that of bats. Figure from Huang, Burns, and Boeke 2012. 1.2.1 Retrotransposons In mammals, there are three major classes of retrotransposons: long-interspersed nuclear elements (LINEs), short-interspersed nuclear elements (SINEs), and long-terminal repeat (LTRs) elements (Figure 1.11) (Bourque et al. 2018). The DNA sequence of LINEs and LTR retrotransposons can encode proteins that allow for autonomous retrotransposition, whereas SINEs require LINE-derived proteins to mobilise (Dewannieux and Heidmann 2005). A full-length LINE is around 7 kilobases (kb) long and contains two open reading frames that encode ORF1 and ORF2 (Boissinot and Sookdeo 2016). ORF1 is an RNA- binding protein that functions as a chaperone to facilitate the reverse-transcriptase and endonuclease functions of ORF2. The process of reverse-transcription converts the RNA of LINEs or SINEs into DNA for integration into the genome, which is initiated by Chapter 1 31 endonuclease activity. Most LINEs in the mouse genome are truncated and do not have the ability to mobilise – it has been estimated that there are ~3000 active LINE elements (Goodier et al. 2001). SINEs are much shorter than LINEs (< 700 base pairs (bp) in length) and contain an RNA polymerase III promoter – with A and B block regions – thought to be derived from either tRNAs or rRNAs, as well as an internal region homologous to LINE elements, which may allow for the facilitation of retrotransposition via the LINE machinery (Ferrigno et al. 2001). Figure 1.11: Genetic structures of LINEs, SINEs, and LTR retrotransposons. The three major classes of retrotransposons in mammals (LINEs, SINEs, and LTR retrotransposons) are fundamentally diverse in genetic structure, which informs their different mechanisms of mobilisation. LINEs contain two open reading frames (ORF1/2) that encode proteins that facilitate reverse-transcription and re-integration into the genome. SINEs utilise the LINE derived proteins, as they do not encode their own. However, they do contain RNA polymerase III promoters (with A and B block regions) to independently induce transcription. The internal portions of LTR retrotransposons are flanked by LTRs and contain gag, pol, and env genes. Gag and pol encode a fusion protein necessary for retrotransposition, while env encodes for an envelope protein that allows for the retrotransposon to exit the cell to infect others. Although the LTRs of a full-length element are identical, the 5' LTR typically acts as a promoter, while the 3' LTR behaves as a transcription termination site. Figure adapted from Fueyo et al. 2022. LTR retrotransposons are characterised by two identical non-coding long-terminal repeats that flank a set of genes that can encode proteins that are required for retrotransposition. LTRs are roughly 200-600 bp in length, while the internal portion of the elements can span between 5 and 7 kb. In mammals, all LTR retrotransposons are derived Chapter 1 32 from a superfamily of endogenous retroviruses (ERVs) that likely arose from retroviruses inserting into the germline genome and then being inherited as proviruses through subsequent generations (Gifford et al. 2018). The structure of an ERV is therefore very similar to that of a retrovirus. Each LTR is subdivided into two unique regions (U3 and U5) that flank a regulatory region. Directly downstream of the U5 portion of the 5' LTR is the primer binding site (PBS), which is a highly conserved sequence that is essential for reverse transcription of the ERV (Havecker et al. 2004). Meanwhile, the internal portion of the element, between the two LTRs, contains gag, pol, and env genes that encode various polyproteins required for the propagation of the TE either intra- or extracellularly (Havecker et al. 2004). The gag gene encodes core structural proteins that form the virus- like particle in which the retrotransposon RNA will undergo reverse transcription. The pol gene encodes: 1) a reverse transcriptase that transcribes the ERV RNA into double-stranded DNA (dsDNA); 2) an integrase that processes the 3' end of the ERV dsDNA to produce 3' hydroxyl groups, cleaves the host DNA, and facilitates the ligation between the processed ERV dsDNA and host DNA; and 3) a protease that processes the ERV polyproteins into functional units. Therefore, gag and pol harbour the machinery required for the ERV to intracellularly retrotranspose. Finally, the env gene encodes for the viral envelope, which protects the ERV and allows it to exit its host cell to infect other cells. Most annotated ERVs in mammalian genomes exist as immobile solo LTRs that arise due to inter-LTR homologous recombination – for example, in the human genome, solo LTRs represent ~90% of all ERV insertions (Stoye 2001; Jern and Coffin 2008; Friedli and Trono 2015). Of the full-length elements, most have l