Investigating normal human gene expression in tissues with high-throughput transcriptomic and proteomic data. Mitra Parissa Barzine July 2020 Hughes Hall, University of Cambridge EMBL’s European Bioinformatics Institute (EMBL-EBI) This dissertation is submitted for the degree of Doctor of Philosophy The thesis source code is available at https://github.com/barzine/thesis or https://barzine.net/~mitra. Licensed under Creative Commons Attribution International (CC BY) 4.0 DECLARAT ION I hereby declare that this thesis • is the result of my own work and includes nothing which is the outcome of work done in collaboration except where specified in the text; • is not substantially the same as any that I have submitted, or, is being concurrently submitted for a degree or diploma or other qualification at the University of Cambridge or any other University or similar institution; I further state that no substantial part of my thesis has already been submitted, or, is being concurrently submitted for any such degree, diploma or other qualification at the University of Cambridge or any other University or similar institution. • contains fewer than the prescribed limit of 60,000 words exclusive of tables, footnotes, bibliography, and appendices and has fewer than 150 figures. Mitra Parissa Barzine July 2020 SUMMARY Investigating normal human gene expression in tissues with high-throughput transcriptomic and proteomic data. — By Mitra Parissa Barzine. With the improvement of high-throughput technologies during the last decade, several studies exploring the normal gene expression in human tissues have been published. Many studies examine the transcriptome with RNA sequencing (RNA-Seq), and others probe the proteome with unlabelled bottom-up Mass Spectrometry. As the sampling of undiseased tissues is difficult, the community often refers to expression atlases, which are collating these studies, to support or validate new findings. Despite many overlapping tissues between the studies, few atlases attempt to integrate all the data. In this thesis, I investigate the consistency of gene expression across tissues and studies in human with the help of transcriptomics captured with high-throughput sequencing (RNA-Seq) and proteomics generated with label-free bottom-up Mass Spectrometry (MS). After describing the transcriptomic and proteomic data and their state-of-art processing (Chapter 2), I review several identified sources of biases and my approaches to limit their effects (Chapter 3). The integration of the various transcriptomic datasets (Chapter 4) shows that the biological signal dominates the technical noise for RNA-Seq data. Tissue samples display higher levels of correlation for identical tissues in other studies than for other tissues in the same datasets. In other words, interstudy correlations for identical tissues are higher than correlations between different tissues within the same study. Globally, genes show similar expression profiles across studies for a given set of tissues. All genes categories are involved, including the tissue-specific genes and the ubiquitously expressed ones. After briefly discussing comparisons of proteomic data, I introduce a new proteomic quantification method, PPKM (Chapter 5). The PPKM method allows me to quantify about twice as many proteins compared to usual methods. Limited numbers of previous studies have shown various correlation levels between the expression of protein and mRNA in studies combining high-throughput transcriptomics and proteomics. I show that, for most tissues, we can observe quite good correlation levels (i.e. significantly better than expected by chance), even when the samples have different biological and technical backgrounds as they have been independently sourced. Many genes share similar patterns of expression between the two biological layers, e.g. genes that have a protein detected in a single tissue are more likely to have their mRNA showing specificity for the same tissue. Additionally, three groups of genes present functional enrichments of biological processes. Genes having highly correlated protein and mRNA expressions across tissues are enriched in catabolic processes. Genes having the most anticorrelated expressions are enriched for ribosomes and ncRNAs regulation. Genes with a protein detected in a single tissue are enriched in signalling processes. Overall, this thesis describes a global picture of the current consolidated knowledge we can extract from the joint study of public transcriptomic and proteomic data. Beyond confirming or improving observations reported in the literature, this work provides new insights into the ubiquitous and tissue-specific genes. To the best of my knowledge, this work has also established the most extensive list of genes with robust transcriptomic and proteomic expression across tissues and studies. Furthermore, it shows that joint study approaches can help the development of newmethods, like the new proteomic PPKM quantification method. Finally, the highlighting of distinct functional enrichment profiles for groups of genes across tissues and studies lays a framework for further research. v PREFACE The old saying where there’s a will, there’s a way fails to mention that proper support and means are must-haves in the journey that one takes for a PhD. Many institutions and people have a direct hand in the successful completion of this thesis. I thank the EMBL predoc program, the University of Cambridge and Hughes Hall college to have provided me with the working and living environment for one of the most formative periods of my life. I am extremely grateful tomy supervisor DrAlvis Brazmawithoutwhom this adventurewouldn’t have happened. Thank you, Alvis, for accepting me into the Functional Genomic group, which comprises many exceptional people that contributed to my day-to-day life and enriched me as a person. Thank you for your trust and support through this roller coaster that was my doctorate. I want to express my sincere thanks to Prof. Kathryn Lilley and Prof. Jürg Bähler for making the viva an outstanding experience. I will forever remember with the utmost joy our in-depth and intellectually stimulating discussion. I deeply appreciate your brilliant comments and expert suggestions, which have further broadened my understanding of the biological world. I am very thankful to Dr Jyoti Choudhary and Dr JamesWright for their collaboration. I wouldn’t have ever delved into the proteomics world if it wasn’t for their guidance and discussions. I also would like to thank my thesis advisory committee, Dr Sarah Teichman, Dr Gos Micklem and Dr Wolfgang Huber, for their expert feedback and discussion. Many thanks to Lynn French, Lorraine McAlister, Anna Alasalmi and Clare Impey for helping with many administrative tasks and paperwork involved in my doctorate. I can’t thank my teammates enough for all the welcoming, exciting and fun environment. Thank you for all your help and the lunch conversations about anything from top-notch science to absurd elephant jokes. Thanks to Mar, Nuno, Johan, Wanseon, Sérgio, Claudia, Aida (Fatemeh) for their discussions and close friendship. Special thanks to Nuno and Claudia, who provided me with the support I needed at crucial times. Nuno, thank you for your invaluable work and help with processing the data. You are like a bigger brother with whom I could discuss anything, from the inner hell circle of R to SciFi, in the most in-depth tiny details. Obrigada Nuno for all the years we share at the EBI. I am most grateful to the EBI community and particularly to the EBI predocs, who are amazing people. Thanks to our morning coffee discussions and debates, I learnt so much about bioinformatic fields that were not directly involved in my work. I am truly fortunate to have met all of you. Furthermore, I want to thank more specifically the fantastic people who have taken time to read and give me feedback on this thesis. Mar, Nuno, Juan, Myrto, Aida, Sarah, Julian, Konrad, Nils, Hannah, Atoussa, Isabelle and Steve, I thank all of you for your invaluable inputs and corrections. Steve, thank you also so much for all the discussions about the fine points in British culture and language. vii Thank you Isabelle for our hour-long conversations, which, for the most part, consisted of me figuring out the best ways to explain my doctorate works to a layperson. Thank you, Philippe, for your unfailing support and invaluable assistance, particularly regarding the logistics. Finally, I want to thank my sister, Atoussa, and mymother, Faranak, to always be by my side, despite any possible physical distance. Thank you both for helping me to grow and develop, if not formalise, my ideas. Thank you for sharing with me your own scientific and life interests, which, beyond being exciting, sustain my joy and creativity. viii CONTEN TS Summary v Preface vii List of Terms and Abbreviations xiii List of Figures xxii List of Tables xxiii Introduction 1 1 biological and technological context of this thesis 3 1.1 Universality and diversity of Life . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Transcriptome exploration with RNA sequencing . . . . . . . . . . . . . . . . . . . 7 1.2.1 Library preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.1.1 RNA extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.1.2 RNA enrichment . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.1.3 RNA fragmentation . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2.1.4 Double-stranded cDNA synthesis . . . . . . . . . . . . . . . . . 11 1.2.1.5 Adapter ligation, PCR amplification and size selection . . . . . 11 1.2.1.6 An example of alternative preparation strategy . . . . . . . . . 12 1.2.2 Clustering: Hybridisation and Bridge amplification . . . . . . . . . . . . . 13 1.2.2.1 Hybridisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2.2.2 Bridge amplification . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2.3 Sequencing-by-synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2.3.1 Sequencing specificities for the paired-end protocol . . . . . . . 14 1.2.4 From analogous input to digital output . . . . . . . . . . . . . . . . . . . . 15 1.2.5 A typical bioinformatic workflow for RNA-Seq study . . . . . . . . . . 15 1.2.5.1 Quality check, trimming and filtering . . . . . . . . . . . . . . 16 1.2.5.2 Reconstruction strategies . . . . . . . . . . . . . . . . . . . . . 17 1.2.5.3 Quantification of features . . . . . . . . . . . . . . . . . . . . . 20 1.2.5.4 Normalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.3 Proteome exploration with Mass spectrometry . . . . . . . . . . . . . . . . . . . . 25 1.3.1 Sample preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.3.1.1 Sample collection and conservation . . . . . . . . . . . . . . . . 27 1.3.1.2 Protein extraction and contaminant removal . . . . . . . . . . 28 1.3.2 Reducing samples’ complexity . . . . . . . . . . . . . . . . . . . . . . . . . 28 1.3.2.1 Denaturation, Reduction and alkylation . . . . . . . . . . . . . 29 1.3.2.2 Depletion of highly abundant proteins . . . . . . . . . . . . . . 29 1.3.2.3 Proteolytic digestion . . . . . . . . . . . . . . . . . . . . . . . . 30 1.3.2.4 Separation methods (fractioning) . . . . . . . . . . . . . . . . . 31 1.3.3 Characterisation through fragmentation profiles . . . . . . . . . . . . . . 33 1.3.3.1 General principle . . . . . . . . . . . . . . . . . . . . . . . . . . 33 1.3.3.2 Ionisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 ix Contents 1.3.3.3 Mass analyser . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 1.3.3.4 Fragmentation techniques [Z. Zhang et al., 2014] . . . . . . . . 34 1.3.3.5 Acquisition modes . . . . . . . . . . . . . . . . . . . . . . . . . 35 1.3.4 Bioinformatic strategies for proteomics studies . . . . . . . . . . . . . . . 36 1.3.4.1 Signal processing and peak-picking . . . . . . . . . . . . . . . . 37 1.3.4.2 Peptide identification and validation . . . . . . . . . . . . . . . 37 1.3.4.3 Protein inference . . . . . . . . . . . . . . . . . . . . . . . . . . 44 1.3.4.4 Protein quantification (label-free) . . . . . . . . . . . . . . . . . 49 1.4 Possible downstream analyses for expression data . . . . . . . . . . . . . . . . . . 51 1.4.1 GO analysis (GOA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 1.5 Reproducibility and Experimental design . . . . . . . . . . . . . . . . . . . . . . . . 52 1.5.1 Batch effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 1.5.2 Technical replicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 1.5.3 Biological replicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 1.5.4 Study design example: meta-analyses . . . . . . . . . . . . . . . . . . . . . 54 1.6 Discussion and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2 available high-throughput human datasets 55 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.2 Transcriptome RNA-Seq studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.2.1 Castle et al. dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.2.2 Brawand et al. dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.2.3 Illumina Body Map 2.0 (IBM) . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.2.4 Uhlén et al. dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.2.5 GTEx dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.3 Proteome Mass Spectrometry bottom-up studies . . . . . . . . . . . . . . . . . . . 60 2.3.1 Pandey Lab dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 2.3.2 Kuster Lab dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 2.3.3 Cutler Lab dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 2.4 Consistent processing pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 2.4.1 RNA-Seq raw data processing . . . . . . . . . . . . . . . . . . . . . . . . . 63 2.4.1.1 Data retrieval and preparation . . . . . . . . . . . . . . . . . . 63 2.4.1.2 Genome and annotation reference . . . . . . . . . . . . . . . . 65 2.4.1.3 Data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 2.4.2 MS data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 2.4.2.1 Spectral processing . . . . . . . . . . . . . . . . . . . . . . . . . 68 2.4.2.2 Sequence database creation and searching preparation . . . . . 68 2.4.2.3 Spectral identification and database search pipeline . . . . . . . 70 2.4.2.4 Results processing and filtering . . . . . . . . . . . . . . . . . . 71 2.5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3 about expression, visualisation, correlation and clustering 73 3.1 Visualisation of expression data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.1.1 Distribution plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.1.2 Scatter plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.2 Main statistical approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.2.1 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.2.2 Clustering analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.3 Reducing sources of bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 x Contents 3.3.1 Mitochondria issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.3.2 Protein-coding genes only . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.3.3 Expressed or not expressed . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.3.4 Aggregating tissue expression . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.4 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4 integrating gene expression data from undiseased tissues across rna-seq studies 89 4.1 Meta-analyses’ combined datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.1.1 Tissue overlaps across available normal human RNA-Seq studies . . . . 91 4.1.2 Common measured genes for each of the main shared-tissue sets . . . . 92 4.1.3 Combined datasets summary . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.2 Prevalence of biological signal over technical variabilities at tissue-level . . . . . . 94 4.3 Global stability of gene expression profiles across studies . . . . . . . . . . . . . . 99 4.3.1 Genes with tissue-specific (TS) expression . . . . . . . . . . . . . . . . . . 99 4.3.1.1 Use of prior knowledge: TiGER database . . . . . . . . . . . . . 100 4.3.1.2 Fold change method . . . . . . . . . . . . . . . . . . . . . . . . 101 4.3.1.3 Hampel’s test: detection of atypical expression . . . . . . . . . 104 4.3.2 Uhlén categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.3.3 Similar expression variability of the genes across studies . . . . . . . . 106 4.3.4 Curated sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5 human ms-based protein expression landscape 117 5.1 An overall fragmented and disparate universe to explore . . . . . . . . . . . . . . . 117 5.1.1 MS proteomic data has high detection variability . . . . . . . . . . . . . . 118 5.1.2 Overall about half of the proteins identified in each study for any given tissue are validated in a different study. . . . . . . . . . . . . . . . . . . . . 119 5.1.3 Technical variability prevails over biological signal: intrastudy correlations of different tissues are globally stronger than same-tissue interstudy correlations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.2 New quantification method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.3 Ubiquitous and TS proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.4 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6 integration of transcriptomic with proteomic data 131 6.1 Data and principal analytical approaches . . . . . . . . . . . . . . . . . . . . . . . . 134 6.1.1 Overlapping set of tissues for the three datasets . . . . . . . . . . . . . . . 135 6.1.2 Matching pairs of mRNAs and proteins . . . . . . . . . . . . . . . . . . . . 135 6.1.3 Tissue-centric and gene-centric approaches . . . . . . . . . . . . . . . . . 140 6.2 Fair correlations between independently sourced proteomics and transcriptomics of human tissues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.2.1 Mixed biological signal between the proteome and transcriptome across the tissues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 6.2.2 Influence of the expression breadth on the tissue mRNAs/proteins correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 6.2.3 Tissue-specific mRNAs have significant overlap with tissue-specific proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 xi Contents 6.2.4 Proteins and mRNAs tissue trees present partial concordant results . . . 153 6.3 Wide correlation range for protein/mRNA pairs . . . . . . . . . . . . . . . . . . . . 156 6.3.1 TS protein enrichment for the most correlated pairs . . . . . . . . . . . . . 159 6.3.2 Gene expression profiles clue about biological and technical differences 160 6.3.3 Distinct functional enrichment profiles for pairs with a TS protein, and for the best correlated and most anticorrelated ones . . . . . . . . . . . . . 162 6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 7 concluding remarks 173 APPENDIX 183 a supplementary material for chapter 1 183 a.1 Amino acids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 a.2 Original material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 a.3 EST-sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 a.4 Microarrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 a.5 FASTQ format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 a.6 Phred score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 a.7 Mass analysers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 a.8 Isotopes of common elements and their natural frequency . . . . . . . . . . . . . . 190 a.9 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 a.10 Target Decoy search database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 a.11 PSM validation with q-value and PEP . . . . . . . . . . . . . . . . . . . . . . . . . . 192 a.12 Protein inference: computational challenging step . . . . . . . . . . . . . . . . . . 193 a.13 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 b supplementary material for chapter 3 197 b.1 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 b.2 Other common mathematical definitions . . . . . . . . . . . . . . . . . . . . . . . . 199 b.3 Data visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 c supplementary material for chapter 4 203 c.1 Highest expressed genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 c.2 Most Variable genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 c.3 Tissue specific (TS) genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 c.4 List of publications based on RNA-Seq and covering at least partially its robustness 225 d supplementary material for chapter 5 229 e supplementary material for chapter 6 243 e.1 Hypergeometric test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 e.2 TS protein percent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 f list of r packages 257 g list of publications 259 references 261 xii TERMS AND ABBREV IAT IONS Notation Description 2D-DIGE 2D-differential in-gel electrophoresis 26 aa amino acid xxiii, 5, 25, 31, 39–41, 184, 192 ADP adenosine diphosphate 38 Ampholyte molecule which can both gives or accepts a proton (H+). 32 ArrayExpress EBI archive of Functional Genomics Data. It stores data from high-throughput functional genomics experiments, and provides these data for reuse to the research community. 7, 10, 56, 58–60, 63, 128, 169 AUC area under the curve 27 BAM BGZF-compressed binary file that can be converted into SAM format SAM 17, 22 Bioconductor Bioconductor provides tools for the analysis and comprehension of high-throughput genomic data. Bioconductor uses the R statistical programming language, and is open source and open development. 24, 162 Bottom-up approach In proteomics, bottom-up approaches (in contrast to top- down approaches) involve a step of proteolytic digestion prior to the Mass Spectrometry analysis. 28 bp base pair; unit of length for double-stranded nucleic acids 58 CAGE cap analysis gene expression 111 cDNA complementary DNA 8, 11–13, 24, 58, 185 CDS coding DNA sequence 68, 71 CID collision-induced dissociation 30, 34, 38, 61, 62, 70, 188 Computer cluster set of connected computers that work together to improve performance over a single computer 65 cv coefficient of variation 106, 108 Da Dalton is a unified atomic mass unit. It may be also annotated as u 25, 31, 40, 70, 71 dbGaP The database of Genotypes and Phenotypes (dbGaP) was developed to archive and distribute the data and results from studies that have investigated the interaction of genotype and phenotype in humans. 60, 63 DC direct current 188 DDA data-dependant acquisition 25–27, 30, 35 DEA differential expression analysis 24, 51, 52 DGEA differential gene expression analysis 22, 113 DIA data-independant acquisition 25, 35, 37, 49 DNA deoxyribonucleic acid xiii, xiv, xvi, 3, 5–8, 13, 14, 27, 28, 54, 58, 63, 178, 190 xiii List of Terms and Abbreviations Notation Description dNTP deoxynucleoside triphosphate. It can be any nucleotides (usually either adenosine (A), cytosine (C), guanine (G) or thymine (T)) 14 EBI European bioinformatics institute 58, 62, 63, 65, 67, 72, 91, 101, 112, 114, 127, 128, 167, 175 ECD electron capture dissociation 39 EM expectation-maximisation 21, 47 ENA The European Nucleotide Archive provides a comprehensive record of the world’s nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation. 7, 56, 58, 63 ENCODE The ENCODE (Encyclopedia of DNA Elements) Consortium is an international collaboration of research groups funded by the national human genome research institute 70 Ensembl database that is the joint project between EMBL-EBI and the Wellcome Trust Sanger Institute to develop a software system which produces and maintains automatic annotation on selected eukaryotic genomes. 10, 65, 70–72, 100, 124, 125 eQTL expression quantitative trait locus 59 ESI electron spray ionisation 33, 189 EST expressed sequence tag 7, 100, 185 ETD electron transfer dissociation 35, 39 EThcD electron-transfer and higher-energy collision dissociation 38 FANTOM5 FANTOM5 is a consortium that systematically investigated the genes expressed in all cell types the human body and the genomic regions that contains the transcription starting site. 111 FASP filter-aided sample preparation 31 FASTQ text-based file format. For each cluster read, it records a unique identifier, a nucleotide sequence and the call accuracy for each base (Phred score). Optionally, there can be more information, e.g. the spatial position of the cluster on the flow cell. 15, 63 FDR false discovery rate 43, 44, 47, 48, 50, 51, 63, 69, 71, 85, 192, 193 FF fresh-frozen 27 FFPE formalin-fixed paraffin-embedded 8, 27 Flow cell support of Illumina sequencing. It enables the parallelisation of the sequencing of millions of DNA fragments together which are kept spatially separated in clusters. It is a glass slide with lanes and each lane is coated with short nucleotide sequences that are used to hybridise by complementarity adapters on the DNA that will be sequenced. 11, 15, 23 xiv List of Terms and Abbreviations Notation Description FPKM fragments per kilobase of a feature (i.e. transcript in most cases) per million mapped fragments xix, xxi, xxiii, 22–24, 64, 68, 74, 77, 85, 92, 93, 95–97, 99, 101, 107, 108, 110, 112– 114, 143, 144, 146, 147, 149, 154–156, 166, 178, 188, 205, 206, 208, 209, 213, 214, 223, 226 FT Fourier transform 34, 189 FTMS Fourier transform mass spectrometer 61, 189 Fusion gene gene which is the fusion product of the parts of two (or more) different genes. 14 GENCODE project that produces high quality reference gene annotation for human and mouse genomes. 68, 70, 71 GEO gene expression omnibus 58 GFF general feature format. Tab-delimited file format that records gene information or other features as DNA, RNA or protein sequences. xv, 22 GO gene ontology xiv, 52, 162–164, 167, 169–171, 174 GOA GO enrichment analysis 52, 162, 163 GRCh The Genome reference consortium human genome build. It is always followed by a version number. 10, 65, 66, 68, 70, 72, 100, 110–112, 114, 177 GSEA gene set enrichment analysis 51, 52 GTEx The Genotype-Tissue Expression project establishes a resource database and associated tissue bank for the scientific community to study the relationship between genetic variation and gene expression in human tissues. 57, 59, 62, 63, 65, 74, 78, 86, 87, 91, 92, 94–96, 98, 105, 108, 110, 112, 113, 131, 132, 134, 136, 137, 144, 154–157, 159, 160, 165, 168, 176, 177, 199, 211, 214, 218, 222, 256 GTF gene transfer format. Tab-delimeted file format based on the format and hold information about gene structure. A main feature of this file is that it can be validatable which increases the data reliability. 22 HBB hemoglobin subunit beta 143 HCD higher-energy collisional dissociation 34, 38, 61, 70 HLA human leukocyte antigen 70 HPA The human protein atlas is a Swedish-based programme aiming to map all the human proteins in cells, tissues and organs using integration of various omics technologies 59 HPLC high-performance liquid chromatography 32, 33 IBAQ intensity based absolute quantification 50, 123 IBM Illumina body map 2.0 xix, 63, 68, 70, 74, 86, 87, 94, 96, 109, 112, 113, 211, 218 ICAT isotope-coded affinity tag 26 ICR ion cyclotron 34 ID identification number 56, 58–62, 71, 128, 169 iTRAQ isobaric tag for relative and absolute quantification 26, 34 xv List of Terms and Abbreviations Notation Description laser light amplification by stimulated emission of radiation xv, 33 LC liquid chromatography xv, xvi, 30–32, 35 LC-MS liquid chromatography (LC) followed by MS 32 LC-MS/MS liquid chromatography (LC) followed by tandem MS 27, 28, 30, 31, 37, 41, 61, 71 LDS lithium dodecyl sulfate xv LDS-PAGE lithium dodecyl sulfate-polyacrylamide gel electrophoresis 32, 61 LECA last eukaryotic common ancestor 81 LIT linear ion trap 34, 188 lncRNA long non-coding RNA 10, 70 LTQ linear trap quadrupole 34, 62, 188, 189 MAD median absolute deviation 104, 224 MALDI matrix-assisted laser desorption ionisation 33, 34 miRNA microRNA 12 miRNA-Seq miRNA sequencing, which can also be called miRNA shotgun sequencing 11 MRM multiple reaction monitoring 25 mRNA messenger RNA xv–xvii, xxii, 3, 5, 6, 10, 54, 58, 77, 79, 81, 83, 85, 86, 94, 112, 123, 126, 127, 131, 133–137, 139–141, 143, 144, 146–158, 160–171, 173–176, 178, 179, 185, 194, 223, 243, 251, 256 MS mass spectrometry xv, 2, 25, 27, 28, 30–37, 43, 54, 55, 60, 68, 69, 72, 81, 85, 117, 118, 121, 128, 129, 131, 134, 136, 137, 161, 165, 170, 173, 174, 176, 194 MS/MS tandem MS 25–27, 33–43, 50, 61, 189, 193, 194 mt-mRNA mitochondrial mRNA 83 ncRNA non-coding RNA 10, 56, 164, 175 NHGRI national human genome research institute xiv NIH national institutes of health (USA) 59 NP nondeterministic polynomial time 45, 193 nt nucleotid; common unit of length for single-stranded nucleic acids 14, 56, 83 OR olfactory receptor 123, 127, 128 ORA over-representation analysis 51 PAGE polyacrylamide gel electrophoresis xv, xvii, 32 PCAWG pan-cancer analysis of whole genomes 112, 113 PCR polymerase chain reaction 7, 12, 23, 56 PEP posterior error probability 43, 44, 47, 69, 71, 192, 193 Perl The Perl language is an open-source interpreted programming language developed by Larry Wall and first released in 1987 to easily handle textual information. 65 pH potentiel Hydrogen 32 xvi List of Terms and Abbreviations Notation Description phenotype set of observable characteristics or traits of an individual. The traits can be inherited (genotype), due to the environment, or from the interaction of the environment with the genotype. xiii, 5, 7, 54 Phred A Phred quality score is a measure of the quality of the identification of the nucleobases generated by automated DNA sequencing. xiv, 15, 63, 186 PPKM PSMs per kilobase of gene per million v, 125, 126, 128, 137, 141–144, 147, 151, 152, 154–157, 160, 162, 165, 166, 169, 171, 174 ppm part per million 70 PRIDE The proteomics identifications (PRIDE) database is a centralized, standards compliant, public data repository for proteomics data, including protein and peptide identifications, post-translational modifications and supporting spectral evidence. PRIDE is a core member in the ProteomeXchange (PX) consortium, which provides a single point for submitting mass spectrometry based proteomics data to public-domain repositories. Datasets are submitted to PRIDE via ProteomeXchange and are handled by expert biocurators. 61, 68, 71, 169 PRM parallel reaction monitoring 26 ProteomicsDB ProteomicsDB is a joint effort of the Technische Universität München (TUM) and SAP SE. It is dedicated to expedite the identification of the human proteome and its use across the scientific community. 61, 62, 68 PSM peptide spectrummatch xvi, 42–44, 46–49, 71, 123–126, 137, 192–194 PTM post-translational modification 5, 39, 40 PTR protein-to-mRNA ratio 169 Python The Python language is a high-level and interpreted programming language developed by Guido van Rossum and first released in 1991. The main purpose of this language is to be multivalent while enhancing code readibility. 22 RefSeq The reference sequence database is an open access, annotated and curated collection of publicly available nucleotide sequences (DNA, RNA) and their protein products. 100 RF radio frequency 188, 189 RNA ribonucleic acid ix, xiv–xvii, 1, 3, 5–8, 10–12, 20, 22–24, 27, 28, 54–56, 59, 67, 77, 81, 83, 85, 101, 127, 135, 139, 146, 170, 173, 178, 185, 190 RNA-Seq RNA sequencing, which can also be called whole transcriptome shotgun sequencing ix, xix, xxiii, 1, 2, 7–9, 15, 16, 18, 20, 23, 27, 30, 36, 50, 55–60, 62–67, 70, 72, 78, 79, 81, 83, 85, 89, 91, 93–95, 99, 109–114, 124, 125, 129, 131, 132, 134, 136, 137, 167, 173, 174, 178, 199 xvii List of Terms and Abbreviations Notation Description RPKM reads per kilobase of a feature (i.e. transcript in most cases) per million mapped reads 23, 85, 125 RPLC reversed-phase LC 32, 61 rPTR relative protein-to-mRNA ratio 169 rRNA ribosomal RNA 3, 5, 10 RT-qPCR Reverse-transcription quantitative real-time polymerase chain reaction is a molecular biology technique to quantify the amount of ribonucleic acid in a given cell or sample. Cycles of monitored replications are used to robustly measure the gene expression. It is often considered to be the most powerful and sensitive of the quantitative assay for ribonucleic acid. However, this method requires to know in advance which are the genes of interest. 22 SAM sequence alignment/map. Text based file format. xiii, 17, 22 SDS sodium dodecyl sulfate xvii, 32 SDS-PAGE sodium dodecyl sulfate-polyacrylamide gel electrophoresis 32, 61 SILAC stable isotope labeling by amino acids in cell culture 26 SNP single nucleotide polymorphism 20, 59 SNR signal-to-noise ratio 37 SRM selected reaction monitoring 25, 26 SVM support vector machine 44, 48 TB terabyte 36 TCGA the cancer genome atlas 112 TDA target decoy search approach 43, 51, 192 TiGER tissue-specific gene expression and regulation database created by the Bioinformatics lab atWilmer Institute, Johns Hopkins University 100, 101, 174 TMM trimmed mean of M values (M: log expression ratios) 111 TMT™ tandem mass tags 26 TOF time-of-flight 34 TREP tissue reference expression profile 87, 92, 94–99, 105, 114, 121, 136, 141, 142, 146, 147, 213, 219, 222, 236, 237 tRNA transfer RNA 3, 5 tryptic adjective qualifying peptides that have been generated by trypsin digestion. 31 TS tissue-specific 99–101, 103–105, 110, 111, 113, 114, 121, 128, 146, 149–152, 157, 159, 160, 162, 164, 166, 167, 169–171, 174, 175, 179, 256 TSS transcription starting site xiv UniProt The Universal Protein Resource provides the scientific community with a comprehensive high-quality and freely accessible resource for protein sequence and functional annotation data. 68, 71 UPLC ultra performance liquid chromatography 32, 33 XIC extracted-ion current 27, 50 xviii L I ST OF F IGURES Figure 1.1 Transcription and Translation: an overview . . . . . . . . . . . . . . . . 4 Figure 1.2 DNA and RNA structures . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Figure 1.3 Central dogma of molecular biology proposed by F. Crick . . . . . . . . . 7 Figure 1.4 Overview of a RNA-Seq workflow: library preparation and sequencing . 9 Figure 1.5 de novo Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Figure 1.6 Overview of main alignment strategies for RNA-Seq transcriptome . . . 18 Figure 1.7 Abundance estimation of isoforms by Cufflinks . . . . . . . . . . . . . . 21 Figure 1.8 Abundance estimation of genes by HTSeq-count . . . . . . . . . . . . . . 22 Figure 1.9 Bottom-up quantification approaches . . . . . . . . . . . . . . . . . . . . 26 Figure 1.10 Overview of proteomic data generation . . . . . . . . . . . . . . . . . . . 29 Figure 1.11 Roepstorff–Fohlman–Biemann nomenclature . . . . . . . . . . . . . . . 39 Figure 1.12 MS/MS spectrum and peptide identification. . . . . . . . . . . . . . . . . 40 Figure 1.13 Methods for assigning statistical significance . . . . . . . . . . . . . . . . 44 Figure 1.14 Protein inference: the bipartite graph . . . . . . . . . . . . . . . . . . . . 46 Figure 2.1 General steps for processing the transcriptomic data . . . . . . . . . . . 64 Figure 2.2 Gene length is equal to the sum of the lengths of all its collapsed exons . 67 Figure 2.3 General steps for processing the proteome data . . . . . . . . . . . . . . 69 Figure 3.1 Untransformed (left) and Log2-transformed (null values removed) (right) profile of expression levels (FPKM, protein-coding genes only and all null values excluded) for the IBM dataset . . . . . . . . . . . . . . . . . . 74 Figure 3.2 Profile of expression levels across the transcriptomic (protein-coding genes only) studies (null values removed) . . . . . . . . . . . . . . . . . 75 Figure 3.3 Profile of expression levels across the proteomic studies (null values removed) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Figure 3.4 Examples of scatter plot for replicates from Uhlén (transcriptome) . . . 77 Figure 3.5 Correlation coefficients between RNA-Seq replicates . . . . . . . . . . . 78 Figure 3.6 Comparison of two clustering methods on a subset of the Uhlén study . 80 Figure 3.7 Clustering of the biological samples of Uhlén dataset based on the Pearson correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Figure 3.8 Expressed or not: several cases illustrated . . . . . . . . . . . . . . . . . 84 Figure 4.1 Distribution of unique and shared tissues between the transcriptomic studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Figure 4.2 Unique and shared protein-coding genes expressed (≥1 FPKM) across RNA-Seq studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Figure 4.3 Heatmap of the 4 common tissues across the 5 studies . . . . . . . . . . . 96 Figure 4.4 Heatmap of 23 common tissues between Uhlén and GTEx studies . . . . 97 Figure 4.5 Distribution of the correlations of same tissue pairs for the 4 and 23 tissues combined datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Figure 4.6 Expression heatmap of the four tissues across the five datasets based on TiGER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 xix List of Figures Figure 4.7 Overview for the comparison of the genes across the five studies based on a ranked descriptor 5 studies . . . . . . . . . . . . . . . . . . . . . . . 102 Figure 4.8 Intersection size curve of𝒲1 genes based on their FC ratio rank in each tissue across the five studies . . . . . . . . . . . . . . . . . . . . . . . . . 103 Figure 4.9 Intersection size curve of𝒲2 genes based on their FC ratio rank in each tissue across the two studies . . . . . . . . . . . . . . . . . . . . . . . . . 104 Figure 4.10 Expression of the genes picked with Hampel method . . . . . . . . . . . 105 Figure 4.11 Coefficients of variation across the 5 studies for the set of common expressed genes and tissues . . . . . . . . . . . . . . . . . . . . . . . . . 108 Figure 4.12 Coefficients of variation across 𝒲2 2 studies and their set of common genes across the 23 tissues . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Figure 4.13 Example of EBI gene expression atlas gene centric heatmap . . . . . . . 115 Figure 5.1 Distribution of unique shared tissues between the 3MS-based proteomic studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Figure 5.2 Proteins overlap between the common tissues of the 3 proteomic studies 118 Figure 5.3 Identified proteins across the 4 shared tissues for the 3 datasets . . . . . 119 Figure 5.4 Distribution of the proteins per tissue . . . . . . . . . . . . . . . . . . . . 120 Figure 5.5 Heatmap of the four common tissues between the three proteome datasets 122 Figure 5.6 Two quantification methods for Pandey Lab data . . . . . . . . . . . . . 124 Figure 5.7 Pandey Lab data protein expression distribution with two quantification methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Figure 5.8 Comparison of the impact of the quantification method on the protein distribution per tissue for Pandey Lab data . . . . . . . . . . . . . . . . . 127 Figure 6.1 Number of shared and unique tissues between the proteomic dataset from Pandey Lab and the transcriptomic datasets (Uhlén et al. and Gtex) 135 Figure 6.2 Distribution of the unique and shared proteins/mRNAs for the three datasets across twelve tissues . . . . . . . . . . . . . . . . . . . . . . . . 136 Figure 6.3 Distribution of the unique and shared proteins/mRNAs for Pandey Lab and Uhlén et al. across fifteen tissues. . . . . . . . . . . . . . . . . . . . . 137 Figure 6.4 Distribution of the unique and shared proteins/mRNAs across the three datasets and twelve tissues (new protein quantification method) . . . . . 138 Figure 6.5 Distribution of the unique and shared proteins/mRNAs across fifteen tissues between Pandey Lab (new quantification method) and Uhlén et al. data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Figure 6.6 Overview of different datasets combination . . . . . . . . . . . . . . . . . 139 Figure 6.7 Summary of the expression comparison approaches between the transcriptome and proteome . . . . . . . . . . . . . . . . . . . . . . . . . 140 Figure 6.8 Distribution of Pearson and Spearman correlation coefficients for same- tissue proteomic and transcriptomic pairs versus random tissue pairs . . 142 Figure 6.9 Scatterplot of protein (Pandey Lab data — PPKM quantification) and mRNA (Uhlén et al.) expression for Kidney . . . . . . . . . . . . . . . . . 143 Figure 6.10 Heatmap based on the Pearson correlation between protein and mRNAs expression (alphabetically ordered tissue) . . . . . . . . . . . . . . . . . . 145 Figure 6.11 Expression breadth of the proteins and mRNAs . . . . . . . . . . . . . . 147 Figure 6.12 Unique proteins or mRNAs fractions across tissues . . . . . . . . . . . . 148 Figure 6.13 Comparison of proteins expression breadth to corresponding mRNA breadth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Figure 6.14 Determination process of the specific mRNAs . . . . . . . . . . . . . . . 150 Figure 6.15 Example of overlap of TS proteins and TS mRNAs for Heart . . . . . . . 151 xx List of Figures Figure 6.16 Heatmap of Jaccard indices across 15 tissues . . . . . . . . . . . . . . . . 152 Figure 6.17 p-values associated with the Jaccard indices . . . . . . . . . . . . . . . . 153 Figure 6.18 Tissues hierachical clustering for Pandey Lab and Uhlén et al. data . . . 154 Figure 6.19 Tissues hierachical clustering for Pandey Lab and Uhlén et al. data . . . 155 Figure 6.20 Pearson correlation coefficients of gene expression levels between studies in descending order . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Figure 6.21 Different cases of correlation for protein/mRNA pairs . . . . . . . . . . . 158 Figure 6.22 TS proteins percentage as a function of the considered number of genes (ranked by Pearson correlation) . . . . . . . . . . . . . . . . . . . . . . . 159 Figure 6.23 Possible mRNA/protein expression profiles due to biological reasons. . . 161 Figure 6.24 Enriched GO categories for the genes with a TS protein and the three hundred with the highest correlations and anticorrelations . . . . . . . . 163 Figure 7.1 Application preview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 Figure A.1 Amino acids formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Figure A.2 FASTQ format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 Figure A.3 The available Phred score quality score encoding formats . . . . . . . . . 187 Figure A.4 Overlap resolution effects for each HTSeq-count mode . . . . . . . . . . 187 Figure A.5 Peptide assembly models classification . . . . . . . . . . . . . . . . . . . 195 Figure B.1 Anscombe quartet — why data should always be visualy checked . . . . 200 Figure B.2 Profile of expression across the transcriptome (protein coding genes only) and proteome datasets . . . . . . . . . . . . . . . . . . . . . . . . . 201 Figure C.1 Number of protein-coding genes expressed per tissue . . . . . . . . . . . 203 Figure C.2 Breadth of expression of the protein-coding genes expressed above 1 FPKM 204 Figure C.3 Unique and shared protein coding genes expressed in the common tissues 205 Figure C.4 Comparison of profiles across the 5 studies for their 4 common tissues — including the 37 mitochondrial genes included . . . . . . . . . . . . . 207 Figure C.5 Heatmap including all the replicates of the 4 common tissues across the 5 studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 Figure C.6 Heatmap including all the replicates of the 23 common tissues between Uhlén and GTEx studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Figure C.7 Distribution of the correlation of matched and unmatched tissues pairs across the two working sets. . . . . . . . . . . . . . . . . . . . . . . . . . 210 Figure C.8 Pearson correlation coefficient trend based on the expression levels of the genes considered for each of the 23 common tissues . . . . . . . . . . 211 Figure C.9 Pearson correlation coefficient trends based on the expression levels of the genes considered for𝒲1 . . . . . . . . . . . . . . . . . . . . . . . . . 212 Figure C.10 Example on how correlation may change to cut-offs . . . . . . . . . . . . 213 Figure C.11 Cumulative shared set of genes ranked by expression across the 5 studies 214 Figure C.12 Cumulative shared set of genes, sorted by their expression, between Uhlen and GTEx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Figure C.13 Overlap of the most variables genes across the 5 studies for the set of four common tissues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 Figure C.14 Mean expression of genes compared to their coefficient of variation . . . 217 Figure C.15 Clustering of the four common tissues across the five studies for the most common variable genes . . . . . . . . . . . . . . . . . . . . . . . . . 218 Figure C.16 Clustering of the the four common tissues across the five studies (excluding the most common variable genes) . . . . . . . . . . . . . . . . 219 Figure C.17 Expression of the most common variable genes . . . . . . . . . . . . . . 220 Figure C.18 Maximum of expression / Sum of expression for the most variable genes 220 xxi List of Figures Figure C.19 Intersection size of𝒲1 genes (ranked by cv) . . . . . . . . . . . . . . . . 221 Figure C.20 Intersection size of𝒲2 genes (ranked by cv) . . . . . . . . . . . . . . . . 222 Figure C.21 Breadth of expression (≥1 FPKM) for the most variable mRNAs . . . . . . 223 Figure C.22 Most specific genes highlighted in EBI gene expression atlas . . . . . . . 227 Figure D.1 Unique and shared proteins across the proteomic studies . . . . . . . . . 229 Figure D.2 Proteins overlap between the common tissues of Pandey and Kuster data 229 Figure D.3 Unique and shared proteins across the other 10 common tissues between Pandey and Kuster proteomic studies . . . . . . . . . . . . . . . . . . . . 230 Figure D.4 Heatmap of the 4 common tissues between the three proteome datasets (Pearson correlation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Figure D.5 Heart: Cutler vs Pandey . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 Figure D.6 Heart: Kuster vs Pandey . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Figure D.7 Heatmap of the 14 common tissues between Pandey and Kuster datasets (Spearman correlation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 Figure D.8 Heatmap of the 14 common tissues between Pandey and Kuster datasets (Pearson correlation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Figure D.9 Placenta: Kuster vs Pandey . . . . . . . . . . . . . . . . . . . . . . . . . . 238 Figure D.10 Kuster: Pancreas vs Adrenal . . . . . . . . . . . . . . . . . . . . . . . . . 239 Figure D.11 Kuster Pancreas vs Pandey Adrenal . . . . . . . . . . . . . . . . . . . . . 240 Figure D.12 Heatmap of the 14 common tissues between Pandey and Kuster (Spearman correlation — PPKM) . . . . . . . . . . . . . . . . . . . . . . . 241 Figure D.13 Identified proteins across the 14 shared tissues for 2 datasets (PPKM) . . 242 Figure D.14 Identified proteins across the 14 shared tissues for 2 datasets (first method) 242 Figure E.1 Scatterplot of protein (Pandey et al. — Top3 quantification) and messenger RNA (mRNA) (Uhlén et al.) expression for Kidney . . . . . . 243 Figure E.2 Overview of the tissue scatterplots between Uhlén and Pandey data . . . 244 Figure E.3 STAU2 definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 Figure E.4 Distribution of Pearson and Spearman correlation coefficients for same- tissue proteomic and transcriptomic pairs versus random tissue pairs (untransformed data) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 Figure E.5 Rank comparison chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 xxii L I ST OF TABLES Table 2.1 General description of the 5 transcriptomic datasets (RNA-Seq) used for this study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Table 2.2 Technical description of the 5 transcriptomic datasets . . . . . . . . . . . 65 Table 4.1 Gene classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Table 4.2 Uhlén et al. gene categories . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Table A.1 Molecular weight of the most common aas and their residues (from Lide, 2005) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 Table A.2 Phred quality score to accuracy significance . . . . . . . . . . . . . . . . 186 Table A.3 FPKM are unsuitable for differential expression analysis . . . . . . . . . 188 Table A.4 Most common elements and their stable isotopes . . . . . . . . . . . . . 190 Table B.1 Correlation coefficients between RNA-Seq replicates . . . . . . . . . . . 198 Table C.1 Expressed protein-coding genes . . . . . . . . . . . . . . . . . . . . . . . 206 Table C.2 Example of gene subsets for a two studies (A and B) for a tissue . . . . . 213 Table C.3 Uhlén et al. gene categories for all genes (i.e. unrestricted to protein- coding genes) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 Table D.1 Proteins found in every tissue in all three datasets . . . . . . . . . . . . . 230 Table D.2 Proteins found in every tissue in Pandey and Kuster datasets . . . . . . . 231 Table D.3 Tissue specific proteins found both in Pandey et al. and Kuster et al. datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 Table E.1 Found proteins without a counterpart in the transcriptomic data . . . . . 245 Table E.2 Summary of Pearson and Spearman correlations between proteomics and transcriptomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 xxiii I am not accustomed to saying anything with certainty after only one or two observations. Andreas Vesalius [O’Malley, 1964] IN TRODUCT ION Today, we have comprehensive knowledge about the structure and functioning of the human body at the macroscopic level¹. At the microscopic level, however, the identification and mapping of macromolecules (e.g. RNAs, proteins), their function and whereabouts still need to be refined. Beyond the invaluable addition to our knowledge, other more practical reasons are also sustaining the effort for human expression atlases. Genes with specific behaviour in particular conditions are a convenient starting point for designing new diagnostic tests and discovering new effective drug targets. Besides, a robust atlas for non-diseased tissue baseline expression will allow a better understanding of unperturbed physiology. It can also serve as a reference in studies where controls are unavailable or hard to sample, which is generally the case in cancer research. The completed and annotated human genome and technological advances in high-throughput expression studies have opened the way towards this future new milestone. Evidence of the community shared interest is the recent explosion of high-throughput transcriptomic atlases in the literature. Examples include expression atlases of mouse [C. Wu, Orozco, et al., 2009; Ringwald et al., 2012], pig [Freeman et al., 2012], sheep [Clark et al., 2017], plants as maize [Stelpflug et al., 2016], vigna [S. Yao et al., 2016], pigeon pea [Pazhamala et al., 2017], parasites, e.g. Schistosoma mansoni. Many focus on mapping the human gene expression either as a whole, see e.g. [Krupp et al., 2012; Jiménez-Lozano et al., 2012; Uhlén, Fagerberg, et al., 2015; GTEx Consortium, 2013] or for specific aspects, e.g. the organogenesis in human embryos [Gerrard et al., 2016]. There are large incentives to develop these atlases with transcriptome shotgun sequencing (i.e. RNA-Seq). The technology involved in similar older projects² has demonstrated to generate highly variable data [Rung et al., 2013] that is challenging to integrate (even when produced by the same laboratory on the same platform) [Walsh et al., 2015]. When I started my doctorate, little was known about the interstudy robustness of RNA-Seq. However, it had already shown less background noise and amore extensive dynamic range of detection than the array-based technology (microarray) used in previous expression studies [Z. Wang et al., 2009]. RNA-Seq also has the added advantage to discover new transcripts as it does not rely on previous knowledge [Z. Wang et al., 2009]. 1 Even though human anatomy can still be refined, and new findings can happen [Kumar et al., 2019]. 2 E.g. Gene Expression Atlas for Human Embryogenesis [Yi, Xue, et al., 2010], Atlas of human primary cells [Mabbott et al., 2013], Gene atlas of mouse and human protein-encoding transcriptomes (now hosted by BioGPS) [Su et al., 2004], Allen Brain Atlas [Hawrylycz et al., 2012]. 1 introduction Considering the growing number of studies referring to these atlases³ or the numerous efforts to compile them into new resources for the community — e.g. TISSUES⁴ [Santos et al., 2015], Harmonizome⁵ [Rouillard et al., 2016] and, after reprocessing the raw data, Expression Atlas⁶ [Petryszak, Keays, et al., 2015]— assessing the consistency of the results from one study to another has become paramount. The recently reported reproducibility crisis in science [Begley et al., 2012; Fatovich et al., 2017; Lindner et al., 2018; Lyu et al., 2018] has only further underlined this need. Given the previous context, the first aim of my doctorate was to examine the consistency of the (non-diseased) human tissues landscape of expression in independent large-scale transcriptomics. Then, with the publication of the first drafts of the human proteome [M.-S. Kim et al., 2014; Wilhelm et al., 2014], my aims have expanded to the integration of human tissue expression data across different datasets and biological layers. outlook of this thesis First, I review the biological, chemical, experimental and computational background of my doctorate works in Chapter 1. Then, in Chapter 2, I present the five transcriptomic (RNA-Seq) and the three proteomic (MS) datasets that I have preselected for my analysis. Then I describe the bioinformatic pipelines that have automated the processing of these large-scale datasets. After considering several possible sources of bias and strategies to limit them inChapter 3, I compare and integrate the independent transcriptomic datasets in Chapter 4. Following an assessment of the findings’ consistency in proteomics in Chapter 5, I then employ different approaches for integrating transcriptomics and proteomics in Chapter 6. Finally, I close this thesis with a few remarks in Chapter 7. 3 More than 2,800 papers for the five human primary studies (presented in Section 2.2) on 15 September 2019. 4 TISSUES — https://tissues.jensenlab.org/Search 5 Harmonizome — https://amp.pharm.mssm.edu/Harmonizome 6 Expression Atlas — https://www.ebi.ac.uk/gxa/home 2 It would probably be oversimplifying the matter, but I am strongly tempted to say, ‘All life is nucleic acid; the rest is commentary’. Asimov (1989) 1 B IOLOG ICAL AND TECHNOLOG ICALCONTEXT OF TH I S THES I S The following pages (pp. 3–54) present a summary of facts and techniques that form the biological and technological context of the work presented in this thesis. The different sections may be read on their own or skipped by informed readers without understandability issues. 1.1 universality and diversity of life Every known form of life depends on a common set of molecule types, within which the DNAs, RNAs and proteins are arguably the most specific ones and have the widest variety. The other molecules are either inorganic (water and salts), small, or simple organic ones (sugars, organic or amino acids, nitrogenous bases, lipids or their precursors). [Callen, 2005] The entirety of the DNA (protein-coding and non-coding) of a living organism constitutes its genetic material (or genome). The coding sections of the DNA, i.e. the protein-coding genes, contain the instructions for making (via mRNAs) the proteins, which are the main effectors supporting life functions. When a gene is switched on, it triggers a process, illustrated in Figure 1.1, in which the first step is called transcription and the second one translation. [Callen, 2005; Pierce, 2005] The transcription initiation happens when an RNA polymerase attaches to the start of the gene and uses the DNA strand as a template to create a corresponding RNA from free RNA nucleotides [Alberts et al., 2002]. The transcription is a directional process and always happens from the gene 5’ end towards its 3’ end [Callen, 2005]. (For 5’ and 3’, see Figure 1.2 and related section.) Messenger RNAs (mRNAs) are RNAs that are produced from a gene and used as a template for translation to synthesise proteins. There are other kinds of RNAs, e.g. ribosomal RNAs (rRNAs) (which are the most numerous RNAs in the cell), transfer RNAs (tRNAs) and many others [Callen, 2005]. The entire repertoire of transcripts (i.e. RNA molecules) expressed in a cell or group of cells (such as in a tissue) is called transcriptome [Velculescu et al., 1997; Piétu et al., 1999; Z. Wang et al., 2009]. Unlike the genome which is roughly identical regardless of which cell of a particular individual is considered, the transcriptome may vary, sometimes dramatically, according to the biological context (different organs or tissues, or different conditions, e.g. healthy or in a reaction to a disease) and through time and life stages. [Alberts et al., 2002] 3 biological and technological context of this thesis Figure 1.1. Transcription and Translation: an overview. This work, ‘Transcription and Translation: an overview’, is a derivative work of ‘Simplified diagram of mRNA synthesis and processing. Enzymes not shown.’ and ‘Protein synthesis’ both by Kelvinsong, used under [CC BY] (see Appendix A.2). ‘Transcription and Translation: an overview’ is licensed under [CC BY] by Mitra P. Barzine.4 1.1 universality and diversity of life Before the mRNA is in turn used as a template to make the protein, a few modifications may occur. Among the typical post-transcriptional regulations, there are the capping and the addition of a polyadenylated (polyA) tail (both increasing the half-life of the mRNA), splicing (where internal parts of the mRNA — either non coding (i.e. introns) or coding (i.e. exons) — are removed), or RNA editing [Darnell, 2013]. For eukaryotic species (e.g. Human), the mRNAs have to be exported from the nucleus to the cytoplasm for the next step to happen [Callen, 2005]. The mRNAs initiate the translation, i.e. creation of new proteins, by binding to protein factories called ribosomes (complexes formed from rRNAs and ribosomal proteins). The ribosomes read the mRNA codons (i.e. groups of three consecutive bases) and produce the corresponding protein by synthesising a polypeptide chain from the free amino acids (aas) carried by tRNAs with the corresponding anticodons (i.e. complementary sequence to the mRNA codons). The aas are linked by peptide bonds formed through the reaction of functional groups on their primary chain: the carboxyl group (–COOH) of the first amino acid (aa) with the amine group (NH2–) of the next one. This succession of primary chains and peptide bonds constitutes the protein backbone. This sequence is by convention described from the free amino group of the first aa (i.e. N-terminal) to the free carboxyl group of the last aa (i.e. C-terminal). While the polypeptide chain is completed, it also folds into a three-dimensional structure, which is essential for fulfilling its role [Morris et al., 2016]. Many proteins need to undergo post-translational modifications (PTMs) before being functional. PTMs allow the regulation of the proteins’ activity (activation and deactivation). They can involve proteolytic cleavages or the creation of new covalent bonds, as many proteins comprise more than one polypeptide chain to be functional. Other frequent PTMs include the phosphorylation, acetylation, or glycosylation of the aas (also called residues) [Alberts et al., 2002; Morris et al., 2016]. Besides the variety of possible PTMs and their combination, proteins can comprise twenty different aas (Appendix A.1), which have a vast range of physicochemical properties. Hence, in turn, the proteins have a wide physicochemical range too. [Morris et al., 2016; Callen, 2005]. Although the diversity of the proteome (i.e. entire repertoire of expressed proteins) is the primary contributor to the final phenotype¹ and functions of cells and tissues, its exhaustive study remains particularly challenging. Most proteins are quite stable, but their physicochemical diversity prevents the use of uniform and straightforward protocols that would encompass all the proteins in a given cell or tissue type. [Bruce et al., 2013] On the other hand, all DNA and RNAs have very similar chemical properties, as their structures are very close as shown in Figure 1.2. They are polymeric molecules that are made by a chain of nucleotides. Nucleotides have three distinct chemical subunits: a phosphate group, a pentose (either ribose for RNA or deoxyribose for DNA) and a nitrogenous base, which can be a purine (adenine (A) or guanine (G)) or a pyrimidine 1 Phenotype: set of observable characteristics or traits of an individual. The traits can be inherited (genotype), due to the environment, or from the interaction of the environment with the genotype. 5 biological and technological context of this thesis Figure 1.2. DNA andRNA structures© 2014 Nature Education —Adapted from Pierce (2005). (cytosine (C) and either thymine (T) for DNA or uracil (U) for RNA). The alternation of phosphate groups and the pentoses create the biomolecules backbone, while the information is encoded into the nitrogenous bases sequence. DNA and RNAs all share the same directional reading frame, i.e. 5’ end to 3’ end, or in other words, from the phosphate group that is linked to the pentose carbon annotated 5’ to the phosphate group that is linked to the pentose carbon annotated 3’². [Morris et al., 2016; Alberts et al., 2002; Callen, 2005] The difference in physical properties between DNA and RNA molecules is primarily due to the predominant particular arrangement of DNA (double-stranded) and RNA (single-stranded). The double-stranded configuration of the DNA dramatically improves the stability of the DNA by involving many mechanisms, which includes many hydrogen bonds (three for each couple of G/C and two for each A/T). Besides, in eukaryotic cells, while the genome (DNA) is protected in organelles such as the nucleus, most of the mRNAs life is spent in the cytoplasm, which contains many enzymes (e.g. endonucleases and exonucleases) that cleave the nucleotide sequences and can ultimately degrade the mRNAs. Thus, even though mRNA half-lives are quite variable, mRNAs have generally the lowest stability when compared to the DNAs and proteins. [Alberts et al., 2002; Pierce, 2005] 2 Pentoses are monosaccharides containing a chain of five carbons. When the pentose is part of a nucleotide, these carbons are annotated from 1’ to 5’ to avoid confusion with the carbons of nitrogenous cycles. 6 1.2 transcriptome exploration with rna sequencing replication translationtranscription DNA RNA Protein Proven Special UnestablishedTransfer type: Figure 1.3. The central dogma of molecular biology proposed by F. Crick. The proven modes of information transfers are for the general ones in solid lines and in dashed lines for the special transfers (RNA to RNA or DNA) and the transfers that have yet to be established (DNA to protein). This overall process, which allows the creation of the proteins and based on a flow of information initiated from the DNA, had been predicted by Francis Crick [Crick, 1958; Crick, 1970]. He stated what is now known as the core of the central dogma of molecular biology (shown in Figure 1.3): ‘Once information has got into a protein it can’t get out again’. In other words, the genome contains all the information needed to produce functional proteins, and, in theory, if we reach a total understanding of the information encoded into the DNA, we will be able to predict the phenotype due to the proteome. As DNA is static while the coding portion (about 2%) of the human genome [Venter et al., 2001] varies in expression (both in concentration and composition) depending on the tissue or cell type, genome studies are more established than transcriptomic or proteomic ones, but the latter ones are more phenotypically insightful. 1.2 transcriptome exploration with rna sequencing With the recent era of short-sequencing technology and the completion of the Human genome [Venter et al., 2001; Lander et al., 2001; International Human Genome Sequencing Consortium, 2004], understanding the genome expression is increasingly a more reachable aim. From the early 1980s, technologies involved in transcriptome studies have substantially improved through many successive innovations [Lowe et al., 2017; Parkinson et al., 2009] that include Sanger sequencing [Sanger et al., 1975] or PCR (reviewed by VanGuilder et al. (2008)). Among the various transcriptomic study approaches, there are three key methods [Lowe et al., 2017]: EST sequencing (for gene discovery — see Appendix A.3), microarrays (for gene quantification — see Appendix A.4) and the one with which all the transcriptomic data used in this thesis have been generated: RNA-Seq (which is both used for gene discovery and gene quantification). In the following section, I introduce the typical steps of the required workflow to study the transcriptome through sequencing on an Illumina platform. While not by conscious design, all the transcriptomes analysed in this thesis are the product of Illumina sequencing (see Sections 2.2.1 to 2.2.5). This is unsurprising as Illumina is by far the most popular platform for the last decade [McPherson, 2014] and has been used to generate most of the data in ENA and ArrayExpress. Indeed, Illumina sequencing offers a very good balance 7 biological and technological context of this thesis between accuracy and achieving the highest throughput for the lowest per-base cost [van Dijk et al., 2014]. I will emphasise the approaches and the tools I used to estimate the gene expression levels from raw nucleotide sequences. Figure 1.4 presents an overview of the typical steps of an RNA-Seq workflow from the libraries preparation to the sequencing. Experimental protocols for other platforms may need various and specific modifications that are outside of the realm of this thesis and thus will not be covered here³. Although the collection and the conservation⁴ of the samples before the RNA extraction most definitely affects the final estimations, I will set aside these steps from my review. 1.2.1 Library preparation While there are sequencing technologies that can directly sequence RNAs (see Garalde et al., 2016), most of the technologies handle only DNA. Hence, the first step of a typical RNA-Seq workflow is the preparation of complementary DNA (cDNA) libraries from the starting material. This step and the sequencing itself are the most platform dependent parts of the overall protocol. Indeed, contingent upon which sequencing principle they rely, the sequencers need the libraries to be fixed and loaded differently. 1.2.1.1 RNA extraction There are many methods to extract RNAs from the primary samples, and they are commonly standardised. Indeed, depending on the type of biological samples, the RNAs of interest, the aim of the study and the sequencing platform used, there is one (or more) available commercial kits. These are designed in a way to not interfere with any of the later steps of the library preparation or with the sequencing itself. Unsurprisingly, the choice of one kit (and hence its method of extraction) over another can impact the final RNA-Seq data. The main difference between the most widespread methods being the quantity of non-mature RNAs (i.e. with longer intronic regions) detected according to which kit has been used. However, the relative gene expression levels are similar from one extraction protocol to another. [Sultan et al., 2014] 1.2.1.2 RNA enrichment After extracting the RNA from the cells or tissues, the next step is to enrich the content of the samples with the RNAs of interest (i.e. the concentration of the RNAs of interest is increased either by specifically selecting it or by removing other RNAs). Indeed, the 3 More details on the other main sequencing platforms and their relevant protocols may be found in Goodwin et al. (2016) review paper or at the online resource ‘RNA-seqlopedia’ (http://rnaseq.uoregon.edu/) [Cresko Lab, 2017]. 4 E.g. between fresh-frozen samples and FFPE samples see Esteve-Codina et al. (2017). 8 1.2 transcriptome exploration with rna sequencing Figure 1.4. Overview of a typical RNA-Seq workflow: library preparation and sequencing 9 biological and technological context of this thesis rRNAs are the most abundant type of RNA in any cell. Even though they account for a very small part of the genome⁵, they represent by their number 70% or more of the total population of RNA [Davidson et al., 1999]. Although there are interests to study rRNAs (e.g. Pootakham et al., 2017), mRNAs studies are more popular, and they only constitute about 3 to 5% of the whole RNA population [Alberts et al., 2002]. Other studies research even scarcer kinds of RNA.⁶ There are typically three strategies to achieve RNA enrichment: either by polyA-selection, by ribodepletion or (more complex) by targeted amplification. While these approaches are insufficiently specific to select one particular kind of RNA or remove all rRNAs, it eases and improves the downstream analyses. PolyA-selection This strategy essentially targets the mRNAs. It exploits the polyadenylated tail at the 3’ end of the mRNAs⁷ that is added post-transcriptionally. Magnetic beads, supporting short strings of thymine (oligo-dT), capture thesemRNAs efficiently while the others arewashed away [Mortazavi et al., 2008]. This protocol is probably the most widespread one as it is the easiest and cheapest to set up. A dataset produced following this protocol is known as a polyA-selected dataset. Ribodepletion This strategy is preferred for the study of any non-coding RNA (ncRNA) or when researching the interaction of mRNAs with other RNAs [Morlan et al., 2012]. This strategy is in a way the reverse of the previous one as its also uses magnetic beads, but this time to efficiently⁸ target the unwanted rRNAs as to remove them from the sample. The ribodepletion can also be achieved through ribonucleases. These enzymes specifically digest rRNAs, and then RNAs of interest can be retrieved through size selection. Datasets produced following a ribodepletion protocol are usually calledwhole RNA or total RNA in contrast to the polyA-selected ones. Castle et al. (2010) created a total RNA dataset, but they use another approach where they amplify every other RNA with the help of specially designed probes (see Section 2.2.1). Hence, the protocol they have used is closer to the following one. 5 For example, Homo sapiens, there are 568 genes (<1%) that are described as rRNA out of the 63,898 annotated genes of the Ensembl database (GRCh38.p10, Ensembl 89). 6 Out of the 10,081 experiments tagged as ‘rna assay’ and ‘sequencing assay’ within ArrayExpress, 7,981 were also tagged as ‘RNA-seq of coding RNA’, 1,829 as ‘RNA-seq of non coding RNA’ and 366 have both tags. 4 of them are only described as ‘microRNA profiling by high-throughput sequencing’ — Query date: 22 June 2017. 7 And a few other kinds of RNA, e.g. long non-coding RNAs (lncRNAs) [Cheng et al., 2005] 8 ThermoFisher claims that its RiboMinus protocol can remove up to 99.99% of the rRNAs. 10 1.2 transcriptome exploration with rna sequencing Targeted amplification Targeted amplifications rely on primers that would be designed to target (or avoid as for Castle et al. (2010)) specific sequence motifs of the genome. Most studies based on this kind of approach are referred with a name based on the studied RNA type (e.g.miRNA-Seq) or emphasising the variation of the method (e.g. Capture-Seq [Bussotti et al., 2016]). Often, additional steps are required to prepare the libraries in comparison with a polyA-selected or ribodepleted dataset. 1.2.1.3 RNA fragmentation Most sequencing platforms⁹ require relatively short (i.e. 200 to 500 nt) length to sequence. Concomitantly, it also ensures a more uniform sampling along the RNA. This fragmentation can be carried out via divalent cations hydrolysis or nebulisation. This step is performed on occasions after the cDNA synthesis (see next section). In those cases, the cDNAs are fragmented mostly by digestion with DNase I or by sonication. 1.2.1.4 Double-stranded cDNA synthesis The RNA molecules are used as a template for a retro-transcription involving oligo-dTs or random hexamer primers, respectively only for polyA-selected datasets or any dataset (polyA-selected included). The set of random hexamer has been designed to cover the whole transcriptome. Unfortunately, these random hexamer primers have been proven to lack full randomness [Hansen, Brenner, et al., 2010]. At the end of the most common protocol, the order of synthesis of each cDNA strands is lost, i.e. it is impossible to distinguish which of the cDNA strands has the same sequence as the original RNA. Several techniques, called strand-specific, have been developed to compensate for this [Levin et al., 2010; Parkhomchuk et al., 2009]. 1.2.1.5 Adapter ligation, PCR amplification and size selection After generating blunt edges by restriction digest of the cDNAs, adapters (small known sequences of oligonucleotides) are ligated to both their ends. These adapters are constituted from several parts. A subset of them are later ensuring the hybridisation of the cDNAs with the flow cells¹⁰ (based on sequence complementary), and another set of them are sequence binding sites that are used as primers for the following cluster amplification step occurring in situ. These adapters are also used to introduce additional motifs such as indexes. 9 As for the Illumina platforms that have produced the transcriptomic datasets studied in this thesis. 10 Flow cell: see Section 1.2.2. 11 biological and technological context of this thesis The next two steps can be interchanged depending on the amount of starting material at disposition. PCR amplifies all the molecules before (or after) a size-selection is performed (per gel electrophoresis) to extract length-complying fragments (about 200 to 500 bp) to the sequencer machine requirements¹¹. Unfortunately, the size-selection means that any transcript with an original length below the threshold used for the selection will be missed¹². For example, microRNAs (miRNAs) are shorter than the general requirement of Illumina sequencers. Alternative protocols are addressing this issue [Zhuang et al., 2012]. 1.2.1.6 An example of alternative preparation strategy Along with the targeted, the strand-specific and small RNAs protocols, there are a few other variations to this typical protocol to handle other concerns. For example, it is occasionally necessary to sequence simultaneously (in a single run) multiple samples. This can either be motivated by practical reasons (to lower the experimental costs or hasten the overall processing time) [Hou et al., 2015] or be critical to the experimental design as a way to experimentally handle the batch effects¹³ [Auer et al., 2010]. However, it is crucial to later extricate the several pooled samples from each other as a requirement to many downstream analyses. Multiplexed protocols easily achieve the distinction between the multiple samples as they incorporate barcodes before ligating the adapters. These barcodes are also small sequences of nucleotides, and each sample has its unique associated barcode. In practice, each sample is prepared separately with the added extra step (before the adapters ligation) where the barcode is incorporated; then all the samples are pooled together before the next step, which consists of hybridising the cDNAs to the flow cell. Other extra steps occur just after the sequencing and before any other data analysis: all the reads¹⁴ are separated in files based on their barcodes and the barcode, along with the adapters, is trimmed from all the reads. The main inconvenient of the multiplexing protocol is that the original sequenced length of the cDNAs are then shorter as the barcodes are also (and have to be) sequenced as well. 11 Indeed, the previous fragmentation step creates a great length range of fragments. 12 There is no problem for the greater length as they will statistically present fragments in the correct range. 13 Batch effect: see Section 1.5.1 14 Reads: see Section 1.2.3 12 1.2 transcriptome exploration with rna sequencing 1.2.2 Clustering: Hybridisation and Bridge amplification [Illumina, 2016] Once the libraries are ready, they are loaded onto a flow cell¹⁵. The clustering step comprises two phases: hybridisation and bridge amplification of the cDNA fragments. 1.2.2.1 Hybridisation The double-strand cDNAs are denatured, and then each fragment randomly hybridises across the flow cell surface with one of its small oligonucleotides. These are used as primers for polymerases which create a first complementary strand to the hybridised DNA fragments. The new double-strand molecule is denatured, and the original first template is washed away. 1.2.2.2 Bridge amplification The strands then fold over and their (second) adapter hybridises with a complementary oligonucleotide sequence of the flow cell and thus creating a bridge. The flow cell complementary fragment is then used as the primer for a new strand. The new double-stranded DNA is then denatured (which dismantles the bridge). Each of the two tethered molecules creates a new bridge by hybridisation which are the templates for a new strand each. This process happens many times and simultaneously for millions of fragments. It creates clusters of clonal amplification of the original fragments of the library. After the bridge amplification, the reverse strands are cleaved and washed away. The 3’ end primers are also blocked to avoid any unwanted priming. 1.2.3 Sequencing-by-synthesis Illumina sequencers propose two approaches: single-end and paired-end. In single-end sequencing¹⁶, the sequencing begins at one (and only one) of the fragment ends and progresses towards the second. In paired-end sequencing, once the first end has been sequenced, a bridge replication occurs, then the other end of the original fragment is also sequenced. Thus, in paired-end sequencing, the sequencing occurs at both ends of each original fragment. 15 Flow cells are the support of Illumina sequencing. They parallelise through supported Chemistry the sequencing of millions of DNA fragments together which are kept spatially separated in clusters. Each flow cell is a glass slide with lanes. Each lane is coated with two short nucleotide sequences. One of these oligonucleotides is complementary to a region contained in the ligated adapters. 16 Chronologically the oldest method 13 biological and technological context of this thesis Thoughmore expensive andmore programmatically challenging, the paired-end approach facilitate the detection of genomic rearrangements¹⁷ and repetitive sequence elements. It also helps to distinguish between a gene isoforms and provides greater support to identify novel transcripts (new isoform or gene) and fusion genes¹⁸. In both cases, Illumina’s sequencing process, sequencing-by-synthesis [Bentley et al., 2008], is the same. It uses the DNA replication mechanism with modified deoxynucleoside triphosphates (dNTPs). Reversible fluorescent tagged dNTPs, which are protected at their 3’end to block any further elongation, allow a step-by-step incorporation. The product of this synthesis is called a read, and it supports the base calling¹⁹. The sequencing co-occurs on every identical fragment of every cluster of the flow cell. It begins with the hybridisation of a complementary 5’ primer onto the 3’ binding site of the tethered DNA template. This primer is then extended by replication through several sequencing cycles to create a new read. A sequencing cycle starts with the addition of one complementary fluorescent dNTP to the new growing read, which stops the replication process as the dNTPs 3’ end is blocked. A wash discards all the unlinked dNTPs away. Then, the clusters are excited by a light source, and the signal intensity and (characteristic) wavelength of each dNTPs are recorded since they allow the identification of the new nucleotide incorporated by each cluster and measure the accuracy of the base calling. Finally, the fluorescent tags and the 3’ caps are cleaved and washed away before a new cycle begins. The number of cycles determines the final read length. Unfortunately, as the sequencing proceeds, the error rate of the sequencers increases. This is due to the incomplete removal of the fluorescent signal which increases the background noise and thus reduces the signal-to-noise ratio. Once the programmed read length is achieved (typically between 25 to 200 nt), the reads are washed away (after denaturation). 1.2.3.1 Sequencing specificities for the paired-end protocol The paired-end protocol uses an additional primer. The first run is initiated by a single primer and follows the same steps of the single-end sequencing cycle. Once completed, the complimentary read is washed off, and the 3’ end primer is deprotected. Then, the DNA fragment bends over and hybridises to a complementary oligonucleotide at the surface of the flow cell. Next, the second primer initiates a new sequencing cycle at the end of which the newly synthesised read is washed away. A single new bridge replication follows. The new double-stranded DNA fragment is denatured, and the 3’ primers are protected before 17 As indels or inversions 18 A fusion gene is a gene that is the product of the fusion of parts of two (or more) different genes. 19 Base calling: identification of the nucleosides in a sequence by assigning chromatogram peaks or another kind of signal variations to (nucleo)bases. 14 1.2 transcriptome exploration with rna sequencing the original strand (that has been already read) is cleaved and washed away. Finally, the remaining strand is sequenced following the previously described sequencing cycle. Once the same number of sequencing cycles as the first strand is reached, the read product of the remaining strand is washed away. By convention, the first primer allows to read the forward strand and the second primer the reverse strand. Note that these forward and reverse concepts used in paired-end protocol have no relevance to the biological concepts of forward and reverse for genes. 1.2.4 From analogous input to digital output At the end of the sequencing process, a set of images across the flow cell (one per sequencing cycle) is produced from the detected wavelengths. While it is possible to work with the images themselves, in most cases, the sequencing facilities will perform the base calling and other intermediate steps before providing the end-user with text files. These files are usually distributed in the FASTQ format [Cock et al., 2010] which record for each cluster (read) a unique identifier, a nucleotide sequence and a Phred quality score for each base of the sequence. A few optional information can also be provided, e.g. the position of the read on the flow cell (See Appendix A.5 for a random read example). The Phred quality score (𝑄) measures the accuracy of the identification of the nucleobase to which it refers. These scores are set by the base calling program and are defined as 𝑄 = −10 log10(𝑃 ) with 𝑃 the probability of the base being called wrongly. There are several possible encoding formats (see Appendix A.6). In single-end sequencing, there is one file per sample. In paired-end sequencing, the reads are usually separated based on their associated indexes into two ordered files: all the reads from the forward strands are grouped in one file, and the ones from the reverse strands in a second one. 1.2.5 A typical bioinformatic workflow for RNA-Seq study From the reconstruction of the transcriptome to the normalisation of expression in each sample, the various steps may be addressed through many different algorithmic approaches. Often, the choice of a method at one stage implies a more limited number of alternatives from which to pick at later points in the pipeline. The choice is frequently driven by the kind of downstream analyses planned for the study. More than the practical format of the data for these, it is the assumptions and the methods used upstream that are critical for a rigorous investigation and, later, for an accurate interpretation of the results. 15 biological and technological context of this thesis Figure 2.1 presents an example of the overall in silico process of raw RNA-Seq data. It summarises the steps and highlights the tools I used to process the data within this thesis. Before any downstream analysis, for each read, the genomic region (or locus), from which it has been expressed initially, needs to be identified. Indeed, RNA-Seq main objective is to quantify the expression of genomic features²⁰. In other words, the transcriptome needs to be reconstructed from the short reads and annotated (i.e. identify which features have been expressed in each library). Two different main strategies (see further ‘Reconstruction strategies’ segment) manage to accomplish this identification step. Independently of the approach, this step is the most challenging and time-consuming of the workflow. Tools, which tackle the reconstruction, usually provide many tunable heuristic parameters (e.g. maximum number of allowed mismatches or indels per read before discarding a possible identification²¹) to speed up the task. Unfortunately, as on Illumina platforms, the base calling accuracy decreases along the read length, this may lead to an information loss [Minoche et al., 2011]. To prevent informative reads to be discarded, it is opportune to perform a quality check of the raw data before the identification step. Thus, reads with a drop of accuracy in their 3’end may be shortened (i.e. trimmed) and rescued for the next reconstruction step. Similarly, low- quality reads may be discarded hence lowering the complexity of the reconstruction task and hasten its accomplishment. 1.2.5.1 Quality check, trimming and filtering The quality assessment allows removing any read (or part of it) that would increase the complexity of the reconstruction step or skew the downstream analyses. It is wise to discard uninformative reads, i.e. reads with a low sequence complexity (e.g. poly-T or poly-A tails) or with ambiguous sequences (in other words with uncalled bases — reported as N ). Indeed, these reads will hamper the processing time as they usually map to several parts of the genome while also decreasing the accuracy of the global gene expression estimations. For similar reasons, it is judicious to remove reads with a low overall quality score²². It is also prudent to check and remove any read that may map to possible contamination sources²³. Indeed, as these reads are ambiguous, it is safer to discard them than skew the expression estimations. Finally, as many tools (mappers in particular) require all the input reads to have the same length, the purity-length balance requires optimisation. Indeed, the trimming has to compromise between an approach too lenient (where the mappers discard many unfit 20 These genomic features could be genes, isoforms, exons, novel genes, … In short, any genomic region with an annotated function. 21 Indeed, many reads will have many identifications; these reads are defined as ambiguous reads. 22 It may vary based on the complete set of reads to analyse. 23 For example, for eukaryotes, by aligning (see next segment) every read to the Escherichia coli genome. 16 1.2 transcriptome exploration with rna sequencing reads at a later step), and a too stringent one (where either the reads are shorter, which increases the overall complexity and therefore hinder the mapping both on time and accuracy [Williams et al., 2016], or too few are left for pertinent analyses). When the tools allow it, avoiding quality-based trimmings is probably a better practice. Generally, after the sequencer calls the reads, a first trimming removes all the adaptors and barcodes needed by the sequencing protocols. Thus, in principle, they are not to be found in the ‘raw data’. However, to avert any latter contingency, a search against a list of the most common adaptors and an over-representation assessment of small sequences (k-mers²⁴) at each end of the reads is good practice. 1.2.5.2 Reconstruction strategies Two main approaches can be used for the very computationally expensive step of identification. I will present them in decreasing order of complexity: the de novo assembly of the reads and then the reads alignment approach (to a genome reference or a transcriptome one). Regardless of the approaches, the reconstructed transcriptome is usually reported as a SAM file [H. Li et al., 2009] (or one of its derivative formats: either BGZF-compressed binary file that can be converted into SAM (BAM) or more recently CRAM). de novo Assembly This approach is favoured when the reference genome of the species of interest is unavailable or of poor quality (e.g. many non-model organisms) or inadequate (e.g. cancer samples) for the samples of interest. However, if a reference already exists this strategy is avoided to the utmost. It allows the unbiased discovery of novel exon-exon junctions [Robertson et al., 2010]. As none of the datasets I use in this thesis has been reconstructed through this approach, I briefly summarise the main points below as more in-depth reviews cover this strategy (see J. A. Martin et al., 2011). In de novo assembly, the reconstruction of the transcriptome happens with the construction of the longest possible contigs (i.e. contiguously expressed regions) based on sets of overlapping reads (see also Figure 1.5). Shorter reads add to the overall complexity of this approach. While paired-end reads may help to solve many genomic regions, lowly expressed or repetitive regions remain challenging to determine. There are several algorithmic approaches for de novo transcriptome assembly [Wajid et al., 2012], though the most prevalent one is the de Bruijn representation [Robertson et al., 2010]. 24 k-mer : In the present context, all possible subsequences of length k of a read. 17 biological and technological context of this thesis Figure 1.5. de novo Assembly. From overlapping regions of raw reads, contigs are created by integrating the reads sequences together. Read alignment This approach exploits prior knowledge. The reads are aligned to a reference to hasten the reconstruction process. The reference may be a genome or a transcriptome (provided that a good annotation is available). (a) Alignment to the genome (b) Alignment to the transcriptome Figure 1.6. Overview of main reconstruction strategies for an RNA-Seq transcriptome by alignment to a reference • Genome reference Aligning to the genome allows discovering new genes or isoforms. However, it requires splice-aware algorithms, i.e. they need to align the reads across the splice-junctions (which is possible but non-trivial). As illustrated in Section 1.2.5.2, the reads might span many discontinued regions of the reference. While on the one hand, aligning to the genome 18 1.2 transcriptome exploration with rna sequencing avoids multiple mapping issues²⁵ for the same exon, this also implies that the genome needs to provide the coordinates for the different isoforms which will then require further analyses for accurate quantifications at that isomeric level. Indeed, irrespectively of the number of isoforms including a specific exon, the sequence of this exon is transcribed only once in the reference. • Transcriptome reference Using a transcriptome for reference instead of a genome reduces the complexity of the aligning step due to the lack of intronic sequences. However, it also limits the potential downstream analyses, e.g. any new (or unannotated) gene or isoform will be missed. This approach is the easiest, but a pre-existing accurate and well-annotated gene model is required. Section 1.2.5.2 shows in fact that this approach is simpler to the previous one as a direct read alignment is done against the transcriptome of reference. This enables the accurate gene isoforms expression quantification, provided that the gene model is correct and the reads may be attributed unambiguously to a single isoform for each gene. However, in practice, this approach produces many multimapped reads, particularly for shorter reads as many isoforms are very similar in sequence and vary only in the exons they retain. If the difference in the exon compositions is towards the end of the gene, reads from two isoforms may be indistinguishable. Paired-end sequencing helps to resolve part of the ambiguity encountered with single-end protocols. To mitigate very computational greedy approaches and more constraining ones, several tools complement the previous strategies. Hybrid approach between de novo and alignment There are tools like TopHat2²⁶ (2.0.12) [D. Kim, Pertea, et al., 2013], that use a hybrid approach between a reference alignment and a de novo assembly. Reads and fragments In the case of paired-end data, each read of the pair is first processed separately. Then, in the final evaluation phase and with the help of additional information sources, they are used as a pair to infer among the many possibilities which are the most credible ones. Both parts of a paired-end data once aligned to a concordant region of the genome is then called a fragment instead of a read. Today, there is conflation between the ‘read’ and the ‘fragment’ terms. Even though the term fragment is more accurate and may be used in any situation (as it equals to one read for single-end data and a pair of related reads for paired-end data), it is frequent to see the term read instead (even for paired-end data). 25 Due to sequence similarity, a same read or subpart of a read may be attributed to many different loci in the genome. As it is impossible to attribute the read to its original locus of expression directly, distribution models have to be pondered to avoid unnecessary skewness during the quantification step. 26 TopHat2 — https://ccb.jhu.edu/software/tophat/index.shtml 19 biological and technological context of this thesis TopHat2 — along with STAR²⁷ [Dobin et al., 2013] — is the most popular splice-aware mapper for genomes with a near-complete annotation (e.g. for Homo sapiens) [Engström et al., 2013] despite being slower than the latter [D. Kim, Langmead, et al., 2015]. As many concepts or terms in Science, ‘read mapping’ can have different (however very closely related) meanings. Hence, while for many people, read mapping will encompass any transcriptome reconstruction strategy (including de novo assembly) since the main point is to map the features to functional annotations, for others the term will only refer to ‘read alignment’ strategies specifically [Pachter, 2015]. Tools such as TopHat2 contributes to this confusion. 1.2.5.3 Quantification of features When working with RNA-Seq data, the typical next step after mapping the reads or fragments to the reference is to quantify the expression of the feature²⁸ of interest.²⁹ In the context of this thesis, I only consider gene expression (either as RNA or as protein). Hence, many subtleties required for isoforms or exons studies are here irrelevant and are left out of my overall review. Several tools and algorithmic approaches are available. Indeed, for larger genomes, many regions may present high sequence similarity which results in many ambiguous (and challenging) reads as they mapped to many potential genomic sites. These reads are also called multireads. One early strategy to solve multireads is to discard them from later analyses; another one is to attribute them to the most credible locus based on the overall distribution of the reads for a given sample. [Mortazavi et al., 2008] Paired-end data help in many cases to discriminate between possible genomic original sites, thus decreasing the overall number of multimapped fragments. Two popular tools were used in this thesis, Cufflinks2³⁰ (2.2.1) [Trapnell et al., 2010] and HTSeq-count³¹ (0.6.1p1) [Anders et al., 2015], which are both compatible with TopHat2 but rely on different concepts. I briefly present them below in their chronological release. Cufflinks Cufflinks2 is part of a collection of tools called Tuxedo suite³² which also includes TopHat2 and Bowtie. Cufflinks2 can assemble de novo novel transcripts and isoforms following the same principles than TopHat2. Likewise, using good references is faster and more useful. 27 STAR — https://github.com/alexdobin/STAR 28 E.g. genes, isoforms, exons, splicing events. 29 In fact, while genotyping, heredity studies and other genetically focused studies are possible in principle, the common main focus is centred on expression estimation. For example, instead of reporting a specific SNP, in an RNA-Seq study, the core interest is currently more about the specific allelic expression. Moreover, most of the RNA-Seq studies fail to provide the necessary sequencing depth and coverage for other kinds of study. 30 Cufflinks2 — http://cole-trapnell-lab.github.io/cufflinks/manual/ 31 HTSeq-count — http://www-huber.embl.de/HTSeq/doc/index.html# 32 Tuxedo suite user group: https://groups.google.com/forum/#!forum/tuxedo-tools-users 20 1.2 transcriptome exploration with rna sequencing Reference transcriptome Splice junctions Initialisation Isoform A Isoform B Isoform C Convergence Isoform A Isoform B Isoform C Iterations Isoform B Isoform A Isoform C Figure 1.7. Abundance estimation of isoforms by Cufflinks2 following an EM algorithmic approach. [Adapted from Turner (2015)] In both cases, Cufflinks2 infers the most parsimonious³³ and credible set of transcripts (and their isoforms) that can explain the complete set of observed fragments. This task is challenging as many isoforms are sharing a common set of exons. While most genes present a dominant isoform for a specific condition, there are often a few other isoforms expressed along, even though their amount may be very limited [Gonzàlez-Porta et al., 2013]. Furthermore, Cufflinks2 tries assigning the multimapped reads to one isoform only. First, sets of fragments are separated into sets of isoforms (all fragments that are likely produced from the same set of isoforms are regrouped together). Then to estimate the abundance of each isoform of one set, Cufflinks2 integrates many information sources together. For example, the overall distribution of fragments (or reads), if these are spanning over known (or novel) splice-junctions. Particular attention is drawn to the fragments that map unambiguously to one unique isoform. When available, paired-end fragments are critical: as they cover longer regions, the probability that they span multiple adjacent exons is increased which helps to resolve the possible structure of the original isoforms. [Roberts et al., 2011] The abundance is finally estimated through an EM algorithm [Do et al., 2008; Dempster et al., 1977], with the following main steps (see also Figure 1.7): 1. Initialisation: For each fragment, a rough estimation of the probability to be expressed from each isoform is computed based on the different piece of information cited previously. 2. Iteration till convergence with the observed distribution of fragments: a) Isoforms abundance are recomputed based on the updated fragment-to-isoform assignment b) Fragment-to-isoform assignment re-updated based on the isoforms abundance. 33 As a requirement to Occam’s razor, which may be a debatable strategy [Westerhoff et al., 2009] 21 biological and technological context of this thesis Finally, to compute the gene expression levels, Cufflinks2 aggregates per gene all the isoforms expression abundances together. Cufflinks2 provides by default FPKM³⁴ normalised data. HTSeq-count The Python library HTSeq provides a stand-alone script (HTSeq-count) performing the feature quantification with a more conservative strategy. It discards all ambiguous reads from a SAM/BAM file and then only counts the unambiguous reads that overlap with the features of interest for a given gene model³⁵. Reference Gene 1 Gene 2 Gene 3 3 4 8 Countsraw Figure 1.8. Abundance estimation of genes by HTSeq-count. The unambiguous reads (or fragments for paired-end data) overlapping locus annotated as gene are directly counted. [Adapted from Gonzàlez-Porta (2014)] HTSeq-count deems as ambiguous any multiread or read that overlaps more than one annotation for the considered feature.³⁶ HTSeq-count provides three modes for fine-tuning the overlap definition. For this thesis, I used the ‘intersection non-empty’ mode (see Figure A.4). This mode avoids discarding too many reads due to a too tolerant annotation (i.e. the annotation itself presents many overlapping definitions for a given pair of feature and chromosome region). Initially, HTSeq-count [Anders et al., 2015] was designed for differential gene expression analysis (DGEA). This type of analysis compares expression profiles to highlight the genes (or transcripts) for which the expression is significantly different depending on the considered condition. As multireads are irrelevant for those studies, including or excluding them from the downstream analysis is insignificant. Many papers (e.g. Fonseca, J. Marioni, et al. (2014), Robert et al. (2015), and Everaert et al. (2017)) have since shown that the gene expression estimation by HTSeq-count, while underestimated, is overall well-correlated with other RNA quantification methods (e.g. microarrays or RT-qPCR). Moreover, quantifications with HTSeq-count are highly correlated with Cufflinks2 quantifications for most of the genes after proper 34 See Section 1.2.5.4. 35 Gene models are distributed as annotation file (usually either as GTF or GFF format) and refer to a specific reference (genome or transcriptome). 36 In fact, reads might be discarded for one feature but kept for another one. For example, to quantify the expression of a given gene, HTSeq-count considers every read that unambiguously overlaps with any of its annotated exons — indeed, HTSeq-count defines a gene as the union of all its exons. However, while quantifying exon expression, many of these same reads may be discarded as they overlap several exon annotations with overlaying definitions. 22 1.2 transcriptome exploration with rna sequencing normalisation [Everaert et al., 2017] as HTSeq-count provides raw counts (i.e. unnormalised counts). 1.2.5.4 Normalisation Regardless of the quantification method, a normalisation is usually necessary to avoid a few statistical biases (mainly due to the sampling). The normalisation method, though, is generally determined based on the quantification (method or tool³⁷) and they have to be suitable for the planned downstream analyses. As RNA-Seq fails to assess the absolute concentration of each expressed gene (or transcript) in a sample, each normalisation method is based on a specific set of assumptions that may be incompatible to the ones required by many investigative approaches. Many papers review or compare normalisation methods (See Dillies et al. (2013), Zwiener et al. (2014), Zyprych-Walczak et al. (2015), and Peixoto et al. (2015)). RPKM and FPKM The first evident source of sampling bias is the total number ofmapped reads (or fragments) between two RNA-Seq libraries (shortened as ‘libraries’ from now on). Indeed, there may be considerable discrepancies in their respective amount of starting material loaded on a flow cell³⁸ and, more importantly, the number of mapped reads (or fragments) to a reference³⁹. The second source of bias arises when two genes have their expression level compared. Indeed, as a longer gene produces more reads (or fragments), it has a greater statistical chance to be sampled. Figure 1.8 illustrates this sample bias: Gene 3 is twice as long as Gene 2, and their raw counts also include this scaling factor. However, with proper normalisation, Gene 2 and Gene 3 are expressed in equal proportions. To correct for these two biases, Mortazavi et al. (2008) introduced a new unit ‘RPKM’which they first defined as reads per kilobase of exon model per million mapped reads. Since then, this unit has been redefined and replaced to avoid ambiguities in the case of paired-end data by another unit, ‘FPKM’, which stands for fragments per kilobase of transcript per million mapped reads⁴⁰. The canonical formula for FPKM (or RPKM) is: 37 Many quantification tools (e.g. Cufflinks2) perform the normalisation step automatically as well. They may also (or not) provide raw counts. 38 In fact, this would involve the monitoring of many parameters or assessments of the samples before the library preparation. Moreover, while it may be possible to weight each sample before extracting the RNA, the many steps (involving the fragmentation of the RNAs, PCR syntheses or size-selection) and their associated biases overburden the tracking of the final amounts used for the sequencing. 39 The quantification disregards the unmapped reads (or fragments) and so does the normalisation. 40 As mentioned before, despite the inaccuracy, read and fragment are often used interchangeably; this is also the case for RPKM and FPKM. 23 biological and technological context of this thesis ̂𝜇𝑖𝑗 = 𝑓𝑖 𝐹𝑗 ⋅ 10−6 ⋅ ℓ𝑖 ⋅ 10−3 = 𝑓𝑖𝐹𝑗 ⋅ ℓ𝑖 ⋅ 109 (Canonical F/RPKM formula) where: ̂𝜇𝑖𝑗 is the normalised expression for the feature (e.g. gene or transcript) 𝑖 in sample 𝑗, 𝑓𝑖 is the count number of the fragments (or reads) mapped to feature 𝑖 in sample 𝑗, 𝐹𝑗 is the total count number of all the fragments (or reads) mapped in sample 𝑗, ℓ𝑖 is the length of feature 𝑖. The scaling factor was introduced such as in most cases 1 FPKM is crudely equivalent to a single RNA molecule in the cell [Mortazavi et al., 2008]. This has been observed in other papers (see for example Hebenstreit et al., 2011) and also explains why 1 FPKM is a commonly used threshold. This normalisation is quite intuitive and still largely used today. In fact, I use this normalisation through the thesis. Meanwhile, it is also unsuitable for a popular type of analysis, differential expression analysis (DEA), which seeks to highlight genes which expression varies between different biological conditions. However, if for any biological or technical reason, any set of genes detected in a specific condition and undetected in another, will affect the FPKM estimation of every RNA in both conditions (see Table A.3) and will entangle the interpretation. Other normalisation approaches Differential expression analyses (DEA) have led to the development of distinct models and methods. They generally involve a model where for most of the genes, the expression is assumed to be stable between conditions⁴¹ (e.g. edgeR⁴² [Robinson et al., 2010] or DESeq2⁴² [Love et al., 2014]). Also, many normalisation methods applied first to microarrays are used, e.g. the most common ones include a quantile normalisation method or a simple scaling normalisation. Other normalisation methods try to correct a priori or a posteriori biases⁴³. A few of them may correct batch effect (see Section 1.5) or other confounding factors. RUVSeq⁴² [Risso et al., 2014] is one example. Although FPKMnormalisation is generally avoided in favour of another (more appropriate) method, this thesis aims to explore the baseline expression of the genes between andwithin tissues, and in this context, despite its biases, FPKM normalisation is better suited than any normalisationmethod designed for DEA. For this reason, thework based on transcriptomic data presented in this thesis is based on FPKMs (see Chapter 2). 41 The comparison is usually between diseased (or treated) samples to control (healthy). 42 Bioconductor package 43 The Bioconductor package CQN [Hansen, Irizarry, et al., 2012] for example corrects the expression levels according to their length and their GC content before applying a quantile normalisation (as cDNAs enriched in GC bases are more stable and tend to a more optimal amplification). 24 1.3 proteome exploration with mass spectrometry 1.3 proteome exploration with mass spectrometry Through the last decade, the proteomics field has shifted from technical research on instruments and methods to the extensive and routine use of mass spectrometry (MS) as an analytical tool [Aebersold and Mann, 2016]. Many possible workflows for MS-based proteomic studies exist as MS is very versatile and supports many proteomic investigation approaches, such as protein characterisation, modification sites, structures, mechanism-oriented (interaction) studies [Aebersold and Mann, 2016]. In this regard, high-throughput protein identification and quantification have thoroughly developed when MS became the primary choice method since it allies a good dynamic range with high sensitivity and specificity [Aebersold and Mann, 2003; Brosh, 2009; Cox and Mann, 2011]. Depending on the study purpose, available time and money, the number of samples, the available instruments and the needed sensitivity and specificity, the choice will be based on one strategy rather than another. Although, top-down and, more recently,middle-down approaches exist, bottom-up approaches are the most favoured in the field by far. Top-down approaches [Catherman et al., 2014] study the intact proteins, (i.e. as a whole without digesting them in smaller peptides). They are appealing, but still very challenging both experimentally and computationally [Aebersold and Mann, 2016]. They are more suitable for highly purified samples as the MS and fragmentation (tandem MS (MS/MS)) spectra are highly complex. On the other hand, digesting the proteins in smaller molecules allows them to ionise better and facilitates the spectra interpretation. Middle-down approaches [C. Wu, J. C. Tran, et al., 2012] produce large fragments (up to 20kDa). Bottom-up approaches generally use enzymatic digestion with trypsin to produce small peptides (about twelve aas on average). The cheapness, ease and reproducibility of the trypsin digestion⁴⁴ are the reason for the bottom-up approach popularity⁴⁵. Bottom-up methods fall into two main types of strategies: targeted and untargeted. Another layer of complexity is added by the selected MS acquisition mode: data-dependent acquisition (DDA) or data-independent acquisition (DIA) (see Section 1.3.3.5). Since all the proteomic data presented in this thesis have been generated as part of global discovery studies through DDA bottom-up label free approaches (see Chapter 2), in the following section, I mainly focus on the obtention and processing of these types of proteomic data. Figure 1.9 shows a summary of possible bottom-up approaches based on DDA methods. DDA targeted approaches [Domon et al., 2006; Shi et al., 2016] allow the absolute or relative quantification of a small preselected set of proteins. This strategy is favoured for example to validate possible biomarker candidates. Selected reaction monitoring (SRM) [Picotti et al., 2012] (also known as multiple reaction monitoring (MRM) [A. Hu 44 See section 1.3.2.3 45 The experimental spectra are compared to databases that collect theoretical spectra or experimental ones from prior studies to reconstruct the proteins from the peptidic fragments (hence bottom-up). 25 biological and technological context of this thesis Bottom-up proteomic quantification with MS Untargeted (Global approach, discovery) Targeted (Validation) SRM PRM Metabolic Chemical Enzymatic Tag Label free MS1 (XIC) MS2 (Spectral counting) Figure 1.9. Bottom-up quantification approaches. The work presented in this thesis relies on proteomic data that has been acquired through a DDA bottom-up label-free MS/MS approach (dashed frame). et al., 2016; Shi et al., 2016]) and its variant parallel reaction monitoring (PRM) [Gallien et al., 2012] are more sensitive and specific, and give more accurate quantification than untargeted methods: as the set of proteins and peptides to fragment is known, interpreting the spectra is much easier. However, SRM is the most sensitive, while PRM is the most specific and accurate [Benhaïm, 2017]. Among DDA methods, untargeted strategies are the most suitable for global approaches or discovery projects. Unless samples comprise a subset of proteins (or spike-ins) with known concentrations, these methods provide only relative protein quantification. Tagged strategies [Zhou et al., 2014] label the proteins (before or after extraction) with stable isotopes (see Table A.4). For each condition, a specific isotope is used. Thus, the proteins and peptides have exactly identical physicochemical properties, have the same behaviour through the protocols, and only their mass can differentiate them. The labelling can be either enzymatic (e.g. 18O-labelling [X. Ye et al., 2009]), chemical (e.g. isotope- coded affinity tag (ICAT) [Gygi, Rist, et al., 1999], isobaric tag for relative and absolute quantification (iTRAQ) [Ross et al., 2004], tandem mass tags (TMT™) [Thompson et al., 2003], dimethyl labelling [Hsu et al., 2003], or 2D-differential in-gel electrophoresis (2D- DIGE) [Unlü et al., 1997]), or metabolic (e.g. stable isotope labeling by amino acids in cell culture (SILAC) [X. Chen et al., 2015]). Some tagged approaches have multiplexing protocols. Note that other mass tags (e.g. metal coded affinity tag) attached to peptides or proteins are an alternative to isotopic labelling. Label-free strategies [Hsu et al., 2003; Neilson et al., 2011; Sandin et al., 2014] analyse the proteins after their digestion with the trypsin. Since there is no marking in these 26 1.3 proteome exploration with mass spectrometry methods, multiplexing is impossible. Two different methods can quantify the relative abundance of the proteins. Spectral counting (also called MS2 quantification) quantifies proteins based on the assumption that the more abundant a protein is, the more this protein is selected to be fragmented; thus the total number of MS/MS spectra that can be mapped back to the protein can be used to estimate it. Unfortunately, this technique is lacking accuracy and is highly criticised compared to the secondmethod which is intensity based. The secondmethod, extracted-ion current (XIC) [Higgs et al., 2013], quantifies each peptide by first extracting its ion currents, its molecular mass and retention time (MS1) (or the ones of its fragments (MS2)), and then by integrating the area under the curve (AUC) of the peptide monoisotopic molecular mass (p), and the following isotopic molecular mass peaks (p+1) and (p+2). The XIC assumption is as follows: the more concentrated a peptide is, the greater is the AUC. Note that since there are many possible MS approaches and protocols [Bantscheff et al., 2012], in the following section, I focus on the one that initially generated the proteomic datasets I am reusing for this thesis (see Section 2.3). Thus, I describe one widespread bottom-up approach, i.e. the label-free Liquid Chromatography (LC) followed by tandem Mass Spectrometry (LC-MS/MS) protocol (also known as shotgun proteomics) [Cox and Mann, 2011; Y. Zhang, Fonslow, et al., 2013]. The following segments may be suitable for other methods as well. 1.3.1 Sample preparation DDA label free protocols to prepare samples for discovery proteomic analyses are generally much simpler to implement than the ones for RNA-Seq. However, as proteins are alsomore complex and heterogeneous than DNA and RNA [Bruce et al., 2013], there is a broader choice of them in order to adapt to any requirement [Feist et al., 2015]. 1.3.1.1 Sample collection and conservation Collection Feist et al. (2015) report that traditional dissection, biopsies, blood draws and other methods can deliver adequate samples for proteome analysis. Conservation Recent developments have significantly improved proteome analysis from formalin-fixed paraffin-embedded (FFPE) samples [Steiner et al., 2014]. As they are still evolving, they may compare soon to fresh-frozen (FF) samples. However, for now, fresh or FF samples are remaining the best primary sources. 27 biological and technological context of this thesis 1.3.1.2 Protein extraction and contaminant removal Protein extraction Contrariwise to the collection step, Feist et al. (2015) explain that the crucial consideration is the cell lysis and the extraction approaches used on the protein as they may interact and disrupt the later characterisation step and thus require appropriate picking. Besides, the physicochemical properties of the proteins are far more heterogeneous than for DNA or RNA molecules [Bruce et al., 2013] and may be incompatible with many extraction protocols. Contaminant removal Gutstein et al. (2008), Bodzon-Kulakowska et al. (2007), Visser et al. (2005), and Hilbrig et al. (2003) review various examples of mechanical and chemical extraction and contaminant removal methods. Indeed, contaminants and detergents need to be eliminated from the samples before analysis as, in the typical bottom-up approach, they interfere with the digestion, the separation and fragmentations steps; precipitation and filtering (based on molecular weight cut-off) strategies are the best in the context of bottom-up MS analysis [Feist et al., 2015]. Indeed, many extraction solvents are inappropriate for MS. Precipitations may denature the proteins, but this is usually irrelevant in this situation. Feist et al. (2015) review different precipitation protocols and emphasise that caution is needed for the next re-suspension step to avoid missing a substantial part of the sample proteome. They list several techniques and approaches in this regard. 1.3.2 Reducing samples’ complexity High-throughput bottom-up workflows (including LC-MS/MS) generally aim to decrease the sample complexity to analyse while increasing the depth of proteomic coverage [Z. Zhang et al., 2014; Bruce et al., 2013; Cox and Mann, 2011]. Indeed, many aspects may impede the characterisation of the proteins. For example, the broad physicochemical scope of the proteins will require various protocols to be handled efficiently. Their concentrations in a sample may saturate the MS characterisation capacity as the very abundant proteins are easily detected and quantified while rarer proteins may be missed entirely without specific precautions [K. Liu et al., 2009; Cappadona et al., 2012] or without unbearably increasing the analysis time per sample [Nilsson et al., 2010]. On the other hand, strategies to decrease saturation effects can also be used instead [Z. Zhang et al., 2014]. Hence, the usual workflow will usually involve the denaturation, the reducing, the alkylation and the digestion of the proteins. The peptide mixture products are then separated in smaller fractions before subjection to MS so as to increase the coverage 28 1.3 proteome exploration with mass spectrometry Figure 1.10. Overview of proteomic data generation. depth while keeping a reasonable analysis time-frame [Aebersold and Mann, 2003; Cox and Mann, 2011; Y. Zhang, Fonslow, et al., 2013]. Figure 1.10 presents a general overview of one possible workflow. Indeed, the various complexity reducing steps may happen in a different order than I report hereinafter; they may also happen concomitantly. Additionally, protocols may skip or, conversely, perform the same type of complexity reducing step several times. For example, protocols may present protein fractionation as the first step and then peptide fractionation as a later one. Moreover, almost all protocols will only involve liquid chromatography for their fractioning steps. 1.3.2.1 Denaturation, Reduction and alkylation These steps help the separation of the protein complexes, the relative linearisation of the proteins and, to some extent, the homogenisation of the crude mixtures [Feist et al., 2015; Bruce et al., 2013]. They may happen simultaneously with other steps. For example, they may occur during the extraction, the depletion or the digestion (where they facilitate the trypsin cleavage). 1.3.2.2 Depletion of highly abundant proteins Regrettably, the proteomic field lacks amplification methods, and there are very few strategies to remove the highly abundant proteins which removal is required to capture 29 biological and technological context of this thesis scarcer proteins in untargeted DDA bottom-up protocols. Fortunately, these are very limited in number and can be precisely targeted. Z. Zhang et al. (2014) report two useful strategies. The first one aims to remove them entirely from the sample, either by selective precipitation or (more expensive) by affinity. The second approach aims to the equalisation of the proteome. It may be based either on combinatorial ligand libraries involving bead-supported ligands (on a similar model to RNA-Seq protocols) or on specific protease mixtures. In general, the equalisation of the proteome improves the characterisation of the scarcer proteins [Z. Zhang et al., 2014] while they may introduce skewness to the analyses (due to the relative protein proportions). 1.3.2.3 Proteolytic digestion It may seem contradictory to digest the proteins into peptide mixtures as a means of reducing the complexity. Nevertheless, while the main drawback is the inability to distinguish between proteoforms [Bruce et al., 2013], it improves the proteins characterisation on several points. • It helps to homogenise the sample: peptides present closer physicochemical properties to each other than proteins [Z. Zhang et al., 2014]. Also, peptide separation (by gel or liquid chromatography (LC)) is easier than for proteins (see Section 1.3.2.4). • MS is also more sensitive to peptides than to proteins due to being more sensitive towards lower molecular-weight molecules [Vitek, 2009; Cox and Mann, 2011]. Proteins may also be too large to be fragmented (e.g. with CID) [Bruce et al., 2013]. • It is also easier to accurately characterise (identification and quantification) smaller molecules. Large proteins with similar compositions present very similar molecular mass and may be impossible to discriminate. On the other hand, the sequence- specific enzymatic digestion gives hints on the protein sequence. • Finally, it increases the coverage of the less abundant proteins [Z. Zhang et al., 2014]. Indeed, each protein is represented by multiple peptides hence increasing the sampling probability. Often, one or a small number of LC-MS/MS-characterised peptides is enough to identify the parent protein [Bruce et al., 2013]. The digestion may happen in-gel or in-solution. Although less commonly used as before in bottom-up approaches based on LC-MS/MS, in-gel digestion provides directly the fractioning and deals better with the more complex samples. Often, the gel will contain, in addition to the protease, a few chemical reagents that handle other steps (e.g. chaotropic reagents to denature the proteins at the same time). However, it requires greater time and amount of starting material than for the in-solution digestion. In-solution digestion, on the other hand, is the simplest and the most popular digestion 30 1.3 proteome exploration with mass spectrometry approach among all the proteome studies in general. It usually precedes filtering and fractioning steps. An hybrid method combining both methods, filter-aided sample preparation (FASP) [Manza et al., 2005; Wiśniewski, Zougman, et al., 2009] exists. Although there are other enzymes for restrictive digestion [Giansanti et al., 2016; Tsiatsiani et al., 2015], trypsin is the gold standard protease [Z. Zhang et al., 2014]. Trypsin is a serine protease that has a high proteolytic and highly specific activity: it cleaves the proteins at the carboxyl side of an arginine (R) or lysine (K) aas, when they are not followed by a proline (P). Trypsin’s cheapness and ease of use can also explain its popularity. Besides, trypsin creates peptides that are in the ideal range for MS studies as it produces a large number of short (600 to 1, 000 Dalton (Da)) peptides that can be efficiently fragmented and identified, although inferring the parent proteins remains challenging [Laskay et al., 2013]. Circularity may also contribute to this trend: as more studies are using it, more data are available for comparison; hence more studies employ it. Peptides produced by trypsin digestion are referred to as tryptic peptides. 1.3.2.4 Separation methods (fractioning) Fractioning (or fractionation) is the principal method to simplify sample complexity. It also allows focusing selectively on subcellular fractions when needed. Indeed, one may want to study particular organelles, cell compartments or other kinds of the proteome (e.g. phosphorylated or glycosylated proteins) [Cox and Mann, 2011]. Besides, fractioning is also a good strategy to reduce the impact of undersampling and increase repeatability between analyses. With MS being prone to undersampling, repeated analyses may fail to yield the same protein identifications, as different sets of peptides may get sampled. Several methods may fraction peptide mixture. Protocols may involve many of their various combinations. For example, M.-S. Kim et al. (2014) and Wilhelm et al. (2014) have used a gel and LC sequentially before MS analyses. Precipitation It is very easy to perform, and they usually involve solvent gradients. While their use is frequent for desalting crude mixtures, it is also quite limited for protein mixtures as other separation methods (based either on gel or capillarity such as LC) are more performant. Gel electrophoresis based separation Protocols may involve a first gel-based separation method before the liquid-based separation and fragmentation of LC-MS/MS [Feist et al., 2015]. The gel separation may comprise one step or two, which are respectively named one- dimension (1D) and two-dimensions (2D) gels. 31 biological and technological context of this thesis 1-D gel approach comprehends a denaturing alkylating gel (usually Sodium Dodecyl Sulphate-PAGE (SDS-PAGE) but others are possible, e.g. Lithium Dodecyl Sulphate-PAGE (LDS-PAGE)) [Shevchenko et al., 2006]. Thus, proteins lose all their quaternary, tertiary and secondary structures. The separation relies then on the length of the proteins. Indeed, SDS molecules carry negative charges, and they bind to the proteins proportionally to their length, and as an electric field is applied to the gel, the proteins migrate towards the positive side of the gel at different speeds due to their difference in their mass-charge ratio. The 1D protocol is faster than the 2D one, while the latter is more selective as it relies on very similar principles but in two separate steps⁴⁶. Liquid chromatography (LC) Chromatography is a technique of choice for the separation of mixtures into their components or — at least — in simpler mixtures. LC, as any chromatography, involves a mobile and a stationary phase. The mobile phase comprises an eluent (i.e. a solvent) and the mixture to be separated. A column plays the stationary phase part. High pressure is applied (e.g. HPLC (high-performance liquid chromatography) or UPLC (ultra performance liquid chromatography)) to improve and accelerate the process. The separation relies on the difference of affinity of the mixture components between the mobile and stationary phases. Many combinations are possible. However, any poor choice may precipitate the mixture on the (extremely) expensive column which means that both the column and the mixture will be lost. Hence, in discovery mode and for complex mixtures (as for proteins extracts), instead of using a normal phase, i.e. crude silica-gel, column (which interacts tightly with polar molecules such as peptides and may prevent them to interact with the mobile phase afterwards), a reversed phase column is more common. In these columns, the silica-gel is modified and has been attached to long hydrocarbon chain. Thus, the column is unable to fix anything permanently. Along with a Reversed-phase LC (RPLC) column, polar eluants are used. These interact strongly with charged proteins and peptides. Hence, the separation occurs on the polarity of the molecules and the more hydrophobic a molecule, the longer it will remain in the column as it will present fewer interactions with the eluent. The ease of coupling this powerful method to the (also powerful) MS explains why Liquid Chromatography (LC) followed by Mass Spectrometry (LC-MS) is so widespread today. Their combined use also allows high repeatability and optimised running time. 46 Usually, the proteins are first separated based on their isoelectric point (or pI) in a native gel (as opposed to a denaturing one). An ampholyte reagent is added to the gel which ensures a stable gradient of pH through the gel. The proteins migrate until they reached a pH region where their overall charge is neutral. In a second time, the proteins are separated perpendicularly this time on their mass after the addition of SDS or an alike reagent. 32 1.3 proteome exploration with mass spectrometry 1.3.3 Characterisation through fragmentation profiles The first reported use of the principle underlying MS happened in 1913 when Sir Joseph John Thomson channelled a stream of neon ions through an electromagnetic field and captured its deflection on a photographic plate [Thomson, 1913]. Arthur Dempster and FrancisW. Aston created the first mass spectrometers in 1918 and 1919 respectively [Aston, 1919]. In this context, the use of MS is quite recent in biology for proteomics as it has been developing from the 1980s onwards particularly with the development of soft ionisation methods [Papachristodoulou et al., 2014]. 1.3.3.1 General principle MS relays on the following principle: molecules of interest are ionised into charged particles, then the mass analyser separates them in the gas phase based on their total mass (𝑚) to charge (𝑧) ratio, i.e. 𝑚/𝑧. A detector collects all these ions and translates their signal (intensity versus 𝑚/𝑧) to an electric one which is the output serving as raw data for the later analyses. In the simplest case, the molecules are singly charged (𝑧 = 1), and only their molecular mass is recorded. However, it is quite common for the molecules to carry more than one electric charge and that if the energy used for the ionisation is substantial, they may also forego fragmentation and internal reactions that will result to the production of many spectra. The collection of spectra obtained from a single peptide ultimately increases the accuracy of the mass measurement and the identification of the peptide [Papachristodoulou et al., 2014]. Recent developments have shown that the use of two mass spectrometers in tandem (MS/MS) reduces the number of missed proteins significantly while keeping reasonable running times and it is more potent than excessive fractioning as is the improvement of HPLC [Cox and Mann, 2011]. 1.3.3.2 Ionisation While there are other methods (more adapted to small organic molecules) two soft ionisation methods are routinely used for proteomic samples: matrix-assisted laser desorption ionisation (MALDI) and electron spray ionisation (ESI). MALDI is very useful for large molecules. It allows creating ions with a minimum of fragmentation (if any at all). The molecules are fixed onto a matrix and then a pulsed laser irradiates the sample which provokes the ablation and then the desorption from the matrix and ionisation of the molecules. However, it requires mass spectrometers compatible with this ionisation technique [Z. Zhang et al., 2014; Walther et al., 2010]. ESI is currently the most popular technique of ionisation, mainly because it may be used with a large panel of different analysers and it interfaces efficiently with HPLC or UPLC. An electrostatic method performs the ionisation. The application of a high electric potential on a needle through which a liquid (containing the dissolved analyte peptides) 33 biological and technological context of this thesis is passing provokes its dispersion into small and highly charged droplets. These droplets start to evaporate to the point where the charge on their surface is so high that the desorption of the analytes occurs. The analytes are then in an ionised form (often carrying many charges, contrary to MALDI). They are then released into the mass spectrometer. [Walther et al., 2010] 1.3.3.3 Mass analyser Many different mass analyser designs are available. They all share two properties: they accelerate the ions (in a vacuum), so all the ions share the same kinetic energy, and then they deflect (and thus resolve) the ions based on their various𝑚/𝑧. Many are also capable of trapping and storing specific ranges of ions for more in-depth analyses until they release them based on their 𝑚/𝑧. The most common commercial analysers include quadrupole (Q), linear ion trap (LIT), time-of-flight (TOF), ion traps and most recent Fourier transform (FT) analysers such as ion cyclotron (ICR) and Orbitrap™, which has become the most popular. The choice of one analyser over others or any combination of them depends on many factors (including availability). [Haag, 2016] In Appendix A.7, I review the analysers involved in the production of the raw proteomic data used in this thesis. Indeed, all the datasets have been produced by a combination of linear trap quadrupole (LTQ) and Orbitrap™. 1.3.3.4 Fragmentation techniques [Z. Zhang et al., 2014] The overall quality and success of peptide (and then protein) identification depends largely on the quality of ion fragmentation. Tandem-MS (MS/MS) improves the identification of the peptides by cleverly increasing the number of fragmentations. Indeed, the first MS is used to select specific ions (hence known 𝑚/𝑧) to pass to a second MS. Thus for each MS1 spectrum, a collection of fragmentation mass spectra can be gathered. The peptide sequences are then more likely to be accurate, and the risk of false positive is significantly decreased. While the experimenter may want to limit fragmentation, they may also often use dedicated techniques to increase it. There are schematically three different classes of fragmentation techniques: collisional, electron-based or photon-based. The collisional category, CID (see Appendix A.7), is widespread. The chosen ion is introduced in the collision cell, and its collision with an inert gas particle produces kinetic energy, which transforms into internal energy. The fragmentation happens when this internal energy is sufficient to activate the dissociation of the ion. Another related and slightly more effective method for higher charge state ions is the higher-energy collisional dissociation (HCD). This latest method (developed by ThermoFisher for Orbitrap™ analysers), has gained popularity with the development of quantitative proteomics (e.g. iTRAQ) as it provides in parallel peptide identification and quantitation. In the electron-based 34 1.3 proteome exploration with mass spectrometry category, the most common is probably the electron transfer dissociation (ETD) technique: an anion donates an electron to a cationic peptide, and this transfer initiates the fragmentation of the peptide backbone [Syka et al., 2004]. For the other categories, see the review from Z. Zhang et al. (2014) and the included literature. 1.3.3.5 Acquisition modes There are two possible acquisition modes: data-dependent acquisition (DDA) and data- independent acquisition (DIA). The common DDA bottom-up label free protocol has many advantages [Aebersold and Mann, 2016]. It is untargeted and free from any hypothesis, and hence a great tool for global discovery study. It surveys the proteome all at once, and prior knowledge is unrequired. Its popularity increased concomitantly to the increased availability of high-quality genome and gene sequence databases and more recent technical advances in MS (development of new protein/peptide ionisation and fragmentation methods) [Aebersold and Mann, 2003; Cox and Mann, 2011; Z. Zhang et al., 2014]. The main DDA limitation [Guillaumot, 2017] is its inability to select all the existing peptidic ions for fragmentation by MS/MS. The stochastic selection is biased both by each peptide ionisation efficiency and the possible peptides coelution during the LC before being introduced in the MS. In practice, the instrument first quickly scans the current eluting peptides (MS1) before a few (the “top N”) precursors are selected one after the other for fragmentation (MS2). One MS1 spectrum with N MS2 spectra constitute one duty cycle, which usually lasts about 1s. Peptides elute over 30 to 40s, and software tools that drive the instruments can predict peptidic peaks and thus will increase the chance of high-quality MS2 spectra by selecting the peptides at their maximum abundance. A dynamic exclusion window avoids the same peptides being repeatedly selected and new targets to be fragmented. MS1 spectra allow selecting potential peptides to fragment, and their intensity. MS2 spectra are used to identify the peptides later. DIA approaches are more complicated as no precursor is selected and all the coeluted peptides are fragmented at the same time. The sustained interest in DIA methods is because they provide a less biased overview of the proteomes, although generally more restricted. They can generate comprehensive fragment-ion maps for specific proteoforms [Chapman et al., 2014]. Often DIA is performed after a short DDA survey that establishes a small reference spectrum library to help with the analysis of the MS2 spectra. In the past decade, the number of DIA studies has expanded, particularly with methods such as SWATH-MS [Gillet et al., 2012] (for proteome quantification). DIA methods are likely to gain even more popularity as many efforts are put into their development [A. Hu et al., 2016]. 35 biological and technological context of this thesis 1.3.4 Bioinformatic strategies for proteomics studies MS-based proteomics is quite challenging, and, in most cases, the major bottleneck of proteomics pipelines remains the data analysis [Y. Chen et al., 2016; Tyanova, Temu, Sinitcyn, et al., 2016]. Since mass spectrometers’ raw data output for proteomics is directly uninterpretable, it needs processing before being meaningful. Because the sheer amount of produced data can reach the terabyte (TB) range, it prohibits any manual handling and requires automation [Codrea et al., 2016], particularly for the peptide and protein identification steps [Nilsson et al., 2010]. The protein identification process leads to three computational challenges: peptide identification, protein inference and result validation [T. Huang et al., 2012]. Besides, shotgun proteomics produce highly redundant data: peptide subsets that ionise better than the rest are repeatedly and preferentially selected for fragmentation and thus will produce more MS/MS spectra [Eriksson et al., 2007; Koziol et al., 2013]. On the other hand, certain subsets of peptides can be undetectable with the currently available technology; they hardly ionise, or have a weak signal that is masked by other more abundant or more ionisable peptides. Thus, shotgun experiments are plagued by missing data [Stead et al., 2008; Lazar, Gatto, et al., 2016]. As for RNA-Seq, for each proteomics analysis step, there are many tools. Many integrate a few (if not all) of the steps described in the following pages, e.g. MaxQuant⁴⁷ [Tyanova, Temu, and Cox, 2016], OpenMS⁴⁸ [Pfeuffer, Sachsenberg, Alka, et al., 2017], Skyline⁴⁹ [Pino et al., 2020] or Crux⁵⁰ [Park et al., 2008; McIlwain et al., 2014]. From the raw data acquisition to downstream analyses (or any of the intermediate steps), pipelines integrate many different combinations of tools [Vitek, 2009]. Pipelines vary depending on experimental design, data type (i.e. expression, modification state or interactions map) and the data creation practicalities (e.g. method of separation, mass analyser kind, acquisition mode). Many factors are interconnected in MS and while individual effects have been extensively studied, designing a sound pipeline may be overlooked easily [Sun et al., 2012; Maes et al., 2016]. However, this risk is reduced as many tools imply working upstream or downstream with specific tools, e.g. the output of MaxQuant (raw data to protein quantification) is ready for use by Perseus⁵¹ [Tyanova, Temu, Sinitcyn, et al., 2016] (for analyses such as pattern recognition, time-series analysis, cross-omics comparisons and multiple hypothesis testing). Besides, as files from mass spectrometers are usually encoded in proprietary (commercial) formats, there may be a limited choice of available tools or software for a specific combination of analysis and data. A few open file formats (and their appropriate converters) exist (e.g. mzML 47 MaxQuant — http://www.maxquant.org 48 OpenMS — http://www.openms.de/ 49 Skyline — https://skyline.ms/ 50 Crux — https://crux.ms/ 51 Perseus — http://www.perseus-framework.org 36 1.3 proteome exploration with mass spectrometry format [Martens et al., 2011] and ProteoWizard⁵² [Chambers et al., 2012]), however, they may fail to record a few critical points of the raw data, and, so, they may also be unfit for particular analyses. Spectronaut⁵³ [Bernhardt et al., 2014] is one example of specialised pipelines. Spectronaut handles the targeted analysis of DIA⁵⁴ data. As bottom-up approaches are the most common, their associated tools also tend to dominate the MS-based proteomic bioinformatics [L. H. Lee, 2015]. In the following pages, I review the steps involved in the MS/MS pipeline, presented in Chapter 2, that has processed all the proteomic data presented in this thesis. 1.3.4.1 Signal processing and peak-picking The signal processing step comprises the (noise) filtering (or denoising), the baseline correction (which eliminates systematic trends), the signal normalisation (or centroiding), and the peak picking (i.e. mass peaks detection) [Codrea et al., 2016; Nahnsen et al., 2013]. These steps are highly automatised and happen almost simultaneously to the signal acquisition. Peak picking requires automation for proteomic experiments, notably as the number of spectra for one sample is in the order of 100,000. Even though instrument resolution has significantly improved over the last decade, peaks can be smooth (i.e. with a large signal- to-noise ratio (SNR)) and easy to pick, or they can still be noisy (i.e. barely distinct from the background) and trickier to identify. Peak width is related to the peptidic ion mass-to- charge (𝑚/𝑧), the mass analyser resolution and acquisition parameters. Algorithms may use this relation to detect peaks by scanning the mass spectra for local maxima of expected widths. [Bauer et al., 2011]. They can also rely on the associated isotopic peak clusters of the molecular mass peak (see Appendix A.8). These molecular peaks clusters (and their relative intensity) also provide information that can resolve the atomic composition of the peptides [Renard et al., 2008]. Optionally, a final molecular mass correction step can be applied to remove modifications due to the ionisation process. 1.3.4.2 Peptide identification and validation An essential step of proteomic data processing is to identify and then validate a peptide sequence for each molecular mass that has been detected. Typically in LC-MS/MS, spectra from a first mass spectrometer (MS1) allow the selection of the ionised peptides (i.e. precursor ions) to be fragmented, while the spectra from the second one (MS2) allow 52 ProteoWizard: http://proteowizard.sourceforge.net/ 53 Spectronaut — https://biognosys.com/spectronaut 54 See Section 1.3.3.5. 37 biological and technological context of this thesis their identification [X. Wang et al., 2019a; Codrea et al., 2016]. Identification robustness increases with the number of spectra associated with each peptide (or protein). Matching spectra to peptide sequences In MS/MS experiments, various fragments (product ions) may be produced through the cleavage of covalent bonds or internal rearrangements (e.g. loss of carbonyl group, a water or ammoniummolecule) [Macias et al., 2020; Wysocki et al., 2005; Yagüe et al., 2003; Bythell et al., 2010]. While different protocols and parameters can produce similar fragments, the peptide MS/MS profiles (spectra) recorded by a given instrument with different fragmentation modes are distinct. Only fragments carrying at least one charge can be detected and measured. The product ions’ nature and relative abundances depend on the initial peptide sequence, and the chosen fragmentation method⁵⁵ and energy of the fragmentation event [Révész et al., 2021]. Different fragmentation methods may produce different sets of diagnostic fragments, and hence it can be beneficial to use different methods in combination [Révész et al., 2021; Dupree et al., 2020; Tu, J. Li, Shen, et al., 2016; Diedrich et al., 2013]. The two most popular fragmentation approaches are collision-induced dissociations (CID or HCD) and produce many fragments that result from the peptide backbone fragmentation, of which peptide bonds (see Section 1.1) represent the predominant breakage pathway as they have the lowest energy [Dupree et al., 2020; Medzihradszky et al., 2015]. The Roepstorff–Fohlman–Biemann nomenclature [Roepstorff et al., 1984; Biemann, 1988], presented in Figure 1.11, has been widely accepted to designate the product ions generated by the backbone fragmentation [Medzihradszky et al., 2015]. When a peptide bond (in purple in Figure 1.11) break occurs, the precursor ion generates two complementary fragments. The products are named as: • b-ions, for the fragments comprising the amino (or N-terminus) end of the precursor’s sequence, • y-ions, for the ones comprising the carboxyl (or C-terminus) end. Before possible internal rearrangements⁵⁶, the sum of the mass of complementary ions equals the molecular mass of their precursor ion. Other fragment types also exist, and they assist with a better peptide characterisation⁵⁷ [Noor et al., 2020; Steen et al., 2004; Wysocki et al., 2005]. For instance, one may use EThcD to generate more informative fragments to study peptides’ phosphorylation or ADP-ribosylation [Penkert, Yates, et al., 2017; Penkert, Hauser, et al., 2019; Bilan et al., 2017]. Furthermore, other protocols fragment the side chain (R group — 55 See Section 1.3.3.4. 56 e.g., water loss for fragments with Ser, Thr, Glu, or Asp [Medzihradszky et al., 2015] 57 i.e. peptide identification 38 1.3 proteome exploration with mass spectrometry see Appendix A.1) and produce satellite ions (d-, v- and w-ions) that can help to differentiate between isomers (e.g. Leu and Ile) [Han et al., 2007; R. S. Johnson et al., 1987]. However, the PTMs’ characterisation remains challenging [Dupree et al., 2020]. The predictability of peptide fragmentation⁵⁸ [Dupree et al., 2020; Medzihradszky et al., 2015] and the aas mass differences (see Table A.1) enable (at least partially) the peptide’s sequence resolution. Overlaps of partial sequences obtained from different ion types allow stitching them together in the peptide sequence. Figure 1.11. The Roepstorff–Fohlman–Biemann nomenclature [Roepstorff et al., 1984; Biemann, 1988] unambiguously designates the different ions generated from the peptide backbone fragmentation. Here is an illustration for a peptide formed by eight aas (one green box represents one aa or its residue — see Table A.1). The most common breakage occurring by collision is the cleavage of the peptide bond (amide bond, in purple), which can produce a y-ion (when the C-terminus part of the precursor ion has a charge) or a b-ion (the N-terminus part is charged). The peptide backbone fragmentation can generate other series of ion types, e.g. ETD and ECD can produce c- and z-ions (cleavage of the N-C𝛼 bonds) [Han et al., 2007]. Like b-ions, a- and c-ions are produced when the N-terminal part of the peptide holds a charge, while, like the y-ions, x- and z-ions are created when the C-terminal part remains charged. An index notation distinguishes the ions from the same series: the index is the number of residues comprised (even partially) by the ions. When considering a couple of complementary ions, e.g., b2-ion and y6-ion, the indices sum equals to the residues number of the precursor ion; in other terms, if 𝑛 is the number of aas in the peptide, b𝑚-ion and y(𝑛−𝑚)-ion are complementary. Figure 1.12 shows an MS/MS spectrum, also called fragmentation spectrum or MS2 spectrum. When considering a series of either b- or y-ions, the mass difference between two (single charged) consecutive ions corresponds to the mass of the residue at the end⁵⁹ of the lengthiest (hence the heaviest) ion of the pair. Although based on a real example, Figure 1.12 shows a simplified MS/MS spectrum to ease its interpretation. In practice, assigning peaks is more challenging, including assigning to one series over another one. The lower the mass accuracy, the more difficult it is to discriminate between possible isobaric combinations and different ion types [Medzihradszky et al., 2015]. 58 Both in the ions’ nature and their relative intensities [Frank, 2009]. 59 either at the N-terminal end for a-,b- or c-ions or at the C-terminal end for the x-, y- and z-ions 39 biological and technological context of this thesis (a) Simplified representation of an MS/MS spectrum (or MS2 spectrum) for the peptide IYEVEGMR — adapted from Sadygov et al. (2004) (b) Peptide sequence solved from the MS/MS spectrum. Figure 1.12. MS/MS spectrum and peptide identification. In (a), two consecutive ions of the same series have a mass difference that corresponds to the mass of one aa contained in one ion but missing in the other. The supplementary aa is found at the C-terminus of the heaviest ion. The respective EVEGM and YEVEG partial sequences can directly be solved from the b- and y-ion series. By combining the MS/MS spectrum presented in (a) and the precursor molecular mass given by the MS1 spectrum (997.16 Da — highlighted in red in (b)), one can deduce the remaining peptide sequence. Mass ions directly extracted from (a) are in orange for the b-ions and in blue for y- ions. Amino acids with supporting evidence in (a) are highlighted in (b) in orange when backed by b-ions and in blue by y-ions. The sum of the masses of complementary b- and y-ions equals the precursor’s molecular mass. The complementary ion of y7 has a mass of 114.16 Da, which corresponds to Leu’s or Ile’s acylium ion. Leu and Ile share the same mass and are undistinguishable from their b- or y-ions only. It may happen in the literature that a place holder is used to symbolise either of them, e.g. J, which refers to none of the standard aas. At the C-terminal end of the sequence, the complementary ion to b7 has a mass of 175.20 Da that corresponds to Arg or the couple Val/Gly, an isobaric combination to Arg. However, the precursor peptide has been produced by trypsin digestion that specifically cleaves at the carboxyl side of Arg and Lys (when not followed by a Pro). Thus, it is more likely that the remaining sequence comprises Arg only. Theoretically, peptide sequences can be solved by collecting and identifying all the ions from the same series and correlating the mass differences between them with the residue masses of the amino acids — including any possible PTM or other modification. However, even when excluding Leu’s and Ile’s ambiguity, a direct resolution is hard to achieve in practice as series overlap and ions may not be detected or recognised (e.g. Ser and Glu can lose a water molecule through internal rearrangement [Medzihradszky et al., 2015]). 40 1.3 proteome exploration with mass spectrometry A complete ion series of a fragmented peptide is ideally required, but not all peptide bonds may break or be intense enough to be detected. In addition, water and ammonium losses and other modifications can shift the mass of the residues. However, there are also many rules of thumb (see Steen et al. (2004)) providing quick guidance about reasonable sequences. Furthermore, possible immonium ions (+H2N = CHR) may help determine the aas composition of the precursor peptide, and past studies have highlighted many fragmentation rules that contribute to removing ambiguities. See Medzihradszky et al. (2015) for a more detailed discussion on fragmentation rules and MS/MS spectra interpretation refinements. As modern LC-MS/MS experiments can produce tens of thousands of MS/MS spectra [O’Bryon et al., 2020], manual approaches are neglected in favour of algorithms to automate the peptide spectra matching. There are three main possibilities for protein/peptide spectra matching: the sequence- based searching approach, the spectral library searching one and the de novo sequencing: • The sequence-based approach matches the experimental spectra to theoretical ones. These theoretical spectra are created by adequate in silico digestion and fragmentation of proteins found in protein sequence databases such as UniProtKB/Swiss-Prot from UniProt⁶⁰ [The UniProt Consortium, 2017], which is manually annotated and reviewed, UniProtKB/TrEMBL (automatically annotated, unreviewed), or from genomic sequence databases (e.g. NCBI ⁶¹) after in silico translation of the sequences. This type of matching requires database-dependent search engines, e.g. Mascot⁶² [Perkins et al., 1999], Sequest [Eng, McCormack, et al., 1994; Tabb, 2015], Andromeda [Cox, Neuhauser, et al., 2011] from MaxQuant, X!Tandem [MacLean et al., 2006], MS-GF+ [S. Kim et al., 2014], MS Amanda [Dorfer et al., 2014], Open-pFind [Chi et al., 2018], MSFragger [Kong et al., 2017], Morpheus [Wenger et al., 2013] or Pulsar included in Spectronaut®. • The spectral library based approach matches the experimental spectra to other previously recorded experimental spectra. This approach relies on spectral matching engines, e.g. HMMatch [X. Wu et al., 2007], SpectraST [Lam et al., 2008] or BiblioSpec [Frewen et al., 2007] from Skyline, M-Split [J. Wang et al., 2010], Pepitome [Dasari et al., 2012], QuickMod [Ahrné, Nikitin, et al., 2011], pMatch [D. Ye et al., 2010], COSS [Shiferaw et al., 2020] or ANN-SoLo [Bittremieux et al., 2018]. • The de novo sequencing approach consists of comparing the observed mass data to the theoretical mass of every possible peptide sequence. This method is the closest to the manual approach described above. For example, see PepNovo⁶³ [Frank, 2009], PEAKS⁶⁴ [B. Ma et al., 2003], pNovo 3⁶⁵ [H. Yang et al., 2019], DeepNovo [N. H. Tran et 60 UniProt — https://www.uniprot.org/ 61 NCBI — https://www.ncbi.nlm.nih.gov/protein/ 62 Mascot — http://www.matrixscience.com/ 63 PepNovo — http://proteomics.ucsd.edu/Software/PepNovo/ 64 PEAKS — https://www.bioinfor.com/peaks-studio/ 65 pNovo 3 — http://pfind.ict.ac.cn/software/pNovo/ 41 biological and technological context of this thesis al., 2017],MetaSPS⁶⁶ [Guthals et al., 2013],Novor⁶⁷ [B.Ma, 2015], Lutefisk [Taylor et al., 1997], NovoHMM [Fischer et al., 2005], ANTILOPE from OpenMS, Twister [Vyatkina et al., 2015] for topdown data, UniNovo⁶⁸ [Jeong et al., 2013], UVNovo⁶⁹ [Robotham et al., 2016]. The first two methods are favoured over de novo peptide sequencing, as the latter is very cumbersome and time-consuming [Codrea et al., 2016] and requires high quality data for best performance [Muth et al., 2018]. Note that there are also hybrid approaches based on de novo sequencing and database matching, such as GutenTag [Tabb, Saraf, et al., 2003], InsPecT [Tanner et al., 2005], DirecTag [Tabb, Z.-Q. Ma, et al., 2008], ByOnic [Bern et al., 2012], or PEAKS DB [J. Zhang et al., 2012]. Many search engines exist for matching MS/MS spectra, see Noor et al. (2020), C. Chen et al. (2020), Griss (2016), Shteynberg, A. I. Nesvizhskii, et al. (2013), and Eng, Searle, et al. (2011) for reviews. Furthermore, combining the results of several search algorithms when possible improves the final outcomes [Noor et al., 2020; C. Chen et al., 2020; Sadygov et al., 2004; Eng, Searle, et al., 2011; Shteynberg, A. I. Nesvizhskii, et al., 2013; Griss, 2016]. Sequence based algorithms are by far the most popular and many (e.g. SEQUEST and Mascot) rely on the same information and parameters choices to match (and score) the spectra to peptide sequences. • Fragmentation mode • Digestion enzyme and the number of possible omitted cleavages • Mass tolerance for𝑚/𝑧 ratio of peptidic and fragmented ions. • Possible charges for the peptidic and fragmented ions. • Knowledge database (the more complete, accurate and adequate a database is the better and more robust is the peptide and protein identification). Scoring functions While algorithms assign MS/MS spectra to peptide sequences, they simultaneously compute a corresponding score for each peptide-spectrum match (PSM). This score summarises the quality of the assignment and allows the selection of the best candidate-peptide for each spectrum. Only the matches with the best PSM scores (if not the very best one only) are reported and used in the next steps of the quantification. Sadygov et al. (2004) classify the scoring functions in four classes: descriptive (e.g. SEQUEST ⁷⁰ [Eng, McCormack, et al., 1994]) interpretative, stochastic and probability-based modelling (e.g. Mascot). While many of these functions return statistical scores (e.g. Mascot and SEQUEST ), many other return non-statistical ones. For 66 MetaSPS — http://proteomics.ucsd.edu/software-tools/metasps/ 67 Novor — rapidnovor.com 68 UniNovo — http://proteomics.ucsd.edu/Software/UniNovo 69 UVNovo — https://github.com/marcottelab/UVnovo 70 SEQUEST — http://fields.scripps.edu/yates/wp/ 42 1.3 proteome exploration with mass spectrometry these latter ones, tools such as Percolator⁷¹ [Käll, Canterbury, et al., 2007; Spivak, Weston, Bottou, et al., 2009] or PeptideProphet⁷² [Keller et al., 2002; K. Ma et al., 2012] allow transforming their scores into probabilities, and ease the application of threshold to remove unreliable matches [J. S. Cottrell, 2011]. Regardless of whether the score is either statistical (including probability-based) or not, searching the database to match spectra to peptide sequence is a statistical process. Most MS/MS spectra only partially cover a peptide sequence, which leads to many ambiguities. With the development of high-throughput MS-based proteomics, researchers dropped manual interpretation and validation and moved towards empirical score thresholds [Brosh, 2009]. Thresholds are a compromise between sensitivity and accuracy (or error rate), i.e. true positive (or correct) identifications proportion versus false positive (or incorrect) identifications proportion. A high threshold for the scores reduces the error rate, but decreases sensitivity. A low threshold accepts more PSMs, but also more incorrect matches. Peptide validation Today, among the many methods that validate the peptide assignment and estimate its error rate, the gold standard is the target-decoy search approach (TDA) [Perkins et al., 1999; Elias et al., 2007; Savitski et al., 2015]. Many search engines compute a p-value (see Appendix A.9.2) for each of the PSMs, but it is insufficient to determine if multiple PSMs are true matches. A low p-value PSM has a low probability of being incorrect. Because of the large number of PSMs per experiment, statistically, a portion of these are incorrect. Thus, other statistical measures adjusting for multi-testing (see Appendix A.9.3) are required. Corrections (such as Bonferroni [Shaffer, 1995]) can be applied though they are stringent and discard many correct PSMs. False discovery rate (FDR) [Benjamini and Hochberg (1995)] estimates the proportion of incorrect PSMs among all accepted PSMs [A. I. Nesvizhskii, 2010; Aggarwal et al., 2015], see Equation (FDR). Different approaches exist to estimate it. The most favoured is TDA as it is non-parametric, easy to apply, and it also has the advantage of working with search engines that have non-statistical scores. Instead of figuring out which the correct and incorrect PSMs are, the target decoy search approach (TDA) aims to estimate the overall FDR associated with a specific collection of PSMs. In turn, this enables assessing the likelihood of each of the PSMs in the collection [Elias et al., 2007] (see following segments on q-values and PEP). The critical elements of this method are the creation of a decoy database with sequences that are incorrect but similar (while non-overlapping) to the target (i.e. true) ones. Consequently, any PSM found for a decoy sequence is by definition spurious. The known proportion of decoy versus target sequences in the search space allows the computation of the FDR [Elias et al., 2007; Elias et al., 2010] as shown in Equation (FDR). In Appendix A.10, I briefly discuss the decoy database. 71 Percolator — http://percolator.ms/ 72 PeptideProphet — http://peptideprophet.sourceforge.net/ 43 biological and technological context of this thesis FDR = Number of accepted PSMsdecoy Number of all accepted PSMs = Number of accepted PSMsdecoy Number of accepted PSMsdecoy +Number of accepted PSMstarget (FDR) Most of the search engines return multiple scores. Hence, defining a proper threshold for each of them can be challenging or cumbersome. A possible solution is Percolator, which trains a support vector machine (SVM) [Boser et al., 1992] to distinguish the correct and incorrect PSMs [Käll, Canterbury, et al., 2007]. This machine learning based algorithm has several advantages. It can exploit many scores and other specific data features to automatically determine the best threshold without overfitting to a particular collection of PSMs. Thus, comparisons of results between studies and laboratories are facilitated. Moreover, many studies have reported that the use of Percolator improves the results in terms of both accuracy and sensitivity and increases the overall number of identified peptides [Granholm et al., 2014; Xu et al., 2013; Tu, Sheng, et al., 2015; The, MacCoss, et al., 2016; Wright, Collins, et al., 2012]. Percolator can either be used directly on the collection of target and decoy PSMs or as a post-processing step. Protein inferring algorithms use two other significance measures for PSM validation: q- values and PEP, which are described in Appendix A.11. Anti-conservativeConservative adjusted p-value (eg. Bonferroni correction) PEP FDR, q-value (Unadjusted) p-value. . . . optimal for most studies Figure 1.13. Methods for assigning statistical significance to a collection of PSMs — Adapted from [Käll, Storey, et al., 2008]. Methods based on PEP instead of q-values are more conservative, as PEPs are always greater than q-values [Käll, Storey, et al., 2008], see Figure 1.13. For an individual validation, e.g. checking the presence of a specific peptide in a particular condition, PEPs are more indicative. On the other hand, q-values are favoured where the whole collection of PSMs is considered, e.g. for an overview of the proteome landscape. [Käll, Storey, et al., 2008] 1.3.4.3 Protein inference While identified as the key problem more than fifteen years ago, protein inference (i.e. identification) remains the main challenging issue in shotgun proteomics [A. I. Nesvizhskii 44 1.3 proteome exploration with mass spectrometry and Aebersold, 2005; He et al., 2016]. Note that in the literature, protein inference often also encompasses the peptide identification as an intermediate step and the results validation as they influence the protein identification quality. Protein inference consists in assembling peptides into sets of reliable proteins. Most assembly algorithms model the relationship between identified peptides and protein sequences as a bipartite graph [T. Huang et al., 2012], illustrated in Figure 1.14. Proteins identified by two or more (unique) peptides represent the best case: their identification is reliable and computationally lighter. However, the existence of ‘degenerate peptides’ and ‘one-hit wonder’ proteins [T. Huang et al., 2012; He et al., 2016] makes this task challenging and computationally intensive. Degenerate peptides are ambiguous peptides that are shared by multiple protein sequence definitions. They create a computational challenge because it is difficult to resolve from which proteins they are derived and to select which of the two following options is true: either all the related proteins are expressed in the sample or only some of them. Some workflows cluster together proteins with homologous sequences to ease the process. More often, to lessen the computational and interpretation burden, these ambiguous peptides are discarded, and the inference relies solely on ‘unique peptides’, i.e. peptides that are attributable to one protein sequence only. ‘One-hit wonders’ are proteins that are identified by a single peptide only. They require careful handling regardless of their peptide uniqueness status, since if the peptide identification is a false positive (i.e. an artefact), then the protein is also a false positive. Shorter proteins are generally harder to identify (and quantify) for this reason. As shown in Appendix A.12, the peptide assembly can be formulated as a set covering problem [Cormen et al., 2009; Hochbaum, 1997]. This problem is known to be NP-complete [van Leeuwen et al., 1990], and for which it is in practice impossible to calculate an optimal solution. Inference algorithms seek a compromise between the minimal and exhaustive lists of possible proteins. Usually, algorithms approximate this solution through a parsimonious approach (see the below example based on Maxquant). Regardless of the approach, all algorithms involve a bipartite graph (see Figure 1.14), even if they may also include other supplementary information to build their models (see Figure A.5, p. 195). Figure 1.14 shows how degenerate peptides, one-hit wonder proteins and the validation quality of the peptide identification complicate the inference. Note that if an edge connects a peptide 𝑖 and a protein 𝑗, the peptide 𝑖 is said to be covered by the protein 𝑗 [T. Huang et al., 2012]. Protein 1 and protein 2 share peptide 1 and peptide 2. As both protein 1 and protein 2 are also covering another peptide (peptide 3 for protein 1 and peptide 4 for protein 2), it seems 45 biological and technological context of this thesis 1 2 3 n 1 32 4 5 6 m7 Spectra Peptides Proteins4 Bipartite graph Supplementary information model (one example)PSM Figure 1.14. Protein inference: the bipartite graph. In order to infer proteins, algorithms attribute each validated peptide to possible proteins of origins. Peptides 1 and 2 are both included in the definition of proteins 1 and 2; it is impossible to determine if both proteins are present in the sample or not based on these two peptides only. Peptide 3 backs the existence of protein 1 and Peptide 4 backs the existence of protein 2; if either peptide 3 or peptide 4 is missed in detection, it is easy to conclude that only one of proteins 1 and 2 is only present. On the other hand, protein 3 is only identified by peptide 5. If the latter is an artefact, then protein 3 is also an artefact. In order to achieve the inference, peptide assembly algorithms can rely on the bipartite graph as sole input (solid blue box), or they can include other data types as well, e.g. the score associated to each PSM (dashed blue circle) or the raw spectra itself (solid teal box). reasonable to assume that both proteins are expressedwithoutmore information. If peptide 4 is actually a false positive, would it mean then that only protein 1 is expressed? Now, what happens if peptide 4 is hard to detect and is missing from the validate list of peptides? Peptide 5 is the only one to identify protein 3 while peptide 6 and peptide 7 are both backing the existence of protein 4. If one of the two latter peptides is an artefact, protein 4 is still most likely a true positive. However, if peptide 5 is a false positive, so is protein 3. Many different approaches, algorithms and their related search engines for protein inference have been reviewed in the literature [T. Huang et al., 2012; Serang and Noble, 2012]. Despite the partial or lack of overlap between sets of confidently identified peptides, combining several search engines to infer the proteins has proven to yield better results than a single search [Searle et al., 2008]. At worst, it improves the confidence of the identification as more peptides are characterised per protein [T. Huang et al., 2012; Audain et al., 2017]. One possible approach for protein inference is a parsimonious approach, as implemented for example by Maxquant [Tyanova, Temu, and Cox, 2016; Cox and Mann, 2008]. 46 1.3 proteome exploration with mass spectrometry After identifying all proteins covering a given peptide, Maxquant joins the proteins with the same set or subset of peptides in the same protein group. In other words, if the peptides set 𝑆𝑎, defining a protein 𝑃𝑎, is equal to or strictly included in the peptides set 𝑆𝑏, defining the protein 𝑃𝑏, then 𝑃𝑎 and 𝑃𝑏 are joined in the same group 𝐺1. Then, in each group, the proteins are ordered by the decreasing number of peptides they cover. Hence, the protein sequence at the top of the group can explain all the group’s peptides. Maxquant refers to a peptide as ‘unique’ when found in only one group of proteins in contrast to degenerated peptides shared between two distinct protein groups that cannot be combined because their other peptides are unique to each group. Shared peptides are called ‘razor’ (referencing Occam’s razor) in the protein group with the highest number of peptides since it is considered the simplest explanation. The user can choose which peptides are included in the quantification: unique peptides only, both unique and razor peptides (default option⁷³) or all the peptides (of each group). The following steps enable results filtering based on the protein groups PEPs (a protein group PEP equals the multiplication of its peptide PEPs), the spectra quality, the unique peptides number and the FDR threshold. Many tools implement other approaches. DTASelect [Tabb, McDonald, et al., 2002] sorts peptides by their identified locus (i.e. gene or protein identifier) and then by sequences. Next, DTASelect filters the results based on various criteria, including the spectra quality and user’s inputs, to finally keep proteins supported by enough different peptides or by at least one peptide identified several times. The algorithm adopts an optimistic approach instead of a parsimonious one and only groups together proteins that have a strict identical sequence coverage. ProteinProphet [A. I. Nesvizhskii, Keller, et al., 2003] (part of the Trans-Proteomics Pipeline) is one of the most widely used methods in the literature [Sikdar et al., 2016]. The tool computes in each sample the presence probability of a protein by combining the probabilities of its different identified peptides through an iterative process. An EM algorithm derives a mixture model of correct and incorrect peptide identifications from the observed data⁷⁴. The following steps can summarise the inference⁷⁵. ProteinProphet keeps the best spectrum matching any given assigned peptide. It retrieves all proteins that cover any identified peptide. The tool groups the peptides by proteins and computes the protein’s presence probability based on the available peptide evidence. It readjusts the peptides’ negative and positive distributions based on their sibling counts: proteins that have more than one assigned peptide are rewarded, while ‘one-hit wonders’ are penalised. ProteinProphet apportions the degenerate peptides across all their covering proteins before using a parsimonious approach to readjust their distributions through an EM algorithm. As a starting point, the sum of all the peptides’ weights of one protein 73 Considered by the authors as the best compromise between most accurate protein quantification and unequivocal peptide assignment [Cox and Mann, 2008]. 74 including the PSM results, their associated score or the properties of the peptide matched to the spectrum 75 Based on A. Nesvizhskii (2006) and Serang and Noble (2012) 47 biological and technological context of this thesis equals one. Through successive iterations, these weights are refined such that proteins with a higher number of peptides are rewarded. All redundant and indistinguishable protein entries are finally collapsed together. Themodel learns directly from the observed data, which increases its robustness [T. Huang et al., 2012]. However, one identified issue with this award/penalty system is that it creates cases where proteins covering many low-scoring peptides can outrank proteins covering a smaller set of higher-scoring peptides [Serang andNoble, 2012]. As the size of the dataset to study grows, the problem worsens. A complimentary tool in the Trans-Proteomics Pipeline developed by the authors, iProphet [Shteynberg, Deutsch, et al., 2011], addresses this issue by considering other information levels and refining the computation of the posterior probabilities and protein FDR estimates. Percolator⁷⁶ [Käll, Canterbury, et al., 2007; The, MacCoss, et al., 2016; Halloran et al., 2019] (part of both The OpenMS proteomics pipeline (TOPP) and theCrux suite⁷⁷) is based on SVMs. It implements a generalised semi-supervised⁷⁸ learning approach distinguishing between target and decoy⁷⁹ PSMs for any shotgun dataset. First, a classifier is trained with a subset of the data, i.e. the algorithm knows the data labels. Second, after ranking targets and decoys according to a selection of features, the algorithm selects the target PSMs with a 1% FDR and then trains an SVM to discriminate between the kept targets and the full set of decoy PSMs, and induces a new ranking. These steps iterate until convergence of the ranking (i.e. the ranking remains the same from one iteration to the next). Fido⁸⁰ [Serang, MacCoss, et al., 2010] (implemented first as a standalone tool, but now distributed as a part of Percolator) is based on Bayesian inference (see Appendix A.13). From a set of simple assumptions, the authors have developed a Bayesian model. This model estimates three parameters directly from the data: the probability of a present protein to generate peptides, the error probability of peptides to be detected from noise and the proteins’ a priori probabilities to be present in a sample. These parameters are respectively annotated 𝛼, 𝛽 and 𝛾. Fido explicitly allows high ranking spurious PSMs. It uses their presence likelihood⁸¹ and rewards proteins which include strong independent supporting evidence besides their degenerate peptides. Fido automatically apportions information from degenerate peptides while ensuring that each protein presence likelihood is unrestricted to their degenerate peptides. Compared to the initial approach of ProteinProphet, this method’s accuracy is resistant to the dataset size. A related method to Fido is EPIPHANY [Pfeuffer, Sachsenberg, Dijkstra, et al., 2020]. 76 Percolator — http://percolator.ms 77 Crux suite — http://crux.ms/ 78 semi-supervised since only decoy PSMs are labelled as ‘incorrect’ while target PSMs are an unlabelled mixture of correct and incorrect PSMs. [Halloran et al., 2019] 79 shuffled or reverse peptide sequences 80 Fido — https://noble.gs.washington.edu/proj/fido/ 81 By converting the PSMs prior probability estimates back into discriminant score-based likelihoods when needed 48 1.3 proteome exploration with mass spectrometry Another inference tool from Percolator’s authors is Barista [Spivak, Weston, Tomazela, et al., 2012], which is also part of the Crux suite. Barista combines the verification of the PSMs and the protein inference where Percolator handles only the second task. The authors advocate for a topdown approach to optimise the protein inference problem instead of subdividing the workflow into independent modules. Their tool builds a tripartite graph (spectra-peptide-protein) based on the results of a database search (target and decoy sequences) and a protein database. A learning algorithm infers the protein presence likelihood in each sample. This algorithm involves a similar set of features to Percolator. Barista iteratively refines the peptide identification ranks to better discriminate between correct and incorrect identification. In contrast to Percolator, the PSMs are not filtered in Barista and contribute to the results optimisations. The authors themselves state that assessing which approach is the best in practice can require more study [McIlwain et al., 2014]. Many other inference algorithms exist, including PIA [Uszkoreit et al., 2015], MIPGEM [Gerster et al., 2010], PAnalyzer [Prieto et al., 2012], DBParser [X. Yang et al., 2004] and PANORAMICS [Feng et al., 2007]. The field is steadily improving and adapting in response to the development of the measurement instruments and acquisition protocols (e.g. IPF [Rosenberger et al., 2017] for DIA data), preparation protocols (e.g. multiple proteolytic digestions in parallel, see [Miller et al., 2019]), increase of computing resources (for example Percolator optimisations [Halloran et al., 2019]), optimisation or implementation of new mathematical approaches and concepts (e.g. DeepPep [M. Kim et al., 2017] based on a deep-convolutional neural network framework or gpGrouper [Saltzman et al., 2018] and MIPGEM [Gerster et al., 2010] that implement a gene centric approach). Furthermore, the rise of metaproteomic, proteogenomic, metabolomic and multiomic studies has brought new perspectives and challenges [Gonnelli et al., 2015; Starr et al., 2018; Rechenberger et al., 2019; Menschaert et al., 2017; Liebal et al., 2020]. The, Edfors, et al. (2018) have designed an experiment simulating homologous proteins that can evaluatemost protein inference algorithms. They observe better concordance between inferred proteins and the ground truth when excluding degenerate peptides rather than adopting a parsimonious approach. However, they report that results are different when considering protein groups instead of individually. 1.3.4.4 Protein quantification (label-free) Label-free quantification methods will probably be more refined in the future. Over the last decade, along with the instruments’ improvement and the increasing number of label- free bottom-up proteomics, there have already been many developments in quantification methods and tools. Blein-Nicolas et al. (2016), Y. Chen et al. (2016), and Lindemann et al. (2017) review many of the currently available ones. Note that methods created for label-free relative quantification experiment designs can be adapted to methods that allow 49 biological and technological context of this thesis absolute quantification [Pappireddi et al., 2019; Sinitcyn et al., 2018; Y. Chen et al., 2016]. Many normalisation methods have also been reviewed by Välikangas et al. (2018b). As mentioned above, label-free quantification methods are either based on the number of identified peptides or MS2 (i.e. MS/MS) spectra matched to a protein or on the precursors’ ion extracted current peak intensities (i.e. XIC) from the MS1 spectra. Spectral/peptide counting is a widespread method in the literature. H. Liu, Sadygov, et al. (2004) present a linear correlation over two orders of magnitude between the relative protein abundance and the acquired spectra number. Spectral counting is simple but requires proper normalisation. Many other methods (e.g. APEX [Braisted et al., 2008], emPAI [Ishihama et al., 2005] or MAI/PLGEM-STN/SC [H. Y. Lee et al., 2019]) are derived from it. Cozzolino et al. (2020) have realised a comparative study between their method and two other spectral counting ones. There are also peak intensity based methods such as intensity based absolute quantification (IBAQ), which are the most favoured today. Arike et al. (2012) report that this latter method is better than the previous ones based on spectral counting as the estimated quantification correlates better to the absolute abundance. Ahrné, Molzahn, et al. (2013) refine this statement as IBAQ has biases and quantification errors that make it unfit for direct use for a proportional assessment of the complete protein set; IBAQ dramatically underestimates low-abundant proteins. Instead, to improve the general quantification estimation, the authors advocate a Top3 approach [Silva et al., 2006], which represents the abundance of each protein by the average or sum intensity of its three best ionised (unique) peptides. Top3 allows better quantification of the proteome landscape in general, even if it is less accurate for the shortest proteins and saturates for the largest ones. MS1-based quantifications have been reviewed by X. Wang et al. (2019b). Many normalisations include the protein sequences length (longer proteins are more likely to produce a greater number of sampled peptides) and the sum of ion intensity or the spectra number (for spectral count methods) that acquired for each protein within a sample [Blein-Nicolas et al., 2016]. Similar approaches to RNA-Seq can be used to allow protein comparison across different samples. Most analyses benefit from validating the inferred proteins. Unfortunately, there is a lack of consensual methodology to determine a protein FDR and strong opinion divergences on its conceptual validity in the field [Savitski et al., 2015]. A few are even questioning its validity more generally [J. Cottrell, 2013]. Beyond the differences due to frequentist and Bayesian definitions and approaches, part of the pointed out discrepancies [The, Tasnim, et al., 2016] is caused by overlooking that protein inference tools are simultaneously using two different null hypotheses (ℋ0 — see Appendix A.9.1). Two commonℋ0 statements testing the validation of a protein identification are: ℋ′0 The best scoring peptide is incorrectly matched to the protein. (Often, the protein FDR is derived from its best scoring peptide.) 50 1.4 possible downstream analyses for expression data ℋ″0 The protein is absent from the sample. Nonetheless, the most widespread method is based on the TDA and the protein FDR is computed with Equation (FDR) (p. 44). Savitski et al. (2015) demonstrate that the classic TDA largely overestimates the FDR for large datasets. To overcome this, Savitski et al. (2015) propose a new ‘picked’ TDA. This approach pairs together target and decoy sequences of each protein instead of treating them individually. For each pair, the target’s and the decoy’s protein scores are compared, and the highest is kept and the other one discarded. The, Tasnim, et al. (2016) encourage the use of the ‘picked’ TDA when one’s analysis is based on ℋ′0, but recommend to keep the classical TDA for ℋ″0 as there is a lack of a better method to date (and ‘picked’ TDA actually underestimates FDR in that case). Several benchmarking studies [Välikangas et al., 2018a; Al Shweiki et al., 2017; Bubis et al., 2017; Navarro et al., 2016] have been published recently that may help choose between the many existing quantification methods and the tools implementing them. 1.4 possible downstream analyses for expression data Over-representation analyses (ORAs) are a standard final stage analysis on expression data. They can provide biological insights and hint about mechanisms. ORAs highlight sets of gene (or protein or metabolite) categories that are overrepresented in a selected subset of the data compared to the expectation of a random category. Many expression studies aim to produce ‘a list of “interesting” biomolecules’ [Tipney et al., 2010], ORAs help to determine the most pertinent ones by providing biological context and increasing statistical power. Besides, such lists are often substantial so extracting common functional information from subsets of genes/proteins eases the interpretation of the data. Various tools and algorithms have been developed to overcome the daunting task of individually checking the genes/proteins. See Shi Jing et al. (2015) for some examples. While, the field has been reviewed a few times (see Khatri and Drăghici (2005), D. W. Huang et al. (2009), and Khatri, Sirota, et al. (2012)), it is still unfortunately lacking a gold standard and systematic comparative studies [Mathur et al., 2018]. Depending on the experimental design and study, the list of biomolecules can be associated with a rank or another form of a score. As differential expression analyses (DEAs) are the most widespread type of analyses for expression data, many tools and algorithms work solely with the corresponding outputs. Hence, those are unfit for other types of studies (such as the ones in this thesis). Available gene set enrichment analysis (GSEA) [Subramanian et al., 2005] tools are commonly inadequate for any other purpose than DEA studies. See Tamayo et al. (2012) and Irizarry, C. Wang, et al. (2009) and the included references for some examples of GSEA tools. 51 biological and technological context of this thesis On the other hand, while still devised for DEA, other tools handle any rank or score and thus are more flexible; they can analyse data outside of their original scope. Many gene ontology (GO) analysis tools fall into this latter category. 1.4.1 GO analysis (GOA) The GO is a collaborative and curated classification that describes the gene products following three hierarchically structured and controlled vocabularies (also known as ontologies): either based on the biological processes (‘BP’) to which they contribute, their position (when active) in the cell (cellular component (‘CC’)) or their biochemical activity, i.e. molecular function (‘MF’). [Ashburner et al., 2000] Generally, GO enrichment analyses (GOAs) compare a selected list (with the genes/proteins of interest) to a background list (e.g. all observed genes in the experiment or all existing genes in the annotation). For each GO term, this method computes the enrichment of the selected set based on the real fraction of the set for the considered GO term and its likelihood (computed on the background list). For example, if a GO term 𝒜 is associated with 0.1% of all the background (list) genes, but then over 70% of the genes from the selected list are associated with GO term 𝒜, one can safely accept that the selected genes list is enriched for this term. Various tools rely on different statistical tests to determine if the enrichments are significant. Investigating only a few GO terms of interest is common. While ranked lists are unnecessary for GOAs, one can apply a cut-off before running this type of analysis. However, in those cases, GSEA will be generally favoured to a GOA. 1.5 reproducibility and experimental design Science develops based on reproducible facts, as they help with drawing relevant and accurate conclusions. To increase reproducibility, it is essential that observations and measurements be (as much as possible) unbiased towards any parameter outside of the study focus. To this end, one of the critical issues that need to be tightly monitored is the presence of batch effects. Including adequately designed replicates in the study is the most effective way to control for the unwanted batch effects. Another issue is due to high-throughput transcriptomics and proteomics still being evolving fields. For each new identified problem, researchers create new tools and algorithms. However, these are often aimed at one study only and their reuse in another study can be difficult, or even impossible (e.g. discontinued proprietary software). On the other hand, well established tools allow fine tuning many parameters, the impact of which can be overlooked while the reporting. Thus, it is unsurprising that result 52 1.5 reproducibility and experimental design agreement between different tools is unsatisfactory [Conesa et al., 2016]. 1.5.1 Batch effects Batch effects are artefactual and due to all the variables that the investigator can not control (either by lack of technology, design or knowledge), for example, environmental conditions, reagent or sample lots, genetic population background or experimenters. They are often the source of complication for many studies (including high-throughput genomic ones) [Leek et al., 2010]. The danger lies in overlooking them and then confusing these artefacts with biological results, which will lead to flawed interpretation and conclusions. Several studies have been refuted in the past because of unaccounted batch effects. Usually, issues in the initial results are detected by others laboratories, which notice high correlations between the ‘biological’ findings of the original study and the running dates or processing groups. Thus, questions about the biological validity arise. The first step to address them is through well-designed experiments [Leek et al., 2010], which include technical and biological replicates. These replicates are usually created and randomly processed as to avoid creating any artefactual link between them. A replicate is a set of measurements done in the same condition. Correction can be applied and may help resolve batch effects in some cases. For examples, see Oytam et al. (2016), Gagnon-Bartsch et al. (2012), and Peixoto et al. (2015). 1.5.2 Technical replicates Technical replicates are initially from the same sample, which has been tested multiple times through a given experimental protocol. It allows testing for the variability of the protocol itself. While using the same sample to test different protocols may also be referred to as ‘technical’, it is better to avoid this terminology as this creates confusion. 1.5.3 Biological replicates Biological replicates are testing the same cells or tissues from different individuals through the same protocol. They allow assessing the biological variability (which is higher than the technical variability, since it also encompasses it). 53 biological and technological context of this thesis 1.5.4 Study design example: meta-analyses Meta-analyses are studies that combine (or aggregate) the results of multiple analyses. One of their main weaknesses is that meta-analyses cannot control bias sources and correct for bad designs [Slavin, 1986]. Because results from smaller studies are more prone to ‘play of chance’, they are usually weighted less than bigger studies when they are directly combined across many studies [Egger et al., 1997], particularly when combining statistical results. However, it is not always true and bigger studies may be subject to greater uncontrolled variations. Egger et al. (1997) recommend testing the heterogeneity across the studies to be combined. Examining the studies outcomes’ similarity degree allows figuring out if the variation between the studies is only due to sampling or to a distribution of different effects. In order to compare studies together, individual results are expressed in a standardised fashion (e.g. means, confidence interval). 1.6 discussion and conclusion Over the years, the central dogma of molecular biology has been being refined and better understood. However, as the first apparent linearity of the theory tends to persist, so are many assumptions. For example, all (or almost all) information necessary to explain the phenotype is in the cell; in its genome and edits provided by its transcription and translation regulatory mechanisms that may be triggered by environmental stimuli. Another assumption is that mRNA and protein levels should share strong correlates. Alternatively, as our genomes are so similar, our transcriptomes and proteomes should also share many similarities. Although the truth seems more intricate, these assumptions remain as we still lack the technical means to test them to their fullest. Due to the intrinsic nature of DNA, mRNAs and proteins, high-throughput DNA and RNA studies are more well-established and standardised than proteomic studies. Although, the study of the genome is the most mature, it fails to contextualise the phenotype, especially for non-disease cases (often referred to as healthy or normal conditions in this thesis). Proteins are in theory the best candidates to study the phenotype. Unfortunately, high- throughput protein studies such as shotgun MS are particularly challenging. The proteins physicochemical diversity and the lack of technology to amplify proteomes mean that in order to optimise their studies many different protocols and experimental designs had to be developed to reach the different proteins in a sample. Since the transcriptome has a dynamic dimension like the proteome and is technically easier to explore and quantify, its study has emerged as a reasonable trade-off strategy. 54 Data! Data! Data! I can’t make bricks without clay! Sherlock Homes [Doyle, 1892] 2 AVA ILABLE H IGH -THROUGHP U THUMAN DATASETS In the past few years, many laboratories have studied the expression of human genes at the transcriptome and at the proteome levels by taking advantage of high-throughput techniques (e.g. Krupp et al. (2012), Brawand et al. (2011), Ramsköld et al. (2009), Fagerberg et al. (2014), Uhlén, Fagerberg, et al. (2015), Gremel et al. (2015), Melé et al. (2015), Desiere et al. (2006), M.-S. Kim et al. (2014), and Wilhelm et al. (2014)). In this chapter, I review the openly available (for research) data I use within my thesis to explore the gene expression in (undiseased) tissues and explain how they have been reprocessed. Besides the results I report in the subsequent chapters, the present chapter also provides the basis for work I have co-authored¹. Unless otherwise stated, all the computational processing of the RNA-Seq part described here have been performed by myself under the supervision of Dr Alvis Brazma. I also received general feedback from Dr Mar Gonzàlez-Porta, Dr Johan Rung and Dr Nuno Fonseca. The proteome data has been processed by Dr James Wright. 2.1 introduction All the datasets were selected to fit three main criteria. Firstly, they comprise normal (i.e. reported as disease-free) human samples from at least three different tissue types. Secondly, gene expression quantifications are based on RNA-Seq for the transcriptome and on label-free MS for the proteome². Finally, the raw data is openly available and reusable³. In the next section, I first describe the RNA-Seq and the MS data I use in my thesis; then I detail how these data have been processed to be employed in the next chapters for various analyses that explore the transcriptome, the proteome and finally, the comparison and integration of these two biological layers. 1 J. C. Wright, J. Mudge, et al. (2016). ‘Improving GENCODE reference gene annotation using a high-stringency proteogenomics workflow’. Nat. Commun. 7, p. 11778. 2 These technologies are non-targeted high-throughput and allow one, in theory, to study the whole repertoire of RNAs or proteins in a sample. 3 In this context, reusable means that the data can be processed as accurately by third-party researchers than the original authors, and this without the need to access additional information that have been not openly released 55 available high-throughput human datasets 2.2 transcriptome rna-seq studies I describe hereinafter the five transcriptomic datasets I used in the chronological order of their first public release. Table 2.1 summarises the main characteristics of these datasets. 2.2.1 Castle et al. dataset Castle et al. (2010) released this dataset along with their study: ‘Digital Genome-Wide ncRNA Expression, Including SnoRNAs, across 11 Human Tissues Using PolyA-Neutral Amplification’. The authors were interested in exploring the whole RNA repertoire with sequencing-based technology and they primarily focused their study on the non-coding part. Purchased RNA extracts were used to create multiple-donors pooled samples for 11 tissues from which total RNA libraries were prepared following a total transcriptomic protocol where nonribosomal RNA transcripts are amplified specifically by PCR [Armour et al., 2009]. For each library (tissue), an average of 50 million sequence reads were sequenced using an Illumina Genome Analyser-II sequencer (single-end). The original reads were trimmed to 28 nt before being released through EMBL archives (ENA ID: ERP000257 and ArrayExpress ID: E-MTAB-305). Despite several limitations, such as the lack of replicates, the old technology and the short reads, I have included this dataset for two main reasons. Firstly, it is the oldest available RNA-Seq data I found that was performed on normal human tissues. Thus, the congruence of results for this dataset with the following ones gives a rough idea on the extent of RNA- Seq datasets that may be integrated together. Secondly, as RNA-Seq studies are prepared mainly with polyA-selected protocols today, I was interested in gauging how the library preparation protocols — and the presence of ncRNAs — can affect the quantifications and then any final observation. 2.2.2 Brawand et al. dataset In the article entitled ‘The evolution of gene expression levels in mammalian organs’, Brawand et al. (2011) focused their interest on the evolution of the mammalian transcriptomes⁴. They collected 6 organs from 10 different vertebrates: 9 mammalians (including human) and a bird. There are no technical replicates, but two biological replicates per tissue: one male and one female for every tissue except the testis (two males). The 131 libraries (including 23 for Homo sapiens) were prepared with a polyA-selected protocol. Hence, 4 While there were existing studies on the matter, the sequencing approach was then creating new perspectives. 56 2.2 transcriptome rna-seq studies Ta ble 2.1 .G en er al de scr ip tio no ft he fiv et ra ns cri pt om ic da tas ets (R NA -Se q) us ed fo rt hi ss tu dy Illu mi na Bo dy Ma p( IBM )h as no ‘re gu lar ’te ch nic al rep lic ate sa st he ‘re pli cat es’ are the pro du ct of dif fer en tp rot oc ols ,th us are un fit to est im ate the sp eci fic no ise of eit he rp rot oc ol (si ng le- en do rp air ed -en d). N. B. : Th ep rot oc ols us ed for GT Ex an dC ast le da tas ets are no tt he sam e: GT Ex is fol low ing the mo st co mm on rib od ep let ion pro toc ol, wh ile Ca stl ei sb ase do na tar ge ted am pli fic ati on pro toc ol. Ar ray Ex pre ss ID Da ta ID Lib rar y Pr ep ara tio n Se qu en cin g Re pli cat es Nu mb er of Tis su e Ty pe s Mu lti- sam pli ng fro m the sam ei nd ivi du al To tal RN A Po lyA sel ect ed Sin gle en d Pa ire d en d Bio log ica l Te ch nic al E- MT AB -30 5 Ca stl e ✔ ✔ 11 E- GE OD -30 35 2 Br aw an d ✔ ✔ ✔ 8 E- MT AB -51 3 IBM ✔ ✔ ✔ (✔) 16 E- MT AB -28 36 (an dE -M TA B- 17 33 ) Uh lén ✔ ✔ ✔ ✔ 32 E- MT AB -29 19 Gt ex (v4 ) ✔ ✔ ✔ 54 ✔ ✔i nd ica tes tha tth ed ata set pre sen ts the ch ara cte ris tic ,a nd (✔) tha to ne (or mo re) of the req uir ed cri ter ia of the ch ara cte ris tic is lac kin g. 57 available high-throughput human datasets they are largely enriched in protein-coding genes. An average of 3.2 billion 76 bp-long single-end reads were generated per sample using an Illumina Genome Analyser IIx (single-end) and they released them through GEO (accession number: GSE30352). I personally retrieved the human data from ArrayExpress ID: E-GEOD-30352⁵. 2.2.3 Illumina Body Map 2.0 (IBM) This dataset, created in 2010, has been released in 2011⁶ by Illumina and it used its most recent technology at that time: the paired-end sequencing. Until then, all the sequencing was done from only one end of the DNA or cDNA fragments. From that date, most of the following transcriptome studies based on RNA-Seq use paired-end sequencing. The dataset covers 16 tissues (one donor per tissue) and the libraries were prepared following a polyA-selected and are enriched in protein coding genes. Although each sample has been sequenced twice and despite having in principle technical replicates, these are “non-regular” technical replicates. Technical replicates, by contrast to biological replicates, usually imply that their processing uses the same sample source and protocols. Thus, the error and noise due to a specific technique could be determined. Here, however, each tissue has been sequenced once with a single-end protocol and once with a paired-end one to compare their ability to discriminate between mRNA isoforms. Indeed, Illumina’s main incentive to develop its paired-end technology was to improve the accurate identification of spliced mRNAs. The sequencing was performed with an Illumina HiSeq 2000, and the reads were released through ArrayExpress ID: E-MTAB-503 (ENA ID: ERP000546), fromwhere I have retrieved both the single-end and paired-end mono-tissue samples (the original dataset includes raw data files for mixtures that have been created with the tissue samples). Despite the lack of biological replicates, it was for an extended time the most extensive freely available RNA-Seq dataset of human tissues. Hence, it has been referenced many times (e.g. Asmann et al. (2012), Barbosa-Morais et al. (2012), Smith et al. (2012), Derrien et al. (2012), Florea et al. (2013), D. Kim, Pertea, et al. (2013), Kechavarzi et al. (2014), Zhao (2014), Pasquali et al. (2014), Corpas et al. (2014), Petryszak, Burdett, et al. (2014), Brown et al. (2015), Jänes et al. (2015), De Simone et al. (2016), Kern et al. (2016), Iwakiri et al. (2016), L. Yao et al. (2017), and Akers et al. (2018)) in the literature since its release. In fact, this dataset is the most viewed one in ArrayExpress (with 68,020 views on 31 May 2018 — the second most viewed dataset (46,247 views) being ArrayExpress ID: E-MTAB-62[Lukk et al., 2010]). 5 ArrayExpress was routinely importing datasets from GEO on a weekly basis until very recently. While not automatically, GEO data are still included in EBI Gene Expression Atlas. 6 See Human BodyMap 2.0 data from Illumina - Ensembl Blog, 2011 58 2.2 transcriptome rna-seq studies 2.2.4 Uhlén et al. dataset Uhlén et al. have created the Human Protein Atlas⁷ (often referred as HPA in the literature). This atlas revolves mostly around the spatial distribution of the proteins through the human body. Using diverse approaches and techniques, including RNA-Seq, they first released RNA-Seq data for 27 normal tissues as part of their study: ‘Analysis of the human tissue-specific expression by genome-wide integration of transcriptomics and antibody-based proteomics’ [Fagerberg et al., 2014]. Later, they extended the dataset with new samples and 5 new tissues. The latest version was published within ‘Tissue-based map of the human proteome’ [Uhlén, Fagerberg, et al., 2015] in Science. For each of the 32 tissues, there are (at least two) biological replicates. With a few exceptions, the tissues have both male and female donors. Many of the tissues present also technical replicates. The total set comprises 200 samples, which have been picked by pathologists based on the screening of frozen biopsy samples. The polyA-selected libraries were paired-end sequenced with an Illumina HiSeq 2000 or 2500. I first started to work with the early version of this dataset (ArrayExpress ID: E- MTAB-1733 — 171 samples for 27 tissues), and then I upgraded mywork with the extended more recent version (ArrayExpress ID: E-MTAB-2836). At the preparation time of this thesis, this normal human dataset is the most comprehensive, freely and publicly available dataset: either regarding the number of tissues (see Table 2.1) or the number of samples (see Table 2.2). Therefore, its growing number of references is unsurprising. 2.2.5 GTEx dataset The Genotype-Tissue Expression (GTEx) project is funded by the NIH Common Fund and aims to establish, in its authors’ own words, ‘a resource database and associated tissue bank for the study of the relationship between genetic variation and gene expression and othermolecular phenotypes inmultiple reference tissues’. The project was first introduced in GTEx Consortium (2013). It aims to quickly collect various tissues from postmortem donors for genotype-tissue expression analyses (notably eQTL studies, which study the function of SNPs in the modulation of RNA expression). The results of the analyses are released through the GTEx portal⁸. As the project is quite ambitious and the collection and sequencing of the samples spreads over a long period of time, several intermediate data ‘freezes’ have been released⁹. My analyses include samples up to the fourth release of the pilot phase (v4). This release 7 Human Protein Atlas — https://www.proteinatlas.org/ 8 GTEx portal — https://gtexportal.org 9 Many groups are involved in collecting, producing or processing the data. To ease the communication and work coordination, many time points are used to reference each a specific state (version) of the data. Each version of the data is called a freeze. 59 available high-throughput human datasets covers 54 tissues/cell types (53 normal and 1 tumoral) collected on from individuals for a total of 3,276 samples. The RNA-Seq libraries were prepared following a polyA-selected protocols and have been paired-end sequenced on an Illumina HiSeq 2000/2500. There is an average of 80 million reads per sample. For privacy reasons, the raw data is available only through controlled access via dbGaP ID: phs000424.v4.p1 (access number specific to the version of the data I used in my study). Unfortunately, this translates to a slow access time to the raw data. During the data selection process, I had to disregard a few studies as the raw data was not fitting the reusability criterion [E. T. Wang et al., 2008; Pan et al., 2008]. Many times I came across studies with ambiguous encoding format for the raw data such as the ArrayExpress ID: E-GEOD-41637 dataset [E. T. Wang et al., 2008]. Despite my best efforts, I was unable to resolve this issue by contacting the respective authors. E. T. Wang et al. (2008) (ArrayExpress ID: E-GEOD-41637) is one example of study that I unfortunately had to dismiss. 2.3 proteome mass spectrometry bottom-up studies As mentioned earlier, the proteomic data have been selected and handled by Dr Jyoti Choudhary and Dr James Wright. Until recently, compared to the transcriptome, the proteome world was lacking in normal human tissues expression quantification experiments. In fact, while there were human protein maps available (e.g. the Human Protein Atlas¹⁰), these are mostly reporting the spatial expression of proteins (as they are based on immunohistochemistry or other means of identification) than quantifying their (non-targeted) abundance in each tissue. In 2014, two independent groups of authors [M.-S. Kim et al., 2014; Wilhelm et al., 2014] published (in Nature, issue 7502) their own ‘draft of the human proteome’ based on the study of tissues with MS. These two datasets complement a previous smaller one that was publicly released but was never the object of a publication. Hereinafter, I present these three datasets that I use in my thesis. See Figure 2.3A (p. 69) for a short summary. 10 Human Protein Atlas — https://www.proteinatlas.org 60 2.3 proteome mass spectrometry bottom-up studies 2.3.1 Pandey Lab dataset The Pandey Lab [M.-S. Kim et al., 2014] created the Human Proteome Map¹¹ which they released alongside ‘A draft map of the human proteome’ in Nature. For their study, they processed 30 kinds of histological normal human tissues and cell line samples (17 adult tissues, 7 foetal tissues and 6 haematopoietic cell types). Each samplewas created from pooling samples of three individuals (generally two males and one female). Their proteomic libraries were prepared with a label-free method to quantify as many proteins as possible. The samples were fractionated to protein level through SDS-PAGE, and then at peptide level after trypsin digestion by RPLC to create 85 experimental samples. Finally, state-of-art MS/MS protocols (with high-resolution and high accuracy FTMSs Thermo Scientific Orbitrap™ instruments) was used to generate about 25 million of (HCD) high-resolution mass spectra which account for 2,212 LC-MS/MS profiles. The raw spectra were retrieved from ProteomeXchange via the repository PRIDE ID: PXD000561. While the authors’ effort to generate technical high quality raw data was highly appraised by the scientific community, their processing (identification and quantification) methods were criticised (see Ezkurdia et al. (2014) and Deutsch et al. (2015)). Thus, for this thesis I have relied only on quantifications provided by Dr James Wright who reprocessed the raw spectra. 2.3.2 Kuster Lab dataset In their approach of the human proteome map, the Kuster Lab [Wilhelm et al., 2014] combined newly generated LC-MS/MS spectrum data (about 40% of their complete working set) with already publicly available data (either from their colleagues or accessible through repositories — for the remaining 60%). They reprocessed the whole collection of spectra to maximise proteome coverage and make it available through their own repository: ProteomicsDB¹². The subset of data considered in my thesis is also known as the [protein] Human BodyMap which is the part that the Kuster Lab primary generated for their own study. They collected 48 experiments covering 36 tissues (adult and foetal) and cell lines. After LDS-PAGE fractionation and digestion into peptides with trypsin, they processed the samples with LC-MS/MS to create 1,087 profiles. Overall, that represents about 14 million of HCD/CID spectra from Thermo Scientific instruments (including an Orbitrap™). This specific raw data subpart was downloaded from ProteomicsDB ID: PRDB000042. 11 Human Proteome Map — http://www.humanproteomemap.org 12 ProteomicsDB — https://www.proteomicsdb.org/ 61 available high-throughput human datasets 2.3.3 Cutler Lab dataset This dataset was generated prior to the Pandey Lab and the Kuster Lab data as it was released in 2011 through PeptideAtlas¹³ [Desiere et al., 2006], IDs: [PAe001768 — PAe001778]. It was created by Paul Cutler at Roche Pharmaceuticals. It comprises 10 different tissues (and one sample per tissue) that after trypsin digestion, were analysed through Thermo Scientific LTQ-Orbitrap™. In total, there are 1,618 CID profiles which accounts for 13 million raw CID spectra from a LTQ-Orbitrap™ instrument. While this dataset was never published on its own, it has been used in different studies (e.g.Wilhelm et al., 2014). The raw fileswere accessed and downloaded fromProteomicsDB ID: PRDB000012. 2.4 consistent processing pipelines The authors of these five transcriptomic and three proteomic studies have, in most cases, released the quantification of the expression values either directly (e.g. Krupp et al., 2012) or upon requests (e.g.M.-S. Kim et al., 2014). Third-parties also distribute quantification for these studies either retrieved from the original studies, such as BioGPS¹⁴ [C. Wu, Macleod, et al., 2013] or Harmonizome¹⁵ [Rouillard et al., 2016], or after reprocessing the raw data as the EBI Gene Expression Atlas¹⁶ [Petryszak, Keays, et al., 2015] does. To primarily reduce for avoidable technical variability, and despite readily available quantifications for most of the datasets, I only used data reprocessed from raw files by myself or Dr Nuno Fonseca (GTEx dataset) for the transcriptomic data and by Dr James Wright for the proteomic data as already mentioned. In fact, each study has been originally processed with different protocols, e.g. GTEx [Melé et al., 2015] and Castle [Krupp et al., 2012]. While the EBI Gene Expression Atlas reprocesses raw data through the same methods and has quantification for most of the aforementioned datasets (it is still lacking the Castle et al. one.), these were still the products of different protocols when I started my work. Indeed, the datasets were processed with different versions of reference (Human genome build and annotation) and tools. Intuitively, we expect that different processing protocols produce different results. As I started to work with RNA-Seq data, I noticed many potential analysis variables that impact at various levels the resulting gene expression values. Indeed, many of these have 13 PeptideAtlas — http://www.peptideatlas.org/ 14 BioGPS — http://biogps.org/ 15 Harmonizome — https://amp.pharm.mssm.edu/Harmonizome 16 EBI Gene Expression Atlas — https://www.ebi.ac.uk/gxa/home 62 2.4 consistent processing pipelines been reported in the literature since then; in fact, annotation versions [Frankish et al., 2015], contamination (from viruses or bacteria DNA) [Cantalupo et al., 2015], quality controls (and subsequent reads filtering choices) [Kroll et al., 2014], mapping and quantifications pipelines [Fonseca, J. Marioni, et al., 2014] have considerable effects on the final quantification. Lastly, normalisation methods also greatly impact the final expression figures [Dillies et al., 2013; Zwiener et al., 2014]. For all these reasons, I decided to reprocess all transcriptomic datasets with the same exact protocol as the first step of my study. Recently, Danielsson et al. (2015) compare results based on prepublished data and reprocessed ones and conclude that using a single processing pipeline ensures better results. Likewise, there are various tools and many parameters for each processing step needed to quantify the proteomic data that may impact the final expression values [Aebersold, 2011]. For example, various search engines allow detecting different sets of peptides [Griss, 2016]. Mackay (2015) has reviewed and analysed the impact of many of these variables more specifically for label-free proteomics, such as the effect of FDR, protein inference tools or normalisation methods. Therefore, the three datasets were reprocessed uniformly from the raw spectra up to the normalisation of the protein expression values. 2.4.1 RNA-Seq raw data processing As presented in Section 1.2.5, there are many steps from the raw data files to the quantification matrices on which this thesis’ analyses are based. Figure 2.1 presents a general overview of the RNA-Seq processing protocol I used. I downloaded and entirely processed four of the transcriptomic datasets myself (Castle, Brawand, IBM and Uhlén data) and Dr Nuno Fonseca retrieved and processed the GTEx dataset¹⁷. In this thesis, I present results computed on the quantification of these five datasets which have been processed through the same identical pipeline. 2.4.1.1 Data retrieval and preparation I retrieved the human raw data of each dataset from ArrayExpress and ENA through their identifier (see section 2.2) (p. 56). After we received our access approval, Dr Nuno Fonseca retrieved GTEx data from dbGaP. While most of the raw files can be used as they are, an additional step is needed for the Castle files. Indeed these files are using an older FASTQ format that is non-compliant to the most accurate and recent tools used for this thesis. As it is a simple matter of changing the quality score scale (see appendix A.6), I converted these files to Phred+33 FASTQ files 17 As the GTEx data is involved in many projects within the EBI and due to its huge amount of files (number and size — see Table 2.2), it was agreed that this would be processed centrally by one person and then redistributed to all the other interested parties. Dr Nuno Fonseca had this tremendous task. 63 available high-throughput human datasets Normalisation GRCh38 annotation Quantification Filtered data Filter Internal normalisation function FPKM data Raw count data iRAP HTSeq-count TopHat2 Cufflinks2 .fastq Convert to Phred 33 (custom script) .fastq Bowtie (Contaminants removal) FastX-toolkitPreprocessing Mapping Castle Brawand, IBM, Uhlen and GTEx GRCh38 reference Figure 2.1. General steps for processing the transcriptome. The pipeline iRAP integrates all the tools needed for the state-of-art processing of RNA-Seq data. The quality of the reads is checked and they are trimmed if needed. After removal of possible contaminant reads (such as E. coli), the reads are aligned with TopHat2. The gene expression is then quantified with two different approaches: based on the aggregation of isomers for each gene or simply based on the number of aligned fragment on the gene locus defined in the reference. Cufflinks2 provides directly FPKM values. HTSeq-count provides raw counts which were normalised by an iRAP function into FPKM. 64 2.4 consistent processing pipelines Table 2.2. Technical description of the five transcriptomic datasets I processed all the datasets except the one in italic. For the Brawand dataset, I only included and processed the Homo sapiens part. Dataset Participantnumber Library number File number Total size of the fastq raw files (GB) Mean number of biologic samples per tissues [min;max] Castle 10 11 11 58 10 (mixture) Brawand 18 21 23 111 2.8 [2;3] Illumina Body Map 16 36 48 1,004 1 Uhlén 122 200 400 1,851 3.81 [2;11] GTEx (v4) 551 3,276 6,552 ∼ 50,000 60.67 [4;214] with a Perl script (provided digitally as supplementary data). 2.4.1.2 Genome and annotation reference I collected and processed the datasets through an extended period of time. Hence, for a subset of them, I produced many intermediate sets of results based on the GRCh37.p12 (and later GRCh37.p13) human reference genome and the latest available Ensembl gene set annotation (73, 74 or 75) at that time. In fact, the quality of each new annotation update is generally greater than its predecessor¹⁸. As the GTEx data was processed with GRCh38.p1 and Ensembl 76, that led me to reprocess all the other four RNA-Seq datasets for the sake of consistency and to avoid more biases [Guo et al., 2017]. Thus, unless indicated otherwise, the results presented in the current work are based on the GRCh38.p1 human genome reference and the Ensembl 76 gene set annotation. 2.4.1.3 Data processing In the early stages of my research, I was processing each of the different steps sequentially and semi-manually with the help of custom made scripts. While the EBI computer cluster greatly facilitated the handling of the numerous files, the task remained quite tedious. Additionally, the scripts I wrote would need a fair amount of work to achieve general reproducibility on other platforms. Fortunately, Dr Nuno Fonseca developed an ‘integrated RNA-seq analysis Pipeline’: iRAP¹⁹ [Fonseca, Petryszak, et al., 2014]. This tool allows the automation of the typical state-of-the-art and optimised workflow to study RNA-Seq. It takes full advantages of the capacities provided by computer clusters. Thus, I switched from my original set of 18 Although, it is not unusual to have gene or transcript additions based on new studies that are then removed (or fused to another) in a later version. 19 iRAP — https://nunofonseca.github.io/irap/ 65 available high-throughput human datasets scripts to iRAP to improve my workflow without changing any step or parameter. Besides the usual input files (raw RNA-Seq files and genome/annotation references), iRAP needs a configuration file that precisely describes the dataset (its design and technical features) and, if needed, specific parameters to use. To provide full reproducibility, each version of iRAP is shipped with its own set of third-party version-defined tools and default parameters. Thus, apart from remarkably speeding up the data processing, iRAP also ensures the protocol integrity across the five transcriptomic datasets I use in my thesis regardless of who runs the pipeline. Each of the transcriptomic datasets is the product of the same version of the iRAP pipeline (development version 0.6.3b) and set of parameters. As the default parameters of iRAP are tuned for human Illumina paired-end data, I only have to define a few of them. Hence, the quality and contamination checks, and the filtering and trimming of the reads are done following the default options of iRAP. Quality assessment, trimming and filtering iRAP uses internally FastX toolkit²⁰ (0.0.13) to perform the assessment and the trimming. The usual uninformative and ambiguous reads (see Section 1.2.5.1: Quality check, trimming and filtering (p. 16)) have been discarded as were any with an overall quality score below a threshold of 10. The quality of the call decreases while the base calling progresses — see Section 1.2.3: Sequencing-by-synthesis (p. 13). On another note, some tools (mappers in particular) need all the reads to be trimmed to the same length. iRAP optimises the compromise between the purity and the length of the reads to avoid more errors or biases due to smaller reads [Williams et al., 2016] by trimming atmost 15% of the original lengthwhile discarding more reads if necessary to maximise the length. Reads that could be assigned to a likely contamination source, here Escherichia coli (as I work with Homo sapiens), are also discarded. A non-splice aware mapper, Bowtie²¹ (1.1.1) [Langmead et al., 2009] maps all the reads to the contaminant genome and all the reads mapping perfectly and unambiguously are discarded. Mapping I mapped the reads to the genome (GRCh38.p1) and the transcriptome (Ensembl 76 gene set annotation) with iRAP’s (0.6.3b) proposed default splice-aware mapper TopHat2²² (2.0.12) [D. Kim, Pertea, et al., 2013] with its set of default predefined arguments. Indeed, TopHat2 can handle reads from many organisms by fine-tuning the parameters (e.g. number of mismatches or indels to tolerate), but the default parameters are adjusted for 20 FastX toolkit — http://hannonlab.cshl.edu/fastx_toolkit/ 21 Bowtie — http://bowtie-bio.sourceforge.net/ 22 TopHat2 — https://ccb.jhu.edu/software/tophat/index.shtml 66 2.4 consistent processing pipelines Gene length Transcript 1 Transcript 2 Transcript 3 Transcript 4 Collapsed exons + + Figure 2.2. Gene length is equal to the sum of the lengths of all its collapsed exons. Though, this method lacks complete accuracy, it provides a sufficient estimation of the gene length for an efficient normalisation regarding the length bias. The coordinates for the 5’ and 3’ ends of each exon is extracted from the annotation and they are collapsed together. This gene length is unaffected by incorrect attribution of a fragment to a specific transcript when there are many possible options. normal human. Quantification and Normalisation While RNA-Seq can be used to identify (and discover) RNA isoforms, I have focused my thesis on the gene level expression. Indeed, current annotations and knowledge are still lacking in the reasons and external conditions that impact the expression of a specific isoform over the others. In addition, criticisms have been raised on the accuracy of distinction between them [Engström et al., 2013; Jänes et al., 2015; Dapas et al., 2017]. However, normalising gene expression presents more challenges than specific transcript expression. For instance, the definition of the gene length may be different from one laboratory to another. In this thesis’ framework, when I have to use a gene length for a computation, I use the identical gene length definition as found in iRAP and EBI Gene Expression Atlas. Thus, as shown on Figure 2.2 the gene length is defined as the sum of the lengths of all its collapsed exons. As mentioned in Section 1.2.5.4: Normalisation (p. 23), I used two different popular tools based on different strategies to estimate gene expression levels: Cufflinks2²³ (2.2.1) [Trapnell et al., 2010] and HTSeq-count²⁴ (0.6.1p1) [Anders et al., 2015] (with the intersection non-empty mode). These tools are also integrated in iRAP. For Cufflinks2, I used the mode where the multi-mapped reads are probabilistically assigned depending on the coverage of each mapped locus. In addition, Cufflinks2 provides normalised gene expression levels by aggregating their corresponding normalised isoform expression levels. Cufflinks2 uses the equation (Canonical F/RPKM formula) to normalise isoform expression levels. The length of the isoforms are extracted 23 Cufflinks2 — https://cole-trapnell-lab.github.io/cufflinks/manual/ 24 HTSeq-count — https://htseq.readthedocs.io/ 67 available high-throughput human datasets from the reference. On the other hand, HTSeq-count provides only raw counts for the feature of interest. iRAP provides an internal FPKM normalisation function that is an implementation of the equation (Canonical F/RPKM formula). As I requested HTSeq-count to work at gene level, this formula requires gene lengths which are computed with the aforementioned method. All the configuration files I created for this thesis may be found at my personal Github repository²⁵. As the paired-end set of the IBM data was presenting an overall better quality than its single-end counterpart, I only include IBM’s paired-end data for the remaining of the thesis. 2.4.2 MS data processing After retrieval of the data from PRIDE and ProteomicsDB, Dr James Wright reprocessed the three proteomeMS-based datasets. Figure 2.3 illustrates the pipeline that processed the three datasets in a consistent and optimal manner. I summarise this protocol in Figure 2.3 (p. 69) and in the following sections 2.4.2.1 to 2.4.2.4. See Wright, Mudge, et al. (2016) and Weisser et al. (2016) for more details. 2.4.2.1 Spectral processing Themsconvert module of ProteoWizard²⁶ (v3.0.6485) [Holman et al., 2014] converted all the files to the standard format mzML. TOPP [Kohlbacher et al., 2007] from OpenMS²⁷ (pre-v2.0 development build) [Röst et al., 2016], processed the raw spectra. Notably, PeakPickerHiRes which centroids them and FileMerger that merges the ones from the same fractionated experiments. 2.4.2.2 Sequence database creation and searching preparation The target sequence database is a critical element of the MS pipeline and thus, Dr James Wright has carefully designed it. It combines six different parts, three based on known protein sequences and three other covering possible new protein candidates. The known sources include the complete human GRCh38 (v.20) coding DNA sequence (CDS) translated sequences from GENCODE; the human reference proteome from UniProt²⁸ [The UniProt Consortium, 2017] (in its May 2014 version); common 25 https://github.com/barzine/phd-analyses/tree/master/chapter2/irap-configuration-files 26 ProteoWizard — http://proteowizard.sourceforge.net/ 27 OpenMS — https://www.openms.de/ 28 UniProt — http://www.uniprot.org/ 68 2.4 consistent processing pipelines Cu tle r La b - 10 T iss ue s - 10 E xp er im en ts - 13 M ill io n CI D Sp ec tra Pa nd ey L ab - 24 T iss ue s a nd 6 C ell li ne s - 85 E xp er im en ts - 25 M ill io n H CD S pe ct ra K us te r L ab - 36 T iss ue s a nd C ell li ne s - 48 E xp er im en ts - 14 M ill io n H CD /C ID Sp ec tra A C Op en M S Ce nt ro id in g Fi le M er ge M as co t S ea rc h M S- GF +S ea rc h M as co tP er co lat or Pe rc ol at or .ra w .m zM L Pr ot eo W iza rd m sc on ve rt .m zT AB Pr ot eo m e Di sc ov er Pe rc ol at or SE QU ES T Fi lte r S ig ni fic an t P SM s Pr ot ein a nd G en e In fe re nc e Qu an tif ica tio n pe r e xp er im en t No rm ali sa tio n pe r e xp er im en t Av er ag e Qu an tif ica tio n fo r e ac h tis su e B Control M od els fr om R NA -S eq AU GU ST US Ps eu do ge ne s.o rg Ps eu do ge ne s, 5’ U TR , ln cR NA G RC h3 8 (v .2 0) Possible new candidates Co nt am in at io n se qu en ce s H LA s eq ue nc es Un ip ro tH um an R ef er en ce (M ay 2 01 4) CD S fro m G EN CO DE G RC h3 8 ( v.2 0) Known candidates Ra nd om D ec oy se qu en ce s 4,999,422 sequences 787,587 sequences 4,211,835 sequences Fig ur e2 .3. Ge ne ra ls tep sf or pr oc es sin gt he pr ot eo m e. [A da pta tio no fc ou rte sy ma ter ial sf rom Dr Jam es W rig ht] . (A )T he thr ee da tas ets ha ve be en pro ces sed thr ou gh the sam ep ipe lin e. In thi st he sis ,I on ly us et he sam ple sf rom ad ult tis su es. (B ) Ex ten siv es ou rce so fp rot ein seq ue nc es we re us ed for the sea rch da tab ase ,in clu din gp red ict ion of no ve lp rot ein s. Co nta mi na tio na nd de co ys eq ue nc es we re als oi nc lud ed to all ow for FD Re sti ma tio n. (C )S tat eo fth ea rt wo rkf low wa su sed to pro ces st he MS da ta fro m raw file s. Th is wo rkf low co mb ine sm ult ipl eM Ss ea rch en gin es an dp ost -se arc he va lua tio nt oo ls. Re su lts we re filt ere db yp ep tid e len gth ,F DR ,P EP an da gre em en tb etw ee nt he mu ltip le sea rch alg ori thm s. No te tha tth ere is no rel ati on be tw ee nt he rea ls ize of the da tab ase pa rts an dt he irr ep res en tat ion ;th ed eco ys eq ue nc es are as nu me rou sa st he su m of the kn ow na nd po ssi ble can did ate s. 69 available high-throughput human datasets contamination protein sequences²⁹ and HLA sequences³⁰. This known portion of the target sequence database represents 787,587 tryptic peptide sequences. The sources for potential novel proteins included a selection of non-coding gene sequences (including pseudogenes, lncRNA and untranslated regions) from GENCODE GRCh38 (v.20); prediction of novel sequences with AUGUSTUS³¹ [Stanke et al., 2004]; a set of two-consensus predictions (December 2013) from Pseudogene.org³² [Karro et al., 2007] and three-frame translated RNA-Seq transcript sequences. These translated sequences include models built on IBM by Ensembl and by the Kellis lab in addition with models built on different ENCODE cell lines by Caltech and CSHL. This novel portion of the target sequence database provides an addition of 4,211,835 tryptic peptide sequences. Mimic³³ generated 4,999,422 (787, 587 + 4, 211, 835) randomised decoy sequences, i.e. the decoy database and the target database have an equal size of peptide sequences. The different databases were then merged together. It is represented on Figure 2.3B. To account for the isobaric peptides, all isoleucine (I) residues were converted to leucine (L) before the search and then after the search all leucine (L) residues were converted to (J)³⁴ to avoid later misconceptions. 2.4.2.3 Spectral identification and database search pipeline Figure 2.3C describes the overall workflow used by Dr James Wright to quantify the protein abundance in each tissue. As mentioned in Section 1.3.4.2, workflows involving several algorithms produce better results. Mascot Server (v 2.4— Matrix Science) cluster produced a first search on the mzML files submitted through MascotAdapterOnline (part of TOPP). In parallel, Dr James Wright also used MS-GF + Search, which involves the run of MS-GF + (v. 10089) [S. Kim et al., 2014]. MascotPercolator (v 2.08) [Brosch et al., 2009; Wright, Collins, et al., 2012] optimised and rescored the results from Mascot and msgf2pin/Percolator (v 2.08−1) [Granholm et al., 2014] optimised the results from MS-GF +. Finally, SEQUEST [Eng, McCormack, et al., 1994] and Percolator [Spivak, Weston, Bottou, et al., 2009] performed a search in a Proteome Discoverer (v 1.4 — Thermo Scientific) workflow. The different workflows used common stringent parameters for all the database searches: the precursor tolerance was set to 10 ppm; fragment tolerance for HCD spectra to 0.02 Da and to 0.5 Da for CID spectra; the allowed missed cleavages were limited to 3. As described in Wright, Mudge, et al. (2016), the research also accounted for several amino acid modifications by including (known) mass tolerances. The fixed modification 29 Contamination sequences — http://maxquant.org/contaminants.zip 30 HLA sequences — https://www.ebi.ac.uk/ipd/imgt/hla/download.html 31 AUGUSTUS — http://bioinf.uni-greifswald.de/augustus/ 32 Pseudogene.org — http://pseudogene.org/ 33 Mimic — https://github.com/percolator/mimic 34 As J is one of the letter from the Latin alphabet that do not map to any amino acid. 70 2.4 consistent processing pipelines carbamidomethyl (+57.0214 Da) was specified for all cysteine residues. The searches also comprised the following variable modifications: N-terminal acetylation (+42.01056 Da), N-terminal carbamidomethyl (+57.0214 Da), deamidation of asparagine and glutamine residues (+0.984 Da), oxidation of methionine residues (+15.9949 Da), and the possible N-terminal conversion to pyro-glutamine of glutamine (−17.0265 Da) and glutamic acid (−18.0106 Da) residues. The search results were converted into mzTab formatted files and uploaded along with the mzML spectra and FASTA search database to PRIDE ID: PXD002967. 2.4.2.4 Results processing and filtering Custom Perl scripts parsed, merged and filtered the results of each search engine so that every PSM had the same identification in at least two of the three search engines. In each case, the least confident PEP (i.e. the highest, see Appendix A.11) was retained. The PSMs were then filtered to keep matches only to the three following criteria: q-value (see Appendix A.11 and Appendix A.9.3) less than or equal to 0.01 (i.e. 1% FDR); a PEP inferior or equal to 0.05 and a peptide length superior or equal to seven amino acids. PSMs matching contaminant or decoy sequences were also removed. The resulting list of peptides was then used to infer the proteins with a simple approach. Protein clusters were created based on the common matching non-null set of peptides, i.e. each protein cluster has at least one unique peptide. Then, the GENCODE CDS and UniProt accession were mapped back to Ensembl identifiers. Proteins with a gene (or gene clusters) definition matching at least three unique peptides were kept for the remaining of the analysis while the others were discarded. The quantification of the retained proteins was computed for each experiment with an approach close to the Top3 method [Silva et al., 2006]. The precursor intensities of the three most intense unique peptides per gene identifier (or for gene cluster) were summed, before being divided by the total summed quantification of all proteins in each sample to provide the ‘within sample abundance’. Then, these abundance values were normalised by the ten genes displaying the lowest coefficient of variation across all tissues. When there was more than one experiment per tissue, the final quantification values are the median value across all the replicates of each tissue. Protein clusters matching several Ensembl gene identifiers or failing the unique peptide rule are discarded from the presented further analyses. The list of the discarded clusters is different for each of the proteome datasets. Compared to the original Pandey Lab study [M.-S. Kim et al., 2014], fewer proteins were quantified, but the results are congruent to other previous studies on the range of detection and quantification of LC-MS/MS. The quantifications were released along with our paper [Wright, Mudge, et al., 2016] and the reanalysis of the Pandey data was also 71 available high-throughput human datasets released through EBI Gene Expression Atlas under the accession: E−PROT−1 and described in Petryszak, Keays, et al. (2015) and Wright, Mudge, et al. (2016). 2.5 discussion and conclusion In this chapter, I introduced the five transcriptomic and three proteomic normal human tissues datasets on which I based my thesis. I described how both the transcriptomic and the proteomic datasets have been reprocessed from raw files with state-of-the-art unified pipelines which are also using the same genome build and annotation references in the final processed version. As mentioned before, I have produced a subset of the transcriptomic datasets with the previous human reference genome (GRCh37) and three different Ensembl gene set annotations (73, 74 and 75). I have run many of the analyses of Chapters 3 and 4 on these data. While the results may vary for individual genes, the overall outcomes are congruent hence supporting the robustness of the findings presented in this thesis. In addition, all the products of the RNA-Seq pipeline are in agreement with the original studies findings. The MS pipeline also produces similar results to the original studies — except for M.-S. Kim et al. (2014), which original processing and results raised many criticisms [Ezkurdia et al., 2014]. While we are in the era of data deluge and big data, the number of tissue overlaps for independent normal human studies is surprisingly small — see Figure 4.1. (p. 92) Most of these datasets have been (and will be) referenced through many papers for comparison (or as control) purposes; hence, it is essential to assess the soundness of these practices by assessing the consistency between these datasets. 72 See first, think later, then test. But always see first. Otherwise you will only see what you were expecting. Douglas Adams 3 ABOUT EXPRESS ION , V I SUAL I SAT ION , CORRELAT ION AND CLUSTER ING As a first step towards the different meta-analyses presented in this thesis, I have opted for a largely empirical approach to determine a consensus set of methods and parameters on each individual study before applying them across all the datasets in the further chapters. This strategy has also allowed me to estimate the overall data quality per dataset and to structure them appropriately for the upcoming analyses. I mentioned in Chapter 2 quality checks that happened before the processing of the data. Those quality assessments are rather technical¹. In the present chapter, I describe post- processing quality (or sanity) checks that examine higher (and fuzzier) aspects, e.g.: • Possible outliers in the data • Systematic and unsystematic batch effect within each study • Adequacy of data, concepts and statistical models Even if every of these aspects may not be addressed or corrected, the final results and interpretations of this study are then more solid. 3.1 visualisation of expression data Data visualisation is a simple, but very effective method towards adequate analyses and thus more pertinent results. It allows uncovering the detection of underlying structures and possible unwanted artefacts. 3.1.1 Distribution plots In the literature, expression values are frequently visualised on a log-scale (log2(𝑥)). Figure 3.1 illustrates how this scaling improves the readability of the plot. To overcome the lack of definition of log2(0), I have added a common pseudocount (equal to 1) to all the observations. However, in a few cases and only for visualisation purposes, I have removed the null values to avoid misinterpretations; for examples the expression distribution (per tissue) plots Figures 3.1 to 3.3. When I remove the null values I clearly state it in the plot legend as the norm is the pseudocount addition. 1 Is it true signal or noise? Are all the nucleotides called? Is it a true identification or a false positive? … 73 about expression, visualisation, correlation and clustering Figure 3.1. Untransformed (left) and Log2-transformed (null values removed) (right) profile of expression levels (FPKM, protein-coding genes only and all null values excluded) for the IBM dataset Figure 3.2 shows all the remaining transcriptome datasets. Overall, on this log2(𝑥) scale (and with all null values excluded): all the samples present a similar shape; a peak near 0 for the lowly expressed and undetected genes and a long-trailing tail. The bulk of the expressed genes on this scale is below 6 (i.e. below 63 FPKM). In Figure 3.2c, we can observe that the general expression of the pancreas is shifted towards the left in comparison to the other tissues. This may be an artefact as this shift of the values distribution is absent in the pancreas of the other transcriptomic studies (Figure 3.1b and Figure 3.2d). Moreover, as highlighted by the next chapter analyses, Uhlén’s and GTEx’s pancreas are strongly correlated (𝑟 = 0.83;𝜌 = 0.96). Aside from the Pandey data (Figure 3.3c), the expression of the proteins is more heterogeneous (in particular Cutler data, see Figure 3.3a). This is concordant to the more disparate and variable techniques involved in the proteomic sample preparation (see Section 1.3.1: Sample preparation). 3.1.2 Scatter plots Anscombe (1973) created four datasets (see Figure B.1) which share similar descriptive statistics to show the importance of data visualisation even through a simple scatter plot. He demonstrated that checking the datasets graphically with scatter plots allows one to quickly detect outliers and roughly estimate the relationship between two variables. Even a non-linear but strong relationship is promptly highlighted (e.g. top right corner of Figure B.1). 74 3.1 visualisation of expression data 0.00 0.05 0.10 0.15 0.20 -10 0 10 Log2(FPKM) de ns ity 0.00 0.05 0.10 0.15 0.20 -10 0 10 Log2(FPKM) de ns ity 0.00 0.05 0.10 0.15 0.20 -10 0 10 Log2(FPKM) de ns ity 0.00 0.05 0.10 0.15 0.20 -10 0 10 Log2(FPKM) de ns ity Pancreas (c) Uhlén et al. (d) GTEx (a) Castle et al. (b) Brawand et al. Figure 3.2. Profile of expression levels across the transcriptomic (protein-coding genes only) studies (null values removed) 75 about expression, visualisation, correlation and clustering Figure 3.3. Profile of expression levels across the proteomic studies (null values removed) 76 3.2 main statistical approaches (a) Technical replicates (Heart) (b) Biological replicates (Kidney) Figure 3.4. Examples of scatter plot for replicates from Uhlén (transcriptome) Technical replicates present very strong correlations particularly for higher expressed mRNAs (≥ 32 FPKM). Biological replicates present lower but still strong correlations within the same dataset. Figure 3.4a illustrates how very lowly detected RNAs diverge even in technical replicates. Biological replicates, within a same dataset, may present very close profiles even if the spread for the lowly detected genes is even greater as showed in Figure 3.4b. 3.2 main statistical approaches As the general normal distribution shape of the gene expression levels on log-scale are similar, I have also computed Pearson correlation coefficients (in addition to Spearman ones) to assess the similarity of the replicates within (intra) and between (inter) studies. 3.2.1 Correlation Correlation coefficients are a measure of the statistic dependence between two continuous variables² (e.g. 𝑋 and 𝑌 ) and always ranges within [−1, 1] (see also appendix B.1: Correlation). 2 In the context of this study, the variables are either expression levels of a given gene across samples/tissues or expression levels of all genes between two samples or tissues 77 about expression, visualisation, correlation and clustering The correlation coefficient is computed by the pairwise comparison of observations between two variables. Most implementation methods will manage an unbalanced number of observations by excluding the incomplete pairs. To ease the interpretation I filtered the data a priori; I only kept expression values effectively observed in all the datasets (as I explain in section 3.3.3: Expressed or not expressed). From the several methods available to compute the correlation coefficient, I chose both the Spearman and the Pearson correlations. As Spearman correlations are computed on ranks, they report any kind of relationship, while Pearson correlations are computed on the values and report only linear relationships. However, Pearson correlations are easier to interpret and can be used with one of the variable to predict the other one. (See also Appendix B.1.1: Spearman correlation and Appendix B.1.2: Pearson correlation). Figure 3.5. Correlation coefficients between RNA-Seq replicates. The correlation means and medians are high across the studies replicates. However, the range of the correlations are quite extreme in a few case. Spearman correlations are higher than the Pearson ones. See also Appendix B.1.3. Figure 3.5 (and Table B.1) presents the Pearson and Spearman correlation coefficients for the technical replicates for Uhlén study and the biological replicates within Brawand, GTEx and Uhlén studies. On average the correlation coefficients are high both for the technical or the biological replicates. The GTEx study presents the same average correlation coefficients but a more extreme range. This may be explained by a strong batch effect as the samples were collected and sequenced at different times by different laboratories. 78 3.2 main statistical approaches 3.2.2 Clustering analysis Aswe know the tissue type for each sample of each dataset, wemay debate that supervised analyses can be more informative than unsupervised ones. However, they would involve proper corrections for batch effects and other technical biases for each dataset. This is challenging as it often requires more knowledge than what is available through the repositories. In Chapter 2, we have seen that is also unwise to rely solely on the normalised data provided by the original authors when working with various datasets that are non- uniformly processed³. To assess the consistency of the quantification across the different datasets, in particular for RNA-Seq, I picked a widely used unsupervised method for exploratory analysis in gene expression studies: clustering analysis. There are many available approaches and algorithms from which to pick; I chose a (bottom-up) hierarchical clustering (a.k.a. connectivity-based clustering). This sort of clustering is widely used in gene expression studies. Broadly speaking, this method groups samples by similarity in an extensive hierarchy, which allows uncovering possible hidden structures within the data; thus establishing if samples are more alike biologically or by study origin for instance. In general, we expect biology to be a better predictor when we only consider data from either transcriptome or proteome. Even more so if the identification technology and the quantification workflows are consistent. Yet, a technical predictor can not be directly excluded. Indeed, most transcripts (in particular mRNAs) are expressed in many tissues. Two tissues chosen at random share about 60 to 90% of their pool of mRNAs [Ramsköld et al., 2009; Gremel et al., 2015]. Parallelly, on the proteome side, M.-S. Kim et al. (2014) estimate that 75% of themass of a cell is due to ubiquitous proteins andWilhelm et al. (2014) estimate that about 10,000 to 12,000 proteins are ubiquitously detected, which represent about 60 to 75% of the proteins that they identified per tissue. Thus, if the variation of expression are too subtle from one tissue to another, a strong sample collection or data processing bias may hide any relevant biological signal. In practice, each sample starts in its own cluster and then iteratively, each cluster is merged with its nearest one. Themethod has two parameters: the distance and the linkagemethod. Debate is still going on how to pick these parameters among the many possible choices (for more details see [Jaskowiak et al., 2014; Guinand et al., 2002]). The distance measures the dissimilarity between two samples and one common approach is to calculate the subtraction result of the correlation coefficient from 1 (hence, a greater similarity between the two samples means a smaller distance). In analogy to previous analyses, I have also used both Spearman and Pearson correlation methods. The linkage parameter specifies which part of each cluster is used as reference for computing the distance between the clusters. There are many methods and after trying 3 Eventual bias corrections in RNA-Seq vary according to planed downstream analyses and proteomic data is hard to handle and two processing pipelines may rather give quite different results (see Section 5.2). 79 about expression, visualisation, correlation and clustering (a) Clustering based on Ward’s method (b) Clustering directly extracted from the original study [Fagerberg et al., 2014] Figure 3.6. Comparison of two clusteringmethods on a subset of theUhlén study. ((a)) includes all the samples and we observe that only a few of them are mixed with other samples from other tissues. This mixture is only observed between Small intestine and Duodenum.80 3.3 reducing sources of bias several, I have arbitrarily selected the one that divides most accurately the samples by their tissue source across the different datasets. In fact, I noticed that Ward’s method [Ward, 1963] was the best for this task and was outperforming the complete-linkage method⁴. Indeed, this latter method was used in [Fagerberg et al., 2014] (first release of the Uhlén dataset) where the authors have discarded a few samples as they were clustering incoherently in regards of their biological nature. Figure 3.6b presents the effect of the different clustering methods. Notice that all the tissue clusters are better defined when Ward’s method is used: this method allows conserving all the samples for the analysis as long as other bias sources are corrected (see Section 3.3.1: Mitochondria issue). 3.3 reducing sources of bias Many non-trivial methods correct for the skewness present in RNA-Seq and MS-based proteome global expression distributions and for other possible bias sources. For some examples, see Leek et al. (2010), Leek (2014), Yi, Raman, et al. (2018), S. Li et al. (2014), and Stegle et al. (2012). However, it may be complex to assess the biological relevance of those corrections. Moreover, many require more metadata that is often available in the public repositories. In the context of this thesis, I am interested in consistent traits across the datasets that may be consolidated into a reference. Thus, biases that are common to all the different included studies are in practice negligible. On the other hand, I have adjusted for a few easily avertible biases that I describe hereinafter. 3.3.1 Mitochondria issue Mitochondria are organelles that can be found in eukaryotic cells. They have a central role in many essential processes [Kotrys et al., 2019]. While mitochondria share many similarities with bacteria⁵, substantial divergences, notably for mammals, have been discussed in many reviews such as Boguszewska et al. (2020), Barshad et al. (2018), Hillen et al. (2018), Al-Faresi et al. (2019), Ladoukakis et al. (2017), and Shokolenko et al. (2017). One remarkable key difference is the polyadenylation of the RNAs. In bacteria, the polyadenylation of mRNAs prompts their degradation [Hajnsdorf et al., 2018; Rorbach et al., 2014]. On the other hand, the entire range of the polyadenylation effects is still 4 Ward’s method minimises at each step the variance within each cluster; the complete-linkage method (or farthest neighbour clustering) uses the maximum distance between the two farthest elements of each pair of clusters and merges the pair with the smaller inter-cluster distance. 5 Despite the many debates and investigation going on about the lineages and mechanisms, current research accepts that mitochondria have evolved from an α-proteobacterial endosymbiont of a host cell prior to the last eukaryotic common ancestor (LECA) [W. F. Martin et al., 2015; Stairs et al., 2015]. 81 about expression, visualisation, correlation and clustering (a) With the 37 mitochondrial genes included (b) Without the mitochondrial genes Figure 3.7. Clustering of the biological samples of Uhlén study based on the Pearson correlation — all expressed genes are included. 82 3.3 reducing sources of bias elusive for the human mitochondrial RNAs (mt-mRNAs) [Al-Faresi et al., 2019]. The polyadenylated tail has only a relative effect on the mt-mRNAs’ stability [Bratic et al., 2016]. Besides, it is highly transcript-specific and can either stabilise them or flag them for degradation [Kotrys et al., 2019]. Most likely, its prime roles are the creation of a functional stop codon and the protection of the 3’ side of the mt-mRNA against degradation [Bratic et al., 2016]. Except for MT-ND6, a polyadenylated tail has been observed for the twelve other mt-mRNAs. The extent of polyadenylation can vary across cell types [Kotrys et al., 2019]. However, the polyadenylated tail has an average length of 45 nt [Rorbach et al., 2014], which explains its possible captures by RNA-Seq (see Section 1.2.1). Gene expression levels of mitochondria can report very useful information, e.g. the stress level of a cell in a single cell experiment [Ilicic et al., 2016]. However, it is unwise to keep them for a bulk analysis, particularly when comparing different biological sources. Indeed, it is very hard to properly normalise their expression; it involves knowing the amount of mitochondrial genome copies in the studied samples, while the mitochondria are from an unknown polyploidy and RNA-Seq protocols are badly suited for polyploid organisms [Pearce et al., 2015]. Thus, I have decided to remove them from the analysis as they skew anything relying on correlation. In fact, there are always mitochondrial genes among the highest expressed genes and they usually dominate manifolds the expression of the other genes. Removing the (37) mitochondrial genes from the bulk of expressed genes (more than 10,000 protein-coding genes) produces more defined clusters as showed on Figure 3.7 (see also, in the next chapter analysis, Figure 4.3 in comparison of Figure C.4, where the simple exclusion of the mitochondrial genes allows all tissues to cluster by biological origin rather than the mixtures observed when they are kept). Figure C.14 illustrates furthermore the distinctiveness of the mitochondrial genes expression levels. 3.3.2 Protein-coding genes only I have focused my analyses on the mRNAs (i.e. RNAs that have a biotype described as protein-coding in Ensembl 76). In addition to the obvious reason to match with the proteomic data, most of the transcriptomic data is the product of poly-A selected protocols (see Section 1.2.1.2: RNA enrichment). Thus, aside from the mRNAs, all the other RNAs are off-target and, for many of them, their expression levels estimations may be highly imprecise. 3.3.3 Expressed or not expressed While it can seem a trivial concept and might be overlooked, whether a specific molecule is truly expressed — or not — in a given condition, can actually have an extensive impact on the results of the analyses, particularly when integrating proteome and transcriptome 83 about expression, visualisation, correlation and clustering Ti ssu e 1 Ti ssu e 2 Ti ssu e 3 Ti ssu e 4 Ti ssu e N Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene x 0 0 0 00 0 0 0 0 0 0 0 00 0 000 0 Ti ssu e 1 Ti ssu e 2 Ti ssu e 3 Ti ssu e 4 Ti ssu e N Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene x ? Figure 3.8. Expressed or not: several cases illustrated. Genes like Gene 1 are unequivocal: they have been detected in all the different tissues. Genes that have been quantified in some of the conditions are, in principle, detectable with the protocol of sampling and quantification used for the assay. For these genes, when no signal is collected, I assume this is a true 0 signal. In contrast, genes without any quantification in any tissue, e.g. Gene 4, are discarded from the remaining analysis as it is impossible to state whether they are truly absent from the biological sample or if it is due to the protocol used; they are undefined. The same approach is used for the transcriptome and the proteome. together. For example, the Pearson correlation coefficient is very sensitive to outliers and null values. If for both samples, a vast number of null values are recorded, this will lead to a greater similarity. Hence, it is important that the data used for the analysis is meaningful in its entirety, i.e. a null value is a truly an observation and translates to a lack of expression, rather than a lack of observation. • The undefined: If a protein or transcript is never found in any of the samples of a dataset, then I considered that we can not determine if the protein or transcript was either truly not expressed or, for any reason, was not captured during the library preparation or the identification/quantification steps. Hence, those are excluded from the analyses as I can not resolve precisely if this is a technical artefact or a biological truth. An example is illustrated by the row circled in red in Figure 3.8. • Expression in a dataset: By contrast, if a protein or a transcript is expressed in some samples of the dataset, then, whenever no expression was recorded in the other samples, I consider that the expression of the considered macromolecule is truly null for those samples. 84 3.3 reducing sources of bias • Expression within a sample: Due to the technical (and biological) differences between proteomics and transcriptomics, I use different thresholds to classify the presence of a protein or a transcript. – Expressed protein: On the proteomic side, I consider that a protein is expressed if it has been identified and quantified. In other words, a protein is expressed if the expression value is greater than zero in a sample. As described in Section 1.3.4.3, a protein identification and expression are inferred on a set of selected peptides. Thus, if the peptide selection changes, the identified proteins and their level of expression as well. – Expressed transcript: On the transcriptomic side, while the identification is direct, we have to account for technical noise, but also for ‘transcriptional noise’ [Z. Wang et al., 2009; Dar et al., 2015]. Indeed, SEQC/MAQC-III Consortium (2014) reports that excluding low-expression measurements reduce the FDR of RNAs considerably. While we can empirically evaluate it for each RNA-Seq dataset [Ramsköld et al., 2009], there is a widespread threshold used in the literature: 1 FPKM (or RPKM) — e.g. Fagerberg et al. (2014) and Uhlén, Fagerberg, et al. (2015). In fact, Hebenstreit et al. (2011) showed in their study ‘RNA sequencing reveals two major classes of gene expression levels in metazoan cells’, that to be detected and quantified at protein level, an mRNA should at least present an expression equals to 1 RPKM. As an important part of my thesis focuses on the comparison of proteomic and transcriptomic data (see Chapter 6), I have conducted all the analyses at least once with this threshold of 1 FPKM (I have also used 0 (i.e. using the same threshold for mRNAs for the proteins) and 5 FPKM as other thresholds for a few specific analyses). Limitations of the study While I have chosen to define proteins as expressed if only they present a non-null value, the truth is more complex. One major challenge of bottom-up proteomics is the high rate of missing values. The detection of 10 to 50% of the expressed proteins can fail in a given study, and the proportion of a peptide/protein exhibiting a missing value at least once within the same study can reach 90% [Lazar, Gatto, et al., 2016]. As presented in Section 1.3.3, the detection of a protein is affected by the expression ranges of all the other proteins in the mixture. Thus, due to its relative abundance to other proteins in two given samples or tissues, the same protein can be detected in one andmissed in another onewhile present in both; it can reach the MS detection limit in the first case but not in the second. Imputing the missing values is a widespread handling approach, for which, many algorithms have been developed and reviewed [Webb-Robertson et al., 2015; Välikangas et al., 2018a; Gardner et al., 2020]. 85 about expression, visualisation, correlation and clustering For this thesis, I have chosen to not impute the missing data and exclude any mRNA or protein that is not expressed in at least one sample in each and every dataset used for the analyses. This led me to define different working sets for the following chapters analyses to limit the number of omitted mRNAs/proteins. I have compared the list of undefined, expressed and unexpressedmolecules. However, the bulk of the analyses has been done on the common expressed genes across the datasets. 3.3.4 Aggregating tissue expression To deal with an unbalanced number of biological replicates across the datasets (see Section 1.5: Reproducibility and Experimental design), I computed a ‘virtual’ reference for each tissue within the datasets that present more than one biological sample per tissue, i.e. Brawand, Uhlén and GTEx datasets. Note that, Castle, Cutler, Kuster and Pandey datasets present by design only one measure of expression per gene per tissue. To compute the ‘virtual’ references for Brawand, Uhlén and GTEx datasets, I have taken the median value of each gene across all the biological replicates for each of their tissues. Notes: • For IBM dataset, I have discarded the single-end sequenced samples (as already mentioned in Chapter 2). • The Uhlén dataset required an extra a priori step to the averaging of the biological replicates for some of the tissues as they present technical replicates. For these, I have first averaged the gene expression levels for each subject-tissue pair before computing the gene expression level medians of each tissue. • The GTEx dataset required another post-processing step after the averaging of the biological replicates. Indeed, the samples are described based on their body site sources while the other datasets describe their samples only based on their tissue origin. Thus, while there is only Heart samples in Castle, Brawand, IBM, Uhlén, Cutler, Kuster and Pandey, GTEx has samples from the left ventricle of the Heart and from the Atrial appendage of the Heart. For this case and other similar ones, I have average the virtual reference of the body sites in GTEx that I considered relevant for comparison with tissues found in the other datasets. • While I have detected many samples that seem outliers to their biological replicates within the same datasets, I have decided to keep all the samples for the tissues expression averaging step. On this matter, in many scientific exchanges, I was repetitively asked about the inclusion of all the GTEx dataset samples related to Oesophagus for the averaging step. Indeed, while two of the three GTEx’s body sites (i.e. Gastro oesophageal junction and Oesophageal muscularis) present great similarities together (𝑟 = 0.99) and more modest correlations to Uhlén’s Oesophagus samples (𝑟 < 0.80), the last body site (i.e. Oesophageal muscularis) expression, very dissimilar to the two former ones (𝑟 < 0.64), presents higher correlation to Uhlén’s (𝑟 = 0.94). Thus, while only considering this latter body site 86 3.4 discussion and conclusion significantly improves the overall Oesophagus correlation between GTEx and Uhlén, I have decided to include all three body sites (as one tissue) in my study. Indeed, there are no suitable reasons or information that allows excluding any of the body site prior to the analyses; gene expression scatter plots of the Uhlén’s samples versus any of these three body sites present two trends which suggest that Uhlén’s Oesophagus samples are composite. The meta-analyses of the following chapters use a TREP (tissue reference expression profile) for each tissue of each study, i.e. there is only one measure per gene for each tissue. These measures are either the primary sample expressions for Castle and IBM studies or these ‘virtual’ constructed references for Brawand, Uhlén and GTEx. 3.4 discussion and conclusion In this chapter, I have reviewed many fine quality control points that may be (and often are) overlooked. These details have critical impact on the results of the analyses I perform and more gravely on their interpretations⁶. Aside the datasets selection criteria, this phase is by far the most subjective one of the whole thesis. Hence, to avoid excessive data cleaning and possible cognitive biases, I have formulated the aforementioned filtering rules that are important but simple. Overall, I am quite stringent and I have preferred to keep more data unless there is a strong rationale to discard them. Therefore, there are sharper filters and samples exclusion options that may easily improve the results I present in this thesis. 6 For more, see ‘The devil in the details of RNA-seq’ [Kratz et al., 2014] and the included references. 87 So far, I think it’s been working. But who knows? Mark Watney [Weir, 2014] 4 I N TEGRAT ING GENE EXPRESS ION DATA FROM UND I SEASED T I S SUES ACROSS RNA- SEQ ST UD IES To pave the way towards a generalised baseline expression reference for the normal (i.e. non-disease) human, I assess in this chapter the similarity of the tissues sourced from different RNA-Seq studies and the general profiles of their expressed genes. When I started this project in 2013, little was known of either the robustness or the shortcomings and pitfalls of RNA-Seq and its related processing of output data. Since then, several studies were published assessing RNA-Seq robustness (See Appendix C.4). A few are close in scope to my own investigations. Thus whenever relevant, I introduce and discuss my results in relation to the published ones. In the first part of this chapter, I introduce the datasets based on the transcriptome studies (described in Chapter 2) that I use in the different meta-analyses. In the second part, I appraise the congruence of the interstudy tissue expression profiles. Then, I examine different components that may contribute the most to (and thus explain) the overall strong biological correlations that are observed between the studies’ tissues. Finally, I explore the interstudy consistency of the gene expression profiles. All the work presented in this chapter was performed by myself under the supervision of Dr Alvis Brazma. I received invaluable advice and help frommy discussions with Dr Nuno Fonseca. I also received general feedback and comments from Dr Mar Gonzàlez-Porta, Dr Johan Rung, Dr Sarah Teichmann, Dr Gos Micklem and Dr Wolfgang Huber. 89 integrating gene expression data from undiseased tissues across rna-seq studies Communication to the community derived from this chapter • (paper) R. Petryszak, M. Keays, et al. (2015). ‘Expression Atlas update—an integrated database of gene and protein expression in humans, animals and plants’. Nucleic Acids Res. 44 (D1), pp. D746–52 • (short talk) Quantitative Genomics 2015 — Integration of independent human RNA- seq datasets: a feasibility study • (poster) ECCB 2014 — A feasibility study: Integration of independent RNAseq datasets • (poster) SymBLS 2014 — Integration of independent human RNAseq datasets, a feasibility study • (invited short talk) GM2 2013 — Baseline Gene expression Atlas • (flash talk) CSAMA 2013 — How quantitative is RNA-seq? 90 4.1 meta-analyses’ combined datasets In the past years, RNA-Seq rapidly gained popularity for human gene expression studies due to a broader dynamic range than previous technologies and the promise to enable quantitative profiling [J. C. Marioni et al., 2008]. That technology was an advancement with respect to microarray assays that are semiquantitative [M.-L. Lee, 2006] and very prone to batch effects [Irizarry, Warren, et al., 2005]. However, RNA-Seq studies had shown variation in their conclusions on various occasions for similar research topics [SEQC/MAQC-III Consortium, 2014]. At the time that I started my Ph.D., it appeared that RNA-Seq might share at least partially the problems encountered with microarray assays. In fact, batch effects restrain the use of direct approaches for the comparison of independent microarray data, and the resulting insights are usually limited [Walsh et al., 2015; Chrominski et al., 2015; Rung et al., 2013; Lazar, Meganck, et al., 2013]. The following meta-analyses attempt to provide more insights into the interstudy RNA- Seq robustness for tissues expression as a supporting exploratory study to the EBI Gene Expression Atlas [Petryszak, Keays, et al., 2015]. 4.1 meta-analyses’ combined datasets Through this chapter meta-analyses, I use two sets based on combined subsets of the transcriptomic studies introduced in Chapter 2. The following Sections 4.1.1 and 4.1.2 illustrate the construction of these sets. While many approaches exist, I usually consider the most stringent routes, i.e. I rather exclude part of the data to infer conclusions than keep wider datasets and more partial, biased or ambiguous results. Thus, I identified the identical core of explored tissues and expressed genes across the studies. From this base, I created two more robust combined datasets (𝒲1 and𝒲2) for the meta-analyses. 4.1.1 Tissue overlaps across available normal human RNA-Seq studies Figure 4.1 presents the tissue overlap between the five studies. All studies share at least four tissues: Heart, Kidney, Liver and Testis. This 4-tissue set is the base of the first combined dataset (𝒲1). The greatest number of shared tissues occurs between the two most recent studies, Uhlén [Uhlén, Fagerberg, et al., 2015] and GTEx [Melé et al., 2015]. This 23-tissue set is the base of the second combined dataset (𝒲2) and includes Adipose tissue, Adrenal gland, Bladder¹, Cerebral cortex, Colon, Oesophagus, Fallopian tube, Heart, Kidney, Liver, Lung, Ovary, Pancreas, Prostate, Salivary gland, Skeletal muscle, Skin, Small intestine, Spleen, Stomach, Testis, Thyroid and Uterus. 1 May also be referred to as Urinarybladder 91 integrating gene expression data from undiseased tissues across rna-seq studies 0 2 2 8 201 10 0 2 0 0 1 0 10 3 010 0 0 0 0 0 00 50 0 0 4 Castle Brawand IBM Uhlen Gtex Figure 4.1. Distribution of unique and shared tissues between the transcriptomic studies. The five studies share four common tissues: Heart, Kidney, Liver and Testis. The most prominent overlap of tissues (23) is between Uhlén and GTEx. 4.1.2 Common measured genes for each of the main shared-tissue sets In the following sections of the thesis, I only present the results based on the HTSeq-count quantification. As shown in Table 2.1 (p. 57), many of the transcriptomic studies I use were produced through polyA-selected library protocols. Hence, to avoid unnecessary biases², I have limited my analyses to protein-coding genes (Ensembl 76). All mitochondrial genes have been filtered out before any TREPs analysis (as specified in Section 3.3.1). The Venn diagram presented in Figure 4.2a only includes protein-coding genes that are observed³ at least once at 1 FPKM for one of the four shared tissues. As mentioned in the previous chapter (Section 3.3.4 p. 87), the bulk of expressed genes at this threshold is common to all five studies. While each study presents a tiny portion of genes that are unique, overall most genes are detected in at least two studies. The most considerable contingent of shared gene expression is observed between Uhlén and GTEx. Figure 4.2b presents a similar Venn diagram which focuses on the set of twenty-three tissues (𝒲2) betweenUhlén andGTEx studies. The uniquely expressed genes in each study are negligible compared to the overlap. They represent at most 0.03% of the measured 2 See Section 3.3.2: Protein-coding genes only. 3 See Section 3.3.3: Expressed or not expressed. 92 4.1 meta-analyses’ combined datasets 221 131 75 96 23424 3146 19 249 20 47 40 29 187 86 296717 18 93 28 77 15 476 1792 403 111 32 29 12268 Castle Brawand IBM Uhlen Gtex (a) Four common tissues across the five tissues (𝒲1) 460 283 17554 Gtex Uhlen (b) Twenty-three common tissues between Uhlén et al. and GTEx studies (𝒲2) Figure 4.2. Unique and shared protein-coding genes expressed (≥ 1 FPKM) across the RNA-Seq studies for𝒲1 and𝒲2 93 integrating gene expression data from undiseased tissues across rna-seq studies genes in each of the studies. I have also analysed all the other subgroups of genes (i.e. unique to each study or shared only between two to four studies) for any functional annotation enrichment (see Section 1.4). No analysis provided any conclusive result. 4.1.3 Combined datasets summary I have based all the meta-analyses of this chapter on the two𝒲1 and𝒲2 datasets, which are defined as follow: 𝒲1 ∶ 𝒟Trans1 ×𝒢protein coding1 ×𝒯1 and 𝒲2 ∶ 𝒟Trans2 ×𝒢protein coding2 ×𝒯2 where: •𝒟Trans𝑖 is a set of mRNA expression studies (presented in Section 2.2). With𝒟Trans1 = { Castle, Brawand, IBM, Uhlén, GTEx} and 𝒟Trans2 = { Uhlén, GTEx} • 𝒢protein coding𝑖 is a set of genes 𝑔𝑝𝑐 that are shared by all elements of𝒟Trans𝑖 andhave a biotype described as protein coding in Ensembl 76. 𝒢protein coding1 comprises 12,268 protein-coding genes. 𝒢protein coding2 comprises 17,551 protein-coding genes.• 𝒯𝑖 is a set of tissues that are shared by all elements of𝒟Trans𝑖 . 𝒯1 includes four tissues and 𝒯2 twenty-three tissues. Note that as stated in Section 3.3.4, to avoid an unbalanced number of samples per tissues across studies, I aggregate into a single virtual reference, i.e. TREP, the gene expression measured for each gene for each tissue in each study, regardless of the number of replicates. Thus,𝒲1 comprises 20 TREPs, and𝒲2 46 TREPs. 4.2 prevalence of biological signal over technical variabilities at tissue-level As shown in Chapter 3, the expression levels of biological replicates (i.e. identical tissue samples) are highly correlated within the same study and allow one to group the samples based on their biological source. Thus, clustering the samples across studies should offer a rough assessment of the underlying driving forces for the observed gene expression levels. A clustering by study would mean that technical variabilities are stronger than any biological expression signature (which is an actual recurrent observation with microarray studies due to their strong batch effects [Sudmant et al., 2015]). On the other hand, an interstudy sample clustering by tissue would imply that RNA-Seq 94 4.2 prevalence of biological signal over technical variabilities at tissue-level measurements demonstrate a good (biological) signal over (technical) noise ratio. In other words, RNA-Seq would be then less prone to batch effects and more robust than microarray assays [Taminau et al., 2014; Walsh et al., 2015]. The heatmaps of the hierarchical clustering of the TREPs⁴ for𝒲1 and𝒲2 are respectively presented in Figures 4.3 and 4.4. They are based on clustering (Ward’s method linkage [Ward, 1963]) the TREPs’ Pearson correlation coefficients (protein-coding genes expressed at least at 1 FPKM). Both heatmaps show that the overall biological signal measured in the tissues is stronger than the noise generated by any technical variation or batch effect. In Figure 4.3, each cluster corresponds to a tissue. The clustering signal by tissue dominates over the signal from the dataset. It highlights a greater biological similarity of the TREPs due to their sampling origins rather than any possible technical similarity due to laboratory protocol variations. One may object that the very different gene expression landscapes of Heart, Kidney, Liver and Testis [Ramsköld et al., 2009; Lukk et al., 2010; Danielsson et al., 2015; Sudmant et al., 2015; Melé et al., 2015; Uhlén, Fagerberg, et al., 2015] may drive this result and lesser differentiated tissues may exhibit more mitigated correlations. Figure 4.4 (𝒲2) confirms that the biological origin of the tissues is the dominant criterion for the clustering of the TREPs. Moreover, in many cases, TREPs mixtures occur in close biologically related tissues, e.g. Fallopian tube and Ovary or Salivary gland with Oesophagus and Stomach TREPs. The functional proximity of these tissues is likely supported by an overall similarity in their gene expression. Thus, even though there are clear biological substructures emerging like for Heart and Skeletal muscle, without correction, the biological signal to technical noise ratio for close tissues may be insufficient to discriminate them accurately in every case. The general observed biological prevalence holds when I extend the analysis to include all the available tissue samples (see Figure C.5 and Figure C.6). See also Section 2.2: Transcriptome RNA-Seq studies (p. 56) and Table B.1: Correlation coefficients between RNA-Seq replicates (p. 198). Figure 4.5 shows the distribution of the Pearson correlation coefficients for the pairs of the profiles (TREPs) of tissues with the same name sourced from the different studies for both of the combined datasets𝒲1 (4-tissue set) and𝒲2 (23-tissue set). Even with the lack of any batch effect correction, most of the Pearson correlations are above 0.5. There are two exceptions: the correlation between the Testis TREPs of Castle and Brawand (𝑟 = 0.42) from 𝒲1 and Salivary gland TREPs of Uhlén and GTEx (𝑟 = 0.2) from 𝒲2. The median correlation for 𝒲1 is about 0.7 and 0.84 for 𝒲2. Spearman correlation gives even better 4 Tissue reference expression profile. See Section 3.3.4: Aggregating tissue expression. 95 integrating gene expression data from undiseased tissues across rna-seq studies Liv er (Uh len ) Liv er (Br aw an d) Liv er (G tex ) Liv er (Ca stle ) Liv er (IB M) Te stis (Ca stle ) Te stis (IB M) Te stis (Uh len ) Te stis (G tex ) Te stis (Br aw an d) He art (Ca stle ) He art (Br aw an d) He art (IB M) He art (Uh len ) He art (G tex ) Kid ne y ( Ca stle ) Kid ne y ( IBM ) Kid ne y ( Gt ex) Kid ne y ( Bra wa nd ) Kid ne y ( Uh len ) Liver (Uhlen) Liver (Brawand) Liver (Gtex) Liver (Castle) Liver (IBM) Testis (Castle) Testis (IBM) Testis (Uhlen) Testis (Gtex) Testis (Brawand) Heart (Castle) Heart (Brawand) Heart (IBM) Heart (Uhlen) Heart (Gtex) Kidney (Castle) Kidney (IBM) Kidney (Gtex) Kidney (Brawand) Kidney (Uhlen) 0.2 0.4 0.6 0.8 1 Pearson Correlation 0 50 10 0 15 0 Co un t Figure 4.3. Heatmap of the four common tissues across the five studies. All protein-coding genes (except the mitochondrial ones) at least expressed at 1 FPKM are included. All the different TREPs cluster by tissue of origin regardless of their study sources. Each of the colours on the top bar following the x-axis is associated to one of the study (purple for Uhlén, blue for Brawand, green for GTEx, orange for Castle and red for IBM), and the colours on the side bar following the y-axis are associated to the tissues (green for Kidney, red for Heart, blue for Testis and orange Liver). 96 4.2 prevalence of biological signal over technical variabilities at tissue-level Sk ele tal m usc le ( Uh len ) Sk ele tal m usc le ( Gt ex) He art (U hle n) He art (G tex ) Te sti s (G tex ) Te sti s (U hle n) Pa nc rea s (U hle n) Pa nc rea s (G tex ) Sk in (Uh len ) Sk in (G tex ) Liv er (Uh len ) Liv er (G tex ) Sal iva ry gla nd (U hle n) Sto ma ch (G tex ) Sto ma ch (Uh len ) Sal iva ry gla nd (G tex ) Oe sop ha gu s (U hle n) Oe sop ha gu s (G tex ) Sm all in tes tin e ( Uh len ) Sm all in tes tin e ( Gt ex) Co lon (U hle n) Co lon (G tex ) Ur ina ryb lad der (U hle n) Sp lee n ( Gt ex) Sp lee n ( Uh len ) Pro sta te (Uh len ) Pro sta te (G tex ) Fal lop ian tu be (Uh len ) Ov ary (U hle n) Ov ary (G tex ) Ut eru s (U hle n) Ur ina ryb lad der (G tex ) Ut eru s (G tex ) Fal lop ian tu be (G tex ) Co rte x ( Uh len ) Co rte x ( Gt ex) Ad ren al (G tex ) Ad ren al (Uh len ) Lu ng (U hle n) Lu ng (G tex ) Th yro id (Uh len ) Th yro id (G tex ) Kid ne y ( Uh len ) Kid ne y ( Gt ex) Ad ipo se (Uh len ) Ad ipo se (G tex ) Skeletal muscle (Uhlen) Skeletal muscle (Gtex) Heart (Uhlen) Heart (Gtex) Testis (Gtex) Testis (Uhlen) Pancreas (Uhlen) Pancreas (Gtex) Skin (Uhlen) Skin (Gtex) Liver (Uhlen) Liver (Gtex) Salivary gland (Uhlen) Stomach (Gtex) Stomach (Uhlen) Salivary gland (Gtex) Oesophagus (Uhlen) Oesophagus (Gtex) Small intestine (Uhlen) Small intestine (Gtex) Colon (Uhlen) Colon (Gtex) Urinarybladder (Uhlen) Spleen (Gtex) Spleen (Uhlen) Prostate (Uhlen) Prostate (Gtex) Fallopian tube (Uhlen) Ovary (Uhlen) Ovary (Gtex) Uterus (Uhlen) Urinarybladder (Gtex) Uterus (Gtex) Fallopian tube (Gtex) Cortex (Uhlen) Cortex (Gtex) Adrenal (Gtex) Adrenal (Uhlen) Lung (Uhlen) Lung (Gtex) Thyroid (Uhlen) Thyroid (Gtex) Kidney (Uhlen) Kidney (Gtex) Adipose (Uhlen) Adipose (Gtex) 0.2 0.4 0.6 0.8 1 Pearson Correlation 0 20 0 40 0 60 0 Co un t Figure 4.4. Heatmap of twenty-three common tissues between Uhlén and GTEx studies. All protein-coding genes (≥ 1 FPKM with the exclusion of the mitochondrial genes) are included. Most TREPs cluster by tissues (y-axis colour bar) rather than by study of origin (x-axis colour bar) with a few exceptions: there is a mixture of the Fallopian tube and Ovary TREPs. In addition, Salivary gland TREPs is more correlated to Oesophagus or Stomach regarding the original study. Urinarybladder TREPs seem to cluster randomly with the others. However, these TREPs are in singleton groups. 97 integrating gene expression data from undiseased tissues across rna-seq studies results: average correlation coefficients are 𝜌 = 0.49 for 𝒲1 and 𝜌𝒲2 = 0.9; the median correlations are 𝜌𝒲1 = 0.88 and 𝜌𝒲2 = 0.93. Both the Pearson⁵ and the Spearman correlation coefficients for the more exhaustive𝒲2 set, which comprises the twomost recent studies, are higher than the observed correlations for 𝒲1. Three main reasons may explain this situation as they contribute to lower the technical variations: • In addition to using paired-end sequencing, the library preparation protocols were better established for these two more recent studies; • The instrument used for the sequencing were from the same series (HiSeq 2000 and HiSeq 2500); and • These studies present a higher number of replicates per tissue. 0.00 0.25 0.50 0.75 1.00 4 common Tissues 23 common Tissues Combined dataset Pe ar so n co rr el at io n Figure 4.5. Distribution of the Pearson correlation coefficients of same tissues pairs for the four and the twenty-three tissues combined datasets. In general, the Pearson correlations are high when we are directly comparing TREPs from different studies. The same-tissue pairs in the 23-tissues combined dataset (𝒲2) present a higher median correlation (0.85) and narrower distribution than in the 4- tissues combined dataset (𝒲1) (median= 0.74). However, 𝒲2 displays one outlier with a very low Pearson correlation (0.2: Salivary gland tissue). Sampling, processing differences or biological reasons may just as well explain this outlier. On the other hand, the pairs comprising different tissues are very lowly correlated in general (see Figure C.7). Although, in a few cases of 𝒲2 (23-tissues combined dataset), high correlations are also observed for different-tissues pairs, e.g. Fallopian tube and Uterus from GTEx study (see also Figure 4.4). It is rather hard to decipher if this may be due to a technical issue (e.g. at the collection or library preparation stage) or because these tissues are biologically very close. As the exclusion of the undefined⁶ genes from the analyses handles one possible source of spurious Pearson correlation (due to null values), I have then checked, and, discarded, 5 Despite one major outlier in the second combined dataset (Salivary gland — Pearson correlation: 𝑟=0.2) 6 I.e. unobserved — See also Section 3.3.3: Expressed or not expressed 98 4.3 global stability of gene expression profiles across studies another possible source that is the skewed distributions; highest expressed genes may be technical artefacts, but they fail to show any significant correlation (see Appendix C.1). The high correlations are the results of the overall similarity of genes expression patterns across tissues and studies. 4.3 global stability of gene expression profiles across studies After validating that RNA-Seq allows distinguishing the shared biological origin of most tissues (TREPs) across different studies, the question then arises as to how consistent is the expression of each gene for a given tissue between studies. To this aim, I first assess the expression variability of the genes across the studies. Then, I explore the interstudy coherence of several gene categories. I focus on the tissue-specific (TS) genes, and on a larger number of categories developed upon classifications from Uhlén’s laboratory. 4.3.1 Genes with tissue-specific (TS) expression Tissue-specific (TS) genes are arguably the genes that ought to present a robust expression profile across studies. Tissue specificity definition The definition of tissue specificity varies from one study to another. See also Santos et al. (2015). Liang et al. (2006) that define tissue specificity only for genes expressed solely in one tissue, and then tissue selectivity for genes expressed in more than one tissue with an expression enriched in one or a subset of tissues. Other studies have a broader definition of tissue specificity. They identify genes above a given threshold of tissue selectivity (or enrichment) as tissue-specific genes (e.g. Fagerberg et al. (2014) and C. Jiang et al. (2016)). In this second case, genes with a single-tissue expression are an extreme case of Tissue- Specific (TS) genes. Within this thesis, I use the second definition, i.e. I consider genes as TS as long as they display a higher tissue selectivity than a preset threshold, regardless of how many tissues express them. Indeed, every study presents a subset of genes that are expressed above 1 FPKM in a sole tissue (see Figure C.2). However, the decreasing number of these genes when increasing the number of considered tissues highlights the arbitrary relativity introduced by the study design. For this definition, there are many methods to characterise genes tissue-specificity (e.g. Cavalli et al. (2011), Xiao et al. (2010), Karthik et al. (2016), P. Kim et al. (2017), Kryuchkova- Mostacci et al. (2017), Kadota et al. (2006), X. Yu et al. (2006), and Martínez et al. (2008)). 99 integrating gene expression data from undiseased tissues across rna-seq studies There are also databases that record previously identified TS genes, for normal conditions, e.g. TiGER⁷ [X. Liu et al., 2008] or TiSGeD⁸ [Xiao et al., 2010] and more specialised ones, e.g. for cancer TissGDB⁹ [P. Kim et al., 2017]. TS genes characterisation approaches used in this thesis From the possible approaches to characterise the TS protein-coding genes, I detail three that I used in the following subsections. First, I have queried TiGER to capitalise on previous knowledge. Then, to derive the TS genes directly from𝒲1 and𝒲2, I have used a published method, that uses the gene expression fold change (FC) ratio across the tissues. Finally, I have employed a robust method designed to detect outliers in data, i.e. Hampel’s test [Hampel, 1974], to identify genes which present an unusual expression level in a single tissue. In fact, as gene tissue selectivity and tissue specificity definitions are relative to a context, if the latter changes, the genes attributes may change as well (e.g. one gene that is non-specific in𝒲2 may be Heart-specific in𝒲1). 4.3.1.1 Use of prior knowledge: TiGER database TiGER [X. Liu et al., 2008] reports TS genes for thirty independent tissues (based on ESTs experiments). After retrieving the list of genes for all reported tissues, I have mapped the RefSeq identifiers provided by TiGER to Ensembl gene identifiers (GRCh38, Ensembl 76). Then, I removed all duplicates due to the identifier translation within each tissue, and I also filtered out all the genes identifiers that I found in more than one tissue: TiGER lists a subset of the same genes in many tissues, but the modification in the annotation may also explain part of the repetitive genes. Thus, for each tissue, I have a list of identifiers that are specific to that tissue only. Figure 4.6 is an expression heatmap based on a subset (i.e. 916) of protein-coding genes that are present in this final list of translated TiGER genes for the Heart, Kidney, Liver and Testis. There are three main types of genes. The largest group comprises the genes with a corroborating profile between the TiGER definition and the real data. Then, a second smaller group encompasses genes listed as TS in TiGER, but fails to demonstrate expression specificity towards any tissue in the real data. Finally, the third group includes a very tiny subset of genes which are more specific to another tissue than the initially stated one. Thus, without any additional knowledge, it is difficult to predict beforehandwhich original TiGER definitions will be confirmed or rejected by expression data. Remarkably, most of the genes present the same expression pattern through the tissues across each of the studies and thus regardless of their TiGER category. Once again, Castle expression data is 7 TiGER — http://bioinfo.wilmer.jhu.edu/tiger/ 8 TiSGeD — http://bioinf.xmu.edu.cn:8080/databases/TiSGeD/index.html 9 TissGDB — https://bioinfo.uth.edu/TissGDB/index.html 100 4.3 global stability of gene expression profiles across studies Testis (Castle) Testis (IBM) Testis (Brawand) Testis (Uhlen) Testis (Gtex) Liver (Castle) Liver (IBM) Liver (Gtex) Liver (Brawand) Liver (Uhlen) Heart (Castle) Heart (IBM) Heart (Uhlen) Heart (Brawand) Heart (Gtex) Kidney (Castle) Kidney (IBM) Kidney (Uhlen) Kidney (Gtex) Kidney (Brawand) Heart Kidney Liver Testis (TiGER annotation) 0 5 10 15 log2(FPKM+1) Figure 4.6. Expression heatmap of the four tissues across the five datasets based on TiGER information. This heatmap illustrates three subsets of genes: genes for which real expression data confirm their TiGER definition; genes failing to show any TS profile in their expression data; and genes with mismatching tissue specificity between TiGER definition and the expression data. The colourbar above the heatmap is representing the tissue for which TiGER annotates the genes (presented as columns) as TS (red forHeart, green for Kidney, light orange for Liver and blue for Testis). TiGER definitions are of variable accuracy. exhibiting the only few observed discrepancies¹⁰. 4.3.1.2 Fold change method As Love et al. (2014) noted the most common approach for detecting a gene expression difference between two conditions is to study the expression fold change (FC) ratio between these conditions. This method is still broadly present in the literature, especially for studies other than differential gene expression analyses¹¹; as examples, see Uhlén, Fagerberg, et al. (2015), Zhu et al. (2016), and N. Y.-L. Yu et al. (2015). Besides, EBI Gene 10 Reminder: the FPKM quantification (used here) is sensitive to the number of identified genes (see Equation (Canonical F/RPKM formula) equation (Canonical F/RPKM formula) on page 24) and Castle study identifies and quantifies many more RNAs than the other studies as it uses a whole RNA protocol while the others are using polyA-enrichment (see Chapter 2). 11 Studies comparing gene expression of a treated or diseased condition to control samples 101 integrating gene expression data from undiseased tissues across rna-seq studies Figure 4.7. Overview for the comparison of the genes across the five studies based on a ranked descriptor. The first step applies individually to each of the studies within the combined dataset (i.e. here 𝒲1). It consists in extracting a single value per gene (e.g. a statistic or any other quantitative descriptor) either for the entire dataset (referred thereafter as D-approach) or for each tissue in each dataset (referred as T-approach). The next steps include computing (cumulatively) the intersection size number for each rank and plotting this number divided by the rank as a function of the number of considered genes (i.e. rank). Expression Atlas [Petryszak, Keays, et al., 2015] relies on this method to select the most specific genes for baseline studies¹² (see Figure C.22). There are also a few variations on how to compute this ratio; see Zhu et al. (2016) and Uhlén, Fagerberg, et al. (2015). In this thesis, I compute the FC ratio by dividing the expression of each gene in a given tissue by the average expression of this gene across all the other tissues of that study in the combined dataset. ℱ𝒞𝑔,𝑡,𝑑 = 𝑥𝑔,𝑡,𝑑 1 𝑛 𝑛 ∑ 𝑖=1 𝑥𝑔,𝑖,𝑑 (Fold change (FC) ratio) where: • 𝑥 is the expression value of the gene 𝑔 in the tissue 𝑡 in a study 𝑑 • 𝑛 is the number of tissues 𝑡 12 In contrast to differential gene studies, the baseline studies focus on depicting the expression landscape of each covered condition instead of focusing on the gene expression through these conditions. 102 4.3 global stability of gene expression profiles across studies Studies usually pick arbitrary cut-offs to characterise the specific genes. Uhlén, Fagerberg, et al. (2015) uses two-fold and five-fold cut-offs to determine enriched and enhanced tissue genes. Zhu et al. (2016) set their cut-off at 2 to characterise TS protein-coding and noncoding transcripts. However, I avoid arbitrary cut-offs, and I use the FC ratio to rank the protein-coding genes of my combined datasets according to their specificity within each tissue: higher FC ratios indicate genes with higher specificity. I then assess the consistency of the tissue specificity of the genes through the various studies. For that, I have followed the T-approach overviewed in Figure 4.7. Figure 4.8. Intersection size curve of𝒲1 genes based on their specificity (FC ratio rank) in each tissue across the five studies. When ranked by specificity in Heart, Kidney and Liver, one fortieth of 𝒲1 protein-coding genes are commonly shared between the five studies. For Testis, the shared amount of genes reaches more than one-tenth of𝒲1 whole set of genes. Compared to the most variable genes (Figure C.19), the most specific genes seem to be more consistent across the studies. Figure 4.8 presents the shifts in the intersection size curves of the four tissues of𝒲1. The most specific genes in each tissue of 𝒲1 are shared between the five studies. Indeed, the slopes are very sharp before reaching a peak and dropping as sharply for the first fortieth genes in Heart, Kidney and Liver. The intersection of the most specific genes is even greater for Testis as it concerns more than a tenth of𝒲1 genes. Figure 4.9 shows that 103 integrating gene expression data from undiseased tissues across rna-seq studies this observation holds true for𝒲2 when the number of tissues and genes is increased. Figure 4.9. Intersection size curve of𝒲2 genes based on their specificity (FC ratio rank) in each of the twenty-three tissues across the two studies. 4.3.1.3 Hampel’s test: detection of atypical expression The last method I used to characterise the TS genes is the Hampel’s test. This test is a robust method for detecting outliers [Davies et al., 1993; Pearson, 2002] in data that are identically and independently distributed (i.i.d.) [H. Liu, Shah, et al., 2004], while easy to implement and use [Linsinger et al., 1998]. Much interlaboratory or interstudy research in the literature e.g. Linsinger et al. (1998), Lewczuk et al. (2006), Rocke (1983), and Apfalter et al. (1999) use the Hampel’s test to detect outliers. The method uses the median and the MAD¹³ to estimate the location and the spread, and a cut-off to define the observations that stand apart. For this thesis, I have derived this test to identify conditions (i.e. tissues) where the gene expression is atypical. I rely on the two facts that most genes are expressed everywhere [Ramsköld et al., 2009; Uhlén, Fagerberg, et al., 2015; Melé et al., 2015] (see also Figure C.2), and they mostly present a limited variation in their expression through the various tissues (see Figures 4.11 and 4.12). There is a tissue specificity for a genewhen its expression to this tissue is atypical, i.e. the expression in this tissue is an outlier to the average expression profile across the other tissues. Besides, this test allows detecting genes that are over- or under-expressed in specific tissues, whereas the other methods are only detecting the 13 MAD: median absolute deviation 104 4.3 global stability of gene expression profiles across studies Testis (Castle) Testis (IBM) Testis (Brawand) Testis (Uhlen) Testis (GTEx) Liver (Castle) Liver (Brawand) Liver (Uhlen) Liver (IBM) Liver (GTEx) Heart (Castle) Heart (IBM) Heart (Uhlen) Heart (Brawand) Heart (GTEx) Kidney (Brawand) Kidney (Uhlen) Kidney (Castle) Kidney (IBM) Kidney (GTEx) Heart Kidney Liver Testis (Hampel's test) 0 5 10 15 log2(FPKM+1) Figure 4.10. Expression of the genes picked consistently with the Hampel method in each study solely in one tissue. overexpressed genes. After implementing themethod (see Algorithm 1, p. 224) with a (widespread) adimensional cut-off of 5.2, I have applied it to𝒲2 and the whole original datasets. 𝒲1 comprises too few tissues to allow detecting atypical expression. Overall, there are always more than 60% of congruence between the genes tagged as atypical in a specific tissue for 𝒲2 and the whole dataset. The proportion of agreement between the partial and whole datasets increases when I filter the results to keep only the genes that are recurrently picked for both Uhlén et al. and GTEx data. Figure 4.10 regroups the genes that the Hampel test detects as outliers for the five studies for their four shared tissues. All corresponding-tissue TREPs present similar patterns of expression regardless of their original study, although the Castle TREPs have overall lower expression values than the others. Other filters may improve the results. Overall, the TS genes, identified separately in each dataset and with several methods, are showing a cleaner biological interstudy signal over possible technical intrastudy noise and are contributing to the high interstudy tissue correlations presented above. 105 integrating gene expression data from undiseased tissues across rna-seq studies 4.3.2 Uhlén categories Uhlén laboratory publications [Fagerberg et al., 2014; Uhlén, Fagerberg, et al., 2015; Uhlén, Hallström, et al., 2016] use different categories of genes to describe the normal human transcriptome. As their classification changes between these related papers, I have redefined a classification based on them (presented in Table 4.1) before applying it to𝒲1 and𝒲2 (Table 4.2). The following classification considers the breadth¹⁴, the level and the specificity of the gene expression. Table 4.1. Gene classification adaptation of Uhlén et al. classification [Fagerberg et al., 2014; Uhlén, Fagerberg, et al., 2015; Uhlén, Hallström, et al., 2016] Category Definition Not detected Never detected above 0 FPKM Not expressed Never detected above 1 FPKM Mixed high Expressed in a subset of tissues and always ≥ 10 FPKM Mixed Low Expressed in a subset of tissues and always < 10 FPKM Ubiquitous High Expressed in all the tissues and always ≥ 10 FPKM Ubiquitous Low Expressed in all the tissues and always < 10 FPKM Group enhanced Expressed in a subset of tissues with an expression ≥ 5 ∗meanall the tissues Tissue enhanced Expressed in a single tissue with an expression ≥ 5 ∗meanall the tissues Tissue enriched Expressed in a single tissue with an expression ≥ 5 ∗Maxall the other tissues Table 4.2 shows that for many of these categories, the number of shared protein-coding genes is high between the different studies of the two combined datasets𝒲1 and𝒲2. It supports the previous results that protein-coding genes present in general a similar gene expression profile for a same set of tissues across studies. 4.3.3 Similar expression variability of the genes across studies To further appraise the robustness of gene expression, I have studied more globally their variability across studies. There are several available estimators to describe the gene expression variability, e.g. the standard deviation (sd) the variance (𝑠𝑑2) or the coefficient of variation ( 𝑠𝑑𝑚𝑒𝑎𝑛 ). I only report here the results based on the coefficients of variation. The coefficient of variation (cv) allows assessing the dispersion of the gene expression values across the tissues within each dataset. As it adjusts for the mean, it is a more straightforward estimator to interpret 14 The breadth of expression of a gene is the number of tissues (or cell lines) in which it is expressed. 106 4.3 global stability of gene expression profiles across studies Ta ble 4.2 .U hl én et al. ge ne ca teg or ies Ap art fro m the un de tec ted ge ne sa nd the on es ex pre sse db elo w 1F PK M, ag en em ay be ref ere nc ed in sev era lc ate go rie s. En sem bl 76 (∼ 22 ,50 0p rot ein co din gg en es) No t de tec ted No te xp res sed at 1F PK M cu t-o ff Mi xe de xp res sio n Ub iqu ito us ex pre ssi on Gr ou p En ha nc ed Tis su e En ha nc ed Tis su e En ric he d Lo w (< 10 FP KM ) Hi gh (≥ 10 FP KM ) Lo w (< 10 FP KM ) Hi gh (≥ 10 FP KM ) Wholedataset Ca stl e 3,4 03 3,2 68 8,7 73 1,0 33 1,3 99 63 4 11 3,6 64 1,9 75 Br aw an d 2,9 64 3,0 95 8,0 34 1,7 88 1,7 60 95 8 0 2,7 29 2,5 48 IBM 2,6 93 2,6 05 7,3 25 1,4 06 1,1 35 85 8 32 2 5,2 48 2,4 53 Uh lén 2,6 62 1,7 47 5,7 69 1,0 53 45 6 40 6 2,5 11 5,2 01 2,3 33 GT Ex 2,1 97 1,8 86 5,5 56 1,1 17 68 7 69 8 3,8 59 4,3 56 1,9 19 Co ns en su s 2,1 97 48 6 1,7 49 22 1 33 16 1 0 67 7 53 1 Common 4-tissues combined datasets Ca stl e 19 ,06 6 2,9 94 8,5 89 1,5 13 2,9 94 10 94 — — 2,1 85 Br aw an d 19 ,50 5 2,9 62 8,6 26 2,2 28 2,9 62 12 51 — — 3,6 72 IBM 19 ,77 6 2,9 89 8,5 34 1,9 54 2,9 89 12 12 — — 2,8 24 Uh lén 19 ,80 7 2,9 17 8,3 67 2,2 27 2,9 17 11 90 — — 3,7 30 GT Ex 20 ,27 2 3,8 70 8,9 88 2,3 12 3,8 70 14 27 — — 3,5 54 Co ns en su s 1,9 73 55 0 3,3 51 64 9 55 0 43 9 — — 1,4 12 Common 23-tissues combined datasets Uh lén 2,6 62 1,9 70 6,1 60 1,1 35 59 4 42 7 1,2 85 5,7 76 2,5 18 GT Ex 2,1 97 2,2 58 6,9 66 1,5 40 1,8 22 99 7 1,0 48 5,4 96 2,4 60 Co ns en su s 2,1 97 1,5 44 4,9 36 79 1 42 3 41 7 55 8 4,2 23 1,8 85 107 integrating gene expression data from undiseased tissues across rna-seq studies than the variance itself, in particular for interstudy comparisons. As depicted in Figure 4.11 (and Figure 4.12 for 𝒲2), the distribution of gene expression coefficients of variation presents a similar pattern across the five studies of𝒲1. The five datasets present two peaks. One at approximately 0.5 which characterises genes that are quite invariant in their expression across the four tissues within each study. Another subset of genes forms a peak for coefficients of variation equal to 2. This last group of genes are the most variable ones in each dataset. There is an overlap of the most variable genes between the five datasets (as shown in Figure C.19). While Figure 4.4 has already established that expression profiles for each tissue are very similar across the studies, Figure 4.12 highlights that most genes seem to share the same intertissue expression profile variability as the distribution of the coefficients of variation across Uhlén et al. and GTEx studies are very alike. Uhlen Gtex Castle Brawand IBM 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 0.0 0.5 1.0 Coefficient of variation (cv) D en sit y Figure 4.11. Distribution of the coefficients of variation (cv) across𝒲1 (commonset of expressed protein-coding genes across the four common tissues): {Heart, Kidney, Liver, Testis} across the five studies. The coefficients of variation (cv) of the protein-coding genes (12,268) of the four tissues present the same bimodal distribution profile across the five studies. These profiles present two peaks: at 0.5 and 2. Genes with a cv less than or equal to 0.5 have a similar expression profile to a left-truncated version of the complete gene set ones (due to the 1 FPKM cut-off) as in Figure 3.1. On the other hand, the protein-coding genes with a coefficient of variation equal to or greater than 1.5 have two kinds of distinct profiles: • The gene expression is low across the four tissues, and it is above the cut-off of 1 FPKM only once (see Figure C.21); or • The gene expression is specifically high for one single tissue relative to the three others (See Figure C.18). 108 4.4 discussion Uhlen Gtex 0 1 2 3 4 5 0 1 2 3 4 5 0.0 0.3 0.6 0.9 Coefficient of variation (cv) D en sit y Figure 4.12. Distribution of the coefficients of variation across𝒲2. The bimodal distribution is more unbalanced than in Figure 4.11. Indeed, as more tissues are included for the calculation of the coefficients of variation, the second peak is found around 5. This peak has a smaller amplitude than the peaks at 2 in Figure 4.12. There are still many genes that have a coefficient of variation around 2. However, the overall distribution of the higher coefficients of variation is smoother than for 𝒲1. Hence, most genes present a similar profile of expression through the various tissues. More in-depth analyses confirmed that overall the genes present an equivalent coefficient of variation from one study to another for the same tissue set. See Figures C.19 and C.20. 4.3.4 Curated sets Together the results from the previous sections indicate that many genes categories (if not all of them) have an equivalent (i.e. stable) expression profile across studies for the same tissues. Protein-coding genes that were characterised consistently as any of the aforementioned categories across the five datasets of 𝒲1 or the two of 𝒲2 are provided digitally as supplementary data. They can be found at http://www.barzine.net/~mitra/thesis. The code required to produce these results can be found at https://github.com/barzine/phd-analyses. 4.4 discussion In this chapter, I have directly compared and integrated the human tissue transcriptome from five RNA-Seq studies. The meta-analyses are based on the largest number of undiseased human tissue studies to date. I have constructed two combined datasets of protein-coding genes. The first one (𝒲1) comprises four tissues and 12,268 shared genes extracted from five independent studies (Krupp et al., 2012; Brawand et al., 2011; Uhlén, Fagerberg, et al., 2015; Melé et al., 2015 and IBM) and the second one (𝒲2) comprises twenty-three tissues and 17,554 shared genes from two studies (Uhlén, Fagerberg, et al., 109 integrating gene expression data from undiseased tissues across rna-seq studies 2015; Melé et al., 2015). Clustering analyses (and Welch’s Two Sample t-test) of these two sets confirm that RNA-Seq technical noise is lower than relevant biological signals present in the data. Indeed, Sudmant et al., 2015; Danielsson et al., 2015; N. Y.-L. Yu et al., 2015 and Uhlén, Hallström, et al., 2016 also observe that interstudy corresponding tissues pairs are more related than intrastudy non-corresponding tissue ones (average correlation for corresponding tissue-pairs 𝑟 = 0.75, 𝜌 = 0.88; average for non-corresponding tissue-pairs 𝑟 = 0.20, 𝜌 = 0.75). I have then shown that overall genes present similar interstudy expression variability profiles for the same tissue set. I have considered different gene groups to examine their coherence of expression profiles more closely. Since there is no generally accepted definition of a TS gene (see Section 4.3.1), I have relied on three different methods to study them, including extracting TS gene definitions from an existent resource TiGER [X. Liu et al., 2008] that I have updated to the current human genome build (GRCh38). Mining the experimental data with this updated list highlights the need for caution when dealing with older resources. While the congruence of the three methods is partial, the TS genes show distinct expression profiles across tissues that are rather consistent through the different studies. I have also explored the congruence of several other genes categories across the studies following a classification inspired by Uhlén et al. publications [Fagerberg et al., 2014; Uhlén, Fagerberg, et al., 2015; Uhlén, Hallström, et al., 2016]. These gene categories are based on the level and breadth of expression (‘Not detected’, ‘Not expressed at 1 FPKM’, ‘Ubiquitous low expression (< 10 FPKM)’, ‘Ubiquitous high expression (≥ 10 FPKM)’, ‘Mixed low expression (when expressed,< 10 FPKM)’, ‘Mixed high expression (when expressed,≥ 10 FPKM)’, ‘Group Enhanced’, ‘Tissue Enhanced’, and ‘Tissue Enriched’). Finally, I have compiled all the genes showing a consistent pattern of expression through the meta-analyses across the studies into curated sets. Since I started this project, other research groups have published similar studies. However, at the time of writing this thesis, all the other studies (including the aforementioned ones) were still based on the human genome build GRCh37 (or hg19), while I am using the more recent GRCh38 one. These studies either have different focus, aims, approaches or more limited scopes. Below, I outline how my work expands or completes theirs. • M. Uhlén, B. M. Hallström, et al. (2016). ‘Transcriptomics resources of human tissues and organs’. Mol. Syst. Biol. 12 (4) This review presents the results of N. Y.-L. Yu et al. (2015) and Danielsson et al. (2015) discussed in the following points. It also compares data released by the GTEx consortium [Bahcall, 2015; GTEx Consortium, 2015; Gibson, 2015] to its authors’ own dataset (Uhlén data [Fagerberg et al., 2014; Uhlén, Fagerberg, et al., 2015]). As many findings from other studies (that I discuss hereafter) are reviewed, the comparison of the two 110 4.4 discussion studies is limited to the examination of the proportion of genes in each category of a simplified classification for nineteen tissues. We reach the same conclusions: overall, there are significant overlaps across the datasets for each of the categories they have considered in their study, i.e. ‘Expressed in all’, ‘Not detected’, ‘Tissue enriched’, ‘Group enriched’, ‘Enhanced’ and ‘Mixed’. • P. H. Sudmant et al. (2015). ‘Meta-analysis of RNA-seq expression data across species, tissues and studies’. Genome Biol. 16, p. 287. The authors focus on interspecies, intertissue and interstudy comparisons. The major issue of the study is its use of TMM as a gene expression unit. TMM normalisation has the assumption that genes have a stable expression across conditions while the different analyses I have presented indicate that many protein-coding genes expression profiles show tissue specificity while they have a stable expression across the studies. They also limit their scope to very specific orthologs since they explore RNA-Seq expression data across species, tissues and studies. They confirm that with RNA-Seq expression profiling, the interstudy technical variation is generally lower than the intrastudy biological one, i.e. interstudy homologous tissues of the same species are usually closer in similarity than different tissues of the same species (or matched tissues of different species) of the same study. They found that interstudy comparisons are more variable for human than other species. They finally note that this kind of meta- analysis is dependent on the choice of tissues to be studied. • N. Y.-L. Yu et al. (2015). ‘Complementing tissue characterization by integrating transcriptome profiling from the Human Protein Atlas and from the FANTOM5 consortium’. Nucleic Acids Res. 43 (14), pp. 6787–6798, integrate Uhlén et al. data [Fagerberg et al., 2014] with CAGE peak expression data from the FANTOM5 consortium [FANTOM Consortium and the RIKEN PMI and CLST (DGT) et al., 2014]. Overall, their analyses are very similar to the ones I have presented in this chapter. We also reach similar conclusions as well: – Overall gene expression is comparable through their two datasets. – Tissue expression signatures are independent of the data set (and profiling method). – Global comparison of ubiquitously expressed and TS genes are comparable across the two studies – They compare the two datasets at gene levels because of the lack of accuracy of the current RNA-Seq protocols and algorithms and focus on the protein- coding genes as the level of agreement between the two studies is low (which they attribute to the polyA-selected protocol of Uhlén et al. data). Besides the choice of the original studies, the few differences are (1) the version of the annotation (they use the previous human genome build (GRCh37)) and (2) they apply more simplified classification for ubiquitous and TS genes. In addition, they are also more lenient to assess the congruence (e.g. expressed in all tissues in one dataset and 95% of the tissues in the other datasets is considered as concordant). Strikingly, they also found a discrepancy with Salivary gland compared to the other 111 integrating gene expression data from undiseased tissues across rna-seq studies tissues which I have noticed in this study¹⁵. • F. Danielsson et al. (2015). ‘Assessing the consistency of public human tissue RNA-seq data sets’. Briefings Bioinf. 16 (6), pp. 941–949, covers three different tissues (Brain¹⁶, Heart, Kidney) extracted from five projects (E. T. Wang et al., 2008; Brawand et al., 2011; Uhlén, Fagerberg, et al., 2015; Krupp et al., 2012 and IBM). Their study is limited to the comparison of precomputed data (from the original laboratories) versus uniformly reprocessed data (by themselves). They are exploring experimental variation factors and possible correction strategies. They reveal that original precomputed data have considerable study-specific biases. Their results on interstudy tissue similarities are superficial. One of their most ‘fine-grained’ results is the very low number of shared genes amongst the hundred most expressed genes for the three tissues across their five datasets. • A. Santos et al. (2015). ‘Comprehensive comparison of large-scale tissue expression datasets’. PeerJ 3, e1054, focuses on the congruence of gene—tissue association through different types of expression data (transcriptome and proteome) and resources. They report that most genes are either expressed in every considered tissue or in small subsets in their constructed working dataset. They also find that tissue specificity trends are globally similar, even though there are many differences in the identified genes set across studies. They have integrated together five tissues (Heart, Kidney, Liver, Nervous system and Small intestine) from five transcriptome datasets that they use ‘as-is’¹⁷, before refining a complementary set that comprises only the three highest-quality datasets: UniGene database¹⁸ [Wheeler et al., 2003; Pontius, Joan U. and Wagner, Lukas and Schuler, Gregory D., 2002], Uhlén et al. data (HPA RNA-Seq) [Fagerberg et al., 2014; Uhlén, Fagerberg, et al., 2015] and Castle et al. (RNA-Seq atlas) [Krupp et al., 2012] for which they provide association data for 14,722 genes. They forsake the direct quantitative exploration and comparison of gene expression between the studies, but examine the tissue association enrichment through the gene expression fold change (FC) for a qualitative cross-study. The main issue with their transcriptomic study is their assumption that higher expression¹⁹ means more robust gene—tissue association which may often (but wrongly) be translated to a greater tissue-specificity. • Q. Wang et al. (2017). ‘Enabling cross-study analysis of RNA-Sequencing data’. bioRxiv (110734), have used subsets of GTEx and the cancer genome atlas (TCGA) raw data that they have quantified mRNAs at transcript levels with GRCh37 (hg19). They have corrected for the study effect with ComBat [W. E. Johnson et al., 2007] and have released the normalised data to the community. Note that EBI Gene Expression Atlas provides more recent versions of the GTEx and TCGA (as a part of the pan- 15 Salivary gland is the only tissue for which Uhlén and GTEx show a Pearson correlation coefficient 𝑟 < 0.65. 16 Either Cerebral cortex or Hypothalamus 17 Even in the follow-up paper, where they reprocess all the RNA-Seq data formouse, rat, pig, Palasca et al. (2018) do not mention any improvement for the human RNA-Seq data (either in the results or methods and data). 18 UniGene database — https://www.ncbi.nlm.nih.gov/unigene 19 FPKM values are directly used as score for true presence and selectivity to a tissue 112 4.4 discussion cancer analysis of whole genomes (PCAWG) project²⁰) data. • SEQC/MAQC-III Consortium (2014). ‘A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium’. Nat. Biotechnol. 32 (9), pp. 903–914, have found that relative expression measurements by RNA-Seq are accurate and reproducible across sites. The authors also showed that the overlap of identified and characterised genes is imperfect (91%) even when the design includes the same two well-characterised reference RNA samples across all sites. Also, this specific design prevents inferring how the biological signal may compare to the individual variations and the possible noise introduced by collection, storage and extraction protocols. • Several papers (Khang et al., 2015; Peixoto et al., 2015; Rau et al., 2014) explore the reliability of RNA-Seq in the context of DGEA which is outside the scope of this thesis. • Zhuo et al., 2016 explore the stable expressed genes across multiple (24) RNA-Seq studies for Arabidopsis which is why I will not discuss it further. In summary, while the expression levels are hard to translate directly from one study to another [N. Y.-L. Yu et al., 2015; Santos et al., 2015], many facts have been highlighted in this thesis with a subset of them confirmed by the studies mentioned before: • Tissues are clustering preferably with corresponding (or closely biologically related) tissues even across studies rather than clustering with different tissues from the same study (i.e. biological signal >>> technical noise due to RNA-Seq protocols). • More recent transcriptome studies are more congruent than previous ones. • Testis presents the highest number of TS protein-coding genes (see Figure C.1). It also presents the most variety of expressed protein-coding genes (≥ 1 FPKM) Castle, Brawand, IBM and Uhlén studies. This extends to GTEx study if all detected genes (i.e. above 0 FPKM) are considered. • Liver has the most robust TS protein-coding genes. It may be explained either by a robuster gene expression, its greater homogeneity than most tissues, or a greater knowledge and better annotation than the other tissues. • Most genes are ubiquitously expressed while a small proportion are expressed in a very limited set of tissues. • Well-differentiated tissues have specific expression profiles that allow using processed data ‘as-is’ for rough comparisons such as sample swap checks or quality controls, e.g. Salivary gland in Uhlén et al. data which presents low correlation with GTEx data (see Figure 4.5, p. 98) and with FANTOM5 data [N. Y.-L. Yu et al., 2015]. • Annotations have an essential effect on the final results. Thus, whenever possible, we ought to keep the resources up-to-date. 20 PCAWG project analyses conjointly all available kind of research data related to cancer 113 integrating gene expression data from undiseased tissues across rna-seq studies Unsurprisingly, updating the human genome version from GRCh37 to GRCh38 for the reconstruction step²¹ enhances the results significantly. Besides, as my analyses were incorporatingmore studies, the results were supportingmore similarity in the gene expression levels across the tissues and studies. Indeed, as I focus my analyses on the common set of genes across the studies, I remove the most interstudy variant genes (i.e. that are probably more sensitive to technical factors), and I bias the analyses towards the genes for which RNA-Seq is more robust to quantify their expression profiles. Hence, it may be interesting to relax the filters by including genes that are found in any two or more datasets as a follow-up study. 4.5 conclusion The meta-analyses show that RNA-Seq captures a strong biological signal for tissue gene expression despite any noise created by batch effect or technical variations. Tissues reference expression profiles (TREPs) are well correlated across independent studies and are the sum of the overall genes contributions. While highest expressed genes fail to show significant correlation between the different studies, the analyses show that most gene expression profiles are comparable from one study to another. The gene centred heatmap (Figure 4.13) available now in EBI Gene Expression Atlas²² [Petryszak, Keays, et al., 2015] and its corresponding widget are a direct translation of this observation. To assist further research, I provide extensive curated and consolidated gene sets for the different categories reviewed in the above analyses, i.e. for the TS genes, and the categories derived from Uhlén publications: ‘Not detected’, ‘Not expressed at 1 FPKM’, ‘Ubiquitous low expression (< 10 FPKM)’, ‘Ubiquitous high expression (≥ 10 FPKM)’, ‘Mixed low expression (when expressed,< 10 FPKM)’, ‘Mixed high expression (when expressed,≥ 10 FPKM)’, ‘Group Enhanced’, ‘Tissue Enhanced’, and ‘Tissue Enriched’. There is a need for more multi-tissue studies with biological replicates to refine and complete the above findings for other tissues and extend them to the transcript isoform level. New strategies, notably normalisation methods, have to be also developed to allow the easy reuse of uniformly processed and quantified data by the community. Ideally, the final aim should be to provide a general human transcriptome build as it already exists for the genome. Finally, as long as the annotations are redefined and refined, it also means that periodic resources reprocessing may be inevitable. 21 See Section 1.2.5.2: Reconstruction strategies 22 EBI Gene Expression Atlas — https://www.ebi.ac.uk/gxa/ 114 4.5 conclusion Fig ur e4 .13 .E xa m pl eo fE BI ge ne ex pr es sio n atl as ge ne ce nt ric he atm ap .T his he atm ap sh ow st he rel ati ve ex pre ssi on of the Al bu mi ne (EN SG 00 00 01 63 63 1) acr oss the tis su es an ds tud ies .N ote tha tt he ex pre ssi on is cal cu lat ed wi thi ne ach stu dy lib rar yb efo re be ing ag gre ga ted by ide nti cal co nd itio no rt iss ue . 115 I was taught that the way of progress is neither swift nor easy. Marie Curie 5 HUMAN MS -BASED PROTE INEXPRESS ION LANDSCAPE After exploring the high-throughput human transcriptomic studies in Chapter 4 and before integrating themwith proteomic data in Chapter 6, I present in this chapter the comparison of the three proteomic datasets introduced in Chapter 2. Ezkurdia et al. (2014) and Deutsch et al. (2015) have partially reviewed these data. However, we have reprocessed them starting from the raw data for this thesis. In this context, reassessing the quantified processed proteomic data before any integration is pertinent. The work presented in this chapter was done in collaboration with Dr James Wright who has implemented the new protein quantification method (presented in Section 5.2). I have received general feedback from Dr Alvis Brazma, Dr Mar Gonzàlez-Porta, Dr Sarah Teichmann. 5.1 an overall fragmented and disparate universe to explore As I have described in Chapter 1, proteins present a wide range of physicochemical properties (see Section 1.1 and Appendix A.1) and are challenging to identify and quantify in high-throughput studies (see Section 1.3). Thus, it comes as no real surprise that while the use of MS for proteomics has been developing since the 1980s [Papachristodoulou et al., 2014], the first notable attempts to draft the human proteome occurred only recently in 2014 [M.-S. Kim et al., 2014; Wilhelm et al., 2014], or that the oldest (unpublished) available multi-tissue Cutler dataset is from 2010 (see Section 2.3.3). Till early 2019, Cutler Lab, Kuster Lab and Pandey Lab datasets are the only ones that explore concurrently several non-diseased human tissues. See Section 2.3 for more details and the processing pipeline designed and implemented by Dr James Wright to handle them. As presented in Figure 5.1, they share four tissues: Heart, Lung, Ovary and Pancreas. The protein overlap of this four-tissues set between these three datasets is rather narrow as shown in Figure 5.2. The Cutler Lab dataset shares the smallest number of tissues with the two other studies; Pandey Lab and Kuster Lab datasets share over twice as many proteins that they sharewith Cutler (3,338 instead of 1,384). The number of shared proteins between Pandey Lab and Kuster Lab data rise to 4,172 when all their fourteen common tissues are considered. 117 human ms-based protein expression landscape 5 221 4 10 15Cutler Kuster Pandey Figure 5.1. Distribution of unique and shared tissues between the threeMS-based proteomic studies. The three datasets share together: Heart, Lung, Ovary and Pancreas. The two most recent studies share fourteen tissues in total; the additional ten tissues are: Adrenal gland, Colon, Gallbladder, Oesophagus, Kidney, Liver, Placenta, Prostate, Rectum and Testis. 286 76 373 444 1384 1510 1685 Cutler Kuster Pandey Figure 5.2. Proteins overlap between the common four tissues of the three proteomic studies. Unique and shared proteins detected and quantified across the three MS studies for their four shared tissues: Heart, Lung, Ovary and Pancreas. 5.1.1 MS proteomic data has high detection variability Figure 5.3 illustrates the number of proteins identified in each of these four tissues. The colours indicate in which dataset (or group of datasets) the proteins have been identified. See Figure D.1 for the precise numbers of each set. The tissue with the highest number of identified proteins, regardless of which dataset, is the Ovary. The highest number of proteins identified in all three datasets at once (600) is in the Heart. The other tissues have the Kuster and Pandey set as their largest protein group formed by 118 5.1 an overall fragmented and disparate universe to explore Heart Lung Ovary Pancreas 0 1000 2000 3000 4000 Protein count Ti ss ue Present in All 3 datasetsKuster & Pandey Cutler & Pandey Cutler & Kuster Pandey only Kuster only Cutler only Figure 5.3. Number of identified proteins in each of the four common tissues for the three proteomic data. Proteins found in more than one dataset are most likely true (in red, light and darker green or purple — the most validated ones in red as they are found in all three datasets). See Figure D.1 for the precise numbers in each set. more than one dataset. The largest set of identified proteins in Pancreas and the second one in Ovary are proteins only found in the Pandey Lab dataset (Pandey only). As shown in Figure 5.4, our state- of-the-art pipeline (see Section 2.4.2) has identified the highest number of proteins in the Pandey Lab dataset. Thus, it is coherent that Pandey Lab proteins represent a large part of the identified proteins in each tissue (either as Pandey only set or in agreement with the other datasets: All 3 datasets, Kuster & Pandey or Cutler & Pandey). More surprising is that the Cutler dataset comprises a notable amount of proteins in Lung that are missing in the other two. A few of these proteins (82) are missing altogether, but a subset of them (410) is still found in (at least) another tissue of the other datasets. While proteins found in more than one dataset are more likely true positives, it is impossible to exclude without risks the ones that are identified in one dataset only. Whether an identified protein in one dataset is an artefact (i.e. false positive) or a miss (false negative) in the other datasets is a challenging question; the diversified nature of proteins involves many sample preparation and simplification methods (see Sections 1.3.1 and 1.3.2). 5.1.2 Overall about half of the proteins identified in each study for any given tissue are validated in a different study. As shown in Figures 5.3, D.1 and D.3, besides a few exceptions (Oesophagus, Gallbladder and Testis), more than half of the proteins are identified in the same tissue in more than one 119 human ms-based protein expression landscape 0 1000 2000 3000 4000 5000 Pla tel ets ly sat e Lu ng Pa nc rea s Br eas t He art Bo ne Ad ipo se Pla tel ets CS F Ov ary N um be r o f p ro te in s Cutler 0 1000 2000 3000 4000 5000 Te sti s Ly mp h n od e An us Sal iva ry gla nd Sto ma ch Oe sop ha gu s Ad ren al Sp lee n To nsi l Ov ary Co lon Pro sta te Lu ng Ut eru s Tu be Pa nc rea s Pla cen ta Liv er Re ctu m Sk in Kid ne y Sem ina l v esi cle Th yro id Ce rvi x Or al cav ity He art Ga llb lad der Co rte x Vu lva As cit es Na sop ha ryn x Mi lk Sal iva Ea rw ax Ha ir f oll icl e N um be r o f p ro te in s Kuster 0 1000 2000 3000 4000 5000 Te sti s Ov ary Re tin a CD 8T ce lls B c ell s Fet al Ov ary Pa nc rea s Pro sta te Fro nta lco rte x Fet al He art Fet al Te sti s NK ce lls CD 4T ce lls Liv er Fet al Liv er Fet al Gu t Co lon Sp ina lco rd Fet al Br ain Ur ina ryb lad der Mo no cyt es Ad ren al Re ctu m Ga llb lad der Kid ne y Pla cen ta Lu ng Pla tel ets He art Oe sop ha gu s Tissue or cell type N um be r o f p ro te in s Pandey Protein is Tissue/cell specific Unspecific to tissue or cell Figure 5.4. Distribution of the proteins per tissue across the three datasets. Cutler Lab dataset has the smallest and Pandey Lab the highest number of proteins per tissue. Coloured in red are the proteins that are specific to one tissue (or cell type); these proteins have been identified in one tissue solely within each dataset. Proteins in turquoise have been identified in several tissues of the dataset. 120 5.1 an overall fragmented and disparate universe to explore dataset. Three proteins are found in every tissue of every dataset: ALB, KRT9, KRT10. This number rises to forty when only the Pandey Lab and the Kuster Lab data are considered (see Table D.2). I have also investigated tissue-specific (TS) proteins (in red in Figure 5.4) that are also identified inmore than one dataset. While the three datasets lacked to identify any TS protein at once, Pandey Lab and Kuster Lab datasets share a few (44 across eight tissues) — see Table D.3 for the complete list. TS proteins are more difficult to confirm through different datasets, but one needs to be careful with the ubiquitous proteins as well. The latter may be present in the samples due to contamination: none of the three ubiquitous proteins (ALB, KRT9 and KRT10) is detected in Heart by the Human Protein Atlas¹ [Uhlén, Fagerberg, et al., 2015] while they are found in epithelial cells. One hypothesis is that contamination occurredwhen sampling from the donor or during the preparation or MS analysis. ALB found in the tissues is more likely coming from the blood supply (where ALB is abundant); KRT9 and KRT10 are environmental contaminants. Lists for ubiquitous and TS proteins of each dataset separately and across the three (when consistent) are given as digital supporting data. 5.1.3 Technical variability prevails over biological signal: intrastudy correlations of different tissues are globally stronger than same-tissue interstudy correlations. After defining the (1,384) protein set that is consistently detected in the four common tissues of the three datasets, I have assessed how consistent is their expression quantification across tissues and studies. Following a similar approach to Figure 4.3 (p. 96), I cluster the twelve proteomic TREPs². I have used Ward’s method to link the TREPs based on their similarity that I have computed by subtracting from 1 the pairwise Spearman correlation of the expression levels of the 1,384 common proteins. As shown in Figure 5.5, the technical variability overcomes the biological signal as the proteomic TREPs cluster according to their original laboratory/study rather than their biological source, except for Cutler Lab and Pandey Lab Heart. However, Cutler Lab and Pandey Lab share the same organisation of their remaining tissues: Pancreas and Lung are the most correlated, and their pair is in turn most correlated to Ovary. Kuster Lab TREPs display the greatest amount of study bias (probably due to stronger batch effects — see Section 1.5.1). Removing the proteins translated from the mitochondrial genes slightly improves the results but excluding the three ubiquitous (likely contaminants) proteins, presented in Section 5.1.2, is impactless. 1 Human Protein Atlas — https://www.proteinatlas.org/ 2 Tissue reference expression profile. See Section 3.3.4 on page 87 for more details on TREPs. 121 human ms-based protein expression landscape Lu ng (C utl er) Pa nc rea s ( Cu tle r) Ov ary (C utl er) He art (C utl er) He art (P an de y) Ov ary (P an de y) Lu ng (P an de y) Pa nc rea s ( Pa nd ey ) Lu ng (K ust er) Ov ary (K ust er) He art (K ust er) Pa nc rea s ( Ku ste r) Lung (Cutler) Pancreas (Cutler) Ovary (Cutler) Heart (Cutler) Heart (Pandey) Ovary (Pandey) Lung (Pandey) Pancreas (Pandey) Lung (Kuster) Ovary (Kuster) Heart (Kuster) Pancreas (Kuster) -1 -0.5 0 0.5 1 Spearman Correlation 0 10 20 30 Co un t Figure 5.5. Heatmap of the four common tissues between the three proteome datasets based on the pairwise Spearman correlations clustering of the expression levels of 1,384 shared proteins. The samples mostly cluster by laboratory. Only Heart from Cutler Lab and Pandey Lab have a stronger intratissue correlation than an intrastudy one. Both for Pandey Lab and Cutler Lab, Pancreas and Lung are more correlated to each other and their pair toOvary. The heatmap (Figure D.4) based on Pearson correlation instead prompts globally identical observations. See also Figure D.5 that shows (as a scatterplot) the relationship for Heart between Pandey Lab and Cutler Lab, and then, Figure D.9 between Pandey Lab and Kuster Lab. 122 5.2 new quantification method Neither applying quantile normalisation or widespread scaling methods on top of this quantification allowed correcting for the technical variability. Because of the limited number of proteins included in this analysis, it is unwise to draw definite conclusions except that there is an extensive need for new quantification normalisation methods that can help with protein expression meta-analyses and, as for now, one has to be very cautious when comparing proteome samples. I attempted to expand this analysis by comparing the fourteen common tissues of Pandey Lab and Kuster Lab datasets, but the results are as inconclusive (see Figures D.7 and D.8). 5.2 new quantification method In this thesis context where I aim to integrate the mRNA expression levels to the protein ones (presented in Chapter 6), I have developed with the help of Dr James Wright a new method to infer and quantify the proteins. Our original processing workflow (detailed in Section 2.4.2) intends to provide a reliable, state-of-the-art, protein quantification. Our method seems more rigorous than M.-S. Kim et al. (2014) original paper; Ezkurdia et al. (2014) challenge the correctness of their quantification by highlighting the disputable presence of olfactory receptors (ORs) in many tissues. On the other hand, our workflow lacks to detect or quantify any possible OR in the Pandey Lab data. The left part of Figure 5.6 summarises our first method of quantification. While probably more accurate, our state-of-the-art quantification method is also more stringent than the original authors’ one. Thus, the total number of quantified proteins is more limited. As presented in Chapter 1, protein quantification methods rely on PSMs identification (see Section 1.3.4.2) and a chosen approach for protein inference (see Section 1.3.4.3). I realised that our main limiting factor is that we get quantification only for proteins that have at least three unique peptides since this first method is based on the Top3 approach [Silva et al., 2006] and uses the three unique³ most expressed peptides of a protein to estimate its overall expression (see Equation (Top3 IBAQ) on p. 123 and Section 1.3.4.4). ̂𝜇𝑖𝑗 = ∑ Intensity of Top 3 unique peptides𝑖𝑗 Total Intensity in experiment 𝑗 (Top3 IBAQ) where: • ̂𝜇𝑖𝑗 is the normalised expression for the protein 𝑖 in experiment 𝑗 (normalised PSM), • ∑ Intensity of Top 3 unique peptides𝑖𝑗 is the sum of the intensity of the three most intense unique peptides of the protein 𝑖, • Total Intensity in experiment 𝑗 is the total sum of the intensity of all the peptides identified in the experiment 𝑗. 3 I.e. exclusive to a single protein 123 human ms-based protein expression landscape 56 Million raw MS spectra Search pipeline (see Figure 2.3 – Chapter 2) 48 Million PSMs assigned Filtering (High confidence : 0.001% FDR) Filtering (1 % FDR) 7.2 Million PSMs 200,771 peptides 17.5 Million PSMs 3.3 Million peptides Mapping to Ensembl Protein Coding genes (≥ 3 unique peptides) Mapping to Ensembl Protein Coding genes (> 1 unique peptide) Top3 quantification & Normalisation per experiment PPKM quantification & Normalisation per experiment First quantification method New quantification method Average quantification per tissue Average quantification per tissue 12,290 proteins (no cluster)6,436 proteins (no cluster) Figure 5.6. Two quantification methods applied to Pandey Lab data. For both approaches, the search pipeline assigning PSMs remains identical (see Section 2.4.2). The first quantification method, which follows a robust inference method involving at least three unique peptides, relies on the intensity of the three most intense unique peptides. The new quantification method that I have devised allows more relaxed inference parameters as it also uses the non-unique peptides for the quantification. See Equation (PPKMdefinition), which is similar to Equation (Canonical F/RPKM formula) on page 24. After averaging per tissue and removing the clusters, the number of quantified proteins with the new method is close to twice the number provided by the first described method. The new quantification method was designed for the analyses in Chapter 6 in particular; this is why the filtering is also less strict than for the first quantificationmethod sincemy main focus is the integration of the proteomic data with the RNA-Seq data. Clusters are protein groups that can be mapped to more than one Ensembl gene identifier. Note that the final proteins numbers include only the fifteen tissues used for the integration in the following Chapter 6. 124 5.2 new quantification method For the following chapter analyses, I requested Dr James Wright to provide me a new quantification for the Pandey Lab data where both unique and the degenerate (i.e. non- unique) peptides are involved in the estimation of the protein expression. The new method I have devised allocates the degenerate peptides in proportion to the distribution of unique peptides per protein following a similar approach to Cufflinks2 for RNA-Seq data (see Section 1.2.5.3). As three unique peptides are no longer required for the quantification, it allows relaxing the inference parameters to two unique peptides (in order to still avoid one-hit wonders, see Section 1.3.4.3) for the identification of the proteins. Once the identification is done, all the unique and degenerate peptides are mapped to the identified proteins. For the unique peptides, their quantification is directly linked to one and only protein. However, for the degenerate peptides, it is necessary to gauge the likely amount provided by each of their matching proteins. For this purpose, the distribution coefficient of the degenerate peptide 𝑑 to the protein 𝑝, 𝐶𝑑,𝑝, is defined as below: 𝐶𝑑,𝑝 = 𝑁𝑢𝑝 ∑ 𝑖=1 Number of PSM(𝑢𝑝,𝑖) 𝑁𝑢𝑝 ∑ 𝑞∈𝑃𝑑 ( 𝑁𝑢𝑞 ∑ 𝑗=1 Number of PSM(𝑢𝑞,𝑗) 𝑁𝑢𝑞 ) (Distrib. coeff. of the degenerate peptide) where: • 𝑃𝑑 is the set of identified proteins that include the degenerate peptide 𝑑 • 𝑁𝑢𝑝 is the number of unique peptides of the protein 𝑝, ∀𝑝 ∈ 𝑃𝑑 • 𝑢𝑝,𝑖 is the 𝑖th unique peptide of the protein 𝑝, ∀𝑝 ∈ 𝑃𝑑 • Number of PSM(𝑢) is the number of PSMs of the peptide 𝑢 Then, the contribution of the degenerate peptide 𝑑 to the quantification of a protein 𝑝 is computed as: Q𝑑,𝑝 = 𝐶𝑑,𝑝 ⋅ Q𝑑 (Distribution of the degenerate peptide quantification to a protein) where: • 𝐶𝑑,𝑝 is the distribution coefficient defined above • Q𝑑 is the total quantification of the degenerate peptide 𝑑 The new quantification follows a similar approach to the F/RPKM normalisation — see Equation (Canonical F/RPKM formula). Protein expression levels are expressed in PPKMs, which stands for PSMs Per Kilobase of gene per Million. As shown in Equation (PPKM definition), this method counts the number of PSMs that can be mapped to the corresponding Ensembl gene identifier of a protein. Then, this raw count is normalised by dividing it by the product of the longest transcript length and the total number of PSMs assigned in that experiment; the result is finally multiplied by a factor (106) to facilitate reading. Using the longest transcript length instead of the longest protein isomer allows avoiding issues due to annotation differences between gene and 125 human ms-based protein expression landscape protein levels in the analyses of the following Chapter 6. ̂𝜇𝑖𝑗 = Number of PSMs matching 𝐺𝑖 ⋅ 10−6 ℓ𝐺𝑖 ⋅Number of PSM𝑗 (PPKM definition) where: • ̂𝜇𝑖𝑗 is the normalised expression of the protein 𝑖 in the experiment 𝑗, • 𝐺𝑖 is the gene that corresponds to the protein 𝑖, • Number of PSMs matching 𝐺𝑖 is the number of PSMs mapped to 𝐺𝑖, • ℓ𝐺𝑖 is the length of the longest transcript of 𝐺𝑖, • Number of PSM𝑗 is the total number of PSMs identified in the experiment 𝑗. Dr James Wright has implemented this new method and provided me PPKM quantifications for the Pandey Lab and Kuster Lab data. Although smaller, another limiting factor is the filtering threshold of the selected PSMs to be inferred in proteins. We have chosen a conservative (and state-of-the-art) threshold prior to the first quantification filtering and a less strict one for the new method. As my aim is to compare and integrate proteomic and transcriptomic data together, the primary purpose of the new quantification is to provide a number of proteins that is roughly similar to the number of mRNAs species’. Thus, we also have had to relax parameters and allow an increased number of false positives among the identified proteins (see Section 1.3.4.2). I have included proteins quantified with both methods in the analyses of the next chapter (Chapter 6). 0.0 0.1 0.2 -20 -15 -10 -5 Log2(Normalised Top3 PSM) de ns ity First quantification (Q1) 0.00 0.05 0.10 0.15 0.20 0 5 10 15 Log2(PPKM) de ns ity Tissue Adrenal Colon Oesophagus Gall bladder Heart Kidney Liver Lung Ovary Pancreas Prostate Rectum Testis Urinarybladder Placenta New quantification (Q2) Figure 5.7. Distribution of the protein expression levels with two different methods. On the left, the protein expression levels distribution for the adult tissues of the Pandey Lab datawith our first quantificationmethod (described in Section 2.4.2). This figure is similar to Figure 3.3(c) but without the fetal tissues. On the right are the protein expression levels of the same tissues that have been computed with the new quantification method. The overall shape of the density plots is very similar between the two approaches. 126 5.2 new quantification method Before moving on to the integration of the proteomic and transcriptomic data in the next chapter, I have carried out a few comparisons between the first and the new quantification methods. As shown in Figure 5.7, the densities of distribution of protein expression levels per tissues have similar shapes between the two methods. Figure 5.8 presents the number of quantified proteins per tissue. The new method allows quantifying for some tissues up to more than twice the number quantified by the first method. This proportion is consistent with the total number of proteins identified across the fifteen adult tissues by the first method (6,436) and the new one (12,290). 0 2000 4000 6000 8000 Te stis - Q 1 Te stis - Q 2 Ov ary - Q 1 Ov ary - Q 2 Pa ncr eas - Q 1 Pa ncr eas - Q 2 Pro sta te - Q 1 Pro sta te - Q 2 Liv er - Q 1 Liv er - Q 2 Co lon - Q 1 Co lon - Q 2 Ur ina ryb lad der - Q 1 Ur ina ryb lad der - Q 2 Ad ren al - Q1 Ad ren al - Q2 Re ctu m - Q 1 Re ctu m - Q 2 Ga llb lad der - Q 1 Ga llb lad der - Q 2 Kid ney - Q 1 Kid ney - Q 2 Pla cen ta - Q 1 Pla cen ta - Q 2 Lu ng - Q 1 Lu ng - Q 2 He art - Q 1 He art - Q 2 Oe sop ha gu s - Q1 Oe sop ha gu s - Q2 Tissue N um be r o f p ro te in s Protein is Tissue specific (first quantification method - Q1) Unspecific (first quantification method - Q1) Tissue specific (new PPKM quantification method - Q2) Unspecific (new PPKM quantification method - Q2) Figure 5.8. Comparison of the impact of the quantification method on the protein distribution per tissue for the Pandey Lab data. This figure partly reproduces the Pandey Lab part of Figure 5.4. Indeed, the turquoise/red 𝑄1 bars are the adult tissues of the Pandey Lab data quantified with the first quantification method (described in Figure 5.4). The purple/green 𝑄2 bars are their equivalent to the new quantification method. The new method allows a considerable increase of the identified and quantified proteins number, sometimes more than twice as many as the first quantification. On the other hand, the rank orders of the total and tissue-specific protein numbers per tissue are quite similar between the two methods. I have also checked the presence of OR in the Pandey Lab data with the new quantification method; only two ORs are present: OR1M1 in Kidney and OR13C4 in Liver. Their presence may be artefactual, or it may be an issue with the annotation. Indeed, while these two proteins are missing from all tissues — either at RNA or protein levels — in The Human Protein Atlas⁴ [Uhlén, Fagerberg, et al., 2015], their mRNAs are present in the Baseline expression of EBI Gene Expression Atlas⁵ [Petryszak, Keays, et al., 2015] in the human 4 The Human Protein Atlas — https://www.proteinatlas.org/ 5 EBI Gene Expression Atlas — https://www.ebi.ac.uk/gxa/ 127 human ms-based protein expression landscape Chloroid plexus at 10 post-conception weeks (HDBR developing brain — ArrayExpress ID: E-MTAB-4840) and in one sheep Testis sample (ArrayExpress ID: E-MTAB-3838). In addition, OR1M1 seems to be expressed in the Blood of the green monkey (Chlorocebus sabaeus—ArrayExpress ID: E-MTAB-4404). Both proteins are also found to be up or down regulated at transcript level in different tumoral samples (Differential expression tab of EBI Gene Expression Atlas). See also the digital supporting data. I have also assessed the consistency of expression measurements across Pandey Lab and Kuster Lab data and their common tissues as I have done for the three datasets with the first quantification (presented in Section 5.1.3). The new quantification gives similar results as shown in Figure D.12. Besides, as shown in Figure D.13, more than half of the proteins quantified in each dataset within a given tissue is also quantified in the other dataset. 5.3 ubiquitous and ts proteins Previously, W. Liu et al. (2014) had compiled a list of 627 TS and 1,093 housekeeping proteins from the expression data released originally by Pandey Lab [M.-S. Kim et al., 2014]. The data processing has a significant impact on protein identification; in our first version of Pandey Lab data, I have found 534 housekeeping (ubiquitous) proteins and 1,491 TS proteins, and for the PPKM quantification: 2,057 ubiquitous and 2,640 TS proteins. I provide as digital supporting data the list of the TS and ubiquitous proteins. 5.4 discussion and conclusion In this chapter, I have reviewed human proteome data from three projects presented in Chapter 2, that have been reprocessed by Dr James Wright with two pipelines for the two largest studies Pandey Lab data [M.-S. Kim et al., 2014] and Kuster Lab data [Wilhelm et al., 2014]. Currently, state-of-art bottom-up label-free MS proteomics captures human tissues expression as a fragmented and disparate universe. Our first processing pipeline, based on the Top3 quantificationmethod presented in Section 2.4.2, appearsmore reliable than some of the original authors’. The original Pandey Lab data was disputably quantifying ORs in many tissues [Ezkurdia et al., 2014]. No trace of ORs was found in any of our reprocessed data. I have also described our new quantification method (PPKM), which allows us to estimate the expression of nearly twice as many proteins than the first one we used. For both quantification methods, the technical variability is generally stronger than the biological interstudy signal even for similar tissues. Besides, across the different tissues, about half of the proteins are consistently observed in the same tissues at least in two datasets. Even when limited to the protein identification only, the general lack of repeatability and reproducibility has been well reported and described for technical and biological 128 5.4 discussion and conclusion replicates in MS proteomics (e.g. Tu, J. Li, Sheng, et al. (2014) and Tabb, Vega-Montoto, et al. (2010)). Canterbury et al. (2014) report that the intra-assay variation between two technical replicates for complex mixtures can be at least 50%; different runs of the same sample or experiment are often produced to raise the interstudy results repeatability and confidence. Thus, beyond the quantification method I have developed with the help of Dr James Wright by drawing on RNA-Seq ones, there is a definite need for new MS protocols and quantification methods for baseline expression⁶ to correct the extreme variability and ease the integration of proteomics data across studies. 6 Normalisationmethods for differential expression analysis are usually unsuited for baseline expression studies. See Välikangas et al. (2018b) for possible differential expression quantification methods. 129 Scientists like ripping problems apart, collecting as much data as possible and then assembling the parts back together to make a decision. Shirley M. Tilghman 6 I N TEGRAT ION OF TRANSCR I PTOMICWI TH PROTEOMIC DATA After assessing the similarity of the human gene expression profiles across various tissues at transcriptomic level (with RNA-Seq studies in Chapter 4) and proteomic level (with bottom-upMS studies in Chapter 5), my next step is to examine how these gene expression profiles compare between these two different biological layers. One major aim of this study is to assess how the correlations between the transcriptome and proteome described in the literature, mostly measured in cells, hold at the tissue level. Moreover, good correlations may potentially lead to the development of new strategies. These may use the expression levels of mRNA as proxies to estimate protein expression, which is generally difficult to measure directly (see Section 1.3). I have performed the integration and all the analyses presented in this chapter under the supervision of Dr Alvis Brazma and Dr Jyoti Choudhary. A few closely related studies [Kosti et al., 2016; Franks et al., 2017; D. Wang et al., 2019] have been published while I was working on the integration of the non-diseased human transcriptome and proteome. As their analyses rely on the same data sets (i.e.Uhlén, GTEx, Pandey Lab data) that I include in my work, I describe and discuss together my results and theirs whenever relevant. 131 integration of transcriptomic with proteomic data Communication to the community derived from this chapter • (paper) Mitra P. Barzine, Kārlis Freivalds, James Wright et al. (2020). ‘Using Deep Learning to Extrapolate Protein Expression Measurements’. Proteomics 20 (21–22), e2000009 • (submitted paper) Andrew F. Jarnuczak; Hanna Najgebauer; Mitra Barzine; Deepti J. Kundu; Fatemeh Ghavidel; Yasset Perez-Riverol; Irene Papatheodorou; Alvis Brazma; Juan Antonio Vizcaíno An integrated landscape of protein expression in human cancer • (poster) CSHL Biology of Genomes 2015 — A feasibility study: Integration of independent human RNA-Seq and proteomic datasets • (talk) GTEx meeting 2017 — A. Brazma Correlating transcriptome and proteome in human tissues • (poster) HUPO 2018 — Jarnuczak et al. An integrated atlas of protein expression in human cancer derived from publicly available • (poster) ECCB 2018 — Viksna et al. An integrated approach to missing data imputation in quantitative proteomics experiments • (poster) RECOMB 2018 — Viksna et al. Deep learning for protein abundance prediction using Gene Ontology and RNA abundance information 132 integration of transcriptomic with proteomic data An on-going debate in the literature is whether good correlations of expression levels prevail between mRNAs and proteins [Uhlén, Hallström, et al., 2016]. The implicit assumption of a proportional relationship is persisting as the many remaining technological limitations prevent rigorous testing [Vogel and Marcotte, 2012]. To date, the existence or concentration of a given mRNA transcript is usually insufficient to ensure detection of the protein in a sample. On the one hand, Ramakrishnan et al. (2009) report that mRNAs abundance are roughly sufficient to predict the protein presence or absence from a sample and Vogel, Abreu, et al. (2010) that mRNA level estimations and sequence features are enough to predict two-thirds of the human protein abundance variation. On the other hand, the literature fails to report any high correlation between the transcriptome and the proteome for any organism. Previous investigations found low or no correlation between the measured expression profiles of the mRNAs and proteins in human [Anderson et al., 1997; G. Chen, Gharib, et al., 2002; Tian et al., 2004; Pascal et al., 2008; Gry et al., 2009; Lundberg et al., 2010], other mammals [Ghazalpour et al., 2011], and across many other species [Gygi, Rochon, et al., 1999; Maier, Güell, et al., 2009; Maier, Schmidt, et al., 2011; Yeung, 2011; Palmblad et al., 2013; Freiberg et al., 2016]. In their encompassing reference experiment, Schwanhäusser et al. [Schwanhäusser et al., 2011; Schwanhäusser et al., 2013] present rather moderate correlations (𝑟2 ≤ 0.41, i.e. 𝑟 < 0.64) and highlight that mRNA levels explain only about 40% of protein variations they have observed. Other studies have explored the mRNAs and proteins relationship in answer to stimuli [Marguerat et al., 2012] or with an increased focus to post-transcriptional regulations (including degradation rates) [Jovanovic et al., 2015]. While many other regulatory processes may occur (e.g. translation rates), post-transcriptional modifications and technical noise are (still) perceived as the probable primary sources of mRNA/protein concentration discrepancies [Vogel and Marcotte, 2012; Plotkin, 2010]. Joint studies of transcriptome and proteome have already helped to highlight links between genotype and phenotype [Vogel and Marcotte, 2012]. However, the mitigated results reported above may explain the focus shift of many subsequent studies. While previous efforts were about linking the actual expression levels, more recent studies primarily have mostly compared qualitative attributes of given proteins and related mRNAs. Examples include the comparison of the presence or absence of mRNAs and their proteins in specific conditions or tissues [Santos et al., 2015; Freiberg et al., 2016; Uhlén, Fagerberg, et al., 2015] or the comparison of their differential expression profiles across identical sets of conditions [Väremo et al., 2015]. All (or almost all) aforementioned studies have turned to cells for their joint analyses of transcriptome and proteome. In contrast, the analyses and integration I present in this 133 integration of transcriptomic with proteomic data chapter are based on tissue studies. 6.1 data and principal analytical approaches Since the human proteome drafts [M.-S. Kim et al., 2014; Wilhelm et al., 2014] in 2014, we have an unparalleled availability of large-scale tissue studies both at the transcriptomic and proteomic layers to explore and integrate together (see Chapter 2). While these data are independent (collected from various individuals, prepared, and characterised by different laboratories), their combined study may help to shed light on the relationship between the transcriptome and proteome at the tissue level. Using different sources for the transcriptome and proteome increases the overall technical noise, but it may also help to highlight relevant biological signals (as they need to be stronger than the noise and batch effects to be captured). In Chapter 4, I show that the transcriptome RNA-Seq datasets present high interstudy tissue correlations (median value for Pearson: 𝑟𝒲1 = 0.75; 𝑟𝒲2 = 0.85 — Spearman: 𝜌𝒲1 = 0.88; 𝜌𝒲2 = 0.93). For this chapter analyses, I only consider the datasets with the highest similarity (highest correlations) that incidentally comprise the greatest number of tissues and are the two most recent studies, i.e. Uhlén et al. [Uhlén, Fagerberg, et al., 2015] and GTEx [Melé et al., 2015] data. To compensate for the shortfalls in the study design implied by the reuse of published data¹, I use both Uhlén et al. and GTEx data to filter out mRNAs with high interstudy variability for identical tissues. Whether this variability is technical or biological is irrelevant; in both cases, interpreting the relationship between a highly variable mRNAs and its protein from another dataset remains hard to interpret. For these mRNAs, it is impossible to explain the observed variability between the two transcriptomic datasets. Indeed, any result is subject to the transcriptomic dataset chosen for the comparison with the proteomic one. Furthermore, the comparison of the two transcriptomic data may give a reference, i.e. an ideal case scenario, for the proteomic/transcriptomic one. On the other hand, as shown in Section 5.1.3, the technical variability prevails over the biological signal of same-tissue samples for the available high-throughput proteomics. With the current technological state, different tissues from the same proteomic study are more likely to present a higher correlation than the same tissues from two different studies. To avoid an overly restricted protein set for the following analyses, I only include one proteomic study: Pandey Lab [M.-S. Kim et al., 2014]. All its samples have been run through the same MS platform and with the same protocol. Moreover, it presents more homogeneous protein distributions (see Figure 3.3 and Figure 5.7) and quantifies more proteins per tissue (Figure 5.4) than the two other datasets. Since a currentmajor limitation of bottom-up MS proteomic studies is the possible lack of detection of proteins for various 1 Independent data also means different collection and sampling processing methods and lack of information on the samples population background. 134 6.1 data and principal analytical approaches reasons (see Section 1.3.2), the higher number of detected proteins in Pandey Lab data suggests that this dataset has a higher quality than the two others. Though I include one proteomic dataset only, as the literature reports that the proteome is more conserved than the transcriptome (across individuals and species) [Laurent et al., 2010; Y. Liu et al., 2016], this data collection ought to provide a crude estimate of the extent of observations that hold from cell to tissue level. This chapter integrates and analyses the matching pairs of mRNA/proteins of the common set of tissues between Pandey Lab and the two transcriptomic datasets. 6.1.1 Overlapping set of tissues for the three datasets 14 3 6 1 12 11 23 Pandey Uhlen GTEx Protein mRNA mRNA Figure 6.1. Number of shared and unique tissues between the proteomic (Pandey Lab) and the transcriptomic (Uhlén et al. and GTEx) data. All analyses include the twelve tissues shared between the three datasets (Adrenal gland, Urinarybladder², Colon, Oesophagus, Heart, Kidney, Liver, Lung, Ovary, Pancreas, Prostate and Testis). In a few cases, I have also extended the analyses to three additional tissues (i.e. Gallbladder, Placenta and Rectum) by including the Uhlén et al. data on the transcriptomic side only. 6.1.2 Matching pairs of mRNAs and proteins To avoid unnecessary biases (described in Section 3.3), I only consider the mRNAs (i.e. RNAs with a protein-coding biotype — Ensembl 76) for the following analyses. Moreover, since missing data is common for proteomics [Lazar, Gatto, et al., 2016], only proteins that are detected in each dataset in at least one of the included tissues are considered for further 2 May also be referred to as Urinary Bladder 135 integration of transcriptomic with proteomic data analyses. Besides, while in the transcriptomics studies biological replicates of each tissue have been processed as individual RNA-Seq libraries, in the proteomic one, the biological replicates have been pooled per tissue before any MS profiling. Thus, to prevent an unbalanced number of samples biasing the integration analyses (see Chapter 3), I use ‘virtual references’, i.e. TREPs³ that I computed for each tissue by taking the median values of each gene across the biological replicates (see Section 3.3.4). As exposed in Chapters 2 and 5, all the proteomic quantifications have been provided by Dr James Wright. The first quantification follows state-of-the-art practices with stringent parameters (described in Section 2.4.2) since accurate protein identification is paramount for reliable proteome exploration. The protein levels are the intensity of their top three unique peptides normalised within-sample. Figure 6.2 presents the genes overlap across twelve shared tissues between the Pandey Lab’s proteins quantified through this first method and Uhlén et al.’s and GTEx’s mRNAs quantified with HTSeq-count (see Section 2.4.1.3). Figure 6.3 is the same analysis across the fifteen shared tissues between Pandey Lab and Uhlén et al. data. 5 0 0 1 6357 13245 658 Pandey Uhlen GTEx Protein mRNA mRNA Figure 6.2. Distribution of the unique and shared proteins of Pandey Lab data and mRNAs from Uhlén et al. and GTEx ones across their twelve shared tissues. There are 6,357 matching gene products between the three datasets. Only 5 proteins have apparently no matching partners in the Uhlén et al. or GTEx data. This first proteomic quantification is following robust guidelines, and both figures show that almost all the genes with an observed protein also have an observed mRNA. However, only about 32% of the quantified mRNAs in the Uhlén et al. and GTEx data have a corresponding protein detected in the Pandey Lab data. 3 TREP: tissue reference expression profile 136 6.1 data and principal analytical approaches 13240 8 6428 Uhlen Pandey ProteinmRNA Figure 6.3. Distribution of the unique and shared proteins/mRNAs for Pandey Lab and Uhlén et al. across their fifteen shared tissues. The number of matching pairs (6,428) and proteins that lack a counterpart in the transcriptomic data (8) are similar regardless of how many different transcriptomic data is included (see Figure 6.2). Once I learned more about the bioinformatic challenges of bottom-up proteomics (described in Section 1.3.4), I chose to be more flexible with the identification and quantification methods to increase the number of proteins included in my analyses. As I aim to integrate independent proteomics with transcriptomics, I mostly focus on robust expression between the two biological layers since discrepancies in this study context are hard to interpret. While artefacts may persist, further analyses with targeted proteomics (see Section 1.3) can help prune or validate the results. I have drawn on RNA-Seq transcriptomic approaches to devise a new quantification method, which is described in Section 5.2 and implemented by Dr James Wright. The method takes advantage of the degenerate peptides⁴ that are distributed across possible protein parents in proportion to their unique peptides. The method produces normalised values of the protein expression levels (whose unit is the PPKM, i.e. PSMs Per Kilobase of gene per Million). As shown in Figures 6.4 and 6.5, while the number of quantified proteins with our new method covers about 62% of Uhlén et al.’s and GTEx’s quantified mRNAs, the number of proteins for which no mRNA was detected in the transcriptomic data remains marginal. Whether it reflects the biological reality or is solely due to RNA-Seq technology beingmore sensitive than bottom-up MS alone, current techniques detect more individual mRNAs than proteins as confirmed by Figures 5.4 and C.1. Thus, it may be surprising that a few proteins lack a match in the transcriptome data. Several possible explanations exist. 4 See Section 1.3.4.3. 137 integration of transcriptomic with proteomic data 35 0 0 33 12494 7108 626 Pandey Uhlen GTEx Protein mRNA mRNA Figure 6.4. Distribution of the unique and shared proteins/mRNAs across twelve shared tissues between Pandey Lab (new quantification method), Uhlén et al. and GTEx data. 6747 69 12921 Uhlen Pandey mRNA Protein Figure 6.5. Distribution of the unique and shared proteins/mRNAs across fifteen tissues between the Pandey Lab (new quantification method) and Uhlén et al. data. 138 6.1 data and principal analytical approaches Artefacts or technical issues are the most likely. For example, the annotation might miss the matching RNAs definitions or defines them with another biotype than protein-coding⁵. Or, peptides and mRNA reads may be assigned to different gene IDs. Alternatively, the mRNAs are present in the sample, but the library preparation has missed their capture (see Section 1.2.1). Or even, the presence of proteins in the sample is a false positive or the result of contamination. However, biological processes might also explain the mismatches. One example is the case of mRNAs with short half-lives while their proteins are very stable. Another possible explanation is that the original location of the proteins is different from the tissue in which they were detected (like hormones or cytokines). Lastly, as the transcriptomic and proteomic samples are independently sourced, a protein may be specific to an individual or a population. This last hypothesis is themost unlikely as there are several biological replicates on the transcriptomic side. Amixture of the previous causes is also plausible. Transcriptomics Proteomics Uhlén et al. Pandey Lab HTSeq-count (FPKM) Top3 (State-of-the-art) (PSM) Our new method GTEx Cufflinks (FPKM) (PPKM) 3 datasets 12 tissues (Figure 6.2) 2 datasets 15 tissues (Figure 6.3) 3 datasets 12 tissues (Figure 6.4) 2 datasets 15 tissues (Figure 6.5) See digital supplementary data (mRNA) Figure 6.6. Overview of different studied datasets combinations. 5 E.g. XXyac-YRM2039.2 annotated as unprocessed pseudogene and now known as WASH1 since Ensembl 77 (October 2014) or TRAJ61 which is annotated as TR J gene. 139 integration of transcriptomic with proteomic data I exclude the unmatched proteins and mRNAs from further analyses. Table E.1 provides the unmatched protein lists for the Ensembl 76 annotation. Unless otherwise stated, to avoid issues exposed in Section 3.3.1, I also remove all the proteins and mRNAs of the mitochondrial genome from the subsequent analyses. Note that Figure 6.6 presents an overview of the various datasets combinations presented in Figures 6.2 to 6.5. 6.1.3 Tissue-centric and gene-centric approaches Tis sue A mRNA a mRNA z Tis sue N … … mRNA x Tis sue Y Tis sue Y Tis sue A Protein a Protein z Tis sue N … … Protein x Comparison of the Transcriptome and the Proteome for each tissue (across all the common mRNA/protein pairs) Comparison of each mRNA with its protein (across all the common tissues) Transcriptome Proteome Tissue-centric Gene-centric Gene a Gene z Figure 6.7. Approaches summary of the expression comparison between the transcriptome and proteome. Tissue-centric analyses focus on how the transcriptome and proteome relate to each other within the same tissue. Gene-centric analyses study for each gene how its mRNA expression levels across all (or a subset of) the tissues may relate to the quantified expression levels of its corresponding protein. Figure 6.7 summarises the two analytical approaches I use to compare transcriptomic and proteomic data. The tissue-centric approach compares for each tissue the global expression of its transcriptomic landscape to its proteomic one. In contrast, the gene-centric approach compares for each gene its expression levels in mRNA and protein across all the tissues. Confusion can arise when integrating proteomics and transcriptomics. Hence, it is essential to define the taken approach clearly [Y. Liu et al., 2016]. 140 6.2 fair correlations between independent proteomics and transcriptomics 6.2 fair correlations between independently sourced proteomics and transcriptomics of human tissues For the first tissue-centric analysis, I assess for each tissue the relationship between the expression of its proteome and transcriptome through the correlation of the protein expression values with their corresponding mRNA ones. After scaling with log2(𝑥 + 1), I compare proteomic and transcriptomic TREPs from identical and random tissue pairs, which are similar and roughly correspond to Gaussian distributions as illustrated by Figures 3.2 and 5.7. Figure 6.8 presents the correlation distribution range of transcriptomic and proteomic TREPs from identical and random pairs of tissues, both with Spearman and Pearson correlation methods (see Appendix B.1). Although transcriptomics and proteomics have independent sources, the Spearman correlations of the same tissues TREPs are equivalent to correlations in cell studies [Lundberg et al., 2010; Schwanhäusser et al., 2011] where the same sample provides mRNAs and proteins. Regardless of the protein quantification method (Top3 [Silva et al., 2006] or PPKM — equation (PPKM definition) on page 126), the median Spearman correlation coefficients are above 0.5 for matched proteomic and transcriptomic TREPs (also referred to as same-tissue pairs). The unscaled data presents identical outcomes (see Table E.2 and Figure E.4). The Pearson correlation is closer to the literature for our new PPKM quantification than for the Top3 quantification. The PPKM Pearson correlation averages above 0.5 [min: 0.38 (Oesophagus) ; max: 0.61 (Liver)] (and is within [min: 0.45 (Oesophagus) ; max: 0.67 (Liver)] for the untransformed data). As tissue proteomic samples can present high correlation without being related in any manner (see Chapter 5 and figure D.10), a Welch t-test [Welch, 1951] allows assessing the significance of the correlation for the same-tissue pairs by comparison to random tissue pairs. The one-sided Welch’s Two Sample t-test⁶ allows rejecting the null hypothesis 𝐻0 (the means of the correlation coefficients for same-tissues pairs are identical or lower to random tissues pairs). Irrespective of the protein quantification or computational methods, all the same-tissue pairs correlations are significant (p-value< 5.10−5, except for Pearson correlation with Top3 quantification where p-value < 0.05 — see Table E.2). The previous correlation distribution may imply a modest relationship between these independent proteomics and transcriptomics, but the same-tissue pairs scatterplots (e.g. Figure 6.9) show tighter links than first suggested. Besides, these scatterplots share a coarse profile despite the wide correlation ranges. Figure 6.9 illustrates the comparison of expression for Kidney between transcriptomics (Uhlén et al.) on the x-axis and proteomics (Pandey Lab — PPKM) on the y-axis. Kidney’s 6 See Appendix C 141 integration of transcriptomic with proteomic data Figure 6.8. Distribution of Pearson and Spearman correlation coefficients for same-tissue proteomic and transcriptomic pairs versus random tissue pairs (log2-scaled data). Depending on the protein quantification method,there are two types of distribution ranges for the Pearson correlations. Top3 quantification method provides a lower correlation (mean ≈ 0.11). The PPKM method (Section 5.2) produces higher correlations (mean ≈ 0.5). All the Spearman correlation ranges between same-tissue proteomic and transcriptomic TREPs are quite similar, regardless of the method quantifying the proteins. The median of Spearman correlation is 0.52. With the Top3 quantification (i.e. pink countered boxes — Top3 x HTSeq), two outliers are noticeable, and they are common to the three comparisons, Pandey x Uhlén (12 tissues and 15 tissues) and Pandey x GTEx (12 tissues): the lowest Spearman correlation is Oesophagus (𝜌 = 0.39) and the highest Liver (𝜌 = 0.62). Both for the Pearson and Spearman correlations, even when the correlations are very low, same-tissue pairs always have higher correlations than different (random) tissues pairs (all p-values computedwithWelch t-test <0.05 — see Table E.2). Thus, even the lowest same-tissue correlations are significant. The green boxplots, comparing the two transcriptomic datasets, are only represented for reference purposes. 142 6.2 fair correlations between independent proteomics and transcriptomics Figure 6.9. Scatterplot of protein (Pandey Lab — PPKM quantification) and mRNAs (Uhlén et al.) expression for Kidney. Each point of this scatterplot represents a gene; it has the log2-transformed expression valueof the corresponding Uhlén et al. mRNA (FPKM) on the x-axis and the log2-transformed expression value of the Pandey Lab protein (PPKM) on the y- axis. Most of the mRNA/protein pairs are distributed in an area that can be fitted by a linear function with a positive slope, which indicates a high correlation between mRNAs and proteins expression levels. However, genes with lower expressed mRNAs have a less associated expression between their protein and mRNA, in particular, mRNAs that are expressed below 1 FPKM (i.e. below 0 on the x-axis). On the other side, genes with the highest expressed mRNAs may present a saturation effect (Section 1.3.2) in the quantification of the protein expression. The highest expressed protein is HBB (i.e. Hemoglobin Subunit Beta), which is also found in the five highest expressed proteins in all the other tissues. Possibly, its presence is due to remaining erythrocytes in the samples. On the outer parts of the scatterplot, there are the respective distribution densities of the proteins and the mRNAs. Whilst the correlation calculation includes every pair of mRNA and protein, the plot excludes any pair with an unexpressedmRNA or protein to optimise the visualisation. Figure E.2 presents an overview of the other tissue scatterplots. 143 integration of transcriptomic with proteomic data correlation coefficients stand in the middle of the range regardless of the considered studies, protein quantification or correlation methods involved in the comparison. A linear function with a positive slope (not drawn) can fit the bulk of the points. Indeed, the expression of most mRNAs and proteins in a tissue are highly associated with the exception of the lowest (< 1 FPKM) and a number of the highest measured mRNAs. Besides the mismatching sampling sources, other possible explanations for the observed divergences are technical limitations (such as protein saturation effect, see Section 1.3.2), translational noise (see Section 3.3.3) or a consequent half-life difference between the mRNA and its protein. Although the number of genes presenting lowly associated levels of mRNA/protein expression is rather limited, it is enough to impair the Pearson and Spearman correlation coefficients. Systematic exclusion of lowly associated pairs of mRNAs and proteins is impractical and arguable as they are inconsistent from one tissue to another. Case-by-case treatment will be necessary. Removing the lowly expressed mRNAs (< 1 FPKM) only marginally changes the correlation coefficients, e.g. for Kidney, when considering the PPKM quantification for the proteins, the Pearson correlation increases from 0.56 to 0.58, while the Spearman correlation is relatively unchanged (0.51 instead of 0.52). There are similar changes observed when considering the more conservative Top3 protein quantification. The Pearson correlation 𝑟 = 0.18 increases to 0.21. The Spearman correlation remains unchanged (𝜌 = 0.52). Both transcriptomic studies (Uhlén et al. or GTEx) providing alike results, I describe for most of the following analyses the data combination that provides the greatest number of tissues and genes to study, i.e. the fifteen-tissue set between Uhlén et al. and Pandey Lab data quantified with the PPKM method. The other combinations (provided in Appendix E or electronic format) may diverge for individual genes through the various combinations, but the general trends are identical. I focus on Pearson correlation over Spearman correlation in the following parts since the results for the PPKM quantification are globally similar for both. 6.2.1 Mixed biological signal between the proteome and transcriptome across the tissues As shown in Figure 6.10, for nine tissues (in yellow) transcriptomic and proteomic expressions correlate better in matching tissues. For four other tissues (Colon, Lung, Oesophagus and Urinarybladder — in dark pink), only the proteomics correlate the best with the matching transcriptomics, while the transcriptomics correlates better with other 144 6.2 fair correlations between independent proteomics and transcriptomics proteomics tissues. The remaining two tissues have their proteomics correlating as much (e.g. Gallbladder) to other tissues or more to transcriptomics from other tissues (Rectum). While the different correlation methods lead to similar result trends, individual differences persist. In a few cases, e.g. Heart, these slight differences may considerably change the Ad ren al Co lon Ga llb lad derHe art Kid ne y Liv er Lu ng Oe sop ha gu s Ov ary Pa ncr eas Pla cen ta Pro sta te Re ctu m Te sti s Ur ina ryb lad der Urinarybladder Testis Rectum Prostate Placenta Pancreas Ovary Oesophagus Lung Liver Kidney Heart Gallbladder Colon Adrenal Proteome Pandey lab log2(PPKM+1) Tr an sc rip to m e (m RN A s) Uh lé n et a l. lo g2 (F PK M +1 ) 0.52 0.5 0.54 0.54 0.52 0.51 0.57 0.55 0.55 0.55 0.55 0.56 0.6 0.53 0.5 0.47 0.47 0.51 0.47 0.52 0.5 0.5 0.5 0.52 0.5 0.5 0.66 0.56 0.57 0.5 0.5 0.54 0.51 0.5 0.56 0.57 0.57 0.58 0.3 0.4 0.5 0.6 Pearson correlation 0 20 40 60 Figure 6.10. Heatmap based on the Pearson correlation between protein and mRNAs expression (alphabetically ordered tissue). Correlations for same tissue pairs (diagonal) are highlighted in yellow when the highest observed correlations are between the matching proteomics and transcriptomics pairs; in dark pink, when the proteomics correlates the best with the matching transcriptomics. When other higher correlations are observed for a tissue proteomics or transcriptomics they are in given grey. 145 integration of transcriptomic with proteomic data relative correlation ranking order of the TREPs (see Figure E.5). In the following sections, I explore several avenues to identify possible factors that influence the association strength between the proteome and transcriptome. I first study the effect of tissue composition (in proteins and mRNAs) on the correlations. I begin with the assessment of the impact of the proteins and mRNAs that are found in one tissue only, before looking into the tissue-specific (TS) proteins and mRNAs. Then, in a more quantitative approach, I examine more closely how the mRNA expression profiles relate to their respective protein ones. 6.2.2 Influence of the expression breadth on the tissue mRNAs/proteins correlation In Chapter 5, I have shown that the protein expression of both different tissues and same- tissue pairs are sharing a similar correlation range (see Figure D.11). In this context, genes expressed in a small number of tissues (both as a protein andmRNA) can have a significant impact on the correlation and may explain the mitigated results. The expression breadth of a gene is the number of tissues and cell lines within which the gene is expressed at a given threshold⁷. Figure 6.11 allows visualising the distribution of the expression breadth of the mRNAs (Uhlén et al.) and the proteins (Pandey Lab data) across their fifteen common tissues. In the following sections, I may refer to a (TS) gene as a unique gene when it is only expressed in a single tissue. Figure 6.11a shows that the distribution of the protein expression breadth is bimodal. Either due to technical limitations or biological reasons, proteins detected in a sole tissue form the most numerous class and represent 20 % of the overall number. Proteins expressed in all tissues are the second most numerous class (about 16 %); the third largest class (12 %) comprises the proteins expressed in two tissues. On the other hand, almost all mRNAs are expressed in every tissue (Figure 6.11b). One hypothesis is that mRNAs levels have to exceed a sufficient threshold for their proteins to be detected. Thus, I also studied the effect of two additional minimum expression thresholds for the mRNAs on the expression breadth. The two new expression breadth profiles are more alike to the proteomic one. As shown in Figure 6.11c, the number of transcripts only found in one tissue increases at thewidespread 1 FPKM threshold, which roughly equates to one RNA in the cell [Mortazavi et al., 2008; Hebenstreit et al., 2011]. The expression breadth profile of the mRNAs expressed at or above 5 FPKM present a similar bimodal distribution (Figure 6.11d) to the protein one. While arbitrary, 5 FPKM 7 If a gene is expressed below the considered threshold in all the tissues, its expression breadth is null. 146 6.2 fair correlations between independent proteomics and transcriptomics is a threshold commonly found in the literature [Uhlén, Fagerberg, et al., 2015; Gonzàlez- Porta et al., 2013; J. Chen et al., 2018]. 2613 1560 1026 766 601 499 478 404 423 404 441 430 522 692 2050 0 1000 2000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of tissues Pr ot ein co un t (a) Protein expression breadth (PPKM quantification) 130 82 71 62 76 84 86 88 79 111 114 160 227 371 11168 0 3000 6000 9000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of tissues m RN A co un t (b) mRNA expression breadth (> 0 FPKM) 988 483 335 284 274 251 211 205 218 286 285 424 755 1376 6065 0 2000 4000 6000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of tissues m RN A (≥1 F PK M ) c ou nt (c) mRNA expression breadth (≥1 FPKM) 1946 834 632 492 425 385 328 376 396 411 500 630 930 1167 1336 0 500 1000 1500 2000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of tissues m RN A (≥5 F PK M ) c ou nt (d) mRNA expression breadth (≥5 FPKM) Figure 6.11. Expression breadth of the proteins and mRNAs. The expression breadth of the proteins has a bimodal distribution. Many proteins are detected either in a single tissue or in all of them. Almost every mRNA is detected in every tissue. Their breadth becomes bimodal when their expression threshold is increased to 5 FPKM. To ease the general visualisation, I have omitted to plot the mRNAs for which the expression remains below the threshold for all tissues (i.e. expression breadth=0 for the considered threshold). Figure 6.12 displays the fraction of unique genes (i.e. only expressed in a single tissue) detected as a protein or anmRNA at a considered threshold for each tissue as the analysis is seeking a possible link between the number of uniquely detected genes and the correlation strength between the proteomic and transcriptomic TREPs. Hence, these fractions are computed by dividing the number of unique genes (proteins or mRNAs) of each tissue by the total amount of uniquely detected genes across all tissues. The tissues are ordered by increasing order of their fraction. 147 integration of transcriptomic with proteomic data 0.00 0.25 0.50 0.75 1.00 ratio Pr ot ei ns (> 0 P PK M ) 0.00 0.25 0.50 0.75 1.00 ratio m RN A (> 0 F PK M ) 0.00 0.25 0.50 0.75 1.00 ratio m RN A (≥ 1 FP KM ) 0.00 0.25 0.50 0.75 1.00 ratio m RN A (≥ 5 FP KM ) Tissue Kidney Lung Rectum Colon Adrenal Placenta Oesophagus Gall bladder Prostate Pancreas Heart Ovary Urinarybladder Liver Testis Distribution of unique genes (proteins or mRNAs) across tissues Figure 6.12. Unique proteins or mRNAs fractions across tissues. Although their proportion varies from one tissue to another, all fifteen tissues have proteins that are specifically detected in each tissue solely, as shown in the top plot in Figure 6.12. In contrast, unique mRNAs are detected in a more limited number of tissues (see the three bottom plots of Figure 6.12). Besides, the unique proteins are more evenly distributed between the fifteen tissues than the unique mRNAs. Except for Testis and Liver, which are consistently expressing the highest number of uniquely detected genes, the other tissues fail to present any similarity between the available proteomic and transcriptomic data. Liver is the most correlated tissue (Figure 6.10) and comprises the second-highest number of unique genes. Testis is the third-best correlated tissue despite having the highest fractions of unique proteins and mRNAs regardless of any threshold. It may be tempting to hypothesise that the number of unique genes relate to correlation levels, but the other tissues fail to show any relationship between the number of unique mRNAs and proteins they expressed and the strength of the correlations. Put together, these results suggest that the number of proteins and mRNAs uniquely expressed in these tissues play a minor role at best in the mRNA/protein correlation computed for each tissue. The lack of relation between the proteomic and transcriptomic 148 6.2 fair correlations between independent proteomics and transcriptomics observations is confirmed by a more refined analysis of the expression breadth. 0 1000 2000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of tissues Pr ot ei n co un t Corresponding mRNA breadth Identical Similar Mixed Different Expression < 5 FPKM Figure 6.13. Comparison of proteins expression breadth to their corresponding mRNA. The proteins’ expression breadth (Figure 6.11a) is coloured according to their corresponding mRNA expression breadth at 5 FPKM (Figure 6.11d). About one-fifth of the uniquely detected proteins have their corresponding mRNA identically expressed once at or above 5 FPKM. The number of proteins classified as Identical decreases significantly through other breadths until it raises again from thirteen tissues to reach about one- third of the ubiquitous proteins. Proteins and mRNAs with mismatching expression breadth are split into several categories. Proteins and mRNAs that are both detected within four to twelve tissues are described as Mixed. If the expression breadths of the remaining pairs are close (± 2), they are identified as Similar otherwise as Different. Finally, many genes detected at least once as a protein have an mRNA expression that never reaches 5 FPKM (Expression < 5 FPKM). Figure 6.13 shows that the expression breadth of mRNAs (expressed ≥ 5 FPKM or even smaller threshold) concurs in very few cases to their corresponding protein breadth. Thus, the mRNA expression breadth is insufficient to predict the expression breadth of the corresponding protein. Even for the two extreme cases where the protein is unique to a tissue or ubiquitous (found in all fifteen tissues), there are differences between the expression breadths of the mRNA and the protein of the same gene. All the expression breadth analyses of the transcriptome rely on expression levels. However, Chapter 4 underlines that high mRNA expression levels are unrelated to high interstudy correlation of same-tissue pairs while TS mRNAs present a rather strong relation with it. For this reason, the following analysis examines the relationship 149 integration of transcriptomic with proteomic data between TS mRNAs and TS proteins. 6.2.3 Tissue-specific mRNAs have significant overlap with tissue-specific proteins Unlike mRNAs, many proteins are only expressed in one unique tissue. These are the ones I refer to as TS proteins in the remainder of this thesis. To enable the comparison of these TS proteins with possible transcript partners, I first need to define a set of TS mRNAs. To find the latter, I choose the 𝑛 mRNAs most specific to a tissue based on the Fold change method (Section 4.3.1.2) where 𝑛 is the number of TS proteins of that tissue. Then, as detailed in Figure 6.14, I examine for each tissue the overlap between its 𝑛 TS proteins with its 𝑛mRNAs with the highest tissue-specific ranks. Figure 6.15 illustrates the Heart example. Detected only in T Protein Sort the mRNA by their specificity in T RNA n For each tissue T: + - n x1 x2 n = | x1 ∪ y | = | x2 ∪ y | Specificity of mRNAi = ExpressionT𝑖sum(Expressions of mRNAi in all tissues) y x1 x2 y Figure 6.14. Overview of the comparison of the TS proteins and TS mRNAs. TS proteins are the 𝑛 proteins only expressed in one tissue. Once the mRNAs have been sorted by decreasing order of their relative specificity to a given tissue, the first 𝑛 mRNAs identities are compared to the ones of the 𝑛 TS proteins present in the same tissue. Each tissue has a different number of TS proteins. I thus refine this analysis by computing Jaccard similarity coefficients (or Jaccard indices) [Jaccard, 1901; Lin et al., 2008], see Equation (Jaccard similarity coefficient). The Jaccard indices allow assessing the relationship between TS proteins and mRNAs across all the tissues at the same time and ease the result interpretation in contrast to the raw overlap numbers. 150 6.2 fair correlations between independent proteomics and transcriptomics Heart Jaccard index = 0.075 p-value = 2.43e-18 154 15425 Pandey Uhlén Figure 6.15. Example of overlap of TS proteins and TS mRNAs. The Jaccard index is computed as follow: 𝐽(𝑥1, 𝑥2) = |𝑥1 ∩ 𝑥2| |𝑥1 ∪ 𝑥2| = |𝑥1 ∩ 𝑥2||𝑥1| + |𝑥2| − |𝑥1 ∩ 𝑥2| (Jaccard similarity coefficient) When applied specifically to Figure 6.14, we get: 𝐽( , ) = 𝑘2𝑛−𝑘 , with 𝑛 the number of proteins ( ) that are only found in a given tissue and 𝑘 is the number of common genes between these 𝑛 unique proteins and the 𝑛 most specific mRNAs of the tissue ( ). To measure the Jaccard indices significance, I use the hypergeometric test [Field et al., 2012] (see Appendix E.1). In the current analysis, I consider as ‘success’ when a TS mRNA is among the 𝑛 TS proteins and test if the number of observed successes is greater than the expected number for random sampling. The Jaccard indices for all pairs of the fifteen shared tissues between the Pandey Lab (PPKM quantification) and Uhlén et al. are summarised in Figure 6.16, while Figure 6.17 displays their respective p-values (hypergeometric test). I have rerun these analyses with different sets of parameters and I have consistently observed statistically significant overlaps except in rare cases, which include the comparison of TS genes for Urinarybladder between Pandey Lab (PPKM quantification) and Uhlén et al. (HTSeq-count) based on the fifteen-tissue set and where the TS mRNAs are selected with the fold change method. Overall, the Jaccard indices remain within the same ranges for different sets of parameters. If ranked by their Jaccard indices, several tissues fall within a range similar to their correlation coefficient, while others not — as shown by Figure E.5. For instance, Liver, 151 integration of transcriptomic with proteomic data Ad ren al Co lon Ga llb lad derHe art Kid ne y Liv er Lu ng Oe sop ha gu s Ov ary Pa ncr eas Pla cen ta Pro sta te Re ctu m Te stis Ur ina ryb lad der Urinarybladder Testis Rectum Prostate Placenta Pancreas Ovary Oesophagus Lung Liver Kidney Heart Gallbladder Colon Adrenal Proteome Pandey lab (PPKM) Tr an sc rip to m e (m RN A ) Uh lé n et a l. (F PK M ) 0.0037 0.024 0.0091 0.014 0.0063 0.011 0 0.0033 0.0026 0.014 0 0.021 0.02 0.0085 0.032 0.015 0.012 0.006 0.0085 0.0063 0.0044 0 0 0.021 0.0057 0.019 0.018 0 0.1 0.017 0.0037 0.028 0.015 0.0056 0.0063 0.016 0.006 0.0067 0.0026 0.0057 0.0037 0.006 0.025 0.013 0.012 0.011 0.012 0.003 0.0028 0.0063 0.02 0.006 0 0.0026 0.014 0.0074 0.053 0.015 0.014 0.017 0.0074 0.012 0.003 0.0085 0 0.0066 0 0.01 0.0026 0.0057 0.096 0.021 0.0099 0.0063 0.0096 0.015 0 0.003 0.011 0 0.0088 0.006 0 0.0052 0.1 0 0.015 0 0.012 0.012 0.011 0.012 0.003 0.0056 0 0.011 0 0.0067 0.029 0.0086 0.0037 0.006 0.0049 0.011 0.027 0.0037 0 0.018 0.0056 0.0063 0.011 0 0.086 0.01 0 0 0.015 0 0.0095 0.012 0 0 0.0091 0.02 0 0.0088 0.05 0.017 0.0052 0.0057 0.011 0.021 0.0049 0.0074 0.0072 0 0 0 0 0 0.13 0 0 0.0026 0 0 0 0 0.0011 0.0048 0 0.0079 0.015 0.0028 0.19 0.0088 0 0.01 0.0078 0.0057 0 0.009 0 0.0074 0.017 0.0074 0 0 0.091 0 0.0044 0 0.01 0.0078 0 0.015 0 0 0.0074 0.022 0.011 0.004 0.028 0.014 0.013 0.016 0.006 0.01 0.021 0.026 0.0037 0.012 0.015 0.0085 0.012 0.0074 0.033 0.012 0.0056 0 0.016 0 0.0067 0.0052 0 0.0074 0.012 0.036 0.011 0.012 0.075 0.016 0.0091 0.011 0.0063 0.013 0.012 0.0033 0.01 0.0057 0.015 0.015 0 0.012 0.012 Figure 6.16. Heatmap of Jaccard indices across the common fifteen tissues between Uhlén et al. and Pandey Lab data. For each tissue, the TS proteins are the proteins (quantified with PPKMmethod) that are expressed only in that tissue. The TS mRNAs are the mRNAs with the highest specific coefficients in that tissue. Testis and Pancreas have high ranks and Urinarybladder and Gallbladder low ones for both their correlation and their Jaccard indices. On the other hand, Kidney ranks first for the Jaccard index, but only reaches the seventh rank for Pearson correlation. While the Rectum has the smallest Jaccard index and thus ranks last (i.e. fifteenth), it gets ranking number nine for Pearson correlation. These results suggest that TS proteins and TS mRNAs are unrelated to the tissue correlation levels. Overall, the above direct approaches (based on the gene expression breadth across tissues and their tissue-specificity) fail to show a prominent (if any) contribution or association to the correlation levels between the proteome and transcriptome. These are most likely resulting frommultiple subtle similarities based on identical differential expression within 152 6.2 fair correlations between independent proteomics and transcriptomics Ad ren al Co lon Ga llb lad derHe art Kid ne y Liv er Lu ng Oe sop ha gu s Ov ary Pa ncr eas Pla cen ta Pro sta te Re ctu m Te stis Ur ina ryb lad der Urinarybladder Testis Rectum Prostate Placenta Pancreas Ovary Oesophagus Lung Liver Kidney Heart Gallbladder Colon Adrenal Proteome Pandey lab (PPKM) Tr an sc rip to m e (m RN A ) Uh lé n et a l. (F PK M ) 0.26 0.26 0.26 0.0079 0.058 0.86 0.67 0.26 7e-04 0.26 0.45 0.058 0.26 0.058 4.3e-05 0.97 0.98 1 1 1 1 1 0.99 0.98 0.97 1 0.9 0.94 7.1e-39 1 1 1.6e-05 0.047 1 1 1 0.56 1 0.56 1 0.19 0.047 0.0013 1 0.0086 0.068 0.18 0.18 1 0.37 1 0.0064 0.068 0.65 0.068 0.00645.9e-11 0.65 0.023 0.0064 0.058 0.43 0.77 0.058 1 1 0.18 1 0.77 1 6.9e-23 0.43 0.77 0.015 1 0.69 1 0.00063 1 0.69 1 0.69 1 0.42 2e-27 0.69 0.09 0.69 0.69 0.09 0.33 0.79 0.0085 0.55 0.55 0.95 0.79 0.33 0.00015 0.79 0.95 0.95 0.95 0.0085 0.95 0.83 0.53 0.26 0.26 0.26 1 0.032 8.7e-21 0.53 1 0.26 1 0.53 1 0.83 0.1 1 0.42 1 1 1 6.7e-08 1 1 0.42 1 0.42 0.42 1 1 0.22 0.11 0.11 0.92 0.58 3.9e-42 0.58 0.38 0.38 0.58 0.78 0.021 0.11 0.92 0.38 0.39 1 0.088 1 2.6e-39 1 1 0.39 1 1 1 0.39 0.39 0.39 0.39 0.24 0.71 0.1 2.9e-24 0.92 1 0.012 0.71 0.71 0.24 0.45 0.92 0.71 0.45 0.1 0.37 0.17 0.00032 1 0.066 1 0.37 0.021 0.89 0.89 0.89 0.89 0.066 0.64 0.37 0.037 3.5e-05 0.72 1 0.36 1 1 1 0.13 1 0.13 0.13 0.00025 0.13 0.0016 2e-16 0.42 0.17 0.42 1 1 1 0.76 0.17 0.056 0.42 0.17 0.76 0.056 0.76 Figure 6.17. p-values associated with the Jaccard indices of Figure 6.16. These p- values have been computed with the hypergeometric test. many clusters of the proteins and mRNAs. Given the mixed results of the direct approaches built on uniquely detected genes, I next examine whether or not an indirect method may be more appropriate. For this new analysis, I build hierarchical cluster trees that try to translate the tissues’ expression ‘closeness’ or differentiation distances. Then, I compare the proteins’ and transcripts’ trees. 6.2.4 Proteins and mRNAs tissue trees present partial concordant results As presented in Section 3.2.2, a hierarchical clustering analysis requires a linkage method and a distance between each element that is included in the analysis. 153 integration of transcriptomic with proteomic data I have built the tree of each dataset by linking the tissues withWard’s method [Ward, 1963] like in the previous analyses. I have performed an initial analysis that used the tissue correlations for the distance, but it did not highlight any similarity between the proteomics and transcriptomics trees of hierarchical clusters. The distance used in the following analysis reflects the difference in composition of gene populations expressed by the tissues. It is based on the Jaccard index (see Equation (Jaccard similarity coefficient)) of the tissues where I only consider genes (proteins or mRNAs) that are detected in two tissues strictly. Making the hypothesis that the closeness of two tissues increases with the number of genes they share, I compute the distance between two tissues, 𝑡1 and 𝑡2 using the following formula: distance(𝑡1, 𝑡2) = 1− 𝐽(𝑡1, 𝑡2). Note that for this analysis, I present the results for Pandey Lab (PPKM quantification) and Uhlén et al. for their fifteen shared tissues as for the other analyses, but I also present and compare with the GTEx data and their twelve shared tissues. Figure 6.18 shows the hierarchical clustering of the fifteen tissues for Pandey Lab (PPKM quantification) and Uhlén et al. (≥ 5 FPKM) studies. Adrenal Colon Gallbladder Heart Kidney Liver Lung Oesophagus Ovary Pancreas Placenta Prostate Rectum Testis Urinarybladder (a) Pandey Lab tissues (PPKM quantification) Adrenal Colon Gallbladder Heart Kidney Liver Lung Oesophagus Ovary Pancreas Placenta Prostate Rectum Testis Urinarybladder (b) Uhlén et al. tissues (≥5 FPKM) Figure 6.18. Hierarchical clustering for the fifteen tissues of Pandey Lab and Uhlén et al. studies. Both Pandey Lab and Uhlén et al. trees, respectively Figures 6.18a and 6.18b display the same four pairs of tissues the most closely related: Rectum and Colon, Placenta and Lung, 154 6.2 fair correlations between independent proteomics and transcriptomics Liver and Kidney, and Testis and Ovary. Comparing more than two hierarchical trees for possible finding congruence is cumbersome manually; thus, methods exist to create consensus trees [Felsenstein et al., 2004]. The methods can be strict or create a consensus based on the majority. Since there is a maximum of three trees (one for each dataset) to compare at a time, all the consensus trees within this thesis are strict, i.e. all the trees must be in agreement on a hierarchical organisation for the consensus tree to include it. I use one of the possible implementations of these methods from the R package ape (v5.3) [Paradis et al., 2019]. Adrenal Colon Gallbladder Heart Kidney Liver Lung Oesophagus Ovary Pancreas Placenta Prostate Rectum Testis Urinarybladder (a) Consensus tree of the fifteen shared tissues between Pandey Lab and Uhlén et al. (≥ 5 FPKM) data Adrenal Colon Heart Kidney Liver Lung Oesophagus Ovary Pancreas Prostate Testis Urinarybladder (b) Consensus tree of the twelve shared tissues between Pandey Lab data and Uhlén et al. and GTEx data (≥ 5 FPKM) Figure 6.19. Consensus of the hierarchical clustering of the tissues across the different studies. Figure 6.19 shows two consensus trees. Figure 6.19a is the consensus tree built on the previous trees of Pandey Lab and Uhlén et al. fifteen shared tissues. The tree groups are clearly featured. To assay if these groups may still be found beyond these two datasets, I extend the analysis with the GTEx data. Figure 6.19b relies on the set of twelve shared tissues between Pandey Lab data (quantified by the PPKM method) and the two transcriptomic datasets (≥ 5 FPKM): Uhlén et al. and GTEx data. Compared to Figure 6.19a, only two tissue-sets are consistently observed as most closely related: Liver and Kidney, and Testis and Ovary. Note that the results seem unaffected by the protein quantification method as PPKM and Top3 methods give identical results. However, the threshold, above which the mRNAs are considered, influences the analysis’ outcomes. As very few mRNAs are found uniquely in two tissues at 0 FPKM, this threshold is insufficient to identify any hierarchical 155 integration of transcriptomic with proteomic data organisation in the transcriptomic data. Increasing the threshold, for instance to 1 FPKM, allows exposing clusters of tissues, e.g. Liver and Kidney. However, different thresholds can also highlight different tissue clusters, including Rectum and Colon at 5 FPKM or Pancreas and Gallbladder at 1 FPKM. The influence of the thresholds on the results suggests that genes have their expression levels varying depending on their tissue context. In summary, even indirect analyses based on the proteins and mRNAs breadth expression have some degree of similarity in their results and seem to partially capture a biological signal. Furthermore in-depth (direct or indirect) analysesmay clarify the relationship between the proteome and transcriptome. For example, equivalent correlation levels between specific gene groups (gene co-expression correlation) may be highlighted for each tissue at both biological layers. However, this kind of analysis requiring a case-by-case approach will greatly benefit from well-established and proven mRNA and protein expression baselines for each tissue. The following section Section 6.3 analyses precisely whether genes have comparable expression profiles as a protein than as an mRNA. 6.3 wide correlation range for protein/mrna pairs As previously reported, the expression levels of mRNA/protein pairs can present a tight relationship in one tissue while being seemingly unrelated in another. The first gene- centric analysis explores for each gene the relationship between the expression levels of its mRNAs and proteins across all available tissues. This analysis helps one to determine whether any intrinsic trend structures the gene expression or if it is only subject to the environment. Figure 6.20 displays the Pearson correlation between the matching pairs of mRNAs from Uhlén et al. and the proteins from Pandey Lab (quantified with the PPKM method) across their fifteen common tissues. The observed levels of correlation (in pink) are higher than the expected levels computed by random permutation (the grey line is showing the average of 10,000 permutations). I also compare the Pearson correlation of the matching mRNAs/protein pairs with the ones (in green) of the mRNAs/mRNAs pairs from Uhlén et al. and GTEx data to provide more context. About two-thirds of the genes (8,550) present a strong correlation (𝑟 > 0.8) between the mRNA expression levels of Uhlén et al. and GTEx data, but only about 6% (775) of the mRNA/proteins pairs are above this limit (in dark blue). The Pearson correlation between the mRNA/protein pairs ranges rather widely from 1 (for 32 pairs, which are detected as mRNA and protein in the same single tissue only) to below 156 6.3 wide correlation range for protein/mrna pairs Figure 6.20. Pearson correlation of gene expression levels between studies across the shared tissues in descending order. For each mRNA and its corresponding protein, I have computed their Pearson correlation (in pink) across the fifteen common tissues between Pandey Lab (PPKM) and Uhlén et al. data and then ordered them in decreasing order. The x- axis shows the rank of each pair and the y-axis its correlation coefficient (computed with log2(level+ 1)). The grey line represents the mean of the10,000 randomisations (by pair composition permutation). The permutation confirms that the observed correlation coefficients are significantly higher than expected by chance. The green line serves as the most ideal comparison case: it represents the Pearson correlation of mRNAs pairs between Uhlén et al. and GTEx data. −0.5 (for 105 pairs, with 𝑟 = −0.83 for the most anticorrelated one). A closer look at both extremes (Figure 6.21) reveals several possible relationship profiles between the expression of the mRNAs and proteins. The negatively correlated genes can present overall anticorrelated (Figure 6.21a) or rather unrelated (Figure 6.21b) expression levels of mRNAs and proteins. On the other end, the highest correlated genes present genes with a tissue-specific (TS) protein or whose mRNA and protein expression levels are tightly related (Figures 6.21c and 6.21d). I only present herein (and in the following sections) the set of results based on the Pearson correlation of the gene expression levels between the Pandey Lab data quantified with the PPKM method and Uhlén et al. data across their fifteen shared tissues since the different sets share similar results trends. Furthermore, this data combination provides the highest 157 integration of transcriptomic with proteomic data Adrenal Colon Gall bladder Heart Kidney Liver LungOesophagus Ovary Pancreas Placenta Prostate RectumTestis Urinarybladder 3 4 5 4.8 5.2 5.6 Uhlén et al. (mRNA) log2(FPKM+1) Pa nd ey la b (p ro te in ) lo g2 (P PK M +1 ) SSR3 (ENSG00000114850) Spearman ρ = -0.79 • Pearson r = -0.74 (a) Anticorrelated Adrenal Colon Gall bladder Heart Kidney Liver Lung Oesophagus Ovary Pancreas Placenta Prostate Rectum Testis Urinarybladder 3 4 5 6 7 8 0.0 0.5 1.0 1.5 Uhlén et al. (mRNA) log2(FPKM+1) Pa nd ey la b (p ro te in ) lo g2 (P PK M +1 ) COL2A1 (ENSG00000139219) Spearman ρ = -0.59 • Pearson r = -0.42 (b) Uncorrelated Adrenal Colon Gall bladder Heart Kidney Liver Lung Oesophagus Ovary Pancreas Placenta ProstateRectum Testis Urinarybladder 4 5 6 7 8 9 2 4 6 8 Uhlén et al. (mRNA) log2(FPKM+1) Pa nd ey la b (p ro te in ) lo g2 (P PK M +1 ) TPM2 (ENSG00000198467) Spearman ρ = 0.69 • Pearson r = 0.74 (c) Well correlated Adrenal Colon Gall bladder Heart Kidney Liver Lung Oesophagus Ovary Pancreas Placenta Prostate Rectum Testis Urinarybladder0.0 2.5 5.0 7.5 0 2 4 6 Uhlén et al. (mRNA) log2(FPKM+1) Pa nd ey la b (p ro te in ) lo g2 (P PK M +1 ) HGD (ENSG00000113924) Spearman ρ = 0.91 • Pearson r = 0.97 (d) Highly correlated Figure 6.21. Different cases of correlation for protein/mRNA pairs. The scatterplots show the expression of the genes as mRNA in Uhlén et al. data on the x-axis and as a protein in Pandey Lab data on the y-axis across their common fifteen tissues. Figure 6.21a shows that SSR3 apparently has a protein expression anticorrelated to its mRNA’s one: when the mRNA expression is low (e.g. Pancreas, Liver), the protein expression is high and when the mRNA expression is high (e.g. Adrenal gland, Prostate), the protein expression is low. Figure 6.21b features COL2A1 and for which protein expression is observed for many tissues (e.g. Oesophagus, Kidney, Pancreas) while no or very low mRNA expression has been captured. Figure 6.21c shows that TPM2 is expressed in all tissues and the expression of the protein is well correlated to the mRNA’s one. Figure 6.21d shows that HGD has highly correlated protein and mRNA expression when found in a tissue. Note that true perfect correlation can be observed for pairs where both mRNA and protein are tissue-specific (TS).158 6.3 wide correlation range for protein/mrna pairs number of pairs to be studied across the highest number of tissues. It also allows continuity with the above tissue-centric analyses. Complementary results for Spearman correlation and sets that include GTEx data can be found in digital format at http://www.barzine.net/ ~mitra/thesis. 6.3.1 TS protein enrichment for the most correlated pairs Figure 6.22. TS proteins percent as a function of the considered number of genes (ranked by Pearson correlation). Before being plotted, the genes are first ranked in decreasing order of Pearson correlation between the expression levels of the two considered datasets across their shared tissues. The top plot, which is a reproduction of Figure 6.20, is for interpretive convenience only. The two parts of the figure have corresponding x-axes. The x-axis of the top plot presents the rank associated with the Pearson correlation coefficients of the genes on the y-axis. The x-axis of the bottom plot represents the upper limit rank up towhich are considered the genes (for the calculus). The lower limit rank is 1. The y-axis of the bottom plot displays the percentage TS protein for a set of (ordered) genes. 159 integration of transcriptomic with proteomic data Here, I investigate the incidence of the TS proteins on the level of correlation between mRNAs and their proteins. I first compute for each gene the correlation between the expression levels of Uhlén et al. data and Pandey Lab data. Then, I organise the genes in a sequence⁸ by ranking them by decreasing order of correlation. Thus, the first most correlated gene’s rank is 1. Finally, for each rank 𝑘, I calculate the number of genes with TS proteins among the first 𝑘 ranks before converting it into percentage. Equation (TS protein percentage) on page 256 presents the formula with which I compute the percentage of TS proteins. Figure 6.22 illustrates the percentage of TS proteins for a given number of considered genes, which have been ranked in decreasing order of Pearson correlation coefficients. Similarly to Figure 6.20, randomised protein/mRNA pairs (average for 10,000 permutations) in grey and mRNA/mRNA pairs (Uhlén et al. data and GTEx data) in green are providing some context. The TS protein percentage in pink is extremely high for the highest range of observed correlation between the expression of the Pandey Lab proteins (quantified with PPKM) and Uhlén et al. mRNAs pairs across their shared fifteen tissues. The TS protein percentage then decreases quickly before finally increasing slowly for the lower range of correlation. Thus, the genes identified as TS proteins in Pandey Lab data enrich the most correlated and most anticorrelated mRNA/protein pairs clearly. Most of the pairs have a Pearson correlation above 0.5, although they show a wide range of Pearson correlation, from 𝑟 = −0.77 (for ZNF770) to 𝑟 = 1 (for several genes). However, the correspondingmRNA/mRNApairs between Uhlén et al. and GTEx data show a more evenly distribution of these genes through the 10,000 highest gene correlations after an initial peak. The average of the 10,000 permutations (in grey) of the Pandey Lab with the Uhlén et al. data also shows an initial high TS protein percentage that drops to the global amount of TS proteins among the complete set of shared genes. Overall, TS proteins represent about one-fifth of all the genes. 6.3.2 Gene expression profiles clue about biological and technical differences As mentioned previously, either artefacts or biology may explain the observed low correlations. It is rather difficult to identify which artefacts are specifically impacting each protein/mRNA pair as there are many and can occur in combination. However, two major technical sources of artefacts, which have been reported, are dispersion for the lowly expressed mRNAs and saturation for the highly expressed proteins. See Sections 1.2 and 1.3 (from p. 7) for more details. Figure 6.23 shows possible profiles of relationships between the protein and mRNA pairs. Well-correlated pairs are in the grey area delimited by the green line. For these genes, 8 A sequence is an ordered set. 160 6.3 wide correlation range for protein/mrna pairs the expression levels of the protein observed in a tissue is tightly related to the levels of the mRNA. Although the data have been sampled from different sources, the stronger associations suggest a similar translation process across the tissues andmay also imply less post-translational modifications that can hinder protein identification and quantification than for other genes. Transcriptome Pr ot eo m e Figure 6.23. Possible mRNA/protein expression profiles due to biological reasons. Genes with well-correlated transcriptomic and proteomic expression are found in the grey area delimited by the green line. The genes in the yellow area, i.e.which present a high protein concentration and a low mRNA concentration, may have stable proteins and mRNAs with short half- lives. The genes in the blue area, which present a low protein concentration and high mRNA concentration, may have a highly regulated translation, a protein challenging to capture with MS or a misfit between the annotation definition of their mRNA and protein. Genes in the yellow area present a high protein concentration and low mRNA concentration. One possible cause may be that these genes have stable proteins and mRNAs with short half-lives. On the other hand, genes in the blue area present a low protein concentration high mRNA concentration. These latter genes may have a protein that is more challenging to capture (either because of unsuited protocols as described in Section 1.3 or annotation misfits, e.g. STAU2 as shown in Figure E.3), may forego through higher regulation through their translation or may be actively exported outside of the tissue that synthesises them. Anticorrelated genes are another category that will likely require further analysis for a better overall understanding. The observed anticorrelation between the proteins and mRNAs expression can be caused by various elements, which may include tissue-dependent isomers (either for the mRNA or protein) expression, self-regulation or a variable secretion rate of the protein. At the time of writing, the most accurate way to classify the protein/mRNA pairs remains empirical, i.e. human interpretation based on the joint visualisation of protein and mRNA expression across the different tissues (and dataset combination). 161 integration of transcriptomic with proteomic data As empirical approaches are better designed to examine a few genes of interest per study than to give a broader view of the expression landscape, in the next section I favour a more general strategy instead. I study the three gene groups of interest highlighted above with a GO enrichment analysis (see Section 1.4) to find possible biological factors that may differentiate them. As the pairs with a TS protein enrich both the most correlated and the most anticorrelated genes, I also choose to study them but separately, and thus, I remove them from the most correlated and anticorrelated gene lists. 6.3.3 Distinct functional enrichment profiles for pairs with a TS protein, and for the best correlated and most anticorrelated ones A GO enrichment analysis (GOA — see Section 1.4.1) uses gene ontologies that provide defined terms covering the gene product (i.e. mRNA and protein) properties. These terms are structured into categories, which helps to assess whether a gene set is associated with a biological process (BP), a molecular function (MF) or a cellular component (CC). The three ontologies are included in the Bioconductor package org.Hs.eg.db [Carlson, 2019] for analysis in R⁹ [R Core Team, 2019]. The enrichment computation requires the comparison between the GO terms set associated with each studied list of genes to a reference. I consider three gene lists: the three hundred best correlated and the three hundred most anticorrelated protein/mRNA pairs and all the pairs (2,613) with a TS protein. As a reference, I use the 12,921 matching protein/mRNA pairs between Pandey Lab (PPKM quantification) and Uhlén et al. data. The Bioconductor package clusterProfiler (v3.12) [G. Yu et al., 2012] provides a function enrichGO, which implements the GOA as the over-representation test described by Boyle et al. (2004) and handles all the required statistical testing and (Benjamini and Hochberg [Benjamini et al., 1995]) correction. The lists of the best correlated pairs and the ones with a TS protein present distinctive enrichment profiles through the three ontologies. However, the most anticorrelated pairs list presents an enrichment only for biological process (BP) terms. All the individual enrichment GO analyses along with the comparison of the enrichment for the CC and MF ontologies are provided as digital supplementary material. Figure 6.24 presents the results for the comparison of the enrichment of the three considered gene lists for the BP ontology. The left side of the figure is a heatmap that marks the pairs’ associations with the GO categories on the y-axis. It includes all protein/mRNA pairs of the three studied gene lists. They are sorted on the x-axis in decreasing order of their Pearson correlation. 9 R — https://cran.r-project.org/ 162 6.3 wide correlation range for protein/mrna pairs Fig ur e6 .24 .E nr ich ed GO ca teg or ies fo rt he ge ne sw ith aT Sp ro tei n, th et hr ee hu nd re dw ith th eh igh es tc or re lat ion sa nd th et hr ee hu nd re dw ith th eh igh es ta nt ico rre lat ion s. Th es ha red y-a xis of the tw op art si nc lud es the en ric he dG O cat eg ori es (fo ra ny of the thr ee gro up s). Th el eft pa rt of the fig ur es ho ws ah ea tm ap wh ere all the inc lud ed pro tei n/m RN A pa irs (i.e .3 ,21 3) are so rte d by the ir Pe ars on co rre lat ion on the x-a xis an dt ha te ach ass oc iat ion of ap air wi th aG O cat eg ory is ma rke d. Th er igh tp art sh ow s the res ult so ft he BP GO A an aly sis wi th clu ste rPr ofi ler (re fer en ce: the co mp let es et of 12 ,92 1g en es) ;th et hr ee gro up sa re sh ow ed on the x-a xis wi th the ir nu mb er of ge ne sa nn ota ted in the co ns ide red on tol og y. Fo re ach do t,t he siz er ep res en ts the rat io of pa irs wi thi ne ach gro up co ntr ibu tin gt oe ach cat eg ory en ric hm en t,a nd the co lou rin dic ate st he irs ign ific an ce. 163 integration of transcriptomic with proteomic data These GO categories, shared between both sides of the figure, combine the five most enriched categories for each of the three lists (TS proteins, best correlated and most anticorrelated pairs). The categories enrichment is provided by another clusterProfiler function, compareCluster (which internally invokes enrichGO), which has also produced the right side plot. No cross-enrichment of GO category exists between the three groups (ensured by the ‘includeAll=TRUE’ option). Notably, the GO categories associated with each gene list create coherent groups of similar biological processes. The genes with a TS protein are enriched in terms for specific signalling, either for its detection (‘detection of chemical stimulus’, ‘sensory perception of chemical stimulus’, ‘sensory perception’), as a response to a signal (‘G protein-coupled receptor signalling pathway’) or as a regulation (‘regulation of signalling receptor activity’). Concurrently, genes with the best correlated mRNA and protein pairs of expression are associated with catabolic processes¹⁰, and genes with the most anticorrelated pairs are related to ribosomes and ncRNAs regulation, thus by extension to the translation and its regulation. The GO terms enrichment analysis with the CC ontology shows that the best correlated genes are the most enriched for the following categories: the ‘postsynaptic membrane’, ‘apical plasma membrane’, ‘apical part of cell’, ‘cluster of actin-based cell projections’, ‘brush border’ and the ‘cornfield envelope’. On the other hand, the pairs presenting a TS protein show a slight enrichment for ‘ion channel complex’, ‘transmembrane transporter complex’, ‘transporter complex’, and ‘cation channel complex’. Whereas the categories associated with the TS proteins are more ubiquitous and can concern every cell type, the enriched categories for the best correlated genes are referring more specifically to subsets of cells. The localisation of the best correlated pairs probably suffers from their overall ubiquity. Thus, the results for the best correlated genes are probably an artefact of annotation even though they comparatively rely on more genes. The enrichment analysis with the MF ontology for the best correlated pairs points to different activities: oxidoreductase, cofactor and transmembrane transporter or signalling activities (‘anion transmembrane transporter activity’, ‘oxidoreductase activity, acting on CH- OH group of donors’, ‘oxidoreductase activity, acting on the CH-OH group of donors NAD or NADP as acceptor’, ‘cofactor binding’, and ‘coenzyme binding’). The pairs with a TS protein are also associated transporter and signalling activities (with the following five categories: ‘transmembrane signalling receptor activity’, ‘signalling receptor activity’, ‘molecular transducer activity’, ‘channel activity’, and ‘passive transmembrane transporter activity’). 10 Catabolic processes are an energy release source and depend onmolecules requiring to be break down [Alberts et al., 2002]. 164 6.4 discussion Put together, these results suggest that when there is a high correlation or anticorrelation between anmRNAand its protein, biological processes play amore likely role than possible technical confounding. The anticorrelated pairs fail to present any enrichment for a specific cell compartment or a molecular function. Thus, it implies that regardless of their localisation within the cell or their chemical properties (which relate to their molecular function), the bottom-up MS studies manage to capture most of the proteins with variable effectiveness. Nevertheless, bottom-up MS studies favour some proteins, and missing proteins are a primary source of ambiguities [Poverennaya et al., 2017]. Hence, comparing the relative expression levels of proteins within a tissue may lead to misinterpretations. On the other hand, while it requires caution, the relative expression levels of each protein across tissues ought to provide biological insights. Finally, running all the gene-centric analyses presented above with the other possible combinations of parameters mentioned for the tissue-centric analyses gives similar results. 6.4 discussion In this chapter, I describe the integration and comparison of independent large-scale proteomic and transcriptomic expression datasets of undiseased human tissues. After assessing the range of correlation between the two biological layers, I have tried to identify possible factors that may influence the association between the expression of the mRNAs and proteins. I have employed both tissue- and gene-centric approaches. Building on insights gained in previous chapters (particularly Chapters 4 and 5), I have restricted the integration study to the three following independent sources: Uhlén et al. [Uhlén, Fagerberg, et al., 2015] and GTEx [Melé et al., 2015] data for the transcriptomics and Pandey Lab data [M.-S. Kim et al., 2014] for the proteomics. The three datasets share twelve tissues, while the combined datasets based only on Uhlén et al. and Pandey Lab data present three additional tissues. The above analyses provide the comparison of mRNA and protein expression across fifteen tissues for 12, 921 pairs as they include Uhlén et al. data and Pandey Lab data quantified with our PPKM method (see Chapter 5). This new quantification allows encompassing about twice as many proteins than with the standard state-of-art method (see Chapter 2), which identifies 6,428 proteins only. The tissue-centric analyses show that even independently sourced proteomics and transcriptomics of similar tissues present reasonable correlation coefficients. For instance, the range of Spearman correlation (𝜌Oesophagus = 0.39 ≤ 𝜌𝑖 ≤ 𝜌Liver = 0.62) is consistent with the literature, either published before or during this study (see examples further below). 165 integration of transcriptomic with proteomic data Besides, the new PPKM quantification method for proteins provides similar Pearson correlation ranges (𝑟Oesophagus = 0.38 ≤ 𝑟𝑖 ≤ 𝑟Liver = 0.61) to ones previously described for cell studies specifically designed for the joint integration of same-sourced proteomics and transcriptomics (e.g., Marguerat et al. (2012), Schwanhäusser et al. (2011), Schwanhäusser et al. (2013), and J. J. Li et al. (2014)). Most remarkably, all the same-tissue pairs of transcriptome and proteome have a statistically significant correlation despite the tissue proteomic expression profiles closeness. I have based my following considerations and discussion on the Pearson correlation results even though Spearman correlation is often used in the literature when comparing independent sources of data. The use of the Spearman method is regularly motivated by the lack of data distribution normality. As shown in Figure 5.7, the PPKM quantification produces protein expression levels that share the same logit-normal profile of distribution observed for the mRNAs (see Chapter 3), thus allowing an appropriate use of the Pearson method. Next, I have considered several possible properties that might be factors influencing the mixed correlation levels. I have compared the expression breadth of the mRNAs and the proteins before comparing the most specific proteins and mRNAs of each tissue to one another. The mRNAs and proteins expression breadths (i.e. the number of tissues within which an mRNA or protein is expressed) share the same overall shape for their distribution, but are only partially concordant. For example, at 5 FPKM, only about 26% of the proteins that are expressed in one tissue and about 40% that are expressed in fifteen tissues have an mRNA with an identical breadth of expression. The analysis of expression breadth highlights noteworthy facts — including a few recently reported in the literature; Testis displays the most unique and diverse expression both at transcriptomic and proteomic levels (see also D.Wang et al. (2019) and Y. Zhang, Q. Li, et al. (2015)); Liver (the most correlated tissue) presents the second-highest number of mRNAs and proteins with a unique expression breadth. Besides, when the expression breadth of a protein is unique, the expression breadth of its related mRNA is more likely to be unique at the threshold of 1 or 5 FPKM as well. However, the expression breadth of mRNAs gives no indication of the proteins’ one. Nonetheless, mRNAs’ and proteins’ expression breadths convey part of the biological signal consistently. From the Jaccard indices of the proteins and mRNAs solely expressed in two tissues, I have built hierarchical trees, and their consensus tree outlines the Ovary/Testis and the Kidney/Liver clusters at both proteomic and transcriptomic layers. I have then compared the TS proteins with the TS mRNAs. The overlaps between the 𝑛 TS proteins (expressed in a single tissue only) and the 𝑛 most TS mRNAs of each tissue are non-empty, and except for one tissue (Urinarybladder), statistically significant with 166 6.4 discussion an 𝛼 level of 0.01. While most tissues have similar ranking trends for their overlaps of TS genes and correlation between their proteomics and transcriptomics, either high levels (e.g. Liver, Testis or Pancreas), medium (e.g. Prostate) or low ones (Urinarybladder, Lung or Gallbladder), other tissues have not. Regarding the gene-centric analyses, they show that about 6% of the mRNAs/proteins are highly correlated (𝑟 ≥ 0.8), about 18% of them are well correlated (0.8 > 𝑟 ≥ 0.5), about 75% of them are poorly correlated (0.5 > 𝑟 ≥ −0.5), and less than 1% of the pairs are anticorrelated (−0.5 > 𝑟 ≥ −0.83). A fourth of the genes included in this study have a TS protein. Most of them have a Pearson correlation above 0.5, and they considerably enrich the set of most correlated pairs of mRNAs and proteins. However, their correlation range remains rather wide (−0.77 ≤ 𝑟 ≤ 1). Finally, using a GO enrichment analysis, I have investigated whether the three groups of mRNA/protein pairs (the ones with a TS protein, the best correlated pairs and the most anticorrelated ones) are related to any biological or technical reason. While the most correlated pairs and the ones including a TS protein present an enrichment in biological process (BP), molecular function (MF) and cellular component (CC) terms, the anticorrelated pairs present an enrichment only in BP terms. Overall, the most correlated mRNA/protein pairs seem highly associated with catabolic processes; the pairs with a TS protein are most likely involved with specific signalling (its detection, transduction or answer) and more concentrated in the transmembrane area. The most anticorrelated pairs are enriched in regulation processes. On top of the true relationship between the expression of the mRNAs and proteins, correlations are dependent on the quantification of identified molecules. In general, while many biological reasons (e.g. protein degradation) can lead to midrange or low correlation levels, other technical artefacts may also be at play: saturation effect of more abundant proteins (see Section 1.3.2); degenerate peptides (see Section 1.3.4.3); quantification inconsistencies between methods; platforms and studies [Dapas et al., 2017; Aebersold, Agar, et al., 2018]; annotation, or other oft-neglected sources, e.g. gene or isoform length. One example where the annotation might be at fault is STAU2 (Pearson 𝑟 = −0.59; Spearman 𝜌 = −0.65), which appears to be known to have different annotations in the proteomic and transcriptomic communities. Thus, STAU2’s anticorrelation observed between its mRNA and protein expressions might predominantly result from artefacts rather than any biological cause. Normalisation methods for RNA-Seq require the genes or transcripts length (see Section 1.2.5.4). To keep congruity with the gene length employed by EBI Gene Expression Atlas¹¹ [Petryszak, Keays, et al., 2015], I use the sum of the lengths of all its 11 EBI Gene Expression Atlas — https://www.ebi.ac.uk/gxa/home/ 167 integration of transcriptomic with proteomic data collapsed exons, which is graciously provided by the metapipeline iRAP¹² [Fonseca, Petryszak, et al., 2014]. Although, each gene has various mRNA isoforms [Gonzàlez-Porta, 2014] and proteoforms [Aebersold, Agar, et al., 2018] I chose to perform all the analyses at gene-level expression (see p. 67) due to the many existing criticisms about current algorithms’ accuracy with isoforms [Engström et al., 2013; Jänes et al., 2015; Dapas et al., 2017]. Yet, most genes present one dominant transcript [Gonzàlez-Porta et al., 2013]. One possible method to improve the mRNA/protein correlations is to identify the most dominant transcript isoform of each gene and use their length for the normalisation. This method ought to be rather easy to implement, either with the help of resources like APPRIS¹³ [Rodriguez et al., 2018] or through a direct data analysis. More generally, besides improving current results, (even partial) better identification of the mRNAs and proteins isoforms will most likely unveil still undetected divergences. In terms of the true relationship between the expression of mRNAs and proteins, it is imperative to remember that the proteome and the transcriptome I use in the analyses are independently sourced and aggregated over several individuals. Therefore, some genes displaying mixed correlation in this thesis may present high correlation or anticorrelation in matched samples. The most affected genes are the most sensitive ones to inter-individual variation and batch effects. On the other hand, the highly correlated (or anticorrelated) pairs highlighted above are more likely having a robust expression across individuals and time, while being less subject to technical noise. Among the studies published during my thesis’ works, three are most notably pertinent to the integration and comparison analyses I have outlined above. A first published comparison [Kosti et al., 2016] of the expression data of GTEx [Melé et al., 2015] and the Pandey Lab [M.-S. Kim et al., 2014] provides weaker results than the above-presented ones as the authors have based their complete study on data provided as- is by the primary studies. Overall, the range of Spearman correlation they present for the tissues does not exceed 𝜌 = 0.5. Additionally, this study also fails to show any functional enrichment in their selection of matched pairs of mRNA and protein. Hence, although these primary data resources are useful for others to appraise the presence of a given gene in a particular tissue, more thorough uses need more considerations and probably preliminary treatments. Franks et al. (2017) integrate a subset of the Uhlén data included in this chapter analyses as they have extracted their transcriptomic data as-is from Fagerberg et al. (2014) to integrate it with the data they have reprocessed from Pandey Lab data [M.-S. Kim et al., 2014] and Wilhelm et al. (2014). The study’s primary aim is to quantify the post-transcriptional regulation of the genes through their translational efficiency. To this end, they compute 12 iRAP — https://nunofonseca.github.io/irap/ 13 APPRIS — http://appris-tools.org/ 168 6.4 discussion for each gene a protein-to-mRNA (PTR). PTRs ease the assessment of the gene expression variability across their set of tissues. This study underlines the utmost caution required when one considers the transcriptome as a possible proxy for the proteome. It also reminds that data quality and reliability are primary caveats for possible analyses. Franks et al. (2017) also show that genes from the same set of GO terms display similar relative protein- to-mRNA (rPTR) profile across tissues, i.e. they have higher and lower rPTR in the same tissue sets. This observation suggests concerted functional regulations. Overall, the results presented in this thesis are confirmed and unsurprisingly improved by D. Wang et al. (2019) since they have matching samples (same sources) for the proteomics and transcriptomics. Moreover, there are many biological replicates per tissue. In that follow-up study, the authors have generated proteomics data¹⁴ from the original samples they had used to produce transcriptomic data¹⁵ [Uhlén, Fagerberg, et al., 2015], which I am using in this thesis, including this chapter. While the (Spearman) correlation between the transcriptome and proteome of each tissue has a similar range globally, the expression of an mRNA and its protein across the tissues have a positive correlation in about 90% of all cases and half of them are statistically significant. They also report that there is a core set of ubiquitously expressed pairs of mRNA/protein and that key differences between tissues are more characterised by the level of expression of the molecules rather than their presence or absence. When they compare the highest expressed proteins and mRNAs, they observe a limited overlap that, in my opinion, further confirms that TS genes are more pertinent for integration than the highest expressed genes. They observed that while disease-associated genes are expressed more globally, G protein-coupled receptors are mostly restricted to specific tissues, and are often identified drug targets. Finally, the authors point out that part of the observed discrepancies in the proteomic and transcriptomic expression may be due to the difference in strategies of each community on how to handle the degenerate peptides or multireads. Our PPKM quantification presented in Section 5.2 is one solution to this issue, and it proves to improve the results, particularly for the Pearson correlation. A companion study, by Eraslan et al. (2019), reports that mRNA expression variation across a set of tissues is a better predictor for protein levels than considering all mRNAs expression levels in a tissue. The authors have thus computed a protein-to-mRNA (PTR) for over 11,500 genes. The study uses these PTR to model and probe possible regulatory mechanisms. However, while Franks et al. (2017)’s PTRs may be partially miscalculated because of the independent sampling sources, for Eraslan et al. (2019), there might be an overfitting problem. Further analyses based on other datasets are required to ensure the reproducibility of these results. They warn that for many genes, PTRs is only useful as a gauge of the protein abundance magnitude order. Considering the previous papers together with my results, I have similar correlation levels for same-tissue pairs of independent proteomics and transcriptomics to the ones 14 The proteomic data can be retrieved in PRIDE ID: PXD010154 — https://www.ebi.ac.uk/pride/archive/projects/PXD010154 15 ArrayExpress ID: E-MTAB-2836 — https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-2836/. 169 integration of transcriptomic with proteomic data for matched samples (i.e. collected from the same source) as I have processed the raw data with consistency. Despite many deployed efforts, Franks et al. (2017) and Eraslan et al. (2019) suggest that predicting protein levels directly from mRNA levels is still unlikely at the moment. However, GO information can be useful to appraise protein expression as confirmed in the recent collaborative work I was involved. With the help of Dr Kārlis Freivalds, we have shown that deep learning algorithms that include GO information can reasonably predict the order of expression magnitude of missing proteins in a given label-free MS proteomic study from RNA expression data. While our approach, described in [Barzine, Freivalds, Wright et al. — Barzine et al., 2020], is not an imputation method, it can be used in complement or instead of one. Beyond any predictive interest, GO annotation, through enrichment analyses for instance, can provide biological or mechanism insights and help the design of new research avenues. Previous studies have reported similar results to the highest correlated pairs of mRNA/protein that are enriched with catabolic genes, and the most anticorrelated ones that are enriched for regulatory processes. Vogel, Abreu, et al. (2010) have observed the highest protein-per-mRNA ratios for mammalian metabolic genes. Several papers [Vogel and Marcotte, 2012; Schwanhäusser et al., 2011] observe higher expression stability for RNAs and proteins related to mammalian metabolism and a more rapid degradation tendency for proteins involved in transcriptional regulation (and chromatin organisation). Furthermore, organs appear to present different metabolic profiles [Berg et al., 2002], and, regardless of each individual’s particulars (e.g. sex, alimentary diet, age, physical level), the expression of many catabolic genes only varies according to the tissue. Accepting the premise that, regardless of the tissue, a similar sequence of regulatory steps apply to a gene (from its transcription to possible post-translational modifications), taken together the above facts may suggest that the catabolic genes may have a more straightforward modulation of their expression than the other genes. In organic chemistry, it is well-known that the more a molecule requires steps to be produced, lesser is its yield (i.e. final amount). One possible way to test this hypothesis is to explore if the amount of substrate can regulate these genes’ expression levels. An indirect approach can be the comparison across tissues of the co-expression profiles of these catabolic genes with others involved with the active or passive transport of the molecules to be degraded (e.g. transmembrane channel proteins). Another research avenue can be the exploration of possible links between the tissue-specific (TS) genes and the G protein-coupled receptors. 170 6.5 conclusion 6.5 conclusion Despite possible batch effects and technical noise, a part of the biological signal is strong enough to be consistently captured by the transcriptome and proteome through direct and indirect analyses. At tissue level, the independent transcriptomics and proteomics included in these analyses give similar Spearman or Pearson correlation to samples produced from the same biological sources. The signal appears stronger for more homogeneous tissues, e.g. Liver, or which expression is more distinctive, e.g. Testis. In any cases, a significant number of tissue-specific (TS) genes are consistently shared between proteome and transcriptome. Besides, even indirect analyses, such as hierarchical clustering trees created with genes that are only shared by two tissues, can highlight identical structures between the two biological layers. On the other hand, the gene-centric analyses have shown that while only 24% of the mRNA/protein pairs are well correlated (r>0.8), most (about 73%) have a positive correlation for their expression. While the highest correlated genes are enriched in TS proteins, many of them are expressed ubiquitously. The GO enrichment analysis highlights that genes presenting a TS protein are enriched in specific signalling, genes with the highest correlated mRNA/protein pairs in catabolic processes and genes with the most anticorrelated ones in regulatory processes. Providing proper care and consistent processing, onemay use these independent resources as part of their study to achieve lower but still significant results. Results can improve from better identification and quantification alone. A possible approach can be the standardisation of annotations between communities. Optimisation or new algorithmic strategies are other ones as illustrated by our PPKM quantification applied for this thesis’ works. 171 Has been done. Can be done. Must be done… Fandarel [McCaffrey, 1964] 7 CONCLUD ING REMARKS At the time I started my doctorate, an increasing number of gene expression datasets assaying undiseased human tissues for RNA expression were published. In addition, the first genome wide MS-proteomics studies were performed soon after and raw data made available. My primary aim was to integrate and compare these data, first, on RNA level, and second, to compare RNA and protein expression. I concluded that the published RNA measurements were robust and different datasets were highly consistent when processed uniformly. I also found that correlation between RNA and protein expressionwas typically higher than 0.5, though different groups of genes behaved differently and correlation in some tissues was better than others. Lately, the focus of RNA gene expression studies has been shifting towards single cell level, however genome wide proteomics studies are still in their infancy. Therefore, comparative studies of genome wide transcriptomics and proteomics data on whole tissue level are of significant interest. summary In Chapter 1, I reviewed the biological, chemical and bioinformatic aspects and challenges involved in expression studies based on the high-throughput technologies of RNA-Seq for mRNAs and MS for proteins. Then in Chapter 2, I presented the five transcriptomic and three proteomics studies that I have considered for this thesis. I also described the pipelines with which I have processed them. Since for each dataset the number and size of files are extremely large, automation is paramount to ensure consistency and minimise errors. I detailed in Chapter 3 various data quality controls and statistical approaches. I also discussed possible biases and how the contextual scope of the tissues and genes considered for analyses can influence results. To minimise errors due to context issues, I limited most of my further investigations to a common subset of tissues and expressed genes. Normalisation methods are still inadequate to treat mitochondria genes accurately. Therefore, I chose to remove them from most analyses, which led to better results as it allowed me to integrate samples that the original authors had to discard because of their lack of congruency with the other samples of the same tissue. 173 concluding remarks In Chapter 4, I integrated the five independent transcriptomic datasets and showed that the biological signal dominates the technical noise. All datasets have a higher interstudy correlation for the same tissues than any intrastudy correlation for different tissues. This trend is stronger for more recent studies. I tested various criteria as possible driving forces of the high interstudy tissue correlations. I found that the most variable genes and tissue-specific (TS) genes are more robustly identified across studies than the highest expressed ones. Besides, I also noted that the inclusion of external resources, especially when outdated like TiGER [X. Liu et al., 2008], requires caution. Many listed genes are either wrongly attributed to a tissue or lack to display any specificity and many TS genes highlighted by RNA-Seq are missing. Overall, the integration has revealed that genes present identical general profiles across the studies, even though direct comparisons of independent data may be still impossible. By repeatedly showing that genes show a similar expression profile for a tissue across studies, my analyses prompted the creation of the heatmap visualising expression data and its associated widget for baseline expression data in EBI Expression Atlas¹ [Petryszak, Keays, et al., 2015]. Finally, I provided a core set of genes that are expressed ubiquitously or as TS consistently across all studies. The comparison of three available proteomic datasets in Chapter 5 illustrates the fragmentation and the disparity of the high-throughput MS-based proteomics. MS detection variability induces considerable technological noise, which explains the intrastudy correlations I observed between different tissues are higher than the interstudy correlation for the same tissues. I provided curated sets of the TS and ubiquitous detected proteins. Finally, with the help of Dr James Wright, I have devised the PPKM quantification method, which quantifies more proteins by also accounting for degenerated peptides. Chapter 6 reported the integration of the independent proteomic and transcriptomic data. Regardless of the quantificationmethod, I found similar Spearman correlation levels for the included independent studies to those typically observed in the literature for same-sourced proteome and transcriptome source. While proteomic standard state-of-art processing leads to very low Pearson correlation: 0.04 ≤ 𝑟 ≤ 0.28, our PPKM quantification broadly improves this range (0.38 ≤ 𝑟 ≤ 0.61). Two tissues, Testis and Liver, have exhibited distinct characteristics across the various analyses that are supported by the literature. Testis has the most diverse and specific expression at both transcriptomic and proteomic levels. On the other hand, Liver has the most robust expression across studies and the highest correlation between its mRNAs and proteins expression levels. However, there are shared coherent gene signatures across tissues between the proteome and transcriptome. Furthermore, even indirect analyses can capture part of the biological signal. The tissue specificity of proteins and mRNAs are globally more relevant than their expression levels. In addition to the significant overlaps of TS proteins with the most TS mRNAs, most genes with a TS protein have a high correlation between their mRNA and protein expression across tissues. GO analyses show distinct profiles for three gene lists of interest. First, the genes with a TS protein are enriched for specific signalling, including signal detection, 1 EBI Expression Atlas — https://www.ebi.ac.uk/gxa/home 174 concluding remarks response pathway and regulation. Second, the geneswith the highest correlations between their mRNA and protein expression are enriched for catabolic processes. Thirdly, the genes with the highest anticorrelation between their mRNA and protein expression have shown enrichment for ribosome complexes and ncRNAs regulation. I provide (digitally) the complete set of mRNA/protein pairs with their correlation across the common set of tissues. I also supply the list of overlapping TS proteins and mRNAs. practical challenges Throughout this thesis’ analyses, I had to overcomemany practical challenges. While most of the difficulties encountered ordinarily pertain to Big data projects, one unexpected issue was the current global complexity state of the proteomic world. Proteomics: a hard field to grasp by newcomers While constituting a technical obstacle, the characterisation of proteins (or assimilated complexes) is a strong interest for many fields (e.g. molecular biology, medicine, drug design, green chemistry). As shown in Section 1.3, the physicochemical properties of the proteins make them intrinsically complex to study, which can partly explain the complexity of the theoretical approaches. Understanding high-throughput proteomics requires many prerequisites. However, there is a lack of a clearly identified entry-level document reviewing the field from the bench to normalised protein levels. The available teaching materials are mostly practical or experimental oriented [Y. Zhang, Fonslow, et al., 2013; Z. Zhang et al., 2014; Domon et al., 2010] or on particular steps, e.g. protein inference [He et al., 2016]. The scattered information hampers the ability of newcomers to achieve a global vision of high-throughput proteomics. Big data Challenges Big data is often characterised through, what was first defined by IBM², the 4 V’s: volume, variety, veracity and velocity. Each of them can entail issues at different project levels. Volume The volume of files (see Table 2.2) to handle and process just for the transcriptomics is overwhelming. It requires appropriately dimensioned infrastructure like in the EBI, which can provide high-throughput computing. It is in practice impossible to reproduce the complete work underlying this thesis in a personal computer within a reasonable time. Although (commercial) solutions are increasingly in use for academic projects, dedicated 2 https://www.ibmbigdatahub.com/infographic/four-vs-big-data 175 concluding remarks storage, and high computing facilities ease the analyses considerably and allow more in- depth testing. Even if best practices are continuously refined for transcriptomics, there are still many factors that can be improved and tuned. Besides the raw data, storage capacity is also required for the intermediate and final files. Organising such a large amount of data was time-consuming and challenging at times. Variety The variety of the type of input data files is kept to a minimum as the data was retrieved from public or academic repositories that follow community guidelines³. Issues still ensued from the matching of samples or tissues across the studies. For many tissues, I chose to mix several ‘body parts’ from the same tissue or organ (e.g. ‘Left Ventricle’ and ‘Atrial Appendage’ for Heart) in GTEx to match them to the other studies’ tissues. Perhaps, in some case, one ‘body part’ is perfectly matched to the samples from another study where the authors have only reported the tissue instead. Hence, for these cases, keeping only one of the ‘body parts’ would have been a better choice if additional data had been available. Another source of variety that I have limited in the above work is the diverse annotation versions. Even when mapped to the same genome and annotation versions, discrepancies persist between how transcriptomics and proteomics are defined and assigned to the genes. A possible improvement of the present work will be to develop new quantification tools that use the chromosome coordinates to which mRNAs and proteins map instead of using their gene identifiers. Veracity With all possible tunable parameters for the raw data processing, the data to be integrated can vary widely. Although I have tried a limited number of combinations, my study shows that the results’ trend remains the same regardless of the chosen settings. Moreover, as the number of datasets I included in my analyses grows, the results became increasingly more stable. Many of them are also confirmed by the literature, adding credibility and confidence to the findings, especially considering the recent discussions on reproducibility crisis [Morrison, 2014; Glenn Begley et al., 2015; Goodman et al., 2016; Fatovich et al., 2017; Coiera et al., 2018; Lindner et al., 2018]. However, results for individual genes can vary from one set of settings to another one, and need to be considered with more caution. Velocity Finally, the velocity of new data availability and the required preparation time made the prospect of including all the latest studies in my analyses impractical. Unfortunately, although new GTEx samples or other tissue studies (e.g. Oncobox Atlas of Normal Tissue Expression (ANTE) [Suntsova et al., 2019]) kept being released, I had to stop including 3 ENA for the sequencing data and ProteomeXchange for MS/MS proteomics. 176 concluding remarks them in my study. Likewise, I ceased updating the genome and annotation and settled for GRCh38.p1 and Ensembl 76. My resolving approach To minimise errors, assure consistency and ease future reiterations or extension of these analyses, I provide script files that can reproduce the whole study and its results. I have also automated and structured the analyses throughmodular functions asmuch as possible. I have avoided any manual change and have documented all the name change and sample pairings in the scripts. I provide the necessary code to replicate all of the above (and complementary) results as supplementary material⁴. I chose to develop the analyses with open-source software around the programming language R⁵ [R Core Team, 2019]. See Appendix F for the complete list of R packages involved in this work. This language provides statistical and visualisation functions and is easily expanded through packages developed by the community. While packages are allowing to easily built on previous work, they can be highly interdependent and may evolve rapidly. To draw from an extra package (or fix an identified bug), a comprehensive and time-consuming update of the working environment will often ensue. In turn, I had also to update or rewrite my analyses’ code on many occasions. Furthermore, new software installations or updates are more complicated for distributed computing facilities than on a personal computer. Today, new solutions are developed to facilitate these tasks (e.g. Packrat [Ushey et al., 2018] that create isolated environment) and, hopefully, this burden will be significantly lowered. Note that Dr Nuno Fonseca, who provided me with the quantification of the GTEx data, and Dr James Wright, who provided me with the proteomics ones, have also both developed their processing pipelines with open-source software. Hence, the entirety of the thesis can be repeated (conditionally upon access to GTEx data and high-throughput computing facilities). future works Many improvements are conceivable: • The inclusion of new samples and dataset of transcriptomics (preferably with biological replicates), e.g. extend to the last version of GTEx and the ANTE dataset [Suntsova et al., 2019]. • Add the matching proteomics of D. Wang et al. (2019) to the transcriptomic Uhlén data [Uhlén, Fagerberg, et al., 2015] and then compare the results to the unmatched samples. 4 https://github.com/barzine/BaselineAtlas/tree/thesis 5 R — https://cran.r-project.org 177 concluding remarks • Work on new models of annotation or build a consensus between the current transcriptomic and proteomic annotations. Today, it is difficult to determine for some genes if the observed anticorrelation or lack of correlation between the mRNA and protein expression levels is due to the biology, batch effects or divergences between transcriptomic and proteomic annotations. • Changing the quantification (parameters or methods) may also give better results. New quantification proposal for baseline studies Most normalised quantification methods are designed for differential expression studies, particularly in transcriptomics [Dillies et al., 2013]. Most of these methods are built with the premise that only a limited number of genes present a differential expression across conditions while most gene expressions remain unaffected by the context [P. Li et al., 2015]. As a result, most normalisations are ill-suited for comparing independent samples across multiple studies. Two normalisationmethods do not imply any preconception on the study design: FPKM [Mortazavi et al., 2008; Trapnell et al., 2010] (see Section 1.2.5.4), which I use in this thesis, and TPM (Transcript per Million)⁶ [Wagner et al., 2012]. Both normalisation methods account for the global library size of each sample. While the motivation is sound, the quantification is thus contextual to how many and how much RNAs are detected. Previous efforts to improve the quantification have focused on differential expression studies. Synthetic spike-in molecules [L. Jiang et al., 2011] ensure more reliable quality controls. However, while theoretically plausible, accurate absolute quantification for the whole dynamic expression range has yet to be reached [L. Jiang et al., 2011; SEQC/MAQC- III Consortium, 2014; Hardwick et al., 2017]. Furthermore, spike-ins fail to permit the absolute quantification of the other molecules in the sample. Incidentally, Rudnick et al. (2014) find that spike-in proteins and peptides lack effectiveness for proteomic studies. For proteomics, Wiśniewski, Hein, et al. (2014) propose to use the histone to create a ‘proteomic ruler’. They can assess through this proteomic ruler the amount of DNA in the sample, and thus, give an estimate on the cell number, which provides some context to interpret the quantification of the proteins. Histone genes are, however, ill-suited for bulk RNA-Seq studies [Zhao et al., 2018]. I firmly believe that using internal standards chosen within the naturally expressed population of the studied macromolecules is more appropriate than any external additive. I expect that giving each gene (RNA or protein) expression as a ratio of the expression of a reference gene will be more robust. It will free the expression from the influence of the presence of quantification of any other gene while it will still account for the difference of sequencing depth between samples. Then, the next question is: ‘Which gene (or set of genes) will possibly enable the best normalisation?’ 6 FPKM is easily converted to TPM by scaling with a constant to correct the sum of all values in a library to 1 million. 178 concluding remarks Through this thesis’ various analyses, I have highlighted genes that have a robust and ubiquitous expression across all the studied tissues both at transcriptomic and proteomic levels. Moreover, many genes present high correlation coefficients between the expression of their mRNA and protein. These genes are the best potential candidates as reference. With Dr Nuno Fonseca, we have performed preliminary analyses in this direction across other studies to reduce the list of these candidates. However, considering the best practices in analytical chemistry [Arvid, 1997], a set of standards that can cover the complete dynamic expression ranges of the considered molecules and adjust for the saturation effects, may resolve the abundances better, and thus, be better suited than a single reference. On the other hand, studies focusing on a single tissue may better benefit from a reference built on TS genes. The recent development of single-cell transcriptomics (scRNA-Seq) [G. Chen, Ning, et al., 2019], and the probable feasibility of single cells proteomics [Marx, 2019], may ease refining which genes are the most suitable as universal references, or for specific conditions. Ongoing implementation of an application Integrating proteomics and transcriptomics remains laborious, and results can be mixed. Nonetheless, even simple comparisons can help to improve our general knowledge and current biological models. For instance, in one of our papers [Wright, Mudge, et al., 2016], we have confirmed the existence of putative proteins by observing coverage of the genome both by transcriptomics and proteomics. To help further possible projects, I am currently compiling the different analyses into a set of interactive applications that can replicate all the results and figures presented in this thesis without requiring any programming skill as a prerequisite. Figure 7.1 covers Chapter 3⁷. Once completed, one will be able to analyse and compare their own data to the different datasets I have presented in this thesis. Furthermore, I share all my code (including for the application) under a creative commons license, Attribution 4.0 International (CC BY 4.0) ⁸. Anyone can thus use, adapt or build upon this work as they wish. 7 For a live demo, see http://barzine.net/shiny/mitra/thesis/chapter3/. 8 Attribution 4.0 International (CC BY 4.0) — https://creativecommons.org/licenses/by/4.0 179 concluding remarks Figure 7.1. Preview of the application developed with R and served through a shiny server [Chang et al., 2019]. 180 APPENDIX 181 A SUPPLEMENTARY MATER IAL FORCHAPTER 1 a.1 amino acids As shown in Figure A.1, the amino acids have different chemical properties. Their primary and side chains are respectively shown in black and green. The amino acids all share the same primary chain. Figure A.1. Amino acids formulas — from Morris et al. (2016) 183 supplementary material for chapter 1 Table A.1. Molecular weight of the most common aas and their residues (from Lide, 2005) Name Abbr. MolecularFormula Molecular Weight Residue Formula Residue Weight (-H2O) Alanine Ala A C3H7NO2 89.10 C3H5NO 71.08 Arginine Arg R C6H14N4O2 174.20 C6H12N4O 156.19 Asparagine Asn N C4H8N2O3 132.12 C4H6N2O2 114.11 Aspartic acid Asp D C4H7NO4 133.11 C4H5NO3 115.09 Cysteine Cys C C3H7NO2S 121.16 C3H5NOS 103.15 Glutamic acid Glu E C5H9NO4 147.13 C5H7NO3 129.12 Glutamine Gln Q C5H10N2O3 146.15 C5H8N2O2 128.13 Glycine Gly G C2H5NO2 75.07 C2H3NO 57.05 Histidine His H C6H9N3O2 155.16 C6H7N3O 137.14 Hydroxyproline Hyp O C5H9NO3 131.13 C5H7NO2 113.11 Isoleucine Ile I C6H13NO2 131.18 C6H11NO 113.16 Leucine Leu L C6H13NO2 131.18 C6H11NO 113.16 Lysine Lys K C6H14N2O2 146.19 C6H12N2O 128.18 Methionine Met M C5H11NO2S 149.21 C5H9NOS 131.20 Phenylalanine Phe F C9H11NO2 165.19 C9H9NO 147.18 Proline Pro P C5H9NO2 115.13 C5H7NO 97.12 Pyroglutamatic Glp U C5H7NO3 139.11 C5H5NO2 121.09 Serine Ser S C3H7NO3 105.09 C3H5NO2 87.08 Threonine Thr T C4H9NO3 119.12 C4H7NO2 101.11 Tryptophan Trp W C11H12N2O2 204.23 C11H10N2O 186.22 Tyrosine Tyr Y C9H11NO3 181.19 C9H9NO2 163.18 Valine Val V C5H11NO2 117.15 C5H9NO 99.13 184 A.2 original material a.2 original material To create Figure 1.1, I used original material by Kelvinsong (https://commons.wikimedia.org/wiki/User:Kelvinsong): ‘Simplified diagram of mRNA synthesis and processing. Enzymes not shown.’ (https://commons.wikimedia.org/wiki/File:MRNA.svg) and ‘Protein synthesis’ (https://commons.wikimedia.org/wiki/File:Protein_synthesis.svg). a.3 expressed sequence tag (est) sequencing ESTs are short nucleotide sequence generated from randomly selected RNA transcript [Parkinson et al., 2009]. mRNAs are reverse transcribed into double-stranded cDNAs (either from the 5’ or 3’ end of the transcript) [Lowe et al., 2017]. These cDNAs are cloned to create libraries [Harbers, 2008] and then sequenced either by Sanger method [Sanger et al., 1975] or a more high-throughput one such as the sequencing-by-synthesis (Section 1.2.3). Although this technique is subject to sampling bias [Nagaraj et al., 2007] and often account for only 60% of an organism expressed genes [Bonaldo et al., 1996], it remains a relatively low cost alternative approach to study the transcriptome (gene discovery). a.4 microarrays Microarrays require prior knowledge (e.g. annotated genome or ESTs libraries) of the organism of interest as they exploit it to design probes (short nucleotide oligomers) that are arrayed on a solid support (e.g. a glass or silicon thin film cell) [Lowe et al., 2017; Schena et al., 1995; Bumgarner, 2013]. For transcriptome profiling, the expressed RNAs are first reverse transcribed into cDNAs (also referred as targets) and then, after being fluorescently labelled, they are complimentary hybridised to the microarray probes; the relative abundance of the transcripts is assessed by measuring the intensity of the fluorescence after the excess of unhybridised cDNAs is washed away [Lowe et al., 2017]. This technology is extremely powerful and popular as it allows global and parallel analyses of cellular activity. Microarray technology also has many variations [Hoheisel, 2006] in addition to its original cDNA version for transcriptional profiling [Schena et al., 1995], e.g. for genotyping [D. G. Wang et al., 1998; Gunderson et al., 2006], protein profiling [Hall et al., 2007; Sutandy et al., 2013; Duarte et al., 2017], splice-variant analysis [Cuperlovic-Culf et al., 2006] or transcription factor binding [Bulyk et al., 2002; Bulyk, 2007] studies. 185 supplementary material for chapter 1 a.5 fastq format @ERR030856.1 HWI-BRUNOP16X_0001:1:1:2669:1073#0/1 AAAGGATTATGCAGANGTAGGGCGTGTNNNNNNNNNNNNNGGCTGGGGNNNNNNNNNNNNNNNNNNATNNNCTGACCANCTGAAGTATGTCANGCTGCCT + HHHHHHHIHHFFFFF#>>@>GGGFG########################################################################### @ERR030856.2 HWI-BRUNOP16X_0001:1:1:4476:1072#0/1 GATAGATTATCAGAANGACAGTTACTTNNNNNNNNNNNNNGGGCACTTNNNNNNNNNNNNNNNNNNATNNNTCATAAGNNCTGTTGCCAAATNAGTGATA + HHHHHHHHHHDDDDD#@@AAGGGGG########################################################################### Legend: • Read identifier • Optional information (here flow cell lane:tile number:x:y:z) • First member of pair (here) or single-end • Nucleotide sequence of the read • Separator (+ or any string of character) • Phred score (here Phred 33) Figure A.2. FASTQ format a.6 phred score Table A.2. Phred quality score to accuracy significance Phred quality score (𝑄) Probability of incorrect base call Base call accuracy 10 1 in 10 90% 20 1 in 100 99% 30 1 in 1,000 99.9% 40 1 in 10,000 99.99% The Phred quality score can be encoded in several standards as shown in Figure A.3. 186 A.6 phred score SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ...................... LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL.................................................... !"#\$\%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126 0........................26...31.......40 -5....0........9.............................40 0........9.............................40 3.....9.............................40 0........................26...31........41 S - Sanger Phred+33, raw reads scores between 0 and 40 X - Solexa Phred+64, raw reads scores between -5 and 40 I - Illumina 1.3+ Phred+64, raw reads scores between 0 and 40 J - Illumina 1.5+ Phred+64, raw reads scores between 3 and 40 with 0=unused, 1=unused, 2=Read segment Quality Control Indicator L - Illumina 1.8+ Phred+33, raw reads scores between 0 and 41 Figure A.3. The available Phred score quality score encoding formats Figure A.4. Overlap resolution effects for each HTSeq-count mode. Each mode resolves a number of overlap situations differently. The mode used in this thesis is the intersection non-empty mode. This specific mode resolves more situations than the two others. Hence, the loss of ambiguous reads is reduced in this mode. [Adaptated from HTseq documentation: http://www- huber.embl.de/HTSeq/doc/count.html] 187 supplementary material for chapter 1 Table A.3. FPKM are unsuitable for differential expression analysis Sample 1 Sample 2 raw counts normalised counts raw counts normalised counts 𝐺𝑒𝑛𝑒1 100 0.010 80 0.008 𝐺𝑒𝑛𝑒2 100 0.010 80 0.008 … … … … … 𝐺𝑒𝑛𝑒𝑖 100 0.010 80 0.008 𝐺𝑒𝑛𝑒𝑖+1 0 0 2000 0.2 Total number of fragments (𝐹 ) 10,000 1 10,000 1 a.7 mass analysers See Haag (2016) for more details on other types of analysers. Quadrupole analyser It is one of the most popular analysers as they are cheap compared to the others. They are also compact, durable and reliable. The quadrupole analyser can filter the ions based on their difference of 𝑚/𝑧. They are adequately named quadrupole as they comprise four cylindrical or hyperbolic rods in parallel to each other. Opposite rods are connected together electrically and radio frequency (RF) potential is applied. A direct current (DC) potential is superimposed on the RF one. These combinations of RF and DC potentials constrain the ions to oscillate between the rods as they pass through them. Hence, by tuning the RF and DC, it is easy to select for which range of 𝑚/𝑧 ions will have a stable trajectory and thus the only one detected. Indeed, the ions with unstable trajectory will collide with the rods and be ‘filtered’ out. If used in ‘RF-only’ mode (DC reduced to a minimum), the quadrupole may have other applications. For example, it can guide specific 𝑚/𝑧 ions to other areas (while the bulk of ions will remain trapped). It may also be used as collision cells for CID: by introducing an inert gas and tuning with the RF-energy, the amount of fragmentation undergone by the targeted ions can be precisely controlled. [Haag, 2016] The quadrupole analyser is also qualified as the mass filter. Linear trap quadrupole (LTQ) LTQ is a particular kind of linear ion trap (LIT) which is in principle a sort of a quadrupole mass analyser [Z. Zhang et al., 2014]. A LTQ uses a set of quadrupole rods and a two-dimensional RF field confines the ions radially. In addition, a static electrical 188 A.7 mass analysers potential is applied to end electrodes which forbid the ions to escape axially. However, the quadrupole is commonly segmented into three parts which ensure a perfect homogeneity of the electric field of the trap area and thus avoiding ion loss when the trapping is done. While they may be used as an ion trap, they may be also used as a simple mass filter. RF voltage is tuned to produce multi-frequency resonance ejection waveforms are applied as to eliminate all the undesirable ions in the trap before the fragmentation and mass analysis of the remaining ones. Frequently, these LTQs are used as a front-end to other mass analysers as they have high injection efficiencies and high ion storage capacities. They may be equipped then with two biased radial ejection slits and then be used with two detectors hence the signal-to-noise ratio may be doubled. Compared with other traps, linear ion traps provide an enhanced dynamic range with a reduced low mass cut-off as the ion cloud is spatially distributed on a linear axis and not a 3D centre which improves the sensitivity. And then, for example, the ions may then be accumulated before being released into another mass analyser [Madalinski et al., 2008]. Orbitrap™ It is a very recent analyser and it relies on FT. Recently, there is increasing use of FTMSs for proteomic studies. Indeed, these FTMSs are more precise than previous analysers and allow the detection of a greater range of ions in very short lapses of time [Scigelova et al., 2011]. In this kind of analyser, ions are trapped and both orbit around and oscillate in an electrostatic field between an inner and outer part of a central electrode shaped as a spindle. The ions can only move following the spindle long axis [Makarov, 2000]. While moving around the spindle the ions create a current. The outer part of the spindle records images of this current. Fourier transformation of these images allows obtaining very highly accurate and sensitivemass spectra for a greater dynamic range thanmost of the other analysers. [Q. Hu et al., 2005] LTQ-Orbitrap™ It is a hybrid (tandem) mass spectrometer that uses ESI for the ionisation step and has an LTQ as a first analyser (MS1) and an Orbitrap™ as a second one (MS2). This MS/MS enables multiple levels of fragmentation for the elucidation of a wide range of peptides and can be coupled with an ESI which is a continuous source of ionisation. This instrument allows analysing proteomic samples optimally both in terms of starting material, time [Scigelova et al., 2011] and provides ‘ultrahigh’ mass resolution, high mass accuracy and enhanced dynamic range with respect to mass accuracy [Madalinski et al., 2008]. 189 supplementary material for chapter 1 a.8 isotopes of common elements and their natural frequency Table A.4 lists the mass [Audi et al., 1993; Audi et al., 1995] and the percent natural abundance [Rosman et al., 1998] for stable nuclides (i.e. atom distinctly characterised by its number of protons (Z) and number of neutrons (N)) that may be found in DNAs, RNAs and proteins. Table A.4. Most common constitutive elements and their stable isotopes found in DNAs, RNAs and proteins. Asterisks (*) mark abundances that are not available. Adapted from [Audi et al., 1993; Audi et al., 1995; Rosman et al., 1998] z (Atomic number) Name Isotope Mass atomic (u) Natural frequency (%) 1 Hydrogen 1H 1.007825 99.9885 Deuterium 2H 2.014102 0.0115 Tritium 3H 3.016049 * 6 Carbon 12C 12.000000 98.93 13C 13.003355 1.07 14C 14.003242 * 7 Nitrogen 14N 14.003074 99.632 15N 15.000109 0.368 8 Oxygen 160 15.994915 99.757 170 16.999132 0.038 180 17.999160 0.205 15 Phosphorus 31P 30.973762 100 16 Sulphur 32S 31.972071 94.93 33S 32.971458 0.76 34S 33.967867 4.29 35S 35.967081 0.02 53 Iodine 127I 126.904468 100 a.9 hypothesis testing a.9.1 ℋ0 In statistical testing, the null hypothesis ℋ0 is an answer to the intrinsic nature of statistical calculation: the smaller a given interval is, the lower the probability of a simple random draw in that interval. The null hypothesis can be of different natures. It is generally formulated as an absence of difference between two objects to be compared, or as an absence of relationship between two variables of a same population; its purpose is 190 A.9 hypothesis testing to be rejected. It is always opposed to another alternative hypothesis (ℋ1), which is accepted whenℋ0 is rejected. To test an hypothesis, one needs to construct a statistical model that can represent an ideal form of the data if it were to be generated by random processes alone. This model is also referred as the distribution under the null hypothesis. Then, the likelihood of the collected (observed) data is computed. Finally, it is compared to the (random) probability determined by the model to either accept ℋ0 or reject it if the observed data is very unlikely under the null hypothesis. Usually a test statistic (i.e. quantity derived from the sample used for the hypothesis testing) that measures the apparent departure from the null hypothesis is compared to a value defined such as the probability of a ‘more extreme value’ is even smaller under the null hypothesis. Prior to the analysis, an arbitrary level of significance (or 𝛼) is set either to 0.1, 0.05, 0.01, 0.005 or 0.001, i.e. 10%, 5%, 1%, 0.5% or 0.1% risk to reject ℋ0 by mistake. Depending on whether the observed data is tested, case (1): in both direction, i.e. the data is either greater or equal to the critical value (𝑥) or lesser or equal to the additive inverse of the critical value (−𝑥), or, case (2) in one direction only, i.e., (for example) the data is (only) greater or equal to the critical value (or the data is (only) lesser or equal to the critical value), the statistical test is two-tailed (case 1) or one-tailed. a.9.2 p-value In statistical hypothesis testing, the p-value quantifies the statistical significance of results, under the null hypothesisℋ0 (see Appendix A.9.1). It allows rejecting (or not)ℋ0. The p-value is the probability for a given statistical model of obtaining an equal value or an even more extreme value than what has been observed when ℋ0 is true. Depending on the situation, the more extreme value can mean: One-tail event Left tail event Pr(𝑋 ≤ 𝑥 ∣ ℋ0) Right tail event Pr(𝑋 ≥ 𝑥 ∣ ℋ0) Two-tail event 2min(Pr(𝑋 ≤ 𝑥 ∣ ℋ0), Pr(𝑋 ≥ 𝑥 ∣ ℋ0)) The smaller is a p-value, the higher the significance, i.e. the stronger the evidence thatℋ0 has to be rejected. ℋ0 is rejected if the adequate probability is less than or equal to an arbitrary pre-defined (i.e. prior to the analysis) threshold value 𝛼. Underℋ0, the assumption is that the p-values are uniformly distributed. 191 supplementary material for chapter 1 a.9.3 q-value A q-value is an adjusted p-value (which the calculation may or may not be based on the p-value). In the context of multi-testing, i.e. when multiple simultaneous statistical tests occur, the likelihood of rejectingℋ0 due to a random sampling increases. To avoid accepting the alternative hypothesis by mistake, the whole p-values collection is tested and adjusted for false discovery rate. A q-value of 5% means that 5% of all the significant results are actually false positives. a.10 target decoy search database For best effectiveness, decoy and target peptide sequence databases are searched with the same parameters. Furthermore, to ensure that a wrong hit in the target database and a hit in the decoy one are equally likely, the decoy sequences have to be as similar as possible to the target ones (concerning aa frequencies and composition, length, mass, charges, assigned scores). There are different ways to design the decoy sequences. For example, by reversing the peptide or protein sequences, either with complete or pseudo- reversion (where the last aa is kept in place). Alternatively, by using stochastic methods on the target database such as the randomisation of the sequences or through the creation of new ones based on aa frequencies, their length distribution, and their number in the original database; Markovmodels [Gagniuc, 2017] are often used tomimic the closest target sequences. Many studies explore and compare the different decoy creation methods [Elias et al., 2007; G. Wang et al., 2009; Elias et al., 2010; Wright and Choudhary, 2016]. As for the target database, the decoy sequences are digested in silico before the search. While the search can be done independently on the target and the decoy databases [Blanco et al., 2009], Elias and Gygi (2007) report that searching their resulting concatenation gives better results. Besides, the TDA approach can guide the selection of sensitive PSM attributes (e.g., elution time, charge, peptide length, score) as filtering criteria to discern correct identifications [Elias et al., 2010]. a.11 psm validation with q-value and pep A possible definition of PSM’s q-value is the minimal FDR threshold for which the PSM is accepted as correct. As the q-values are derived from the FDR, which is specific to a PSM collection, they are also (solely) specific to this collection. For example, Percolator estimates the q-value by using the score distribution from the TDA. On the other hand, posterior error probability (PEP) (also known as local FDR [Efron et al., 2001]) is the probability of a PSM being incorrect; a PSM’s PEP is independent of the PSM collection. 192 A.12 protein inference: computational challenging step A classical approach to estimating PEPs uses training sets of target and decoy PSMs to learn the parameters of a probability model (indispensable to compute the PEPs). Thus, for each given score (of any collection), a specific PEP is associated. Choi et al. (2008) showed that for a given collection, the sum of the PEPs is equal to the expected number of incorrect PSMs, which allows calculating the (global) FDR. a.12 protein inference: computational challenging step To explain the computational challenges of the peptide assembly, T. Huang et al. (2012) propose to start with two assumptions: (1) all (𝑚) peptides are true positive, and (2) peptides have an equal probability of detectability. A first assumption corollary is the presence of many homologous proteins in the sample. Besides, one can derive from the first assumption that there are a minimal and a maximal value for the number 𝑛 of proteins that can be identified from the set of 𝑚 peptides. Returning the exhaustive list of proteins (i.e. 𝑛Max) that comprise all𝑚 peptides (e.g. Tabb, McDonald, et al. (2002)) is one possible solution, but it is much more difficult to calculate the minimal list (i.e. 𝑛min) that does the same. As all the peptides 𝑚 are supposed to be true, they have to be included in any of the final minimal list proteins. Therefore inferring this protein list can be formulated as a set covering problem [Cormen et al., 2009; Hochbaum, 1997]. The set covering problem is known to be NP-complete [van Leeuwen et al., 1990], and for which it is in practice impossible to calculate an optimal solution. Usually, algorithms approximate this solution through a parsimonious approach. Many inference algorithms seek a compromise between the minimal and exhaustive lists of possible proteins. While the minimal list probably excludes many true positives (but can still include false positives), the exhaustive list is indisputably comprising a large number of false positives as the parameters are set to maximise the number of peptide/proteins matches; statistically, a subset of these matches are random. In the sequence database, there are many proteins with the same peptidic sequence, e.g. a protein 𝐴, expressed in one set of cells, and another, 𝐵, expressed only in another non-overlapping set of cells. While a sample is only expressing 𝐴, an exhaustive solution will also report 𝐵 as one of the proteins expressed in the sample. Statistically, the greater the size of the reference database and the expression complexity of the sample, the greater is the number of false positives in the results because of degenerate peptides. On the other hand, if a peptide is associated to one unique protein in a database when this peptide is identified with high confidence in a sample, it is extremely probable that the protein is truly present. However, these one-hit wonders are also trickier because the protein presence is reduced to the probability of the peptide to be a true positive instead of an artefact. Even a greater number of MS/MS spectra supporting a peptide existence is only the reflection of a remarkably low probability of the protein being absent in the 193 supplementary material for chapter 1 sample. In this hypothetical setting where all peptides are true positives and equally likely to be detected, the inference is already challenging; it becomes even more complex with real data. The minimal list can be shorter than the theoretical one as the identified peptides can be false positives. On the other hand, proteomic platforms and pipelines tend to repeatedly and consistently detect and quantify particular sets of peptides (proteotypic peptides) [Mallick et al., 2007; Bergeron et al., 2007; Fusaro et al., 2009]. Thus, many peptides are difficult to capture with MS. This has two implications. First, many algorithms associate additional information to the bipartite peptide/protein graph (shown in Figure 1.14) to improve the identification coverage. The algorithms can exploit different data sources, e.g. raw and corrected PSM scores, single stage MS or raw MS/MS spectra, peptide expression profiles, mRNA expression data, protein-protein interaction network or gene model. T. Huang et al. (2012) propose that additional information can further extend the exhaustive list of possible proteins. Secondly, proteotypic peptides have led to the development of peptide detectability [Tang et al., 2006; Alves et al., 2007], which can help to deal with degenerate peptides by attributing probabilities to each peptide/protein assignment. Peptide detectability is considered as a intrinsic peptide property. It is only determined by the peptide primary sequence and its location within the protein. Many different algorithms tackle this peptide assembly key step. T. Huang et al. (2012) organise them in two categorisation frameworks: one based on the needed search engine for the list of PSMs, the second one (presented in Figure A.5) based on the underlying algorithmic technique. T. Huang et al. (2012) describe parametric approaches as those that request prior knowledge to estimate the peptides’ distribution. When there is no need for prior knowledge, they describe the approach as non-parametric, even when the tool assesses the peptide distribution by extracting information from the MS or MS/MS spectra. a.13 bayesian inference Bayesian inference is developed upon Bayes’ theorem, which allows the computation of conditional probabilities, i.e. computation of the likelihood of an event happening given prior known conditions, including the likelihood of another event being true. Bayesian statistics focuses on the credibility of events happening rather than their occurrence frequencies (as in frequentist statistics). See B. Li et al. (2019) for general mathematical definitions of Bayesian inference models and Kurt (2019) for a layman’s guide to Bayesian statistics. 194 A.13 bayesian inference Peptide assembly Bipartite graph model only Parsimonious model Optimistic model Statistical model Non-parametric model Parametric model Supplementary Information Model (+ bipartite graph) Data from other sources mRNA expression data Gene model Protein interaction network MS-generated data Single-stage MS data Raw MS/MS data Peptide expression profile Figure A.5. T. Huang et al. (2012) peptide assembly models classification. 195 B SUPPLEMENTARY MATER IAL FORCHAPTER 3 b.1 correlation Correlation can be considered as a scaled version of the covariance (Equation (Covariance)) of two random variables. Correlation coefficients are adimensional and varie in a restricted range [−1, 1]. While 1 and −1 mean a perfect correlation (either positive or negative), a value equal to 0 expresses that the two variables are not sharing any linear relationship. A value within (−1, 0) or (0, 1) needs more interpretation. In gene expression studies, if the coefficient is within [−0.5, 0.5], the variables are generally considered as independent. Spearman and Pearson are only two methods to compute correlations among other ones. b.1.1 Spearman correlation The Spearman correlation coefficient (usually noted as 𝜌) is more robust than the Pearson correlation. However, it only assesses the monotonic dependence between two variables. The Spearman correlation coefficient is defined as the Pearson correlation of the ranked values of two variables. Spearman correlations are widely used within the literature for biological studies [Brawand et al., 2011; Fagerberg et al., 2014; Danielsson et al., 2015; N. Y.-L. Yu et al., 2015]. b.1.2 Pearson correlation The Pearson correlation coefficient (usually noted as 𝑟) assesses the linear dependence between two variables. It is invariant to systematic addition of a constant or to simple scaling factors between the two variables. The correlation coefficients computed for this thesis rely on the Sample correlation coefficient (as opposed to the Population formula — see equation (Population correlation coefficient)). 197 supplementary material for chapter 3 The (sample) Pearson correlation coefficient can be defined as the following equation (indeed, many rearrangements are possible): 𝑟𝑥,𝑦 = ∑𝑛𝑖=1(𝑥𝑖 − ̄𝑥)(𝑦𝑖 − ̄𝑦) √∑𝑛𝑖=1 (𝑥𝑖 − ̄𝑥) 2√∑𝑛𝑖=1 (𝑦𝑖 − ̄𝑦) 2 = 𝑐𝑜𝑟𝑟(𝑥, 𝑦) ((Sample) Pearson correlation coefficient) where: • 𝑥, 𝑦 are observed values of two random variables 𝑋 and 𝑌 • 𝑛 is the sample size of 𝑥 and 𝑦 • 𝑖 is the index of the current observed value 𝑥 or 𝑦 • ̄𝑥, ̄𝑦 are respectively the sample (𝑥 and 𝑦) means (see Equation (Mean)) • 𝑐𝑜𝑟𝑟(𝑋, 𝑌 ) is another notation of 𝑟𝑥,𝑦 ̄𝑥 = 1𝑛 𝑛 ∑ 𝑖=1 𝑥𝑖 (Mean) where: • 𝑥 is the possible observed values of 𝑋 • 𝑛 is the sample size of 𝑥 • 𝑖 is the index of the current observed value of 𝑥 b.1.3 Different advantages of Pearson and Spearman correlations Pearson correlations are easier to understand, interpret and then to use as predictor while Spearman correlations are more robust and thus better fitted for interstudy comparisons. Computationally, correlations (Spearman in particular) can be challenging to compute, especially for large matrices such as gene expression matrices [S. Wang et al., 2014]. de Siqueira Santos et al. (2014) review Spearman and Pearson correlations along with six other statistical methods. They also summarise many use cases of each of these methods in the general context of gene expression study. Table B.1. Correlation coefficients between RNA-Seq replicates Numeric summary of Figure 3.5 — The correlation means are high across the studies replicates. However, the range of the correlations (in brackets) are quite extreme in a few case. Tissue Replicates type Pearson Correlation Spearman Correlation Brawand Biological 0.90 [0.45;1] 0.93 [0.80;0.99] GTEx Biological 0.75 [0.01;1] 0.93 [0.06;0.99] Uhlén Biological 0.81 [0.15;1] 0.95 [0.70;0.99] Technical 0.99 [0.68;1] 0.99 [0.92;1] 198 B.2 other common mathematical definitions b.2 other common mathematical definitions The population correlation coefficient (𝜌) of two random variable 𝑋 and 𝑌 is defined as: 𝜌𝑋,𝑌 = 𝑐𝑜𝑣(𝑋, 𝑌 ) 𝜎𝑋𝜎𝑌 = 𝑐𝑜𝑟𝑟(𝑋, 𝑌 ) (Population correlation coefficient) where: • 𝑋,𝑌 are two random variables • 𝑐𝑜𝑣(𝑋, 𝑌 ) is the covariance of 𝑋 and 𝑌 (see Equation (Covariance)) • 𝜎𝑋, 𝜎𝑌 are the standard deviations of𝑋 and 𝑌 (see Equation (Standard deviation)) • 𝑐𝑜𝑟𝑟(𝑋, 𝑌 ) is another notation of 𝜌𝑋,𝑌 The covariance is the measure of the joint variability of two random variables, e.g. 𝑋 and 𝑌 . Specifically, it allows quantifying the degree to which two variables are linearly associated. 𝑐𝑜𝑣(𝑋, 𝑌 ) = ∑(𝑥𝑖 − ̄𝑥)(𝑦𝑖 − ̄𝑦)𝑁 − 1 (Covariance) where: • 𝑋,𝑌 are random variables • 𝑥, 𝑦 are respectively one observation of 𝑋 and 𝑌 • ̄𝑥, ̄𝑦 are the means of all observed values of 𝑋 and 𝑌 • 𝑁 is the number of observations of 𝑋 and 𝑌 The standard deviation (𝑠𝑑 or 𝜎) measures the amount of dispersion of the possible values of a random variable around its expected value (𝐸) (theoretical average). 𝑠𝑑(𝑋) = √𝐸[𝑋2] − (𝐸[𝑋])2 = √𝑉 𝑎𝑟(𝑋) (Standard deviation) where: • 𝑋 is a random variable • 𝐸[𝑋],𝐸[𝑋2] are respectively the expected values (or theoretical averages) of𝑋 and𝑋2 (see equation (Expectation)) 𝐸[𝑋] = 𝑥1𝑝1 +𝑥2𝑝2 + ⋅+ 𝑥𝑘𝑝𝑘 = weighted average(𝑋) = 𝜇𝑋 (Expectation) 199 supplementary material for chapter 3 where: • 𝐸 is the expectation • 𝑋 is a random variable • 𝑥1, 𝑥2, …, 𝑥𝑘 are possible value of 𝑋 • 𝑝1, 𝑝2, …, 𝑝𝑘 are the probabilities of the different values of 𝑋 and their sum is equal to 1. • 𝜇𝑋 is the theoretical average of X 𝑉 𝑎𝑟(𝑋) = ∑(𝑥𝑖 − ̄𝑥) 2 𝑁 − 1 = 𝑠𝑑2(𝑋) (Variance) where: • 𝑋 is a random variable • 𝑥 is one observation of 𝑋 • ̄𝑥 is the mean of all observed values of 𝑋 • 𝑁 is the number of observations of 𝑋 • 𝑠𝑑2 is another notation of the variance as the standard deviation is equal to the square root of the variance. b.3 data visualisation Figure B.1. Anscombe quartet — why data should always visually checked. All the datasets, while presenting different distributions, have equal or very similar descriptive statistic indicators; the means and variances for both 𝑥 and 𝑦 variables and the Pearson correlation between 𝑥 and 𝑦, and their linear regressions are very similar. 200 B.3 data visualisation 0.0 0.5 1.0 1.5 0 5000 10000 15000 FPKM de ns ity Tissue Adipose Colon Heart Hypothalamus Kidney Liver Lung Ovary Skeletal.muscle Spleen Testis (a) Castle 0.00 0.25 0.50 0.75 1.00 0 10000 20000 30000 FPKM de ns ity Tissue Frontal.cortex Prefrontal.cortex Temporal.lobe Cerebellum Heart Kidney Liver Testis (b) Brawand 0.00 0.25 0.50 0.75 0 5000 10000 15000 FPKM de ns ity Tissue Tyroid Testis Ovary Leukocyte Skeletal muscle Prostate Lymph node Lung Adipose Adrenal Brain Breast Colon Kidney Heart Liver (c) Illumina Body Map 0.0 0.5 1.0 1.5 0 25000 50000 75000 100000 FPKM de ns ity Tissue Adipose Adrenal Appendix Urinarybladder Bone.marrow Cerebral.cortex Colon Duodenum Endometrium Esophagus Fallopian.tube Gallbladder Heart Kidney Liver Lung Lymph.node Ovary Pancreas Placenta Prostate Rectum Salivary.gland Skeletal.muscle Skin Small.intestine Smooth.muscle Spleen Stomach Testis Tyroid Tonsil (d) Uhlen 0.0 0.3 0.6 0.9 0 50000 100000 150000 FPKM de ns ity Tissue Liver Kidney C.Trans.Fibroblasts Adrenal Coronary Esophagus Testis Stomach Ovary Uterus Aorta Spleen Urinarybladder Colon Pancreas Prostate A.Tibial C.EBV.Trans.Lymph CML Adipose Breast Pituitary Vagina Fallopian.tube Skin Heart Skeletal.muscle Nucleus.accumbens Lung Frontal.cortex Whole.blood Ant.cingulate.cortex Cervix Nerve.tibial Putamen Cerebellar.Hemi Tyroid Small.intestine Salivary.gland Hyppocampus Amygdala Cerebellum Hypothalamus Caudate Spinal.cord Cortex Substancia.nigra (e) Gtex 0 50000 100000 150000 200000 0.0 0.1 0.2 0.3 0.4 PSM de ns ity Tissue CSF Adipose Bone Breast Heart Lung Ovary Pancreas Platelets.lysate Platelets.secreted (f) Cutler 0e+00 1e+05 2e+05 0.00 0.05 0.10 0.15 0.20 PSM de ns ity Tissue Adrenal Heart Cerebral.cortex Cervix Colon Esophagus Gallbladder Kidney Liver Lung Lymph.node Ovary Pancreas Placenta Prostate Rectum Salivary Skin Spleen Stomach Testis Tyroid Tonsil Uterus (g) Kuster 0e+00 2e+05 4e+05 6e+05 8e+05 0.00 0.02 0.04 0.06 PSM de ns ity Tissue Adrenal Colon Esophagus Gallbladder Heart Kidney Liver Lung Ovary Pancreas Placenta Prostate Rectum Testis Urinarybladder (h) Pandey Figure B.2. Profile of expression across the transcriptome (protein coding genes only) and proteome datasets 201 C SUPPLEMENTARY MATER IAL FORCHAPTER 4 0 5000 10000 15000 Te sti s Co lon Kid ne y Ov ary Lu ng Sp lee n Hy po tha lam us Ad ipo se He art Sk ele tal .m usc le Liv er Tissue N um be r o f e xp re ss ed g en es (a) Castle 0 5000 10000 15000 Te sti s Ce reb ell um Fro nta l.co rte x Kid ne y Pre fro nta l.co rte x He art Liv er Te mp ora l.lo be Tissue N um be r o f e xp re ss ed g en es (b) Brawand 0 5000 10000 15000 Te sti s Ad ren al Ov ary Br eas t Th yro id Br ain Pro sta te Kid ne y Lu ng Ly mp h.n od e Ad ipo se Co lon He art Liv er Le uk ocy te Sk ele tal .m usc le Tissue N um be r o f e xp re ss ed g en es (c) IBM 0 5000 10000 15000 Te sti s Re ctu m Ur ina ryb lad der Du od en um Sm all .in tes tin e Fal lop ian .tu be Pro sta te Sp lee n Pla cen ta To nsi l Co lon Ga llb lad der Sto ma ch Sm oo th. mu scl e Ap pe nd ix Th yro id En do me tri umSk in Ly mp h.n od e Kid ne y Eso ph ag us Ce reb ral .co rte x Ad ipo se Ov ary He art Sal iva ry. gla nd Pa nc rea s Lu ng Ad ren al Bo ne .m arr owLiv er Sk ele tal .m usc le Tissue N um be r o f e xp re ss ed g en es (d) Uhlen 0 5000 10000 15000 Eso ph ag us Br eas t Th yro id Co lon Te sti s Sto ma chSk in Ad ipo se Wh ole .bl oo d Lu ng C.E BV .Tr an s.L ym ph Co ron ary Pro sta te He art Ao rta Ne rve .tib ial Sk ele tal .m usc le A.T ibi al Sp lee n Ad ren al C.T ran s.F ibr ob las ts Pa nc rea s Co rte x Va gin a Ov ary Ut eru s Liv er Pit uit ary Ca ud ate Sm all .in tes tin e Nu cle us. acc um ben s Hy po tha lam us Ce rvi x Hy pp oca mp us Am yg da la An t.c ing ula te. cor tex Ur ina ryb lad der Su bst an cia .ni gra Ce reb ell umCM L Fro nta l.co rte x Ce reb ell ar. He mi Kid ne y Pu tam en Fal lop ian .tu be Sp ina l.co rd Sal iva ry. gla nd Tissue N um be r o f e xp re ss ed p ro te in co di ng g en es Gene is Unique NotUnique (e) Gtex Figure C.1. Number of protein-coding genes expressed per tissue 203 supplementary material for chapter 4 0 2000 4000 6000 0 1 2 3 4 5 6 7 8 9 10 11 Number of tissues Ge ne s c ou nt (a) Castle 0 2000 4000 6000 0 1 2 3 4 5 6 7 8 Number of tissues Ge ne s c ou nt (b) Brawand 0 2000 4000 6000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of tissues Ge ne s c ou nt (c) IBM 0 2000 4000 6000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Number of tissues Ge ne s c ou nt (d) Uhlen 0 2000 4000 6000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Number of tissues Ge ne s c ou nt (e) Gtex Figure C.2. Breadth of expression of the protein-coding genes expressed above 1 FPKM 204 supplementary material for chapter 4 0 0 3 0 43796 600 0 106 0 0 0 0 109 108 24250 20 52 0 0 0 97 760 59 57 68 0 18164 Castle Brawand IBM Uhlen Gtex (a) Four common tissues across the five studies (𝒲1) 19737 526 GtexUhlen (b) Twenty-three tissues across Uhlén et al. and GTEx studies. Figure C.3. Unique and shared protein coding genes expressed at any level (> 0 FPKM) in𝒲1 and𝒲2. 205 supplementary material for chapter 4 Table C.1. Expressed protein-coding genes. In Ensembl 76, there are 22,469 genes that have a biotype annotated as ‘protein- coding’. Dataset Number ofTissues Number of mRNAs expressed across all tissue Number of mRNAs expressed at least once 4 common tissues 23 common tissues ›0 FPKM ≥ 1FPKM ›0 FPKM ≥ 1FPKM ›0 FPKM ≥ 1FPKM Castle 11 19,066 15,798 18,477 13,443 — — Brawand 8 19,505 16,410 19,324 15,327 — — IBM 16 19,776 17,171 19,334 15,058 — — Uhlén 32 19,807 18,060 19,379 15,739 19,737 17,832 GTEx 47 20,272 18,386 20,242 16,100 20,263 18,013 Two-sample test Welch’s test [Welch, 1947], also known as the unequal variances t-test, is an adaptation of the Student t-test [Student (Gosset, 1908] and it is better fitted for groups that have different variance and sample sizes. Except in the case of (true) paired data (sampled on the same source), it gives better or at least equal results than the traditional Student t- test [Fagerland, 2012; Derrick et al., 2016; Delacre et al., 2017]. Student’s two-sample location test (i.e. t-test) is a test where the null hypothesis is defines as the means of the two populations which have been sampled are equal. Student’s t-test relies on the assumption that the variance of the population is also equal. 206 supplementary material for chapter 4 Liv er (Uh len ) Liv er (Br aw an d) Liv er (Ca stl e) Liv er (IB M) Liv er (G tex ) Te sti s (C ast le) He art (C ast le) Kid ne y ( Ca stl e) Te sti s (B raw an d) Te sti s (G tex ) Te sti s (U hle n) Te sti s (I BM ) Kid ne y ( Bra wa nd ) Kid ne y ( Gt ex) Kid ne y ( Uh len ) He art (U hle n) He art (B raw an d) Kid ne y ( IBM ) He art (IB M) He art (G tex ) Liver (Uhlen) Liver (Brawand) Liver (Castle) Liver (IBM) Liver (Gtex) Testis (Castle) Heart (Castle) Kidney (Castle) Testis (Brawand) Testis (Gtex) Testis (Uhlen) Testis (IBM) Kidney (Brawand) Kidney (Gtex) Kidney (Uhlen) Heart (Uhlen) Heart (Brawand) Kidney (IBM) Heart (IBM) Heart (Gtex) 0.4 0.6 0.8 1 Value 0 20 40 Color Key and Histogram Co un t Figure C.4. Comparison of profiles across the 5 studies for their 4 common tissues — including the 37 mitochondrial genes. 207 supplementary material for chapter 4 Figure C.5. Heatmap including all the replicates of the four common tissues across the five studies. All protein-coding genes (except themitochondrial ones) at least expressed at 1 FPKM are included. All the samples, except from the Castle study are clustering by their tissue of origin. Remarkably, while the replicates may cluster by their study in each of the tissue groups, many pairs with higher correlations are involving replicates from different studies. Castle study is not a polyA-selected study, hence, its samples clustering may be due entirely to the effect size bias of the FPKM normalisation method. 208 supplementary material for chapter 4 Figure C.6. Heatmap including all the replicates of the twenty-three common tissues between Uhlén et al. and GTEx studies. All protein-coding genes (except the mitochondrial ones) at least expressed at 1 FPKM are included. Most samples are clustering by their tissue of origin while we can observe than many single replicates may cluster less expectedly. Many small mixtures are observed; often they involve closely related tissues, i.e. Heart and Skeletal muscle or Ovary and Fallopian tube. 209 supplementary material for chapter 4 (a) Pearson correlation (b) Spearman correlation Figure C.7. Distribution of the correlation of matched and unmatched tissues pairs for the two working sets. The displayed p-valuesa have been computed with a Welch two-sample t-test. a Thresholds above which the 𝐻0 hypothesis is safe to be rejected. 𝐻0: The correlations of same tissue pairs and different tissues pairs are similar. 210 C.1 highest expressed genes c.1 highest expressed genes Note: the cut-off used in Figure C.9 and Figure C.8 is a range (step 10) of possible values (integers) of gene expression. Here below, the few exceptions grouped by tissue for𝒲1: Heart Uhlén-GTEx pair Kidney Uhlén-GTEx, Castle-Uhlén and Castle-GTEx pairs, Liver Brawand-IBM, IBM-Uhlén and IBM-Uhlén pairs; Testis IBM-Uhlén, Brawand-GTEx, Brawand-Uhlén and Uhlén-GTEx pairs Figure C.8. Pearson correlation coefficient trend based on the expression levels of the genes considered for each of the twenty-three common tissues betweenUhlén andGTEx. In almost every case the complete set of common expressed protein-coding genes of each tissue gives the highest correlations. 211 supplementary material for chapter 4 (a)Heart (b)Kidney (c)Liver (d)Testis Com parison Braw and & Gtex Braw and & IBM Braw and & Uhlen Castle & Braw and Castle & Gtex Castle & IBM Castle & Uhlen IBM & Gtex IBM & Uhlen Uhlen & Gtex FigureC.9.Pearsoncorrelationcoefficienttrendsbasedontheexpressionlevelsofthegenesconsideredforeachof𝒲 1 ’stissues. 212 C.1 highest expressed genes 𝒲2’s results should be interpreted more carefully as there are only two studies involved and there are no actual means to distinguish between an artefact or a true biological reason that may drive the higher correlations. Both for Figures C.8 and C.9, it seems that a couple of TREPs are perfectly anticorrelated for subsets of highest expressed genes, These are most likely mathematical artefacts. The correlation calculations are involve very few genes and as such the correlations are more sensitive to any change. As very few genes are involved, the slightest changes in their respective order ofmagnitude may imply reversed trends. Table C.2. Example of gene subsets for a two studies (A and B) for a tissue 𝐺𝑒𝑛𝑒𝑎 𝐺𝑒𝑛𝑒𝑏 𝐺𝑒𝑛𝑒𝑐 𝐺𝑒𝑛𝑒𝑑 𝐺𝑒𝑛𝑒𝑒 Study A, TREP 𝑇1 (𝑇𝑆𝑡𝑢𝑑𝑦𝐴1 ) 1000 2000 3000 4000 5000 Study B, TREP 𝑇1 (𝑇𝑆𝑡𝑢𝑑𝑦𝐵1 ) 500 2800 6000 5000 4000 For example, if we consider 𝑇𝑆𝑡𝑢𝑑𝑦𝐴1 and 𝑇𝑆𝑡𝑢𝑑𝑦𝐵1 for a set of genes 𝑎,… , 𝑒 (see Table C.2) and a cut-off at 3,000 FPKMs, then the correlation will only involve 𝐺𝑒𝑛𝑒𝑐, 𝐺𝑒𝑛𝑒𝑑 and 𝐺𝑒𝑛𝑒𝑒. Thus, while 𝑐𝑜𝑟𝑟𝑇𝑆𝑡𝑢𝑑𝑦𝐴1 ,𝑇𝑆𝑡𝑢𝑑𝑦𝐵1 (𝑐, 𝑑, 𝑒) = −1, 𝑐𝑜𝑟𝑟𝑇𝑆𝑡𝑢𝑑𝑦𝐴1 ,𝑇𝑆𝑡𝑢𝑑𝑦𝐵1 (𝑎, 𝑏, 𝑐, 𝑑, 𝑒) = 0.6836 2000 4000 6000 1000 2000 3000 4000 5000 Study A St ud y B Expression of TREP T1 Figure C.10. Example on how correlation may change to cut-offs c.1.1 Overlap of the top high expressed genes between the five datasets Figure C.11 shows the ratios of the number of common protein-coding genes for a given amount of highest expressed protein-coding genes to that very number across the studies and for each tissue. Among these (cumulative) proportions, a few present very high value for a minimal subset of genes (below 10 FPKM across all the tissues) which then drop 213 supplementary material for chapter 4 Figure C.11. Cumulative shared set of genes ranked by their decreasing order of expression across the five studies. Apart from a very small subset for the highest expressed protein coding genes, the overlap of the gene ranks across the 5 studies is rather small. The grey line presents the evolution of the ratios for the randomly permuted data which highlights that there is an underlying structure. dramatically to finally increase slowly to reach the expected ratio of 1 FPKM. Figure C.12 presents the same kind of ratios, however, limited to Uhlén and GTEx only. Here as well, aside from the expected perfect ratio for the complete set of protein-coding genes, only a tiny subset of the highest expressed genes produce high rates of highly expressed common genes to the number of considered ranked genes. These results are quite unsurprising as they involve only two studies. In addition to increasing the number of overlaps probabilistically, these two studies are also the two most recent ones and comprise a higher number of replicates per tissues; thus the measurements for each gene in each condition are likely more robust. Comparing the calculated ratios between the real (colour) and the randomly permuted data (grey) on Figures C.11 and C.12 plainly show that there are common (biological) structures that are nonfortuitous across studies. 214 C.2 most variable genes Figure C.12. Cumulative shared set of genes, ranked by their decreasing order of expression, between Uhlén et al. and GTEx studies. The highest expressed genes present greater ratios than when the 5 studies are considered (see Figure C.11). The fewer number of studies considered, which, additionally to be the most recent studies, are the ones to comprise greater number of biological replicates per tissues may be the sole reasons for the improved results. c.2 most variable genes c.2.1 Validation of the association of the most variable genes with the highest correlations After ordering the genes by decreasing order of their coefficient of variation within each of the datasets comprised in 𝒲1, I have calculated the size of overlap for each rank (i.e. from 1 to 12,268) between the five datasets. To help with the interpretation, I finally divide the previous number by the rank. Figure C.19 and Figure C.20 present the result for𝒲1 and𝒲2. Many of the most variable genes are commonly present in the top tier of the five studies, though they have different individual rank. There is a strong growth for about the first 1,250 genes that then settles 215 supplementary material for chapter 4 231 133 159 106 9035 3444 26 57 35 76 26 69 32 22 614530 26 39 31 32 30 79 173 92 83 52 57 2147 Castle Brawand IBM Uhlen Gtex Figure C.13. Overlap of the most variable genes across the five studies for the set of the four common tissues. In each study, I rank the protein-coding genes in decreasing order. This Venn diagram presents the shared and unique protein-coding genes in the top quarter of the most variable genes. a plateau which increases toward the final ratio (1). Using the first quarter of the most variable genes as a cut-off appears to be an acceptable threshold as it comprises the initial growth and part of the plateau. 216 C.2 most variable genes Fig ur eC .14 .M ea ne xp re ssi on of ge ne sc om pa re dt ot he ir co eff ici en to fv ar iat ion . 217 supplementary material for chapter 4 Te sti s (C ast le) Te sti s (I BM ) Te sti s (B raw an d) Te sti s (G tex ) Te sti s (U hle n) He art (C ast le) He art (B raw an d) He art (IB M) He art (U hle n) He art (G tex ) Liv er (Ca stl e) Liv er (Uh len ) Liv er (IB M) Liv er (Br aw an d) Liv er (G tex ) Kid ne y ( Ca stl e) Kid ne y ( Bra wa nd ) Kid ne y ( Gt ex) Kid ne y ( Uh len ) Kid ne y ( IBM ) Testis (Castle) Testis (IBM) Testis (Brawand) Testis (Gtex) Testis (Uhlen) Heart (Castle) Heart (Brawand) Heart (IBM) Heart (Uhlen) Heart (Gtex) Liver (Castle) Liver (Uhlen) Liver (IBM) Liver (Brawand) Liver (Gtex) Kidney (Castle) Kidney (Brawand) Kidney (Gtex) Kidney (Uhlen) Kidney (IBM) -1 -0.5 0 0.5 1 Value 0 40 80 Spearman Correlation Co un t Figure C.15. Clustering of the four common tissues across the five studies for the most common variable genes. The samples cluster by tissue of origin rather than by original study. Each cluster of tissue presents a different hierarchy of study: for Kidney, Uhlén sample is closer to the IBM sample, while for Testis, Uhlén sample is closer to the GTEx one. 218 C.2 most variable genes Liv er (Ca stl e) Kid ne y ( Ca stl e) Te sti s (C ast le) He art (C ast le) Te sti s (I BM ) Te sti s (B raw an d) Te sti s (G tex ) Te sti s (U hle n) He art (B raw an d) He art (G tex ) He art (IB M) He art (U hle n) Liv er (Br aw an d) Liv er (G tex ) Liv er (Uh len ) Liv er (IB M) Kid ne y ( Bra wa nd ) Kid ne y ( Gt ex) Kid ne y ( Uh len ) Kid ne y ( IBM ) Liver (Castle) Kidney (Castle) Testis (Castle) Heart (Castle) Testis (IBM) Testis (Brawand) Testis (Gtex) Testis (Uhlen) Heart (Brawand) Heart (Gtex) Heart (IBM) Heart (Uhlen) Liver (Brawand) Liver (Gtex) Liver (Uhlen) Liver (IBM) Kidney (Brawand) Kidney (Gtex) Kidney (Uhlen) Kidney (IBM) 0.2 0.4 0.6 0.8 1 Value 0 20 40 Spearman Correlation Co un t Figure C.16. Clustering of the four common tissues across the five studies (excluding themost variable genes). Apart from the Castle samples, the samples cluster by tissue of origin rather than by original study. (Note that Pearson correlations give stronger clustering results towards the biological origin of the TREPs.) 219 supplementary material for chapter 4 Testis (Castle) Testis (IBM) Testis (Brawand) Testis (Uhlen) Testis (Gtex) Liver (Castle) Liver (Brawand) Liver (Uhlen) Liver (IBM) Liver (Gtex) Heart (Castle) Heart (IBM) Heart (Uhlen) Heart (Brawand) Heart (Gtex) Kidney (Gtex) Kidney (Brawand) Kidney (Castle) Kidney (IBM) Kidney (Uhlen) 0 5 10 15 Log2(FPKM+1) 0 Expression Co un t Figure C.17. Expression of the most common variable genes. Figure C.18. Ratio of Maximum of expression/Sum of expression for the most variable genes (cv≥1.5) in 𝒲1 that are expressed at least in two different tissues at 1 FPKM. The lowest ratio is above 0.79 and the highest ratio is close to 1. This range shows that the most variable genes are expressed in one tissue more specifically than the three others as the tissue where they are the highest expressed accounts for more than 79% of the sum of expression across the four tissues. 220 C.2 most variable genes One quarter <-- of the genes --> <-- a --> <---------------- b ----------------><------------------- c ------------------> 0.00 0.25 0.50 0.75 1.00 0 2500 5000 7500 10000 12500 Number of ranked genes (by decreasing order of coefficient of variation) Ra tio o f c om m on g en es ac ro ss a ll da ta se ts Data Real Randomised Figure C.19. Intersection size course of 𝒲1 genes (based on their coefficient of variation rank in each of the five studies). There are three main parts. There is an initial strong growth (a) which then settles a plateau (b). Eventually, the ratio increases slowly again until reaching the expected ratio of 1 once all the genes from 𝒲1 are included (c). The first quarter of the genes covers (a) and a part of (b). Apart from (a), the overlap of shared genes between the five datasets when ranked on their coefficient of variation is above 70%. The sigmoid curve (dashed line) is based on randomised data where permutations break the original order of the genes. (Within each dataset, all the gene expression levels are permuted within each tissue, i.e. the overall pattern of expression of each tissue is conserved. This operation is performed 10,000 times. The dashed line is a summary of all these permutations). There is a distinct dissimilarity between the real and the randomised data. 221 supplementary material for chapter 4 One quarter <-- of the genes --> <---- a ----> <--------------------------------------- b ---------------------------------------> 0.00 0.25 0.50 0.75 1.00 0 5000 10000 15000 Number of ranked genes (by decreasing order of coefficient of variation) Ra tio o f c om m on g en es ac ro ss a ll da ta se ts Data Real Randomised Figure C.20. Intersection size course of 𝒲2 genes (based on their coefficient ofvariation rank in each of the two studies). Globally, there are two parts: one initial strong growth (a) and a second part (b) where the curve shifts shallowly towards the expected final ratio of 1. While the number of genes involved in𝒲2 is higher than in𝒲1, the ratio of common genes that are thetwentieth most variable in each study is above 75%. There are three main reasons that may explain this improved result compared to Figure C.19. • 𝒲2 involves a smaller degree (i.e. number of studies) than 𝒲1 (respectively 2 and 5). Hence, bigger intersection sizes are easier to occur. • As previously mentioned, GTEx and Uhlén studies provide probably more accurate TREPs than the other studies (see section 4.2 on page 98). • The greater number of tissues induces a wider range of coefficients of variation, which allows picking up genes with more subtle variations. 222 C.2 most variable genes 0 500 1000 1 2 3 4 Number of tissues Ge ne s c ou nt Castle: Most variable genes breadth (cv ≥1.5; expression ≥ 1 FPKM) (a) Castle 0 500 1000 1500 2000 1 2 3 4 Number of tissues Ge ne s c ou nt Brawand: Most variable genes breadth (cv ≥1.5; expression ≥ 1 FPKM) (b) Brawand 0 500 1000 1500 2000 1 2 3 4 Number of tissues Ge ne s c ou nt IBM: Most variable genes breadth (cv ≥1.5; expression ≥ 1 FPKM) (c) IBM 0 500 1000 1500 2000 2500 1 2 3 4 Number of tissues Ge ne s c ou nt Uhlen (4 Tissues): Most variable genes breadth (cv ≥1.5; expression ≥ 1 FPKM) (d) Uhlen 0 500 1000 1500 2000 1 2 3 4 Number of tissues Ge ne s c ou nt GTEx (4 Tissues): Most variable genes breadth (cv ≥1.5; expression ≥ 1 FPKM) (e) GTEx Figure C.21. Breadth of expression (≥1 FPKM) for the most variable mRNAs (cv≥1.5) across𝒲1. Most of these mRNAs are expressed only at 1 FPKM or above. 223 supplementary material for chapter 4 c.3 tissue specific (ts) genes c.3.1 Hampel method The Hampel method allows detecting outliers and it relies on the median and the MAD (median absolute deviation) as robust estimate of the location and spread (instead of the more commonly used mean and standard deviation). [Hampel, 1971; Hampel, 1974] Algorithm 1: Hampel method Data: Expression matrix; Genes as rows and conditions (tissues) as columns Input: threshold: numeric Input: bool: boolean Result: Indicates if a gene presents an atypical (outlier) expression for any condition (as a boolean or a numeric ratio) foreach Gene g (i.e. row) of the input matrix do med=compute median(g); /* compute the M.A.D. (median absolute deviation) */ mad=median(absolute(g-med)); if !bool then /* Return boolean answer */ newg=absolute(g-med) > threshold*mad; else /* Return ratios that can be later sorted */ newg=(absolute(g-med))/mad; end return(newg); end c.3.1.1 Median Median can be found by listing all values from smallest to greatest. If the number of values is odd, the median is the middle value. If the number of values is even, the median is the mean of the two middle values. c.3.1.2 Median absolute deviation (M.A.D.) For a give variable X, M.A.D. is defined as follow: 𝑀𝐴𝐷 = 𝑚𝑒𝑑𝑖𝑎𝑛(|𝑋𝑖 −𝑚𝑒𝑑𝑖𝑎𝑛(𝑋)|) (M.A.D.) 224 C.4 list of publications based on rna-seq and covering at least partially its robustness c.3.2 List of the tissues available in TiGER Thirty tissues (or equivalent) are available through this database: Bladder, Blood, Bone, Bone marrow, Brain, Cervix, Colon, Eye, Heart, Kidney, Larynx, Liver, Lung, Lymph node, Mammary gland, Muscle, Ovary, Pancreas, Peripheral nervous system, Placenta, Prostate, Skin, Small intestine, Soft tissue, Spleen, Stomach, Testis, Thymus, Tongue and Uterus. c.3.3 Uhlén categories c.4 list of publications based on rna-seq and covering at least partially its robustness • SEQC/MAQC-III Consortium (2014). ‘A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium’. Nat. Biotechnol. 32 (9), pp. 903–914 • A. Santos et al. (2015). ‘Comprehensive comparison of large-scale tissue expression datasets’. PeerJ 3, e1054 • P. H. Sudmant et al. (2015). ‘Meta-analysis of RNA-seq expression data across species, tissues and studies’. Genome Biol. 16, p. 287 • F. Danielsson et al. (2015). ‘Assessing the consistency of public human tissue RNA-seq data sets’. Briefings Bioinf. 16 (6), pp. 941–949 • M. Uhlén, B. M. Hallström, et al. (2016). ‘Transcriptomics resources of human tissues and organs’. Mol. Syst. Biol. 12 (4) • L. Peixoto et al. (2015). ‘How data analysis affects power, reproducibility and biological insight of RNA-seq studies in complex datasets’. Nucleic Acids Res. 43 (16), pp. 7664–7674 • Q. Wang et al. (2017). ‘Enabling cross-study analysis of RNA-Sequencing data’. bioRxiv (110734) 225 supplementary material for chapter 4 TableC.3.Uhlénetal.genecategoriesforallgenes(i.e.unrestrictedtoprotein-codinggenes) Ensembl76 (62,757genedefinitions) Not detected Notexpressed at1FPKM cut-off Mixedexpression Ubiquitousexpression Group Enhanced Tissue Enhanced Tissue Enriched Low (<10FPKM) High (≥10FPKM) Low (<10FPKM) High (≥10FPKM) Whole dataset Castle 18,836 16,258 19,079 1,203 1,456 703 77 8,319 3,896 Brawand 18,278 20,173 15,254 2,057 1,873 977 0 6,180 5,442 IBM 14,494 20,858 16,633 1,582 1,194 926 733 10,042 4,453 Uhlen 17,345 16,548 15,372 1,351 467 419 4,615 10,644 5,498 Gtex 5,755 25,138 17,172 1,464 775 713 7,164 10,032 5,117 Consensus 5,747 4,231 4,121 230 33 166 0 1,073 531[518] Common 4 tissues Working datasets Castle 43,921 3,267 14,850 1,735 3,267 1,181 — — 4,645 Brawand 44,479 3,193 13,975 2,541 3,193 1,282 — — 7,002 IBM 48,263 3,262 13,672 2,160 3,262 1,299 — — 5,242 Uhlen 45,412 3,146 14,332 2,546 3,146 1,213 — — 7,665 Gtex 57,002 4,516 16,652 2,771 4,516 1,459 — — 8,155 Consensus 9,655 557 4,366 675 557 448 — — 1,960 Common 23 tissues Working datasets Uhlen 17,345 27,575 14,981 1,427 611 440 2,203 11,252 5,678 Gtex 5,755 38,988 16,982 1,909 2,122 1,021 1,746 11,236 5,971 Consensus 5,755 27,149 12,250 973 433 430 797 8,030 4,281 226 C.4 list of publications based on rna-seq and covering at least partially its robustness Fig ur eC .22 .M os ts pe cif ic ge ne sh igh lig ht ed in EB Ig en ee xp re ssi oa tla s. 227 D SUPPLEMENTARY MATER IAL FORCHAPTER 5 44 48 596 230 600 416 498 Cutler Kuster Pandey (a) Heart 492 126 660 90 440 1062 374 Cutler Kuster Pandey (b) Lung 7 8 366 10 59 2221 1888 Cutler Kuster Pandey (c) Ovary 167 36 337 265 501 1230 1690 Cutler Kuster Pandey (d) Pancreas Figure D.1. Unique and shared proteins across the proteomic studies 2606 4795649 Pandey Kuster Figure D.2. Proteins overlap between the fourteen common tissues between Pandey and Kuster proteome data. 229 supplementary material for chapter 5 776 6402064 Kuster Pandey (a) Adrenal 767 5092046 Pandey Kuster (b) Colon 1738 1851124 Kuster Pandey (c) Oesophagus 1169 5531080 Pandey Kuster (d) Gall bladder 617 4461482 Pandey Kuster (e) Kidney 1244 2671791 Pandey Kuster (f) Liver 649 5371437 Kuster Pandey (g) Placenta 1317 4442064 Pandey Kuster (h) Prostate 1019 3721673 Pandey Kuster (i) Rectum 2343 4822566 Pandey Kuster (j) Testis Figure D.3. Unique and shared proteins across the other ten common tissues between Pandey and Kuster proteomic studies Table D.1. Proteins found in every tissue in all three datasets Ensembl (76) gene ID Gene symbol ENSG00000163631 ALB ENSG00000171403 KRT9 ENSG00000186395 KRT10 230 supplementary material for chapter 5 Table D.2. Proteins found in every tissue in Pandey and Kuster datasets Ensembl (76) ID Gene symbol ENSG00000023191 RNH1 ENSG00000044574 HSPA5 ENSG00000067225 PKM ENSG00000071127 WDR1 ENSG00000074800 ENO1 ENSG00000080824 HSP90AA1 ENSG00000089220 PEBP1 ENSG00000089597 GANAB ENSG00000092820 EZR ENSG00000096384 HSP90AB1 ENSG00000100345 MYH9 ENSG00000102144 PGK1 ENSG00000108518 PFN1 ENSG00000108953 YWHAE ENSG00000111530 CAND1 ENSG00000111640 GAPDH ENSG00000111669 TPI1 ENSG00000111716 LDHB ENSG00000117450 PRDX1 ENSG00000130985 UBA1 ENSEMBL (76) ID Gene symbol ENSG00000134308 YWHAQ ENSG00000134333 LDHA ENSG00000140575 IQGAP1 ENSG00000148180 GSN ENSG00000149925 ALDOA ENSG00000160752 FDPS ENSG00000163631 ALB ENSG00000164924 YWHAZ ENSG00000165280 VCP ENSG00000166598 HSP90B1 ENSG00000166794 PPIB ENSG00000167658 EEF2 ENSG00000170027 YWHAG ENSG00000170248 PDCD6IP ENSG00000171403 KRT9 ENSG00000178209 PLEC ENSG00000179218 CALR ENSG00000182718 ANXA2 ENSG00000186395 KRT10 ENSG00000204628 GNB2L1 231 supplementary material for chapter 5 Table D.3. Tissue specific proteins found both in Pandey et al. and Kuster et al. datasets Tissue Ensembl (76) ID Gene symbol Adrenal gland ENSG00000141744 PNMT Adrenal gland ENSG00000148655 C10orf11 Adrenal gland ENSG00000160882 CYP11B1 Adrenal gland ENSG00000163428 LRRC58 Adrenal gland ENSG00000163626 COX18 Kidney ENSG00000074803 SLC12A1 Kidney ENSG00000100253 MIOX Kidney ENSG00000112499 SLC22A2 Kidney ENSG00000113361 CDH6 Kidney ENSG00000148942 SLC5A12 Kidney ENSG00000149452 SLC22A8 Kidney ENSG00000154025 SLC5A10 Kidney ENSG00000158296 SLC13A3 Kidney ENSG00000169344 UMOD Kidney ENSG00000170482 SLC23A1 Kidney ENSG00000186335 SLC36A2 Kidney ENSG00000197901 SLC22A6 Liver ENSG00000084734 GCKR Liver ENSG00000100197 CYP2D6 Liver ENSG00000135094 SDS Liver ENSG00000172497 ACOT12 Liver ENSG00000198650 TAT Pancreas ENSG00000010438 PRSS3 Tissue Ensembl (76) ID Gene symbol Pancreas ENSG00000114204 SERPINI2 Pancreas ENSG00000141086 CTRL Pancreas ENSG00000143954 REG3G Pancreas ENSG00000187021 PNLIPRP1 Pancreas ENSG00000215704 CELA2B Pancreas ENSG00000266200 PNLIPRP2 Placenta ENSG00000105825 TFPI2 Placenta ENSG00000116183 PAPPA2 Placenta ENSG00000137868 STRA6 Placenta ENSG00000148848 ADAM12 Placenta ENSG00000163283 ALPP Placenta ENSG00000172296 SPTLC3 Placenta ENSG00000172901 Placenta ENSG00000183668 PSG9 Placenta ENSG00000243137 PSG4 Prostate ENSG00000044524 EPHA3 Prostate ENSG00000103710 RASL12 Rectum ENSG00000205277 MUC12 Testis ENSG00000052841 TTC17 Testis ENSG00000109762 SNX25 Testis ENSG00000130948 HSD17B3 Testis ENSG00000160310 PRMT2 232 supplementary material for chapter 5 Lu ng (C utl er) Pa nc rea s ( Cu tle r) He art (P an de y) He art (C utl er) Pa nc rea s ( Pa nd ey ) Lu ng (P an de y) Ov ary (C utl er) He art (K ust er) Ov ary (P an de y) Pa nc rea s ( Ku ste r) Lu ng (K ust er) Ov ary (K ust er) Lung (Cutler) Pancreas (Cutler) Heart (Pandey) Heart (Cutler) Pancreas (Pandey) Lung (Pandey) Ovary (Cutler) Heart (Kuster) Ovary (Pandey) Pancreas (Kuster) Lung (Kuster) Ovary (Kuster) -1 -0.5 0 0.5 1 Pearson Correlation 0 10 20 30 40 Co un t Figure D.4. Heatmap of the four common tissues between the three proteome datasets based on the pairwise Pearson correlations clustering of 1,384 proteins expression levels. See Figure 5.5 for the heatmap based on Spearman correlation. 233 supplementary material for chapter 5 Figure D.5. Scatterplot of Heart for Cutler and Pandey data. There is an overall strong correlation between the expression of the Heart proteins between Pandey (y-axis) and Cutler (x-axis) even though the dispersion is quite substantial. The black line 𝑦 = 𝑥 is only present as a visual reference. 234 supplementary material for chapter 5 Figure D.6. Scatterplot of Heart for Kuster and Pandey data. The dispersion of expression is more significant than between Pandey and Cutler (see Figure D.5). It is rather difficult to assess the protein expression in Heart from Pandey (or Kuster) based on the other study. The black line 𝑦 = 𝑥 is only present as a visual reference. 235 supplementary material for chapter 5 Kid ne y ( Ku ste r) Kid ne y ( Pa nd ey ) Liv er (Pa nd ey ) Ga ll b lad de r (K ust er) Liv er (K ust er) Pla cen ta (K ust er) Pla cen ta (Pa nd ey ) Lu ng (P an de y) Lu ng (K ust er) He art (P an de y) Oe sop ha gu s ( Pa nd ey ) Pa nc rea s ( Pa nd ey ) Ga ll b lad de r (P an de y) Co lon (P an de y) Re ctu m (Pa nd ey ) Pro sta te (Pa nd ey ) Ov ary (K ust er) Ov ary (P an de y) Te sti s ( Pa nd ey ) Ad ren al (Pa nd ey ) Te sti s ( Ku ste r) Ad ren al (K ust er) Pa nc rea s ( Ku ste r) He art (K ust er) Pro sta te (K ust er) Oe sop ha gu s ( Ku ste r) Re ctu m (K ust er) Co lon (K ust er) Kidney (Kuster) Kidney (Pandey) Liver (Pandey) Gall bladder (Kuster) Liver (Kuster) Placenta (Kuster) Placenta (Pandey) Lung (Pandey) Lung (Kuster) Heart (Pandey) Oesophagus (Pandey) Pancreas (Pandey) Gall bladder (Pandey) Colon (Pandey) Rectum (Pandey) Prostate (Pandey) Ovary (Kuster) Ovary (Pandey) Testis (Pandey) Adrenal (Pandey) Testis (Kuster) Adrenal (Kuster) Pancreas (Kuster) Heart (Kuster) Prostate (Kuster) Oesophagus (Kuster) Rectum (Kuster) Colon (Kuster) 0.5 0.7 0.9 Spearman Correlation 0 40 80 12 0 Co un t Figure D.7. Heatmap of the fourteen common tissues between Pandey and Kuster datasets based on the pairwise Spearman correlations of the expression levels of their 4,172 common proteins. Placenta, Lung and Kidney TREPs between Pandey and Kuster show an overall higher biological similarity than technical variability to the other tissues from the same study source. See also Figures D.8 to D.11. 236 supplementary material for chapter 5 Liv er (K ust er) Liv er (Pa nd ey ) Ga ll b lad de r (K ust er) He art (P an de y) Pla cen ta (Pa nd ey ) Pla cen ta (K ust er) Pa nc rea s ( Pa nd ey ) Ga ll b lad de r (P an de y) Lu ng (P an de y) Re ctu m (Pa nd ey ) Pro sta te (Pa nd ey ) Co lon (P an de y) Re ctu m (K ust er) He art (K ust er) Pro sta te (K ust er) Co lon (K ust er) Ad ren al (Pa nd ey ) Ad ren al (K ust er) Lu ng (K ust er) Ov ary (K ust er) Ov ary (P an de y) Te sti s ( Pa nd ey ) Kid ne y ( Pa nd ey ) Oe sop ha gu s ( Pa nd ey ) Te sti s ( Ku ste r) Kid ne y ( Ku ste r) Pa nc rea s ( Ku ste r) Oe sop ha gu s ( Ku ste r) Liver (Kuster) Liver (Pandey) Gall bladder (Kuster) Heart (Pandey) Placenta (Pandey) Placenta (Kuster) Pancreas (Pandey) Gall bladder (Pandey) Lung (Pandey) Rectum (Pandey) Prostate (Pandey) Colon (Pandey) Rectum (Kuster) Heart (Kuster) Prostate (Kuster) Colon (Kuster) Adrenal (Pandey) Adrenal (Kuster) Lung (Kuster) Ovary (Kuster) Ovary (Pandey) Testis (Pandey) Kidney (Pandey) Oesophagus (Pandey) Testis (Kuster) Kidney (Kuster) Pancreas (Kuster) Oesophagus (Kuster) 0.2 0.4 0.6 0.8 1 Pearson Correlation 0 40 80 12 0 Co un t Figure D.8. Heatmap of the fourteen common tissues between Pandey and Kuster datasets based on the pairwise Pearson correlations of the expression levels of their 4,172 common proteins. Only Placenta andAdrenal gland TREPs between Pandey and Kuster show a greater biological similarity than technical one. See also Figures D.7 and D.9 to D.11. 237 supplementary material for chapter 5 Figure D.9. Scatterplot of Placenta for Kuster and Pandey data. While the expression of some proteins are more spread between both datasets, there is an overall strong linear correlation between Pandey and Kuster for their Placenta tissue. Besides a few exception, protein expression levels in Pandey seem to be underestimated compared to Kuster. This is most probably due to the normalisation method. The black line 𝑦 = 𝑥 is only present as a visual reference. 238 supplementary material for chapter 5 Figure D.10. Scatterplot of Pancreas andAdrenal (fromKuster). Although Kuster’s Pancreas and Adrenal gland are never found in Figures D.7 and D.8 as the most similar to each other, their expression levels present a strong linear relationship to each other. The dispersion and outliers seem insuffisant to unquestionably distinguish different tissues. Figure D.11 is even more compelling. The black line 𝑦 = 𝑥 is only present as a visual reference. 239 supplementary material for chapter 5 Figure D.11. Scatterplot of Kuster Pancreas and PandeyAdrenal. As in Figure D.10, Kuster’s Pancreas and Pandey’s Adrenal gland are never found as the most similar to each other. Once again, there is a strong linear relationship between their protein expression levels, even though the dispersion is greater and the outliers more numerous. The black line 𝑦 = 𝑥 is only present as a visual reference. 240 supplementary material for chapter 5 Figure D.12. Heatmap of the fourteen common tissues between Pandey and Kuster (PPKM) datasets based on the pairwise Spearman correlations of the expression levels of their 8,680 common proteins. Compared to Figure D.7, once again Placenta, Lung, Kidney are displaying a higher biological signal than technical variability. This is also the case of Adrenal gland tissue (as in Figure D.8). Thus, the results are similar to the ones from the first quantification method. 241 supplementary material for chapter 5 Adrenal Heart Colon Oesophagus Gallbladder Kidney Liver Lung Ovary Pancreas Placenta Prostate Rectum Testis 0 2500 5000 7500 Protein count Ti ss ue Present in Both Kuster & Pandey Pandey only Kuster only Figure D.13. Number of identified proteins in each of the fourteen common tissues for Kuster and Pandey proteomic data with our new PPKM quantification method. Although the number of proteins quantified by each method is different, this figure is very similar to Figure D.14. Adrenal Heart Colon Oesophagus Gallbladder Kidney Liver Lung Ovary Pancreas Placenta Prostate Rectum Testis 0 2000 4000 Protein count Ti ss ue Present in Both Kuster & Pandey Pandey only Kuster only Figure D.14. Number of identified proteins in each of the fourteen common tissues for Kuster and Pandey proteomic data quantified with the quantification described in Chapter 2. 242 E SUPPLEMENTARY MATER IAL FORCHAPTER 6 e.1 hypergeometric test The hypergeometric test uses the hypergeometric distribution and equates to the one-sided Fisher’s exact test. It allows measuring the statistical significance of randomly sampling 𝑘 successes out of 𝑛 draws, without replacement, from a population of𝑁 that contains𝐾 successes. Depending on whether the test is about an over or under-representation, the p- value is the probability of drawing respectively a minimum or a maximum of 𝑘 successes. See also N. L. Johnson et al. (2005) for more examples using this test. Figure E.1. Scatterplot of protein (Pandey et al. — Top3 quantification) and mRNA (Uhlén et al.) expression for Kidney. 243 Figure E.2. Overview of the tissue scatterplots between Uhlén and Pandey data. The Liver presents the highest correlation and the Oesophagus the lowest one. 244 Table E.1. Found proteins without a counterpart in the transcriptomic data Set ENSEMBL (76) ID Gene name Biotype Description Source andAccessing number a, b, c, d ENSG00000173349 SFT2D3 p. coding SFT2 domain containing 3 HGNC SymbolAcc: 28767 a, b ENSG00000198788 MUC2 processedtranscript mucin 2, oligomeric mucus/gel-forming HGNC Symbol Acc: 7512 a, b, c, d ENSG00000223953 C1QTNF5 p. coding C1q and tumor necrosisfactor related protein 5 HGNC Symbol Acc:14344 a, b ENSG00000256453 DND1 p. coding DND microRNA-mediatedrepression inhibitor HGNC Symbol Acc:23799 a, b, c, d ENSG00000262664 OVCA2 p. coding ovarian tumor suppressorcandidate 2 HGNC Symbol Acc:24203 b ENSG00000163157 TMOD4 p. coding tropomodulin 4 (muscle) HGNC SymbolAcc:11874 b ENSG00000203618 GP1BB p. coding glycoprotein Ib (platelet),beta polypeptide HGNC Symbol Acc:4440 b ENSG00000251322 SHANK3 processedtranscript SH3 and multiple ankyrin repeat domains 3 HGNC Symbol Acc:14294 c, d ENSG00000105371 ICAM4 p. coding intercellular adhesion molecule 4(Landsteiner-Wiener blood group) HGNC Symbol Acc:5347 c ENSG00000164708 PGAM2 p. coding phosphoglycerate mutase 2 (muscle) HGNC SymbolAcc:8889 c, d ENSG00000181404 XXyac-YRM2039.2 unprocesssedpseudogene c, d ENSG00000183336 BOLA2 p. coding bolA family member 2 HGNC SymbolAcc:29488 245 Table E.1. Found proteins without a counterpart in the transcriptomic data Set ENSEMBL (76) ID Gene name Biotype Description Source andAccessing number c ENSG00000196101 HLA-DRB3 p. coding major histocompatibility complex,class II, DR beta 3 HGNC Symbol Acc:4951 c ENSG00000203618 GP1BB glycoprotein Ib (platelet),beta polypeptide HGNC Symbol Acc:4440 c, d ENSG00000206203 TSSK2 p. coding testis-specific serine kinase 2 HGNC SymbolAcc:1140 c, d ENSG00000206240,ENSG00000206306 HLA-DRB1 p. coding major histocompatibility complex, class II, DR beta 1 HGNC Symbol Acc:4948 c, d ENSG00000206305 HLA-DQA1 major histocompatibility complex,class II, DQ alpha 1 HGNC Symbol Acc:4942 c, d ENSG00000206450,ENSG00000223532 HLA-B p. coding major histocompatibility complex, class I, B HGNC Symbol Acc:4932 c, d ENSG00000225691 HLA-C p. coding major histocompatibility complex,class I, C HGNC Symbol Acc:4933 c, d ENSG00000206505, ENSG00000224320, ENSG00000227715, ENSG00000235657, ENSG00000223980, ENSG00000229215 HLA-A p. coding major histocompatibility complex,class I, A HGNC Symbol Acc:4931 c, d ENSG00000211594 IGKJ4 IG J gene immunoglobulin kappa joining 4 HGNC SymbolAcc:5722 c, d ENSG00000211595 IGKJ3 IG J gene immunoglobulin kappa joining 3 HGNC SymbolAcc:5721 246 Table E.1. Found proteins without a counterpart in the transcriptomic data Set ENSEMBL (76) ID Gene name Biotype Description Source andAccessing number c ENSG00000213402 PTPRCAP p. coding protein tyrosine phosphatase,receptor type, C-associated protein HGNC Symbol Acc:9667 c ENSG00000215695 RSC1A1 p. coding regulatory solute carrier protein,family 1, member 1 HGNC Symbol Acc:10458 c, d ENSG00000227357 HLA-DRB4 p. coding major histocompatibility complex,class II, DR beta 4 HGNC Symbol Acc:4952 c, d ENSG00000231021 HLA-DRB4 p. coding major histocompatibility complex,class II, DR beta 4 RefSeq mRNA Acc:NM_021983 c, d ENSG00000231286 HLA-DQB1 p. coding major histocompatibility complex,class II, DQ beta 1 HGNC Symbol Acc:4944 c, d ENSG00000231679 HLA-DRB3 p. coding major histocompatibility complex,class II, DR beta 3 RefSeq mRNA Acc:NM_022555 c, d ENSG00000256453 DND1 p. coding DND microRNA-mediatedrepression inhibitor HGNC Symbol Acc:23799 c ENSG00000263353 CH17-118O6.1 processedtranscript c, d ENSG00000276938 FAM157A p. coding Homo sapiens family withsequence similarity 157, member A RefSeq mRNA Acc:NM_001145248 c, d ENSG00000277656 GSTT1 p. coding glutathione S-transferase theta 1 HGNC SymbolAcc:4641 c, d ENSG00000277897 GSTT2 p. coding d ENSG00000105507 CABP5 p. coding calcium binding protein 5 HGNC SymbolAcc:3714 d ENSG00000105954 NPVF p. coding neuropeptide VF precursor HGNC SymbolAcc:13782 247 Table E.1. Found proteins without a counterpart in the transcriptomic data Set ENSEMBL (76) ID Gene name Biotype Description Source andAccessing number d ENSG00000142539 CTD-2545M3.6 p. coding d ENSG00000147896 IFNK p. coding interferon, kappa HGNC SymbolAcc:21714 d ENSG00000148136 OR13C4 p. coding olfactory receptor, family 13,subfamily C, member 4 HGNC Symbol Acc:4722 d ENSG00000163157 TMOD p. coding tropomodulin 4 (muscle) HGNC SymbolAcc:11874 d ENSG00000164708 PGAM2 p. coding phosphoglycerate mutase 2(muscle) HGNC Symbol Acc:8889 d ENSG00000166884 OR4D6 p. coding olfactory receptor, family 4,subfamily D, member 6 HGNC Symbol Acc:15175 d ENSG00000169840 GSX1 p. coding GS homeobox 1 HGNC SymbolAcc:20374 d ENSG00000170929 OR1M1 p. coding olfactory receptor, family 1,subfamily M, member 1 HGNC Symbol Acc:8220 d ENSG00000171053 PATE1 p. coding prostate and testis expressed 1 HGNC SymbolAcc:24664 d ENSG00000171396 KRTAP4-4 p. coding keratin associated protein 4-4 HGNC SymbolAcc:16928 d ENSG00000172155 LCE1D p. coding late cornified envelope 1D HGNC SymbolAcc:29465 d ENSG00000176239 OR51B6 p. coding olfactory receptor, family 51,subfamily B, member 6 HGNC Symbol Acc:19600 d ENSG00000182346 DAOA p. coding D-amino acid oxidase activator HGNC SymbolAcc:21191 248 Table E.1. Found proteins without a counterpart in the transcriptomic data Set ENSEMBL (76) ID Gene name Biotype Description Source andAccessing number d ENSG00000182591 KRTAP11-1 p. coding keratin associated protein 11-1 HGNC SymbolAcc:18922 d ENSG00000184321 OR51J1 p. coding olfactory receptor, family 51, subfamily J, member 1 (gene/pseudogene) HGNC Symbol Acc:14856 d ENSG00000187173 LCE2A p. coding late cornified envelope 2A HGNC SymbolAcc:29469 d ENSG00000187766 KRTAP10-8 p. coding keratin associated protein 10-8 HGNC SymbolAcc:20525 d ENSG00000196101 HLA-DRB3 p. coding major histocompatibility complex,class II, DR beta 3 HGNC Symbol Acc:4951 d ENSG00000203618 GP1BB p. coding glycoprotein Ib (platelet),beta polypeptide HGNC Symbol Acc:4440 d ENSG00000203818 HIST2H3PS2 p. coding histone cluster 2, H3,pseudogene 2 HGNC Symbol Acc:32060 d ENSG00000205883 DEFB135 p. coding defensin, beta 135 HGNC SymbolAcc:32400 d ENSG00000206452 HLA-C p. coding major histocompatibility complex,class I, C HGNC Symbol Acc:4933 d ENSG00000211831 TRAJ61 TR J gene T cell receptor alpha joining 61(non-functional) HGNC Symbol Acc:12094 d ENSG00000211835 TRAJ56 TR J gene T cell receptor alpha joining 56 HGNC SymbolAcc:12088 d ENSG00000213316 LTC4S p. coding leukotriene C4 synthase HGNC SymbolAcc:6719 249 Table E.1. Found proteins without a counterpart in the transcriptomic data Set ENSEMBL (76) ID Gene name Biotype Description Source andAccessing number d ENSG00000213402 PTPRCAP p. coding protein tyrosine phosphatase,receptor type, C-associated protein HGNC Symbol Acc:9667 d ENSG00000215695 RSC1A1 p. coding regulatory solute carrier protein,family 1, member 1 HGNC Symbol Acc:10458 d ENSG00000224902 GAGE12H p. coding G antigen 12H HGNC SymbolAcc:31908 d ENSG00000233732 IGHV3OR16-10 IG V gene immunoglobulin heavy variable 3OR16-10 (non-functional) HGNC Symbol Acc:5634 d ENSG00000249209 AP000304.12 p. coding d ENSG00000249730 OR10J4 polymorphicpseudogene olfactory receptor, family 10, subfamily J, member 4 (gene/pseudogene) HGNC Symbol Acc:15408 d ENSG00000253148 RGS21 p. coding regulator of G-protein signaling 21 HGNC SymbolAcc:26839 d ENSG00000255009 UBTFL1 processedpseudogene upstream binding transcription factor, RNA polymerase I-like 1 HGNC Symbol Acc:14533 d ENSG00000255472 RP11-998D10.1 p. coding uncharacterized protein UniProtKB/TrEMBLAcc:E9PR74 d ENSG00000259490 IGHV3OR15-7 IG V gene immunoglobulin heavy variable 3OR15-7 (pseudogene) HGNC Symbol Acc:5633 d ENSG00000263353 CH17-118O6.1 processedtranscript d ENSG00000270467 IGHV3OR16-12 IG V gene immunoglobulin heavy variable 3OR16-12 (non-functional) HGNC Symbol Acc:5636 250 Table E.1. Found proteins without a counterpart in the transcriptomic data Set ENSEMBL (76) ID Gene name Biotype Description Source andAccessing number d ENSG00000270472 IGHV3OR16-9 IG V gene immunoglobulin heavy variable 3OR16-9 (non-functional) HGNC Symbol Acc:5644 q21.11Chromosome bands Showing all 9 features - click to show fewer Proteins from UniProtKB Showing all 14 features - click to show fewer Human cDNAs (RefSeq/ENA) Showing all 58 features - click to show fewer Human cDNAs (RefSeq/ENA) Showing 27 of 39 features, due to track being limited to 6 rows by default - click to show more Proteins from UniProtKB Ensembl Homo sapiens version 98.38 (GRCh38.p13) Chromosome 8: 73,038,197 - 73,745,474 Figure E.3. STAU2 definition The chromosome annotations for the mRNA and the protein of STAU2 are different. 251 Spearman Pearson 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Co rr el at io n Compared studies Pandey x GTEx Pandey x Uhlén Uhlén x GTEx Quantification methods HTSeq x HTSeq PPKM x HTSeq Top3 x HTSeq Tissue Number 12 15 Figure E.4. Distribution of Pearson and Spearman correlation coefficients for same-tissue proteomic and transcriptomic pairs versus random tissue pairs (untransformed data). 252 Table E.2. Summary of Pearson and Spearman correlation coefficients between proteomics and transcriptomics across several data combinations. See also Figure 6.8. Mean correlation of Datasets Numberof tissues Quantification methods Scaled data log2(𝑥 + 1) Correlation method Same-tissue pairs Different tissues pairs p-value Pandey et al. Uhlén et al. 12 Top3 x HTSeq True Spearman 0.51 0.37 4.66e-07 Pandey et al. GTEx 12 Top3 x HTSeq True Spearman 0.5 0.37 7.379e-07 Uhlén et al. GTEx 12 HTSeq x HTSeq True Spearman 0.91 0.66 < 2.2e-16 Pandey et al. Uhlén et al. 15 Top3 x HTSeq True Spearman 0.5 0.38 2.659e-08 Pandey et al. Uhlén et al. 12 Top3 x HTSeq True Pearson 0.11 0.06 0.03696 Pandey et al. GTEx 12 Top3 x HTSeq True Pearson 0.12 0.07 0.02895 Uhlén et al. GTEx 12 HTSeq x HTSeq True Pearson 0.93 0.68 < 2.2e-16 Pandey et al. Uhlén et al. 15 Top3 x HTSeq True Pearson 0.1 0.06 0.02271 Pandey et al. Uhlén et al. 12 PPKM x HTSeq True Spearman 0.52 0.42 4.795e-05 Pandey et al. GTEx 12 PPKM x HTSeq True Spearman 0.52 0.43 8.475e-05 Uhlén et al. GTEx 12 HTSeq x HTSeq True Spearman 0.92 0.72 < 2.2e-16 Pandey et al. Uhlén et al. 15 PPKM x HTSeq True Spearman 0.52 0.43 8.422e-06 Pandey et al. Uhlén et al. 12 PPKM x HTSeq True Pearson 0.5 0.37 0.0004002 Pandey et al. GTEx 12 PPKM x HTSeq True Pearson 0.5 0.41 0.0003306 Uhlén et al. GTEx 12 HTSeq x HTSeq True Pearson 0.94 0.73 < 2.2e-16 Pandey et al. Uhlén et al. 15 PPKM x HTSeq True Pearson 0.49 0.4 9.941e-05 Pandey et al. Uhlén et al. 12 Top3 x HTSeq False Spearman 0.51 0.37 4.66e-07 Pandey et al. GTEx 12 Top3 x HTSeq False Spearman 0.5 0.37 7.379e-07 Uhlén et al. GTEx 12 HTSeq x HTSeq False Spearman 0.91 0.66 < 2.2e-16 Pandey et al. Uhlén et al. 15 Top3 x HTSeq False Spearman 0.5 0.38 2.66e-08 253 Table E.2. Summary of Pearson and Spearman correlation coefficients between proteomics and transcriptomics across several data combinations. See also Figure 6.8. Mean correlation of Datasets Numberof tissues Quantification methods Scaled data log2(𝑥 + 1) Correlation method Same-tissue pairs Different tissues pairs p-value Pandey et al. Uhlén et al. 12 Top3 x HTSeq False Pearson 0.17 0.09 0.022 Pandey et al. GTEx 12 Top3 x HTSeq False Pearson 0.17 0.1 0.015 Uhlén et al. GTEx 12 HTSeq x HTSeq False Pearson 0.92 0.64 < 2.2e-16 Pandey et al. Uhlén et al. 15 Top3 x HTSeq False Pearson 0.16 0.1 0.012 Pandey et al. Uhlén et al. 12 PPKM x HTSeq False Spearman 0.52 0.42 4.795e-05 Pandey et al. GTEx 12 PPKM x HTSeq False Spearman 0.52 0.43 8.475e-05 Uhlén et al. GTEx 12 HTSeq x HTSeq False Spearman 0.92 0.72 < 2.2e-16 Pandey et al. Uhlén et al. 15 PPKM x HTSeq False Spearman 0.52 0.43 8.422e-06 Pandey et al. Uhlén et al. 12 PPKM x HTSeq False Pearson 0.55 0.43 1.059e-06 Pandey et al. GTEx 12 PPKM x HTSeq False Pearson 0.56 0.45 2.026e-06 Uhlén et al. GTEx 12 HTSeq x HTSeq False Pearson 0.93 0.69 < 2.2e-16 Pandey et al. Uhlén et al. 15 PPKM x HTSeq False Pearson 0.55 0.45 1.061e-07 254 E.1 hypergeometric test AdrenalAdrenal Adrenal ColonColon Colon GallbladderGallbladderGallbladder Heart Heart Heart Kidney Kidney Kidney LiverLiver Liver Lung Lung Lung OesophagusOesophagus Oesophagus Ovary Ovary Ovary PancreasPancreas Pancreas PlacentaPlacenta Placenta Prostate Prostate Prostate RectumRectum Rectum Testis Testis Testis Urinarybladder UrinarybladderUrinarybladder 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Jaccard index Pearson correlation Spearman correlation Ra nk Figure E.5. Rank comparison between the Pearson/Spearman correlation and the Jaccard indices computed for matching proteomics and transcriptomics. 255 supplementary material for chapter 6 e.2 ts protein percent The percentage of TS proteins is calculated as follow: ∀𝑎 ∈ 𝒜,∀𝑛 ∈ [1,𝒩] 𝑝𝒯𝒮(𝑛, 𝑎) = 𝑛 ∑ 𝑘=1 𝛿𝑔𝑎,𝑘 ⋅ 1 𝑛 ⋅ 100 (TS protein percentage) where: • 𝒮 is the set of 10,000 randomised expression datasets based on Pandey Lab data. These simulated datasets are created by random permutation of the gene labels and their associated vector of expression values across the tissues. •𝒟 is the set of expression datasets; 𝒟 = 𝒮 ∪ { protein expression for Pandey Lab data; mRNA expression for Uhlén et al. data; mRNA expression for GTEx data }. • 𝒢 is the set of genes 𝑔 that are shared by all elements of𝒟. •𝒩 is the number of elements in 𝒢. • 𝒯𝒮 is the set of genes 𝑔 for which the protein is TS (tissue-specific) in Pandey Lab data. 𝒯𝒮 ⊂ 𝒢. • ∀𝑔 ∈ 𝒢, 𝛿𝑔 = { 1 if 𝑔 ∈ 𝒯𝒮 0 if 𝑔 ∉ 𝒯𝒮 •𝒜 is a set of unordered 2-tuples of elements from𝒟;𝒜 = {(protein expression for Pandey Lab data, mRNA expression for Uhlén et al. data); (mRNA expression for Uhlén et al. data, mRNA expression for GTEx data); (𝑠, mRNA expression for Uhlén et al. data)}. ∀𝑠 ∈ 𝒮. • 𝒞 is a correlation function such that: ∀𝑔 ∈ 𝒢, ∀𝑎 = (𝑑1, 𝑑2) ∈ 𝒜 𝒞(𝑔, 𝑎) ⟼ correlation coefficient of 𝑔 for its expression across tissues shared by 𝑑1 and 𝑑2. • (𝑔𝑎,𝑘) is the sequence of genes 𝑔𝑎,𝑘 of 𝒢 such that: ∀𝑘 ∈ [1;𝒩− 1] 𝒞(𝑔𝑎,𝑘, 𝑎) ≥ 𝒞(𝑔𝑎,𝑘+1, 𝑎) 256 F L I ST OF R PACKAGES R [R Core Team, 2019] packages versions are only given as an indication. Most of the code can be run with older or newer versions. • extrafont (0.17) [Chang, 2014] • RColorBrewer (1.1) [Neuwirth, 2014] • Cairo (1.50) [Urbanek et al., 2019] • reshape2 (1.4.3) [Wickham, 2007] • scales (1.0) [Wickham, 2018] • MASS (7.3) [Venables et al., 2002] • data.table (1.12.2) [Dowle et al., 2019] • ggplot2 (3.1.1) [Wickham, 2016] • gridExtra (2.3) [Auguie, 2017] • gridBase (0.4) [Murrell, 2014] • ggthemes (4.1.1) [Arnold, 2019] • devtools (2.1.0) [Wickham et al., 2019] • modules [Schubert and Rudolph, 2014] • ebits [Rudolph, 2014] • VennDiagram (1.6.20) [H. Chen, 2018] • gplots (3.0.1.1) [Warnes et al., 2019] • Bioconductor (2.44) [Huber et al., 2015] • ape (5.3) [Paradis et al., 2019] • biomaRt (2.40) [Durinck et al., 2005] • clusterProfiler (3.12) [G. Yu et al., 2012] • org.Hs.eg.db (3.8.2) [Carlson, 2019] • WGCNA (1.67) [Langfelder et al., 2008] • mgcv (1.8) [Wood, 2004] • europepmc (0.3) [Jahn, 2018] • rmarkdown (1.12) [Xie, Allaire, et al., 2018] • DT (0.6) [Xie, Cheng, et al., 2019] • shiny (1.3.2) [Chang et al., 2019] • clustermq (0.8.5) [Schubert, 2019] I have created two packages to help reproduce the different analyses presented in this thesis: • barzinePhdData for the data (https://github.com/barzine/barzinePhdData), and • barzinePhdR (https://github.com/barzine/barzinePhdR). 257 G L I ST OF P UBL ICAT IONS Barzine, M. P., K. Freivalds, J. C. Wright, M. Opmanis, D. Rituma, F. Z. Ghavidel, A. F. Jarnuczak, E. Celms, K. Čerāns, I. Jonassen, L. Lace, J. Antonio Vizcaíno, J. S. Choudhary, A. Brazma, and J. Viksna (2020). ‘Using Deep Learning to Extrapolate Protein Expression Measurements’. Proteomics 20 (21-22), e2000009. Jarnuczak, A. F., H. Najgebauer, M. Barzine, D. J. Kundu, F. Ghavidel, Y. Perez-Riverol, I. Papatheodorou, A. Brazma, and J. A. Vizcaíno (2019). ‘An integrated landscape of protein expression in human cancer’. (under review). Petryszak, R., M. Keays, Y. A. Tang, N. A. Fonseca, E. Barrera, T. Burdett, A. Füllgrabe, A. M.-P. Fuentes, S. Jupp, S. Koskinen, O. Mannion, L. Huerta, K. Megy, C. Snow, E. Williams, M. Barzine, E. Hastings, H. Weisser, J. Wright, P. Jaiswal, W. Huber, J. Choudhary, H. E. Parkinson, and A. Brazma (2015). ‘Expression Atlas update—an integrated database of gene and protein expression in humans, animals and plants’. Nucleic Acids Research 44.D1, pp. D746–52. Rustici, G., E. Williams, M. Barzine, A. Brazma, R. Bumgarner, M. Chierici, C. Furlanello, L. Greger, G. Jurman, M. Miller, B. F. Francis Ouellette, J. Quackenbush, M. Reich, C. J. Stoeckert, R. C. Taylor, S. C. Trutane, J. Weller, B. Wilhelm, and N. Winegarden (2021). ‘Transcriptomics data availability and reusability in the transition from microarray to next-generation sequencing’. Wright, J. C., J. Mudge, H. Weisser, M. P. Barzine, J. M. Gonzalez, A. Brazma, J. S. Choudhary, and J. Harrow (2016). ‘Improving GENCODE reference gene annotation using a high-stringency proteogenomics workflow’. Nature Communications 7, p. 11778. 259 REFERENCES Aebersold, R. (2011). ‘Editorial: from data to results’. Mol. Cell. Proteom. 10 (11), E111.014787. Aebersold, R., J. N. Agar, I. J. Amster, M. S. Baker, C. R. Bertozzi, E. S. Boja, C. E. Costello, B. F. Cravatt, C. Fenselau, B. A. Garcia, Y. Ge, J. Gunawardena, R. C. Hendrickson, P. J. Hergenrother, C. G. Huber, A. R. Ivanov, O. N. Jensen, M. C. Jewett, N. L. Kelleher, L. L. Kiessling, N. J. Krogan, M. R. Larsen, J. A. Loo, R. R. Ogorzalek Loo, E. Lundberg, M. J. MacCoss, P. Mallick, V. K. Mootha, M. Mrksich, T. W. Muir, S. M. Patrie, J. J. Pesavento, S. J. Pitteri, H. Rodriguez, A. Saghatelian, W. Sandoval, H. Schlüter, S. Sechi, S. A. Slavoff, L. M. Smith, M. P. Snyder, P. M. Thomas, M. Uhlén, J. E. Van Eyk, M. Vidal, D. R. Walt, F. M. White, E. R. Williams, T. Wohlschlager, V. H. Wysocki, N. A. Yates, N. L. Young, and B. Zhang (2018). ‘How many human proteoforms are there?’ Nat. Chem. Biol. 14 (3), pp. 206–214. Aebersold, R. and M. Mann (2003). ‘Mass spectrometry-based proteomics’. Nature 422 (6928), pp. 198–207. Aebersold, R. and M. Mann (2016). ‘Mass-spectrometric exploration of proteome structure and function’. Nature 537 (7620), pp. 347–355. Aggarwal, S. and A. K. Yadav (2015). ‘False Discovery Rate Estimation in Proteomics’. Statistical Analysis in Proteomics. 1362 of the series Methods in Molecular Biology. Vol. 1362. New York, NY, US: Springer, pp. 119–128. Ahrné, E., L. Molzahn, T. Glatter, and A. Schmidt (2013). ‘Critical assessment of proteome-wide label-free absolute abundance estimation strategies’. Proteomics 13 (17), pp. 2567–2578. Ahrné, E., F. Nikitin, F. Lisacek, and M. Müller (2011). ‘QuickMod: A tool for open modification spectrum library searches’. J. Proteome Res. 10 (7), pp. 2913–2921. Akers, N. K., E. E. Schadt, and B. Losic (2018). ‘STAR Chimeric Post For Rapid Detection of Circular RNA and Fusion Transcripts’. Bioinformatics 34 (14). Al Shweiki, M. R., S. Mönchgesang, P. Majovsky, D. Thieme, D. Trutschel, and W. Hoehenwarter (2017). ‘Assessment of Label-Free Quantification in Discovery Proteomics and Impact of Technological Factors and Natural Variability of Protein Abundance’. J. Proteome Res. 16 (4), pp. 1410–1424. Alberts, B., A. Johnson, J. Lewis, M. Raff, K. Roberts, and P. Walter (2002). Molecular Biology of the Cell. Oxford, UK: Garland Science. Alves, P., R. J. Arnold, M. V. Novotny, P. Radivojac, J. P. Reilly, and H. Tang (2007). ‘Advancement in protein inference from shotgun proteomics using peptide detectability’. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, pp. 409–420. Anders, S., P. T. Pyl, and W. Huber (2015). ‘HTSeq–a Python framework to work with high-throughput sequencing data’. Bioinformatics 31 (2), pp. 166–169. Anderson, L. and J. Seilhamer (1997). ‘A comparison of selected mRNA and protein abundances in human liver’. Electrophoresis 18 (3-4), pp. 533–537. Anscombe, F. J. (1973). ‘Graphs in Statistical Analysis’. Am. Stat. 27 (1), pp. 17–21. Apfalter, S., R. Krska, T. Linsinger, A. Oberhauser, W. Kandler, and M. Grasserbauer (1999). ‘Interlaboratory comparison study for the determination of halogenated hydrocarbons in water’. Fresenius J. Anal. Chem. 364 (7), pp. 660–665. 261 references Arike, L., K. Valgepea, L. Peil, R. Nahku, K. Adamberg, and R. Vilu (2012). ‘Comparison and applications of label-free absolute proteome quantification methods on Escherichia coli’. J. Proteom. 75 (17), pp. 5437–5448. Armour, C. D., J. C. Castle, R. Chen, T. Babak, P. Loerch, S. Jackson, J. K. Shah, J. Dey, C. A. Rohl, J. M. Johnson, andC. K. Raymond (2009). ‘Digital transcriptome profiling using selective hexamer priming for cDNA synthesis’. Nat. Methods 6 (9), pp. 647–649. Arnold, J. B. (2019). ggthemes: Extra Themes, Scales and Geoms for ’ggplot2’. R package version 4.1.1. Arvid, S. D. (1997). Chimie analytique; Trad. et révision scientifique de la 7e éd. américaine par C. Buess-Herman, J. Dauchot-Weymeers et F. Dumont. Bruxelles, BE: DeBoeck Université. Ashburner, M., C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock (2000). ‘Gene ontology: tool for the unification of biology. The Gene Ontology Consortium’. Nat. Genet. 25 (1), pp. 25–29. Asimov, I. (1989). The relativity of wrong. Essays on the Solar System and Beyond. Guernsey, Channels Islands, GB: Oxford University Press. Chap. Beginning with Bone. Asmann, Y. W., B. M. Necela, K. R. Kalari, A. Hossain, T. R. Baker, J. M. Carr, C. Davis, J. E. Getz, G. Hostetter, X. Li, S. A. McLaughlin, D. C. Radisky, G. P. Schroth, H. E. Cunliffe, E. A. Perez, and E. A. Thompson (2012). ‘Detection of redundant fusion transcripts as biomarkers or disease- specific therapeutic targets in breast cancer’. Cancer Res. 72 (8), pp. 1921–1928. Aston, F. W. (1919). ‘A positive ray spectrograph’. Philosophical Magazine 38 (228), pp. 707–714. Audain, E., J. Uszkoreit, T. Sachsenberg, J. Pfeuffer, X. Liang, H. Hermjakob, A. Sanchez, M. Eisenacher, K. Reinert, D. L. Tabb, O. Kohlbacher, and Y. Perez-Riverol (2017). ‘In-depth analysis of protein inference algorithms using multiple search engines and well-defined metrics’. J. Proteom. 150, pp. 170–182. Audi, G. and A. H.Wapstra (1993). ‘The 1993 atomic mass evaluation (I) Atomic mass table’.Nuclear Physics A 565 (1), pp. 1–65. Audi, G. and A. H.Wapstra (1995). ‘The 1995 update to the atomic mass evaluation’.Nuclear Physics A 595 (4), pp. 409–480. Auer, P. L. and R. W. Doerge (2010). ‘Statistical design and analysis of RNA sequencing data’. Genetics 185 (2), pp. 405–416. Auguie, B. (2017). gridExtra: Miscellaneous Functions for ”Grid” Graphics. R package version 2.3. Bahcall, O. G. (2015). ‘Human genetics: GTEx pilot quantifies eQTL variation across tissues and individuals’. Nat. Rev. Genet. 16 (7), p. 375. Bantscheff, M., S. Lemeer, M. M. Savitski, and B. Kuster (2012). ‘Quantitative mass spectrometry in proteomics: critical review update from 2007 to the present’. Anal. Bioanal. Chem. 404 (4), pp. 939–965. Barbosa-Morais, N. L., M. Irimia, Q. Pan, H. Y. Xiong, S. Gueroussov, L. J. Lee, V. Slobodeniuc, C. Kutter, S. Watt, R. Colak, T. Kim, C. M. Misquitta-Ali, M. D. Wilson, P. M. Kim, D. T. Odom, B. J. Frey, and B. J. Blencowe (2012). ‘The evolutionary landscape of alternative splicing in vertebrate species.’ Science 338 (6114), pp. 1587–93. Barshad, G., S. Marom, T. Cohen, and D. Mishmar (2018). ‘Mitochondrial DNA Transcription and Its Regulation: An Evolutionary Perspective’. Trends Genet. 34 (9), pp. 682–692. Barzine, M. P., K. Freivalds, J. C. Wright, M. Opmanis, D. Rituma, F. Z. Ghavidel, A. F. Jarnuczak, E. Celms, K. Čerāns, I. Jonassen, L. Lace, J. Antonio Vizcaíno, J. S. Choudhary, A. Brazma, and J. Viksna (2020). ‘Using Deep Learning to Extrapolate Protein Expression Measurements’. Proteomics 20 (21-22), e2000009. Bauer, C., R. Cramer, and J. Schuchhardt (2011). ‘Evaluation of Peak-Picking Algorithms for Protein Mass Spectrometry’. Data Mining in Proteomics: From Standards to Applications. Ed. by M. Hamacher, M. Eisenacher, and C. Stephan. Totowa, NJ, US: Humana Press, pp. 341–352. 262 references Begley, C. G. and L. M. Ellis (2012). ‘Drug development: Raise standards for preclinical cancer research’. Nature 483 (7391), pp. 531–533. Benhaïm, M. (2017). ‘Développements méthodologiques en protéomique quantitatives pour mieux comprendre la biologie évolutive d’espèces non séquencées’. PhD thesis. Université de Strasbourg. Benjamini, Y. and Y. Hochberg (1995). ‘Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing’. J. R. Stat. Soc. B 57 (1), pp. 289–300. Bentley, D. R. et al. (2008). ‘Accurate whole human genome sequencing using reversible terminator chemistry’. Nature 456 (7218), pp. 53–59. Berg, J. M. and L. Stryer (2002). Biochemistry. New York, NY, US: W.H. Freeman. Bergeron, J. J. M. and M. Hallett (2007). ‘Peptides you can count on’. Nat. Biotechnol. 25 (1), pp. 61– 62. Bern, M., Y. J. Kil, and C. Becker (2012). ‘Byonic: advanced peptide and protein identification software’. Curr. Protoc.Bioinform. Chapter 13, Unit13.20. Bernhardt, O., N. Selevsek, L. Gillet, O. Rinner, P. Picotti, R. Aebersold, and L. Reiter (2014). Spectronaut: a fast and efficient algorithm for MRM-like processing of data independent acquisition (SWATH-MS) data. Biemann, K. (1988). ‘Contributions of mass spectrometry to peptide and protein structure’. Biomed. Environ. Mass Spectrom. 16 (1-12), pp. 99–111. Bilan, V., M. Leutert, P. Nanni, C. Panse, and M. O. Hottiger (2017). ‘Combining Higher-Energy Collision Dissociation and Electron-Transfer/Higher-Energy Collision Dissociation Fragmentation in a Product-Dependent Manner Confidently Assigns Proteomewide ADP-Ribose Acceptor Sites’. Anal. Chem. 89 (3), pp. 1523–1530. Bittremieux, W., P. Meysman, W. S. Noble, and K. Laukens (2018). ‘Fast Open Modification Spectral Library Searching through Approximate Nearest Neighbor Indexing’. J. Proteome Res. 17 (10), pp. 3463–3474. Blanco, L., J. A. Mead, and C. Bessant (2009). ‘Comparison of novel decoy database designs for optimizing protein identification searches using ABRF sPRG2006 standard MS/MS data sets’. J. Proteome Res. 8 (4), pp. 1782–1791. Blein-Nicolas, M. and M. Zivy (2016). ‘Thousand and one ways to quantify and compare protein abundances in label-free bottom-up proteomics’. BBA 1864 (8), pp. 883–895. Bodzon-Kulakowska, A., A. Bierczynska-Krzysik, T. Dylag, A. Drabik, P. Suder, M. Noga, J. Jarzebinska, and J. Silberring (2007). ‘Methods for samples preparation in proteomic research’. J. Chromatogr. B 849 (1-2), pp. 1–31. Boguszewska, K., M. Szewczuk, J. Kaźmierczak-Barańska, and B. T. Karwowski (2020). ‘The Similarities between Human Mitochondria and Bacteria in the Context of Structure, Genome, and Base Excision Repair System’. Molecules 25 (12). Bonaldo, M. F., G. Lennon, andM. B. Soares (1996). ‘Normalization and subtraction: two approaches to facilitate gene discovery’. Genome Res. 6 (9), pp. 791–806. Boser, B. E., I. M. Guyon, and V. N. Vapnik (1992). ‘A Training Algorithm for Optimal Margin Classifiers’. Proceedings of the Fifth Annual Workshop on Computational Learning Theory. COLT ’92. Pittsburgh, Pennsylvania, USA: ACM, pp. 144–152. Boyle, E. I., S. Weng, J. Gollub, H. Jin, D. Botstein, J. M. Cherry, and G. Sherlock (2004). ‘GO::TermFinder–open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes’. Bioinformatics 20 (18), pp. 3710–3715. Braisted, J. C., S. Kuntumalla, C. Vogel, E. M. Marcotte, A. R. Rodrigues, R. Wang, S.-T. Huang, E. S. Ferlanti, A. I. Saeed, R. D. Fleischmann, S. N. Peterson, and R. Pieper (2008). ‘The APEX 263 references Quantitative Proteomics Tool: generating protein quantitation estimates from LC-MS/MS proteomics results’. BMC Bioinf. 9, p. 529. Bratic, A., P. Clemente, J. Calvo-Garrido, C. Maffezzini, A. Felser, R. Wibom, A. Wedell, C. Freyer, and A. Wredenberg (2016). ‘Mitochondrial Polyadenylation Is a One-Step Process Required for mRNA Integrity and tRNA Maturation’. PLOS Genet. 12 (5), e1006028. Brawand, D., M. Soumillon, A. Necsulea, P. Julien, G. Csárdi, P. Harrigan, M. Weier, A. Liechti, A. Aximu-Petri, M. Kircher, F. W. Albert, U. Zeller, P. Khaitovich, F. Grützner, S. Bergmann, R. Nielsen, S. Pääbo, and H. Kaessmann (2011). ‘The evolution of gene expression levels in mammalian organs’. Nature 478 (7369), pp. 343–348. Brosch, M., L. Yu, T. Hubbard, and J. Choudhary (2009). ‘Accurate and sensitive peptide identification with Mascot Percolator’. J. Proteome Res. 8 (6), pp. 3176–3181. Brosh,M. (2009). ‘Development of computational methods for analysing proteomic data for genome annotation’. PhD thesis. University of Cambridge. Brown, S. D., L. A. Raeburn, and R. A. Holt (2015). ‘Profiling tissue-resident T cell repertoires by RNA sequencing’. Genome Med. 7, p. 125. Bruce, C., K. Stone, E. Gulcicek, and K. Williams (2013). ‘Proteomics and the analysis of proteomic data: 2013 overview of current protein-profiling technologies’. Curr. Protoc. Bioinformatics S41 (13.21), pp. 1–17. Bubis, J. A., L. I. Levitsky, M. V. Ivanov, I. A. Tarasova, and M. V. Gorshkov (2017). ‘Comparative evaluation of label-free quantification methods for shotgun proteomics’. Rapid Commun. Mass Spectrom. 31 (7), pp. 606–612. Bulyk, M. L. (2007). ‘Protein binding microarrays for the characterization of DNA-protein interactions’. Adv. Biochem. Eng. Biotechnol. 104, pp. 65–85. Bulyk, M. L., P. L. F. Johnson, and G. M. Church (2002). ‘Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors’. Nucleic Acids Res. 30 (5), pp. 1255–1261. Bumgarner, R. (2013). ‘Overview of DNA microarrays: types, applications, and their future’. Curr. Protoc. Mol. Biol. Chapter 22, Unit 22.1. Bussotti, G., T. Leonardi, M. B. Clark, T. R. Mercer, J. Crawford, L. Malquori, C. Notredame, M. E. Dinger, J. S. Mattick, and A. J. Enright (2016). ‘Improved definition of the mouse transcriptome via targeted RNA sequencing’. Genome Res. 26 (5), pp. 705–716. Bythell, B. J., P. Maître, and B. Paizs (2010). ‘Cyclization and rearrangement reactions of a(n) fragment ions of protonated peptides’. J. Am. Chem. Soc 132 (42), pp. 14766–14779. Callen, J.-C. (2005). Biologie cellulaire — Des molécules aux organismes (2eme edition). Paris, FR: Dunod. Cantalupo, P. G., J. P. Katz, and J. M. Pipas (2015). ‘HeLa nucleic acid contamination in the cancer genome atlas leads to the misidentification of human papillomavirus 18’. J. Virol. 89 (8), pp. 4051– 4057. Canterbury, J. D., G. E. Merrihew, M. J. MacCoss, D. R. Goodlett, and S. A. Shaffer (2014). ‘Comparison of data acquisition strategies on quadrupole ion trap instrumentation for shotgun proteomics’. J. Am. Soc. Mass Spectrom. 25 (12), pp. 2048–2059. Cappadona, S., P. R. Baker, P. R. Cutillas, A. J. R. Heck, and B. van Breukelen (2012). ‘Current challenges in software solutions for mass spectrometry-based quantitative proteomics’. Amino Acids 43 (3), pp. 1087–1108. Carlson, M. (2019). org.Hs.eg.db: Genome wide annotation for Human. R package version 3.8.2. Castle, J. C., C. D. Armour, M. Löwer, D. Haynor, M. Biery, H. Bouzek, R. Chen, S. Jackson, J. M. Johnson, C. A. Rohl, and C. K. Raymond (2010). ‘Digital Genome-Wide ncRNA Expression, Including SnoRNAs, across 11 Human Tissues Using PolyA-Neutral Amplification’. PLOS ONE 5 (7), pp. 1–9. 264 references Catherman, A. D., O. S. Skinner, and N. L. Kelleher (2014). ‘Top Down proteomics: facts and perspectives’. BBRC 445 (4), pp. 683–693. Cavalli, F. M. G., R. Bourgon, W. Huber, J. M. Vaquerizas, and N. M. Luscombe (2011). ‘SpeCond: a method to detect condition-specific gene expression’. Genome Biol. 12 (10), R101. Chambers, M. C., B. Maclean, R. Burke, D. Amodei, D. L. Ruderman, S. Neumann, L. Gatto, B. Fischer, B. Pratt, J. Egertson, K. Hoff, D. Kessner, N. Tasman, N. Shulman, B. Frewen, T. A. Baker, M.-Y. Brusniak, C. Paulse, D. Creasy, L. Flashner, K. Kani, C. Moulding, S. L. Seymour, L. M. Nuwaysir, B. Lefebvre, F. Kuhlmann, J. Roark, P. Rainer, S. Detlev, T. Hemenway, A. Huhmer, J. Langridge, B. Connolly, T. Chadick, K. Holly, J. Eckels, E. W. Deutsch, R. L. Moritz, J. E. Katz, D. B. Agus, M. MacCoss, D. L. Tabb, and P. Mallick (2012). ‘A cross-platform toolkit for mass spectrometry and proteomics’. Nat. Biotechnol. 30 (10), pp. 918–920. Chang, W. (2014). extrafont: Tools for using fonts. R package version 0.17. Chang, W., J. Cheng, J. Allaire, Y. Xie, and J. McPherson (2019). shiny: Web Application Framework for R. R package version 1.3.2. Chapman, J. D., D. R. Goodlett, and C. D. Masselon (2014). ‘Multiplexed and data-independent tandem mass spectrometry for global proteome profiling’. Mass Spectrom. Rev. 33 (6), pp. 452– 470. Chen, C., J. Hou, J. J. Tanner, and J. Cheng (2020). ‘Bioinformatics Methods for Mass Spectrometry- Based Proteomics Data Analysis’. Int. J. Mol. Sci. 21 (8). Chen, G., B. Ning, and T. Shi (2019). ‘Single-Cell RNA-Seq Technologies and Related Computational Data Analysis’. Front. Genet. 10, p. 317. Chen, G., T. G. Gharib, C.-C. Huang, J. M. G. Taylor, D. E. Misek, S. L. R. Kardia, T. J. Giordano, M. D. Iannettoni, M. B. Orringer, S. M. Hanash, and D. G. Beer (2002). ‘Discordant protein and mRNA expression in lung adenocarcinomas’. Mol. Cell. Proteom. 1 (4), pp. 304–313. Chen, H. (2018). VennDiagram: Generate High-Resolution Venn and Euler Plots. R package version 1.6.20. Chen, J., K. Sathiyamoorthy, X. Zhang, S. Schaller, B. E. Perez White, T. S. Jardetzky, and R. Longnecker (2018). ‘Ephrin receptor A2 is a functional entry receptor for Epstein-Barr virus’. Nat. Microbiol. 3 (2), pp. 172–180. Chen, X., S. Wei, Y. Ji, X. Guo, and F. Yang (2015). ‘Quantitative proteomics using SILAC: Principles, applications, and developments’. Proteomics 15 (18), pp. 3175–3192. Chen, Y., F. Wang, F. Xu, and T. Yang (2016). ‘Mass Spectrometry-Based Protein Quantification’. Modern Proteomics – Sample Preparation, Analysis and Practical Applications. Adv. Exp. Med. Biol. Cham, CH: Springer, Cham, pp. 255–279. Cheng, J., P. Kapranov, J. Drenkow, S. Dike, S. Brubaker, S. Patel, J. Long, D. Stern, H. Tammana, G. Helt, V. Sementchenko, A. Piccolboni, S. Bekiranov, D. K. Bailey, M. Ganesh, S. Ghosh, I. Bell, D. S. Gerhard, and T. R. Gingeras (2005). ‘Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution’. Science 308 (5725), pp. 1149–1154. Chi, H., C. Liu, H. Yang,W.-F. Zeng, L. Wu,W.-J. Zhou, R.-M.Wang, X.-N. Niu, Y.-H. Ding, Y. Zhang, Z.-W. Wang, Z.-L. Chen, R.-X. Sun, T. Liu, G.-M. Tan, M.-Q. Dong, P. Xu, P.-H. Zhang, and S.-M. He (2018). ‘Comprehensive identification of peptides in tandem mass spectra using an efficient open search engine’. Nat. Biotechnol. Choi, H., D. Ghosh, and A. I. Nesvizhskii (2008). ‘Statistical validation of peptide identifications in large-scale proteomics using the target-decoy database search strategy and flexible mixture modeling’. J. Proteome Res. 7 (1), pp. 286–292. Chrominski, K. and M. Tkacz (2015). ‘Comparison of High-Level Microarray Analysis Methods in the Context of Result Consistency’. PLOS ONE 10 (6), e0128845. Clark, E. L., S. J. Bush, M. E. B. McCulloch, I. L. Farquhar, R. Young, L. Lefevre, C. Pridans, H. Tsang, C. Wu, C. Afrasiabi, M. Watson, C. B. Whitelaw, T. C. Freeman, K. M. Summers, A. L. Archibald, 265 references and D. A. Hume (2017). ‘A high resolution atlas of gene expression in the domestic sheep (Ovis aries)’. PLOS Genet. 13 (9), e1006997. Cock, P. J. A., C. J. Fields, N. Goto, M. L. Heuer, and P. M. Rice (2010). ‘The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants’. Nucleic Acids Res. 38 (6), pp. 1767–1771. Codrea, M. C. and S. Nahnsen (2016). ‘Platforms and Pipelines for Proteomics Data Analysis and Management’.Modern Proteomics – Sample Preparation, Analysis and Practical Applications. Adv. Exp. Med. Biol. Cham, CH: Springer, Cham, pp. 203–215. Coiera, E., E. Ammenwerth, A. Georgiou, and F. Magrabi (2018). ‘Does health informatics have a replication crisis?’ JAMIA 25 (8), pp. 963–968. Conesa, A., P. Madrigal, S. Tarazona, D. Gomez-Cabrero, A. Cervera, A. McPherson, M. W. Szcześniak, D. J. Gaffney, L. L. Elo, X. Zhang, and A. Mortazavi (2016). ‘A survey of best practices for RNA-seq data analysis’. Genome Biol. 17, p. 13. Cormen, T. H., C. E. Leiserson, R. L. Rivest, and C. Stein (2009). Introduction to Algorithms. Cambridge, MA, US: MIT Press. Corpas, M., R. Jimenez, S. J. Carbon, A. García, L. Garcia, T. Goldberg, J. Gomez, A. Kalderimis, S. E. Lewis, I. Mulvany, A. Pawlik, F. Rowland, G. Salazar, F. Schreiber, I. Sillitoe, W. H. Spooner, A. S. Thanki, J. M. Villaveces, G. Yachdav, and H. Hermjakob (2014). ‘BioJS: an open source standard for biological visualisation - its status in 2014’. F1000Research 3, p. 55. Cottrell, J. (2013). Does protein FDR have any meaning? http://www.matrixscience.com/blog/does- protein-fdr-have-any-meaning.html. Accessed: 2018-10-30. Cottrell, J. S. (2011). ‘Protein identification using MS/MS data’. J. Proteom. 74 (10), pp. 1842–1851. Cox, J. and M. Mann (2008). ‘MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification’. Nat. Biotechnol. 26 (12), pp. 1367–1372. Cox, J. and M. Mann (2011). ‘Quantitative, High-Resolution Proteomics for Data-Driven Systems Biology’. Annu. Rev. Biochem. 80 (1), pp. 273–299. Cox, J., N. Neuhauser, A. Michalski, R. A. Scheltema, J. V. Olsen, and M. Mann (2011). ‘Andromeda: a peptide search engine integrated into the MaxQuant environment’. J. Proteome Res. 10 (4), pp. 1794–1805. Cozzolino, F., A. Landolfi, I. Iacobucci, V. Monaco, M. Caterino, S. Celentano, C. Zuccato, E. Cattaneo, and M. Monti (2020). ‘New label-free methods for protein relative quantification applied to the investigation of an animal model of Huntington Disease’. PLOS ONE 15 (9), e0238037. Cresko Lab (2017). RNA-seqlopedia. url: http://rnaseq.uoregon.edu. Crick, F. (1958). ‘On protein synthesis’. Symposia of the Society for Experimental Biology 12, pp. 138– 163. Crick, F. (1970). ‘Central Dogma of Molecular Biology’. Nature 227 (5258), pp. 561–563. Cuperlovic-Culf, M., N. Belacel, A. S. Culf, and R. J. Ouellette (2006). ‘Microarray analysis of alternative splicing’. OMICS 10 (3), pp. 344–357. Danielsson, F., T. James, D. Gomez-Cabrero, and M. Huss (2015). ‘Assessing the consistency of public human tissue RNA-seq data sets’. Briefings Bioinf. 16 (6), pp. 941–949. Dapas, M., M. Kandpal, Y. Bi, and R. V. Davuluri (2017). ‘Comparative evaluation of isoform-level gene expression estimation algorithms for RNA-seq and exon-array platforms’. Briefings Bioinf. 18 (2), pp. 260–269. Dar, R. D., B. S. Razooky, L. S. Weinberger, C. D. Cox, and M. L. Simpson (2015). ‘The Low Noise Limit in Gene Expression’. PLOS ONE 10 (10), e0140969. Darnell Jr, J. E. (2013). ‘Reflections on the history of pre-mRNA processing and highlights of current knowledge: a unified picture’. RNA 19 (4), pp. 443–460. 266 references Dasari, S., M. C. Chambers, M. A. Martinez, K. L. Carpenter, A.-J. L. Ham, L. J. Vega-Montoto, and D. L. Tabb (2012). ‘Pepitome: evaluating improved spectral library search for identification complementarity and quality assessment’. J. Proteome Res. 11 (3), pp. 1686–1695. Davidson, V. L. and D. B. Sittman (1999). Biochemistry. Philadelphia, PA ,US: Lippincott Williams & Wilkins. Davies, L. and U. Gather (1993). ‘The Identification of Multiple Outliers’. J. Am. Stat. Assoc. 88 (423), pp. 782–792. De Simone,M., A. Arrigoni, G. Rossetti, P. Gruarin, V. Ranzani, C. Politano, R. J. P. Bonnal, E. Provasi, M. L. Sarnicola, I. Panzeri, M. Moro, M. Crosti, S. Mazzara, V. Vaira, S. Bosari, A. Palleschi, L. Santambrogio, G. Bovo, N. Zucchini, M. Totis, L. Gianotti, G. Cesana, R. A. Perego, N. Maroni, A. Pisani Ceretti, E. Opocher, R. De Francesco, J. Geginat, H. G. Stunnenberg, S. Abrignani, and M. Pagani (2016). ‘Transcriptional Landscape of Human Tissue Lymphocytes Unveils Uniqueness of Tumor-Infiltrating T Regulatory Cells’. Immunity 45 (5), pp. 1135–1147. De Siqueira Santos, S., D. Y. Takahashi, A. Nakata, and A. Fujita (2014). ‘A comparative study of statistical methods used to identify dependencies between gene expression signals’. Briefings Bioinf. 15 (6), pp. 906–918. Delacre, M., D. Lakens, and C. Leys (2017). Why Psychologists Should by Default Use Welch’s t-test Instead of Student’s t-test (in press for the International Review of Social Psychology). Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). ‘Maximum Likelihood from Incomplete Data via the EM Algorithm’. J. R. Stat. Soc. B 39 (1), pp. 1–38. Derrick, B., D. Toher, and P. White (2016). ‘Why Welch’s test is Type I error robust’. TQMP 12.1, pp. 30–38. Derrien, T., R. Johnson, G. Bussotti, A. Tanzer, S. Djebali, H. Tilgner, G. Guernec, D. Martin, A. Merkel, D. G. Knowles, J. Lagarde, L. Veeravalli, X. Ruan, Y. Ruan, T. Lassmann, P. Carninci, J. B. Brown, L. Lipovich, J. M. Gonzalez, M. Thomas, C. A. Davis, R. Shiekhattar, T. R. Gingeras, T. J. Hubbard, C. Notredame, J. Harrow, and R. Guigó (2012). ‘The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression’. Genome Res. 22 (9), pp. 1775–1789. Desiere, F., E. W. Deutsch, N. L. King, A. I. Nesvizhskii, P. Mallick, J. Eng, S. Chen, J. Eddes, S. N. Loevenich, and R. Aebersold (2006). ‘The PeptideAtlas project’. Nucleic Acids Res. 34 (Suppl. 1), p. D655. Deutsch, E. W., Z. Sun, D. Campbell, U. Kusebauch, C. S. Chu, L. Mendoza, D. Shteynberg, G. S. Omenn, and R. L. Moritz (2015). ‘State of the Human Proteome in 2014/2015 As Viewed through PeptideAtlas: Enhancing Accuracy and Coverage through the AtlasProphet’. J. Proteome Res. 14 (9), pp. 3461–3473. Diedrich, J. K., A. F. M. Pinto, and J. R. Yates 3rd (2013). ‘Energy dependence of HCD on peptide fragmentation: stepped collisional energy finds the sweet spot’. J. Am. Soc. Mass Spectrom. 24 (11), pp. 1690–1699. Dillies, M.-A., A. Rau, J. Aubert, C. Hennequet-Antier, M. Jeanmougin, N. Servant, C. Keime, G. Marot, D. Castel, J. Estelle, G. Guernec, B. Jagla, L. Jouneau, D. Laloë, C. Le Gall, B. Schaëffer, S. Le Crom, M. Guedj, F. Jaffrézic, and French StatOmique Consortium (2013). ‘A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis’. Briefings Bioinf. 14 (6), pp. 671–683. Do, C. B. and S. Batzoglou (2008). ‘What is the expectationmaximization algorithm?’Nat. Biotechnol. 26 (8), pp. 897–899. Dobin, A., C. A. Davis, F. Schlesinger, J. Drenkow, C. Zaleski, S. Jha, P. Batut, M. Chaisson, and T. R. Gingeras (2013). ‘STAR: ultrafast universal RNA-seq aligner’. Bioinformatics 29 (1), pp. 15–21. Domon, B. and R. Aebersold (2006). ‘Mass spectrometry and protein analysis’. Science 312 (5771), pp. 212–217. 267 references Domon, B. and R. Aebersold (2010). ‘Options and considerations when selecting a quantitative proteomics strategy’. Nat. Biotechnol. 28 (7), pp. 710–721. Dorfer, V., P. Pichler, T. Stranzl, J. Stadlmann, T. Taus, S. Winkler, and K. Mechtler (2014). ‘MS Amanda, a universal identification algorithm optimized for high accuracy tandem mass spectra’. J. Proteome Res. 13 (8), pp. 3679–3684. Dowle, M. and A. Srinivasan (2019). data.table: Extension of ‘data.frame‘. R package version 1.12.2. Doyle, A. C. (1892). ‘The Adventure of the Copper Beechees’. The Strand Magazine. Duarte, J. G. and J. M. Blackburn (2017). ‘Advances in the development of human protein microarrays’. Expert Rev. Proteomics 14 (7), pp. 627–641. Dupree, E. J., M. Jayathirtha, H. Yorkey, M. Mihasan, B. A. Petre, and C. C. Darie (2020). ‘A Critical Review of Bottom-Up Proteomics: The Good, the Bad, and the Future of this Field’. Proteomes 8 (3). Durinck, S., Y. Moreau, A. Kasprzyk, S. Davis, B. DeMoor, A. Brazma, andW.Huber (2005). ‘BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis’. Bioinformatics 21 (16), pp. 3439–3440. Efron, B., R. Tibshirani, J. D. Storey, and V. Tusher (2001). ‘Empirical Bayes Analysis of a Microarray Experiment’. J. Am. Stat. Assoc. 96 (456), pp. 1151–1160. Egger, M., G. D. Smith, and A. N. Phillips (1997). ‘Meta-analysis: principles and procedures’. BMJ 315 (7121), pp. 1533–1537. Elias, J. E. and S. P. Gygi (2007). ‘Target-decoy search strategy for increased confidence in large- scale protein identifications by mass spectrometry’. Nat. Methods 4 (3), pp. 207–214. Elias, J. E. and S. P. Gygi (2010). ‘Target-decoy search strategy for mass spectrometry-based proteomics’. Methods Mol. Biol. 604, pp. 55–71. Eng, J. K., A. L. McCormack, and J. R. Yates (1994). ‘An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database’. J. Am. Soc. Mass Spectrom. 5 (11), pp. 976–989. Eng, J. K., B. C. Searle, K. R. Clauser, and D. L. Tabb (2011). ‘A face in the crowd: recognizing peptides through database search’. Mol. Cell. Proteom. 10 (11), R111.009522. Engström, P. G., T. Steijger, B. Sipos, G. R. Grant, A. Kahles, G. Rätsch, N. Goldman, T. J. Hubbard, J. Harrow, R. Guigó, P. Bertone, and RGASP Consortium (2013). ‘Systematic evaluation of spliced alignment programs for RNA-seq data’. Nat. Methods 10 (12), pp. 1185–1191. Ensembl Blog (2011). Human BodyMap 2.0 data from Illumina. url: http://www.ensembl.info/blog/ 2011/05/24/human-bodymap-2-0-data-from-illumina/ (visited on 03/11/2013). Eraslan, B., D. Wang, M. Gusic, H. Prokisch, B. M. Hallström, M. Uhlén, A. Asplund, F. Pontén, T. Wieland, T. Hopf, H. Hahne, B. Kuster, and J. Gagneur (2019). ‘Quantification and discovery of sequence determinants of protein-per-mRNA amount in 29 human tissues’.Mol. Syst. Biol. 15 (2), e8513. Eriksson, J. and D. Fenyö (2007). ‘Improving the success rate of proteome analysis by modeling protein-abundance distributions and experimental designs’. Nat. Biotechnol. 25 (6), pp. 651–655. Esteve-Codina, A., O. Arpi, M. Martinez-García, E. Pineda, M. Mallo, M. Gut, C. Carrato, A. Rovira, R. Lopez, A. Tortosa, M. Dabad, S. Del Barco, S. Heath, S. Bagué, T. Ribalta, F. Alameda, N. de la Iglesia, C. Balaña, and GLIOCAT Group (2017). ‘A Comparison of RNA-Seq Results from Paired Formalin-Fixed Paraffin-Embedded and Fresh-Frozen Glioblastoma Tissue Samples’. PLOS ONE 12 (1), e0170632. Everaert, C., M. Luypaert, J. L. V. Maag, Q. X. Cheng, M. E. Dinger, J. Hellemans, and P. Mestdagh (2017). ‘Benchmarking of RNA-sequencing analysis workflows using whole-transcriptome RT- qPCR expression data’. Sci. Rep. 7, p. 1559. Ezkurdia, I., J. Vázquez, A. Valencia, and M. Tress (2014). ‘Analyzing the first drafts of the human proteome’. J. Proteome Res. 13 (8), pp. 3854–3855. 268 references Fagerberg, L., B. M. Hallström, P. Oksvold, C. Kampf, D. Djureinovic, J. Odeberg, M. Habuka, S. Tahmasebpoor, A. Danielsson, K. Edlund, A. Asplund, E. Sjöstedt, E. Lundberg, C. A.-K. Szigyarto, M. Skogs, J. O. Takanen, H. Berling, H. Tegel, J. Mulder, P. Nilsson, J. M. Schwenk, C. Lindskog, F. Danielsson, A. Mardinoglu, A. Sivertsson, K. von Feilitzen, M. Forsberg, M. Zwahlen, I. Olsson, S. Navani, M. Huss, J. Nielsen, F. Ponten, and M. Uhlén (2014). ‘Analysis of the human tissue-specific expression by genome-wide integration of transcriptomics and antibody-based proteomics’. Mol. Cell. Proteom. 13 (2), pp. 397–406. Fagerland, M. W. (2012). ‘t-tests, non-parametric tests, and large studies–a paradox of statistical practice?’ BMC Med. Res. Methodol. 12, p. 78. FANTOM Consortium and the RIKEN PMI and CLST (DGT) et al. (2014). ‘A promoter-level mammalian expression atlas’. Nature 507 (7493), pp. 462–470. Al-Faresi, R. A. Z., R. N. Lightowlers, and Z. M. A. Chrzanowska-Lightowlers (2019). ‘Mammalian mitochondrial translation - revealing consequences of divergent evolution’. Biochem. Soc. Trans. 47 (5), pp. 1429–1436. Fatovich, D. M. and M. Phillips (2017). ‘The probability of probability and research truths’. EMA 29 (2), pp. 242–244. Feist, P. and A. B. Hummon (2015). ‘Proteomic challenges: sample preparation techniques for microgram-quantity protein analysis from biological samples’. Int. J. Mol. Sci. 16 (2), pp. 3537–3563. Felsenstein, J. and J. Felenstein (2004). Inferring phylogenies. Vol. 2. Sunderland, MA, US: Sinauer associates. Feng, J., D. Q. Naiman, and B. Cooper (2007). ‘Probability model for assessing proteins assembled from peptide sequences inferred from tandem mass spectrometry data’. Anal. Chem. 79 (10), pp. 3901–3911. Field, A., J. Miles, and Z. Field (2012). Discovering Statistics Using R. London, UK: SAGE Publications. Fischer, B., V. Roth, F. Roos, J. Grossmann, S. Baginsky, P. Widmayer, W. Gruissem, and J. M. Buhmann (2005). ‘NovoHMM: a hidden Markov model for de novo peptide sequencing’. Anal. Chem. 77 (22), pp. 7265–7273. Florea, L., L. Song, and S. L. Salzberg (2013). ‘Thousands of exon skipping events differentiate among splicing patterns in sixteen human tissues’. F1000Research 2. Fonseca, N. A., R. Petryszak, J. Marioni, and A. Brazma (2014). ‘iRAP - an integrated RNA-seq Analysis Pipeline’. bioRxiv (005991). Fonseca, N. A., J. Marioni, and A. Brazma (2014). ‘RNA-Seq gene profiling–a systematic empirical comparison’. PLOS ONE 9 (9), e107026. Frank, A. M. (2009). ‘Predicting intensity ranks of peptide fragment ions’. J. Proteome Res. 8 (5), pp. 2226–2240. Frankish, A., B. Uszczynska, G. R. S. Ritchie, J. M. Gonzalez, D. Pervouchine, R. Petryszak, J. M. Mudge, N. Fonseca, A. Brazma, R. Guigo, and J. Harrow (2015). ‘Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction’. BMC Genom. 16 Suppl 8, S2. Franks, A., E. Airoldi, and N. Slavov (2017). ‘Post-transcriptional regulation across human tissues’. PLOS Comput. Biol. 13 (5), e1005535. Freeman, T. C., A. Ivens, J. K. Baillie, D. Beraldi, M.W. Barnett, D. Dorward, A. Downing, L. Fairbairn, R. Kapetanovic, S. Raza, A. Tomoiu, R. Alberio, C. Wu, A. I. Su, K. M. Summers, C. K. Tuggle, A. L. Archibald, and D. A. Hume (2012). ‘A gene expression atlas of the domestic pig’. BMC Biol. 10, p. 90. Freiberg, J. A., Y. Le Breton, B. Q. Tran, A. J. Scott, J. M. Harro, R. K. Ernst, Y. A. Goo, E. F. Mongodin, D. R. Goodlett, K. S. McIver, and M. E. Shirtliff (2016). ‘Global Analysis and Comparison of the Transcriptomes and Proteomes of Group AStreptococcusBiofilms’. mSystems 1 (6). 269 references Frewen, B. and M. J. MacCoss (2007). ‘Using BiblioSpec for creating and searching tandem MS peptide libraries’. Curr. Protoc. Bioinformatics Chapter 13, Unit 13.7. Fusaro, V. A., D. R. Mani, J. P. Mesirov, and S. A. Carr (2009). ‘Prediction of high-responding peptides for targeted protein assays by mass spectrometry’. Nat. Biotechnol. 27 (2), pp. 190–198. Gagniuc, P. A. (2017). Markov Chains: From Theory to Implementation and Experimentation. Hoboken, NJ, US: Wiley. Gagnon-Bartsch, J. A. and T. P. Speed (2012). ‘Using control genes to correct for unwanted variation in microarray data’. Biostatistics 13 (3), pp. 539–552. Gallien, S., E. Duriez, C. Crone, M. Kellmann, T. Moehring, and B. Domon (2012). ‘Targeted proteomic quantification on quadrupole-orbitrap mass spectrometer’. Mol. Cell. Proteom. 11 (12), pp. 1709–1723. Garalde, D. R., E. A. Snell, D. Jachimowicz, A. J. Heron, M. Bruce, J. Lloyd, A. Warland, N. Pantic, T. Admassu, J. Ciccone, S. Serra, J. Keenan, S. Martin, L. McNeill, J. Wallace, L. Jayasinghe, C. Wright, J. Blasco, B. Sipos, S. Young, S. Juul, J. Clarke, and D. J. Turner (2016). ‘Highly parallel direct RNA sequencing on an array of nanopores’. bioRxiv (068809). Gardner, M. L. and M. A. Freitas (2020). ‘Multiple Imputation Approaches Applied to the Missing Value Problem in Bottom-up Proteomics’. Gerrard, D. T., A. A. Berry, R. E. Jennings, K. Piper Hanley, N. Bobola, and N. A. Hanley (2016). ‘An integrative transcriptomic atlas of organogenesis in human embryos’. eLife 5. Gerster, S., E. Qeli, C. H. Ahrens, and P. Bühlmann (2010). ‘Protein and gene model inference based on statistical modeling in k-partite graphs’. PNAS 107 (27), pp. 12101–12106. Ghazalpour, A., B. Bennett, V. A. Petyuk, L. Orozco, R. Hagopian, I. N. Mungrue, C. R. Farber, J. Sinsheimer, H. M. Kang, N. Furlotte, C. C. Park, P.-Z. Wen, H. Brewer, K. Weitz, D. G. Camp II, C. Pan, R. Yordanova, I. Neuhaus, C. Tilford, N. Siemers, P. Gargalovic, E. Eskin, T. Kirchgessner, D. J. Smith, R. D. Smith, and A. J. Lusis (2011). ‘Comparative Analysis of Proteome and Transcriptome Variation in Mouse’. PLOS Genet. 7 (6), e1001393. Giansanti, P., L. Tsiatsiani, T. Y. Low, and A. J. R. Heck (2016). ‘Six alternative proteases for mass spectrometry-based proteomics beyond trypsin’. Nat. Protoc. 11 (5), pp. 993–1006. Gibson, G. (2015). ‘Human genetics. GTEx detects genetic effects’. Science 348 (6235), pp. 640–641. Gillet, L. C., P. Navarro, S. Tate, H. Röst, N. Selevsek, L. Reiter, R. Bonner, and R. Aebersold (2012). ‘Targeted data extraction of the MS/MS spectra generated by data-independent acquisition: a new concept for consistent and accurate proteome analysis’. Mol. Cell. Proteom. 11 (6), O111.016717. Glenn Begley, C. and J. P. A. Ioannidis (2015). ‘Reproducibility in Science’. Circ. Res. 116 (1), pp. 116– 126. Gonnelli, G., M. Stock, J. Verwaeren, D. Maddelein, B. De Baets, L. Martens, and S. Degroeve (2015). ‘A decoy-free approach to the identification of peptides’. J. Proteome Res. 14 (4), pp. 1792–1798. Gonzàlez-Porta, M. (2014). ‘RNA sequencing for the study of splicing’. PhD thesis. University of Cambridge. Gonzàlez-Porta, M., A. Frankish, J. Rung, J. Harrow, and A. Brazma (2013). ‘Transcriptome analysis of human tissues and cell lines reveals one dominant transcript per gene’. Genome Biol. 14 (7), R70. Goodman, S. N., D. Fanelli, and J. P. A. Ioannidis (2016). ‘What does research reproducibility mean?’ Sci. Transl. Med. 8 (341), 341ps12. Goodwin, S., J. D. McPherson, and W. R. McCombie (2016). ‘Coming of age: ten years of next- generation sequencing technologies’. Nat. Rev. Genet. 17 (6), pp. 333–351. Granholm, V., S. Kim, J. C. F. Navarro, E. Sjölund, R. D. Smith, and L. Käll (2014). ‘Fast and accurate database searches with MS-GF+Percolator’. J. Proteome Res. 13 (2), pp. 890–897. 270 references Gremel, G., A. Wanders, J. Cedernaes, L. Fagerberg, B. Hallström, K. Edlund, E. Sjöstedt, M. Uhlén, and F. Pontén (2015). ‘The human gastrointestinal tract-specific transcriptome and proteome as defined by RNA sequencing and antibody-based profiling’. J. Gastroenterol. 50 (1), pp. 46–57. Griss, J. (2016). ‘Spectral library searching in proteomics’. Proteomics 16 (5), pp. 729–740. Gry, M., R. Rimini, S. Strömberg, A. Asplund, F. Pontén, M. Uhlén, and P. Nilsson (2009). ‘Correlations between RNA and protein expression profiles in 23 human cell lines’. BMC Genom. 10, p. 365. GTEx Consortium (2013). ‘The Genotype-Tissue Expression (GTEx) project’. Nat. Genet. 45 (6), pp. 580–585. GTEx Consortium (2015). ‘Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans’. Science 348 (6235), pp. 648–660. Guillaumot, N. (2017). ‘Nouvelles applications et opportunités en protéomique’. PhD thesis. Université de Strasbourg. Guinand, B., A. Topchy, K. S. Page, M. K. Burnham-Curtis, W. F. Punch, and K. T. Scribner (2002). ‘Comparisons of likelihood and machine learning methods of individual classification’. J. Hered. 93 (4), pp. 260–269. Gunderson, K. L., F. J. Steemers, H. Ren, P. Ng, L. Zhou, C. Tsan, W. Chang, D. Bullis, J. Musmacker, C. King, L. L. Lebruska, D. Barker, A. Oliphant, K. M. Kuhn, and R. Shen (2006). ‘Whole-genome genotyping’. Meth. Enzymol. 410, pp. 359–376. Guo, Y., Y. Dai, H. Yu, S. Zhao, D. C. Samuels, and Y. Shyr (2017). ‘Improvements and impacts of GRCh38 human reference on high throughput sequencing data analysis’. Genomics 109 (2), pp. 83–90. Guthals, A., K. R. Clauser, A. M. Frank, and N. Bandeira (2013). ‘Sequencing-grade de novo analysis of MS/MS triplets (CID/HCD/ETD) from overlapping peptides’. J. Proteome Res. 12 (6), pp. 2846– 2857. Gutstein, H. B., J. S. Morris, S. P. Annangudi, and J. V. Sweedler (2008). ‘Microproteomics: analysis of protein diversity in small samples’. Mass Spectrom. Rev. 27 (4), pp. 316–330. Gygi, S. P., B. Rist, S. A. Gerber, F. Turecek, M. H. Gelb, and R. Aebersold (1999). ‘Quantitative analysis of complex protein mixtures using isotope-coded affinity tags’. Nat. Biotechnol. 17 (10), pp. 994–999. Gygi, S. P., Y. Rochon, B. R. Franza, and R. Aebersold (1999). ‘Correlation between protein and mRNA abundance in yeast’. Mol. Cell. Biol. 19 (3), pp. 1720–1730. Haag, A. M. (2016). ‘Mass Analyzers and Mass Spectrometers’. Modern Proteomics – Sample Preparation, Analysis and Practical Applications. Adv. Exp. Med. Biol. Cham, CH: Springer, Cham, pp. 157–169. Hajnsdorf, E. and V. R. Kaberdin (2018). ‘RNA polyadenylation and its consequences in prokaryotes’. Philos. Trans. R. Soc. Lond., B, Biol. Sci. 373 (1762). Hall, D. A., J. Ptacek, and M. Snyder (2007). ‘Protein microarray technology’.Mech. Ageing Dev. 128 (1), pp. 161–167. Halloran, J. T., H. Zhang, K. Kara, C. Renggli, M. The, C. Zhang, D. M. Rocke, L. Käll, and W. S. Noble (2019). ‘Speeding Up Percolator’. J. Proteome Res. 18 (9), pp. 3353–3359. Hampel, F. R. (1971). ‘A General Qualitative Definition of Robustness’. Ann. Math. Statist. 42 (6), pp. 1887–1896. Hampel, F. R. (1974). ‘The Influence Curve and its Role in Robust Estimation’. J. Am. Stat. Assoc. 69 (346), pp. 383–393. Han, H., Y. Xia, and S. A. McLuckey (2007). ‘Ion trap collisional activation of c and z* ions formed via gas-phase ion/ion electron-transfer dissociation’. J. Proteome Res. 6 (8), pp. 3062–3069. Hansen, K. D., S. E. Brenner, and S. Dudoit (2010). ‘Biases in Illumina transcriptome sequencing caused by random hexamer priming’. Nucleic Acids Res. 38 (12), e131. 271 references Hansen, K. D., R. A. Irizarry, and Z. Wu (2012). ‘Removing technical variability in RNA-seq data using conditional quantile normalization’. Biostatistics 13 (2), pp. 204–216. Harbers, M. (2008). ‘The current status of cDNA cloning’. Genomics 91 (3), pp. 232–242. Hardwick, S. A., I. W. Deveson, and T. R. Mercer (2017). ‘Reference standards for next-generation sequencing’. Nat. Rev. Genet. 18 (6). Hawrylycz, M. J., E. S. Lein, A. L. Guillozet-Bongaarts, E. H. Shen, L. Ng, J. A. Miller, L. N. van de Lagemaat, K. A. Smith, A. Ebbert, Z. L. Riley, C. Abajian, C. F. Beckmann, A. Bernard, D. Bertagnolli, A. F. Boe, P. M. Cartagena, M. M. Chakravarty, M. Chapin, J. Chong, R. A. Dalley, B. David Daly, C. Dang, S. Datta, N. Dee, T. A. Dolbeare, V. Faber, D. Feng, D. R. Fowler, J. Goldy, B. W. Gregor, Z. Haradon, D. R. Haynor, J. G. Hohmann, S. Horvath, R. E. Howard, A. Jeromin, J. M. Jochim, M. Kinnunen, C. Lau, E. T. Lazarz, C. Lee, T. A. Lemon, L. Li, Y. Li, J. A. Morris, C. C. Overly, P. D. Parker, S. E. Parry, M. Reding, J. J. Royall, J. Schulkin, P. A. Sequeira, C. R. Slaughterbeck, S. C. Smith, A. J. Sodt, S. M. Sunkin, B. E. Swanson, M. P. Vawter, D. Williams, P. Wohnoutka, H. R. Zielke, D. H. Geschwind, P. R. Hof, S. M. Smith, C. Koch, S. G. N. Grant, and A. R. Jones (2012). ‘An anatomically comprehensive atlas of the adult human brain transcriptome’. Nature 489 (7416), pp. 391–399. He, Z., T. Huang, C. Zhao, and B. Teng (2016). ‘Protein Inference’. Adv. Exp. Med. Biol. 919, pp. 237– 242. Hebenstreit, D., M. Fang, M. Gu, V. Charoensawan, A. van Oudenaarden, and S. A. Teichmann (2011). ‘RNA sequencing reveals two major classes of gene expression levels in metazoan cells’. Mol. Syst. Biol. 7 (1). Higgs, R. E., J. P. Butler, B. Han, and M. D. Knierman (2013). ‘Quantitative Proteomics via High Resolution MS Quantification: Capabilities and Limitations’. Int. J. Proteomics 2013, p. 674282. Hilbrig, F. and R. Freitag (2003). ‘Protein purification by affinity precipitation’. J. Chromatogr. B 790 (1-2), pp. 79–90. Hillen, H. S., D. Temiakov, and P. Cramer (2018). ‘Structural basis of mitochondrial transcription’. Nat. Struct. Mol. Biol. 25 (9), pp. 754–765. Hochbaum, D. S. (1997). ‘Approximation Algorithms for NP-hard Problems’. Ed. by D. S. Hochbaum. Boston, MA, US: PWS Publishing Co. Chap. Approximating Covering and Packing Problems: Set Cover, Vertex Cover, Independent Set, and Related Problems, pp. 94–143. Hoheisel, J. D. (2006). ‘Microarray technology: beyond transcript profiling and genotype analysis’. Nat. Rev. Genet. 7 (3), pp. 200–210. Holman, J. D., D. L. Tabb, and P. Mallick (2014). ‘Employing ProteoWizard to Convert Raw Mass Spectrometry Data’. Curr. Protoc. Bioinformatics 46 (13.24), pp. 1–9. Hou, Z., P. Jiang, S. A. Swanson, A. L. Elwell, B. K. S. Nguyen, J. M. Bolin, R. Stewart, and J. A. Thomson (2015). ‘A cost-effective RNA sequencing protocol for large-scale gene expression studies’. Sci. Rep. 5, p. 9570. Hsu, J.-L., S.-Y. Huang, N.-H. Chow, and S.-H. Chen (2003). ‘Stable-isotope dimethyl labeling for quantitative proteomics’. Anal. Chem. 75 (24), pp. 6843–6852. Hu, A., W. S. Noble, and A. Wolf-Yadlin (2016). ‘Technical advances in proteomics: new developments in data-independent acquisition’. F1000Research 5. Hu, Q., R. J. Noll, H. Li, A. Makarov, M. Hardman, and R. Graham Cooks (2005). ‘The Orbitrap: a new mass spectrometer’. J. Mass Spectrom. 40 (4), pp. 430–443. Huang, D. W., B. T. Sherman, and R. A. Lempicki (2009). ‘Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists’. Nucleic Acids Res. 37 (1), pp. 1– 13. Huang, T., J. Wang, W. Yu, and Z. He (2012). ‘Protein inference: a review’. Briefings Bioinf. 13 (5), pp. 586–614. 272 references Huber, W., V. J. Carey, R. Gentleman, S. Anders, M. Carlson, B. S. Carvalho, H. C. Bravo, S. Davis, L. Gatto, T. Girke, R. Gottardo, F. Hahne, K. D. Hansen, R. A. Irizarry, M. Lawrence, M. I. Love, J. MacDonald, V. Obenchain, A. K. Ole’s, H. Pag‘es, A. Reyes, P. Shannon, G. K. Smyth, D. Tenenbaum, L. Waldron, and M. Morgan (2015). ‘Orchestrating high-throughput genomic analysis with Bioconductor’. Nat. Methods 12 (2), pp. 115–121. Ilicic, T., J. K. Kim, A. A. Kolodziejczyk, F. O. Bagger, D. J. McCarthy, J. C. Marioni, and S. A. Teichmann (2016). ‘Classification of low quality cells from single-cell RNA-seq data’. Genome Biol. 17, p. 29. Illumina (2016). Illumina Sequencing by Synthesis. url: https://youtu.be/fCd6B5HRaZ8. International HumanGenome Sequencing Consortium (2004). ‘Finishing the euchromatic sequence of the human genome’. Nature 431 (7011), pp. 931–945. Irizarry, R. A., C.Wang, Y. Zhou, and T. P. Speed (2009). ‘Gene set enrichment analysis made simple’. Stat. Methods Med. Res. 18 (6), pp. 565–575. Irizarry, R. A., D. Warren, F. Spencer, I. F. Kim, S. Biswal, B. C. Frank, E. Gabrielson, J. G. N. Garcia, J. Geoghegan, G. Germino, C. Griffin, S. C. Hilmer, E. Hoffman, A. E. Jedlicka, E. Kawasaki, F. Martínez-Murillo, L. Morsberger, H. Lee, D. Petersen, J. Quackenbush, A. Scott, M. Wilson, Y. Yang, S. Q. Ye, and W. Yu (2005). ‘Multiple-laboratory comparison of microarray platforms’. Nat. Methods 2 (5), pp. 345–350. Ishihama, Y., Y. Oda, T. Tabata, T. Sato, T. Nagasu, J. Rappsilber, and M. Mann (2005). ‘Exponentially modified protein abundance index (emPAI) for estimation of absolute protein amount in proteomics by the number of sequenced peptides per protein’. Mol. Cell. Proteom. 4 (9), pp. 1265–1272. Iwakiri, J., M. Hamada, and K. Asai (2016). ‘Bioinformatics tools for lncRNA research’. BBA 1859 (1), pp. 23–30. Jaccard, P. (1901). ‘Etude de la distribution florale dans une portion des Alpes et du Jura’. Bulletin de la Société Vaudoise des Sciences Naturelles 37 (142), pp. 547–579. Jahn, N. (2018). europepmc: R Interface to the Europe PubMed Central RESTful Web Service. R package version 0.3. Jänes, J., F. Hu, A. Lewin, and E. Turro (2015). ‘A comparative study of RNA-seq analysis strategies’. Briefings Bioinf. 16 (6), pp. 932–940. Jaskowiak, P. A., R. J. G. B. Campello, and I. G. Costa (2014). ‘On the selection of appropriate distances for gene expression data clustering’. BMC Bioinf. 15 Suppl. 2, S2. Jeong, K., S. Kim, and P. A. Pevzner (2013). ‘UniNovo: a universal tool for de novo peptide sequencing’. Bioinformatics 29 (16), pp. 1953–1962. Jiang, C., Y. Li, Z. Zhao, J. Lu, H. Chen, N. Ding, G. Wang, J. Xu, and X. Li (2016). ‘Identifying and functionally characterizing tissue-specific and ubiquitously expressed human lncRNAs’. Oncotarget 7 (6), pp. 7120–7133. Jiang, L., F. Schlesinger, C. A. Davis, Y. Zhang, R. Li, M. Salit, T. R. Gingeras, and B. Oliver (2011). ‘Synthetic spike-in standards for RNA-seq experiments’. Genome Res. 21 (9), pp. 1543–1551. Jiménez-Lozano, N., J. Segura, J. R. Macías, J. Vega, and J. M. Carazo (2012). ‘Integrating human and murine anatomical gene expression data for improved comparisons’. Bioinformatics 28 (3), pp. 397–402. Johnson, N. L., A. W. Kemp, and S. Kotz (2005). Univariate Discrete Distributions. Hoboken, NJ, US: Wiley. Johnson, R. S., S. A. Martin, K. Biemann, J. T. Stults, and J. T. Watson (1987). ‘Novel fragmentation process of peptides by collision-induced decomposition in a tandem mass spectrometer: differentiation of leucine and isoleucine’. Anal. Chem. 59 (21), pp. 2621–2625. Johnson, W. E., C. Li, and A. Rabinovic (2007). ‘Adjusting batch effects in microarray expression data using empirical Bayes methods’. Biostatistics 8 (1), pp. 118–127. 273 references Jovanovic, M., M. S. Rooney, P. Mertins, D. Przybylski, N. Chevrier, R. Satija, E. H. Rodriguez, A. P. Fields, S. Schwartz, R. Raychowdhury, M. R. Mumbach, T. Eisenhaure, M. Rabani, D. Gennert, D. Lu, T. Delorey, J. S. Weissman, S. A. Carr, N. Hacohen, and A. Regev (2015). ‘Immunogenetics. Dynamic profiling of the protein life cycle in response to pathogens’. Science 347 (6226), p. 1259038. Kadota, K., J. Ye, Y. Nakai, T. Terada, and K. Shimizu (2006). ‘ROKU: a novelmethod for identification of tissue-specific genes’. BMC Bioinf. 7, p. 294. Käll, L., J. D. Canterbury, J. Weston, W. S. Noble, and M. J. MacCoss (2007). ‘Semi-supervised learning for peptide identification from shotgun proteomics datasets’. Nat. Methods 4 (11), pp. 923–925. Käll, L., J. D. Storey, M. J. MacCoss, and W. S. Noble (2008). ‘Posterior error probabilities and false discovery rates: two sides of the same coin’. J. Proteome Res. 7 (1), pp. 40–44. Karro, J. E., Y. Yan, D. Zheng, Z. Zhang, N. Carriero, P. Cayting, P. Harrrison, and M. Gerstein (2007). ‘Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation’. Nucleic Acids Res. 35 (Database issue), pp. D55–60. Karthik, D., G. Stelzer, S. Gershanov, D. Baranes, and M. Salmon-Divon (2016). ‘Elucidating tissue specific genes using the Benford distribution’. BMC Genom. 17, p. 595. Kechavarzi, B. and S. C. Janga (2014). ‘Dissecting the expression landscape of RNA-binding proteins in human cancers’. Genome Biol. 15 (1), R14. Keller, A., A. I. Nesvizhskii, E. Kolker, and R. Aebersold (2002). ‘Empirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database Search’. Anal. Chem. 74 (20), pp. 5383–5392. Kern, D. M., P. K. Nicholls, D. C. Page, and I. M. Cheeseman (2016). ‘A mitotic SKAP isoform regulates spindle positioning at astral microtubule plus ends’. J. Cell Biol. 213 (3), pp. 315–328. Khang, T. F. and C. Y. Lau (2015). ‘Getting the most out of RNA-seq data analysis’. PeerJ 3, e1360. Khatri, P. and S. Drăghici (2005). ‘Ontological analysis of gene expression data: current tools, limitations, and open problems’. Bioinformatics 21 (18), pp. 3587–3595. Khatri, P., M. Sirota, and A. J. Butte (2012). ‘Ten years of pathway analysis: current approaches and outstanding challenges’. PLOS Comput. Biol. 8 (2), e1002375. Kim, D., B. Langmead, and S. L. Salzberg (2015). ‘HISAT: a fast spliced aligner with low memory requirements’. Nat. Methods 12 (4), pp. 357–360. Kim, D., G. Pertea, C. Trapnell, H. Pimentel, R. Kelley, and S. L. Salzberg (2013). ‘TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions’. Genome Biol. 14 (4), R36. Kim, M.-S., S. M. Pinto, D. Getnet, R. S. Nirujogi, S. S. Manda, R. Chaerkady, A. K. Madugundu, D. S. Kelkar, R. Isserlin, S. Jain, J. K. Thomas, B. Muthusamy, P. Leal-Rojas, P. Kumar, N. A. Sahasrabuddhe, L. Balakrishnan, J. Advani, B. George, S. Renuse, L. D. N. Selvan, A. H. Patil, V. Nanjappa, A. Radhakrishnan, S. Prasad, T. Subbannayya, R. Raju, M. Kumar, S. K. Sreenivasamurthy, A. Marimuthu, G. J. Sathe, S. Chavan, K. K. Datta, Y. Subbannayya, A. Sahu, S. D. Yelamanchi, S. Jayaram, P. Rajagopalan, J. Sharma, K. R. Murthy, N. Syed, R. Goel, A. A. Khan, S. Ahmad, G. Dey, K. Mudgal, A. Chatterjee, T.-C. Huang, J. Zhong, X. Wu, P. G. Shaw, D. Freed, M. S. Zahari, K. K. Mukherjee, S. Shankar, A. Mahadevan, H. Lam, C. J. Mitchell, S. K. Shankar, P. Satishchandra, J. T. Schroeder, R. Sirdeshmukh, A. Maitra, S. D. Leach, C. G. Drake, M. K. Halushka, T. S. K. Prasad, R. H. Hruban, C. L. Kerr, G. D. Bader, C. A. Iacobuzio-Donahue, H. Gowda, and A. Pandey (2014). ‘A draft map of the human proteome’. Nature 509 (7502), pp. 575–581. Kim, M., A. Eetemadi, and I. Tagkopoulos (2017). ‘DeepPep: Deep proteome inference from peptide profiles’. PLOS Comput. Biol. 13 (9), e1005661. 274 references Kim, P., A. Park, G. Han, H. Sun, P. Jia, and Z. Zhao (2017). ‘TissGDB: tissue-specific gene database in cancer’. Nucleic Acids Res. 46 (D1). Kim, S. and P. A. Pevzner (2014). ‘MS-GF+ makes progress towards a universal database search tool for proteomics’. Nat. Commun. 5, p. 5277. Kohlbacher, O., K. Reinert, C. Gröpl, E. Lange, N. Pfeifer, O. Schulz-Trieglaff, and M. Sturm (2007). ‘TOPP–the OpenMS proteomics pipeline’. Bioinformatics 23 (2), e191–7. Kong, A. T., F. V. Leprevost, D. M. Avtonomov, D. Mellacheruvu, and A. I. Nesvizhskii (2017). ‘MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics’. Nat. Methods 14 (5), pp. 513–520. Kosti, I., N. Jain, D. Aran, A. J. Butte, andM. Sirota (2016). ‘Cross-tissue Analysis of Gene and Protein Expression in Normal and Cancer Tissues’. Sci. Rep. 6, 24799ep. Kotrys, A. V. and R. J. Szczesny (2019). ‘Mitochondrial Gene Expression and Beyond-Novel Aspects of Cellular Physiology’. Cells 9 (1). Koziol, J., N. Griffin, F. Long, Y. Li, M. Latterich, and J. Schnitzer (2013). ‘On protein abundance distributions in complex mixtures’. Proteome Sci. 11 (1), p. 5. Kratz, A. and P. Carninci (2014). ‘The devil in the details of RNA-seq’.Nat. Biotechnol. 32 (9), pp. 882– 884. Kroll, K. W., N. E. Mokaram, A. R. Pelletier, D. E. Frankhouser, M. S. Westphal, P. A. Stump, C. L. Stump, R. Bundschuh, J. S. Blachly, and P. Yan (2014). ‘Quality Control for RNA-Seq (QuaCRS): An Integrated Quality Control Pipeline’. Cancer Informat. 13 (Suppl 3), pp. 7–14. Krupp, M., J. U. Marquardt, U. Sahin, P. R. Galle, J. Castle, and A. Teufel (2012). ‘RNA-Seq Atlas–a reference database for gene expression profiling in normal tissue by next-generation sequencing’. Bioinformatics 28 (8), pp. 1184–1185. Kryuchkova-Mostacci, N. andM. Robinson-Rechavi (2017). ‘A benchmark of gene expression tissue- specificity metrics’. Briefings Bioinf. 18 (2), pp. 205–214. Kumar, A., S. K. Ghosh, M. A. Faiq, V. R. Deshmukh, C. Kumari, and V. Pareek (2019). ‘A brief review of recent discoveries in human anatomy’. QJM 112 (8), pp. 567–573. Kurt, W. (2019). Bayesian Statistics the Fun Way: Understanding Statistics and Probability with Star Wars, LEGO, and Rubber Ducks. San Francisco, CA, US: No Starch Press, Incorporated. Ladoukakis, E. D. and E. Zouros (2017). ‘Evolution and inheritance of animal mitochondrial DNA: rules and exceptions’. J. Biol. Res. 24, p. 2. Lam, H., E.W. Deutsch, J. S. Eddes, J. K. Eng, S. E. Stein, and R. Aebersold (2008). ‘Building consensus spectral libraries for peptide identification in proteomics’. Nat. Methods 5 (10), pp. 873–875. Lander, E. S. et al. (2001). ‘Initial sequencing and analysis of the human genome’. Nature 409 (6822), pp. 860–921. Langfelder, P. and S. Horvath (2008). ‘WGCNA: an R package for weighted correlation network analysis’. BMC Bioinf. 9, p. 559. Langmead, B., C. Trapnell, M. Pop, and S. L. Salzberg (2009). ‘Ultrafast and memory-efficient alignment of short DNA sequences to the human genome’. Genome Biol. 10 (3), R25. Laskay, Ü. A., A. A. Lobas, K. Srzentić, M. V. Gorshkov, and Y. O. Tsybin (2013). ‘Proteome digestion specificity analysis for rational design of extended bottom-up and middle-down proteomics experiments’. J. Proteome Res. 12 (12), pp. 5558–5569. Laurent, J. M., C. Vogel, T. Kwon, S. A. Craig, D. R. Boutz, H. K. Huse, K. Nozue, H. Walia, M. Whiteley, P. C. Ronald, and E. M. Marcotte (2010). ‘Protein abundances are more conserved than mRNA abundances across diverse taxa’. Proteomics 10 (23), pp. 4209–4212. Lazar, C., L. Gatto, M. Ferro, C. Bruley, and T. Burger (2016). ‘Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies’. J. Proteome Res. 15 (4), pp. 1116–1125. 275 references Lazar, C., S. Meganck, J. Taminau, D. Steenhoff, A. Coletta, C. Molter, D. Y. Weiss-Solís, R. Duque, H. Bersini, and A. Nowé (2013). ‘Batch effect removal methods for microarray gene expression data integration: a survey’. Briefings Bioinf. 14 (4), pp. 469–490. Lee, H. Y., E. G. Kim, H. R. Jung, J. W. Jung, H. B. Kim, J. W. Cho, K. M. Kim, and E. C. Yi (2019). ‘Refinements of LC-MS/MS Spectral Counting Statistics Improve Quantification of Low Abundance Proteins’. Sci. Rep. 9, p. 13653. Lee, L. H. (2015). ‘Quantitative and functional analysis pipeline for label-free metaproteomics data and its applications’. PhD thesis. University of Tennessee. Lee, M.-L. (2006). Analysis of Microarray Gene Expression Data. New York, NY, US: Springer. Leek, J. T. (2014). ‘svaseq: removing batch effects and other unwanted noise from sequencing data’. Nucleic Acids Res. 42 (21). Leek, J. T., R. B. Scharpf, H. C. Bravo, D. Simcha, B. Langmead,W. E. Johnson, D. Geman, K. Baggerly, and R. A. Irizarry (2010). ‘Tackling the widespread and critical impact of batch effects in high- throughput data’. Nat. Rev. Genet. 11 (10), pp. 733–739. Levin, J. Z., M. Yassour, X. Adiconis, C. Nusbaum, D. A. Thompson, N. Friedman, A. Gnirke, and A. Regev (2010). ‘Comprehensive comparative analysis of strand-specific RNA sequencing methods’. Nat. Methods 7 (9), pp. 709–715. Lewczuk, P., G. Beck, O. Ganslandt, H. Esselmann, F. Deisenhammer, A. Regeniter, H.-F. Petereit, H. Tumani, A. Gerritzen, P. Oschmann, J. Schröder, P. Schönknecht, K. Zimmermann, H. Hampel, K. Bürger, M. Otto, S. Haustein, K. Herzog, R. Dannenberg, U. Wurster, M. Bibl, J. M. Maler, U. Reubach, J. Kornhuber, and J. Wiltfang (2006). ‘International quality control survey of neurochemical dementia diagnostics’. Neurosci. Lett. 409 (1), pp. 1–4. Li, B. and G. J. Babu (2019). ‘Bayesian Inference’. A Graduate Course on Statistical Inference. Ed. by B. Li and G. J. Babu. New York, NY, US: Springer New York, pp. 173–201. Li, H., B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, and 1000 Genome Project Data Processing Subgroup (2009). ‘The Sequence Alignment/Map format and SAMtools’. Bioinformatics 25 (16), pp. 2078–2079. Li, J. J., P. J. Bickel, and M. D. Biggin (2014). ‘System wide analyses have underestimated protein abundances and the importance of transcription in mammals’. PeerJ 2, e270. Li, P., Y. Piao, H. S. Shon, and K. H. Ryu (2015). ‘Comparing the normalization methods for the differential analysis of Illumina high-throughput RNA-Seq data.’ BMC Bioinf. 16, p. 347. Li, S., P. P. Łabaj, P. Zumbo, P. Sykacek, W. Shi, L. Shi, J. Phan, P.-Y. Wu, M. Wang, C. Wang, D. Thierry-Mieg, J. Thierry-Mieg, D. P. Kreil, and C. E. Mason (2014). ‘Detecting and correcting systematic variation in large-scale RNA sequencing data’. Nat. Biotechnol. 32 (9), pp. 888–895. Liang, S., Y. Li, X. Be, S. Howes, and W. Liu (2006). ‘Detecting and profiling tissue-selective genes’. Physiol. Genomics 26 (2), pp. 158–162. Lide, D. R., ed. (2005). Handbook of Chemistry and Physics, 85th edition. Boca Raton, FL, US: CRC Press. Liebal, U. W., A. N. T. Phan, M. Sudhakar, K. Raman, and L. M. Blank (2020). ‘Machine Learning Applications for Mass Spectrometry-Based Metabolomics’. Metabolites 10 (6). Lin, T. Y., Y. Xie, A. Wasilewska, and C.-J. Liau, eds. (2008). Data Mining: Foundations and Practice. Berlin, DE: Berlin Springer. Lindemann, C., N. Thomanek, F. Hundt, T. Lerari, H. E. Meyer, D. Wolters, and K. Marcus (2017). ‘Strategies in relative and absolute quantitativemass spectrometry based proteomics’. Biol. Chem. 398 (5-6), pp. 687–699. Lindner, M. D., K. D. Torralba, and N. A. Khan (2018). ‘Scientific productivity: An exploratory study of metrics and incentives’. PLOS ONE 13 (4), e0195321. 276 references Linsinger, T. P. J., W. Kandler, R. Krska, and M. Grasserbauer (1998). ‘The influence of different evaluation techniques on the results of interlaboratory comparisons’. Accredit. Qual. Assur. 3 (8), pp. 322–327. Liu, H., S. Shah, and W. Jiang (2004). ‘On-line outlier detection and data cleaning’. Comput. Chem. Eng. 28 (9), pp. 1635–1647. Liu, H., R. G. Sadygov, and J. R. Yates 3rd (2004). ‘A model for random sampling and estimation of relative protein abundance in shotgun proteomics’. Anal. Chem. 76 (14), pp. 4193–4201. Liu, K., J. Zhang, J. Wang, L. Zhao, X. Peng, W. Jia, W. Ying, Y. Zhu, H. Xie, F. He, and X. Qian (2009). ‘Relationship between sample loading amount and peptide identification and its effects on quantitative proteomics’. Anal. Chem. 81 (4), pp. 1307–1314. Liu,W., J. Wang, T.Wang, and H. Xie (2014). ‘Construction and analyses of human large-scale tissue specific networks’. PLOS ONE 9 (12), e115074. Liu, X., X. Yu, D. J. Zack, H. Zhu, and J. Qian (2008). ‘TiGER: a database for tissue-specific gene expression and regulation’. BMC Bioinf. 9, p. 271. Liu, Y., A. Beyer, and R. Aebersold (2016). ‘On the Dependency of Cellular Protein Levels on mRNA Abundance’. Cell 165 (3), pp. 535–550. Love, M. I., W. Huber, and S. Anders (2014). ‘Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2’. Genome Biol. 15 (12), p. 550. Lowe, R., N. Shirley, M. Bleackley, S. Dolan, and T. Shafee (2017). ‘Transcriptomics technologies’. PLOS Comput. Biol. 13 (5), e1005457. Lukk, M., M. Kapushesky, J. Nikkilä, H. Parkinson, A. Goncalves, W. Huber, E. Ukkonen, and A. Brazma (2010). ‘A global map of human gene expression’. Nat. Biotechnol. 28 (4), pp. 322–324. Lundberg, E., L. Fagerberg, D. Klevebring, I. Matic, T. Geiger, J. Cox, C. Algenäs, J. Lundeberg, M. Mann, and M. Uhlen (2010). ‘Defining the transcriptome and proteome in three functionally different human cell lines’. Mol. Syst. Biol. 6 (1), p. 450. Lyu, Z., K. Peng, and C.-P. Hu (2018). ‘P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation’. Front. Psychol. 9, p. 868. Ma, B. (2015). ‘Novor: real-time peptide de novo sequencing software’. J. Am. Soc. Mass Spectrom. 26 (11), pp. 1885–1894. Ma, B., K. Zhang, C. Hendrie, C. Liang, M. Li, A. Doherty-Kirby, and G. Lajoie (2003). ‘PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry’. Rapid Commun. Mass Spectrom. 17 (20), pp. 2337–2342. Ma, K., O. Vitek, and A. I. Nesvizhskii (2012). ‘A statistical model-building perspective to identification of MS/MS spectra with PeptideProphet’. BMC Bioinf. 13 Suppl 16, S1. Mabbott, N. A., J. K. Baillie, H. Brown, T. C. Freeman, and D. A. Hume (2013). ‘An expression atlas of human primary cells: inference of gene function from coexpression networks’. BMC Genom. 14, p. 632. Macias, L. A., I. C. Santos, and J. S. Brodbelt (2020). ‘Ion Activation Methods for Peptides and Proteins’. Anal. Chem. 92 (1), pp. 227–251. Mackay, K. I. (2015). ‘A Comparative Study of Analysis Methods in Quantitative Label-free Proteomics’. PhD thesis. University of Liverpool. MacLean, B., J. K. Eng, R. C. Beavis, and M. McIntosh (2006). ‘General framework for developing and evaluating database scoring algorithms using the TANDEM search engine’. Bioinformatics 22 (22), pp. 2830–2832. Madalinski, G., E. Godat, S. Alves, D. Lesage, E. Genin, P. Levi, J. Labarre, J.-C. Tabet, E. Ezan, and C. Junot (2008). ‘Direct introduction of biological samples into a LTQ-Orbitrap hybrid mass spectrometer as a tool for fast metabolome analysis’. Anal. Chem. 80 (9), pp. 3291–3303. Maes, E., P. Kelchtermans, W. Bittremieux, K. De Grave, S. Degroeve, J. Hooyberghs, I. Mertens, G. Baggerman, J. Ramon, K. Laukens, L. Martens, and D. Valkenborg (2016). ‘Designing biomedical 277 references proteomics experiments: state-of-the-art and future perspectives’. Expert Rev. Proteomics 13 (5), pp. 495–511. Maier, T., M. Güell, and L. Serrano (2009). ‘Correlation of mRNA and protein in complex biological samples’. FEBS Lett. 583 (24), pp. 3966–3973. Maier, T., A. Schmidt, M. Güell, S. Kühner, A.-C. Gavin, R. Aebersold, and L. Serrano (2011). ‘Quantification of mRNA and protein and integration with protein turnover in a bacterium’. Mol. Syst. Biol. 7 (1), p. 511. Makarov, A. (2000). ‘Electrostatic Axially Harmonic Orbital Trapping: A High-Performance Technique of Mass Analysis’. Anal. Chem. 72 (6), pp. 1156–1162. Mallick, P., M. Schirle, S. S. Chen, M. R. Flory, H. Lee, D. Martin, J. Ranish, B. Raught, R. Schmitt, T. Werner, B. Kuster, and R. Aebersold (2007). ‘Computational prediction of proteotypic peptides for quantitative proteomics’. Nat. Biotechnol. 25 (1), pp. 125–131. Manza, L. L., S. L. Stamer, A.-J. L. Ham, S. G. Codreanu, and D. C. Liebler (2005). ‘Sample preparation and digestion for proteomic analyses using spin filters’. Proteomics 5 (7), pp. 1742–1745. Marguerat, S., A. Schmidt, S. Codlin, W. Chen, R. Aebersold, and J. Bähler (2012). ‘Quantitative analysis of fission yeast transcriptomes and proteomes in proliferating and quiescent cells’. Cell 151 (3), pp. 671–683. Marioni, J. C., C. E. Mason, S. M. Mane, M. Stephens, and Y. Gilad (2008). ‘RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays’. Genome Res. 18 (9), pp. 1509–1517. Martens, L., M. Chambers, M. Sturm, D. Kessner, F. Levander, J. Shofstahl, W. H. Tang, A. Römpp, S. Neumann, A. D. Pizarro, L. Montecchi-Palazzi, N. Tasman, M. Coleman, F. Reisinger, P. Souda, H. Hermjakob, P.-A. Binz, and E. W. Deutsch (2011). ‘mzML — A Community Standard for Mass Spectrometry Data’. Mol. Cell. Proteom. 10 (1). Martin, J. A. and Z.Wang (2011). ‘Next-generation transcriptome assembly’.Nat. Rev. Genet. 12 (10), pp. 671–682. Martin, W. F., S. Garg, and V. Zimorski (2015). ‘Endosymbiotic theories for eukaryote origin’. Philos. Trans. R. Soc. Lond., B, Biol. Sci. 370 (1678), p. 20140330. Martínez, O. and M. H. Reyes-Valdés (2008). ‘Defining diversity, specialization, and gene specificity in transcriptomes through information theory’. PNAS 105 (28), pp. 9709–9714. Marx, V. (2019). ‘A dream of single-cell proteomics’. Nat. Methods 16 (9), pp. 809–812. Mathur, R., D. Rotroff, J. Ma, A. Shojaie, and A. Motsinger-Reif (2018). ‘Gene set analysis methods: a systematic comparison’. BioData Min. 11, p. 8. McCaffrey, A. (1964). Dragonflight. New York, NY, US: Ballantine Books. McIlwain, S., K. Tamura, A. Kertesz-Farkas, C. E. Grant, B. Diament, B. Frewen, J. J. Howbert, M. R. Hoopmann, L. Käll, J. K. Eng, M. J. MacCoss, and W. S. Noble (2014). ‘Crux: rapid open source protein tandem mass spectrometry analysis’. J. Proteome Res. 13 (10), pp. 4488–4491. McPherson, J. D. (2014). ‘A defining decade in DNA sequencing’. Nat. Methods 11 (10), pp. 1003– 1005. Medzihradszky, K. F. and R. J. Chalkley (2015). ‘Lessons in de novo peptide sequencing by tandem mass spectrometry’. Mass Spectrom. Rev. 34 (1), pp. 43–63. Melé, M., P. G. Ferreira, F. Reverter, D. S. DeLuca, J. Monlong, M. Sammeth, a. R. Young, J. M. Goldmann, D. D. Pervouchine, T. J. Sullivan, R. Johnson, A. V. Ségrè, S. Djebali, A. Niarchou, T. G. Consortium, F. A. Wright, T. Lappalainen, M. Calvo, G. Getz, E. T. Dermitzakis, K. G. Ardlie, and R. Guigó (2015). ‘The human transcriptome across tissues and individuals’. Science 348 (6235), pp. 660–665. Menschaert, G. and D. Fenyö (2017). ‘Proteogenomics from a bioinformatics angle: A growing field’. Mass Spectrom. Rev. 36 (5), pp. 584–599. 278 references Miller, R. M., R. J. Millikin, C. V. Hoffmann, S. K. Solntsev, G. M. Sheynkman, M. R. Shortreed, and L. M. Smith (2019). ‘Improved Protein Inference from Multiple Protease Bottom-Up Mass Spectrometry Data’. J. Proteome Res. 18 (9), pp. 3429–3438. Minoche, A. E., J. C. Dohm, and H. Himmelbauer (2011). ‘Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and genome analyzer systems’. Genome Biol. 12 (11), R112. Morlan, J. D., K. Qu, and D. V. Sinicropi (2012). ‘Selective depletion of rRNA enables whole transcriptome profiling of archival fixed tissue’. PLOS ONE 7 (8), e42882. Morris, J., D. Hartl, A. Knoll, R. Lue, M. Michael, A. Berry, A. Biewener, B. Farrell, and N. M. Holbrook (2016). Biology: How Life Works, 2nd edition. New York, NY, US: W. H. Freeman. Morrison, S. J. (2014). ‘Time to do something about reproducibility’. eLife 3. Mortazavi, A., B. A.Williams, K. McCue, L. Schaeffer, and B.Wold (2008). ‘Mapping and quantifying mammalian transcriptomes by RNA-Seq’. Nat. Methods 5 (7), pp. 621–628. Murrell, P. (2014). gridBase: Integration of base and grid graphics. R package version 0.4-7. Muth, T., F. Hartkopf, M. Vaudel, and B. Y. Renard (2018). ‘A Potential Golden Age to Come-Current Tools, Recent Use Cases, and Future Avenues for De Novo Sequencing in Proteomics’. Proteomics 18 (18), e1700150. Nagaraj, S. H., R. B. Gasser, and S. Ranganathan (2007). ‘A hitchhiker’s guide to expressed sequence tag (EST) analysis’. Briefings Bioinf. 8 (1), pp. 6–21. Nahnsen, S., C. Bielow, K. Reinert, and O. Kohlbacher (2013). ‘Tools for label-free peptide quantification’. Mol. Cell. Proteom. 12 (3), pp. 549–556. Navarro, P., J. Kuharev, L. C. Gillet, O. M. Bernhardt, B. MacLean, H. L. Röst, S. A. Tate, C.-C. Tsou, L. Reiter, U. Distler, G. Rosenberger, Y. Perez-Riverol, A. I. Nesvizhskii, R. Aebersold, and S. Tenzer (2016). ‘A multicenter study benchmarks software tools for label-free proteome quantification’. Nat. Biotechnol. 34 (11), pp. 1130–1136. Neilson, K. A., N. A. Ali, S. Muralidharan, M. Mirzaei, M. Mariani, G. Assadourian, A. Lee, S. C. van Sluyter, and P. A. Haynes (2011). ‘Less label, more free: approaches in label-free quantitative mass spectrometry’. Proteomics 11 (4), pp. 535–553. Nesvizhskii, A. (2006). Statistical Validation of Protein Identifications. Nesvizhskii, A. I. (2010). ‘A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics’. J. Proteom. 73 (11), pp. 2092–2123. Nesvizhskii, A. I. and R. Aebersold (2005). ‘Interpretation of shotgun proteomic data: the protein inference problem’. Mol. Cell. Proteom. 4 (10), pp. 1419–1440. Nesvizhskii, A. I., A. Keller, E. Kolker, and R. Aebersold (2003). ‘A statistical model for identifying proteins by tandem mass spectrometry’. Anal. Chem. 75 (17), pp. 4646–4658. Neuwirth, E. (2014). RColorBrewer: ColorBrewer Palettes. R package version 1.1-2. Nilsson, T., M. Mann, R. Aebersold, J. R. Yates 3rd, A. Bairoch, and J. J. M. Bergeron (2010). ‘Mass spectrometry in high-throughput proteomics: ready for the big time’.Nat. Methods 7 (9), pp. 681– 685. Noor, Z., S. B. Ahn, M. S. Baker, S. Ranganathan, and A. Mohamedali (2020). ‘Mass spectrometry- based protein identification in proteomics-a review’. Briefings Bioinf. O’Bryon, I., S. C. Jenson, and E. D. Merkley (2020). ‘Flying blind, or just flying under the radar? The underappreciated power of de novo methods of mass spectrometric peptide identification’. Protein Sci. 29 (9), pp. 1864–1878. O’Malley, C. D. (1964). Andreas Vesalius of Brussels, 1514–1564. Berkeley, CA, US: Berkeley: University of California Press. Oytam, Y., F. Sobhanmanesh, K. Duesing, J. C. Bowden, M. Osmond-McLeod, and J. Ross (2016). ‘Risk-conscious correction of batch effects: maximising information extraction from high-throughput genomic datasets’. BMC Bioinf. 17 (1), p. 332. 279 references Pachter, L. (2015).What is a read mapping? Accessed: 2017-6-27. url: https://liorpachter.wordpress. com/2015/11/01/what-is-a-read-mapping/. Palasca, O., A. Santos, C. Stolte, J. Gorodkin, and L. J. Jensen (2018). ‘TISSUES 2.0: an integrative web resource on mammalian tissue expression’. Database 2018. Palmblad, M., C. V. Henkel, R. P. Dirks, A. H. Meijer, A. M. Deelder, and H. P. Spaink (2013). ‘Parallel deep transcriptome and proteome analysis of zebrafish larvae’. BMC Res. Notes 6, p. 428. Pan, Q., O. Shai, L. J. Lee, B. J. Frey, and B. J. Blencowe (2008). ‘Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing’. Nat. Genet. 40 (12), pp. 1413–1415. Papachristodoulou, D., A. Snape, W. H. Elliott, and D. C. Elliott (2014). Biochemistry and Molecular Biology. Oxford, UK: OUP Oxford. Pappireddi, N., L. Martin, and M. Wühr (2019). ‘A Review on Quantitative Multiplexed Proteomics’. Chembiochem 20 (10), pp. 1210–1224. Paradis, E. and K. Schliep (2019). ‘ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R’. Bioinformatics 35 (3), pp. 526–528. Park, C. Y., A. A. Klammer, L. Käll, M. J. MacCoss, and W. S. Noble (2008). ‘Rapid and accurate peptide identification from tandem mass spectra’. J. Proteome Res. 7 (7), pp. 3022–3027. Parkhomchuk, D., T. Borodina, V. Amstislavskiy, M. Banaru, L. Hallen, S. Krobitsch, H. Lehrach, andA. Soldatov (2009). ‘Transcriptome analysis by strand-specific sequencing of complementary DNA’. Nucleic Acids Res. 37 (18), e123. Parkinson, J. and M. Blaxter (2009). ‘Expressed Sequence Tags: An overview’. Expressed Sequence Tags (ESTs): Generation and Analysis. Ed. by J. Parkinson. Vol. 533. Methods in Molecular Biology. Totowa, NJ, US: Humana Press, pp. 1–12. Pascal, L. E., L. D. True, D. S. Campbell, E. W. Deutsch, M. Risk, I. M. Coleman, L. J. Eichner, P. S. Nelson, and A. Y. Liu (2008). ‘Correlation of mRNA and protein levels: Cell type-specific gene expression of cluster designation antigens in the prostate’. BMC Genom. 9, p. 246. Pasquali, L., K. J. Gaulton, S. A. Rodríguez-Seguí, L. Mularoni, I. Miguel-Escalada, İ. Akerman, J. J. Tena, I. Morán, C. Gómez-Marín, M. van de Bunt, J. Ponsa-Cobas, N. Castro, T. Nammo, I. Cebola, J. García-Hurtado, M. A. Maestro, F. Pattou, L. Piemonti, T. Berney, A. L. Gloyn, P. Ravassard, J. L. G. Skarmeta, F.Müller, M. I. McCarthy, and J. Ferrer (2014). ‘Pancreatic islet enhancer clusters enriched in type 2 diabetes risk-associated variants’. Nat. Genet. 46 (2), pp. 136–143. Pazhamala, L. T., S. Purohit, R. K. Saxena, V. Garg, L. Krishnamurthy, J. Verdier, and R. K. Varshney (2017). ‘Gene expression atlas of pigeonpea and its application to gain insights into genes associated with pollen fertility implicated in seed formation’. J. Exp. Bot. 68 (8), pp. 2037–2054. Pearce, S., H. Vazquez-Gross, S. Y. Herin, D. Hane, Y. Wang, Y. Q. Gu, and J. Dubcovsky (2015). ‘WheatExp: an RNA-seq expression database for polyploid wheat’. BMC Plant Biol. 15, p. 299. Pearson, R. K. (2002). ‘Outliers in process modeling and identification’. IEEE Trans. Control Syst. Technol. 10 (1), pp. 55–63. Peixoto, L., D. Risso, S. G. Poplawski, M. E. Wimmer, T. P. Speed, M. A. Wood, and T. Abel (2015). ‘How data analysis affects power, reproducibility and biological insight of RNA-seq studies in complex datasets’. Nucleic Acids Res. 43 (16), pp. 7664–7674. Penkert, M., A. Hauser, R. Harmel, D. Fiedler, C. P. R. Hackenberger, and E. Krause (2019). ‘Electron Transfer/Higher Energy Collisional Dissociation of Doubly Charged Peptide Ions: Identification of Labile Protein Phosphorylations’. J. Am. Soc. Mass Spectrom. 30 (9), pp. 1578–1585. Penkert, M., L. M. Yates, M. Schümann, D. Perlman, D. Fiedler, and E. Krause (2017). ‘Unambiguous Identification of Serine and Threonine Pyrophosphorylation Using Neutral-Loss-Triggered Electron-Transfer/Higher-Energy Collision Dissociation’. Anal. Chem. 89.6, pp. 3672–3680. 280 references Perkins, D. N., D. J. Pappin, D. M. Creasy, and J. S. Cottrell (1999). ‘Probability-based protein identification by searching sequence databases using mass spectrometry data’. Electrophoresis 20 (18), pp. 3551–3567. Petryszak, R., T. Burdett, B. Fiorelli, N. A. Fonseca, M. Gonzalez-Porta, E. Hastings, W. Huber, S. Jupp, M. Keays, N. Kryvych, J. McMurry, J. C. Marioni, J. Malone, K. Megy, G. Rustici, A. Y. Tang, J. Taubert, E. Williams, O. Mannion, H. E. Parkinson, and A. Brazma (2014). ‘Expression Atlas update–a database of gene and transcript expression from microarray- and sequencing-based functional genomics experiments’. Nucleic Acids Res. 42 (Database issue), pp. D926–32. Petryszak, R., M. Keays, Y. A. Tang, N. A. Fonseca, E. Barrera, T. Burdett, A. Füllgrabe, A. M.-P. Fuentes, S. Jupp, S. Koskinen, O. Mannion, L. Huerta, K. Megy, C. Snow, E. Williams, M. Barzine, E. Hastings, H. Weisser, J. Wright, P. Jaiswal, W. Huber, J. Choudhary, H. E. Parkinson, and A. Brazma (2015). ‘Expression Atlas update—an integrated database of gene and protein expression in humans, animals and plants’. Nucleic Acids Res. 44 (D1), pp. D746–52. Pfeuffer, J., T. Sachsenberg, O. Alka, M. Walzer, A. Fillbrunn, L. Nilse, O. Schilling, K. Reinert, and O. Kohlbacher (2017). ‘OpenMS - A platform for reproducible analysis of mass spectrometry data’. J. Biotechnol. 261, pp. 142–148. Pfeuffer, J., T. Sachsenberg, T. M. H. Dijkstra, O. Serang, K. Reinert, and O. Kohlbacher (2020). ‘EPIFANY: A Method for Efficient High-Confidence Protein Inference’. J. Proteome Res. 19 (3), pp. 1060–1072. Picotti, P. and R. Aebersold (2012). ‘Selected reaction monitoring-based proteomics: workflows, potential, pitfalls and future directions’. Nat. Methods 9 (6), pp. 555–566. Pierce, B. A. (2005). Genetics: A conceptual approach, 2nd edition. New York, NY, US: Macmillan. Piétu, G., R. Mariage-Samson, N. A. Fayein, C. Matingou, E. Eveno, R. Houlgatte, C. Decraene, Y. Vandenbrouck, F. Tahi, M. D. Devignes, U. Wirkner, W. Ansorge, D. Cox, T. Nagase, N. Nomura, and C. Auffray (1999). ‘The Genexpress IMAGE knowledge base of the human brain transcriptome: a prototype integrated resource for functional and computational genomics’. Genome Res. 9 (2), pp. 195–209. Pino, L. K., B. C. Searle, J. G. Bollinger, B. Nunn, B. MacLean, and M. J. MacCoss (2020). ‘The Skyline ecosystem: Informatics for quantitative mass spectrometry proteomics’. Mass Spectrom. Rev. 39 (3), pp. 229–244. Plotkin, J. B. (2010). ‘Transcriptional regulation is only half the story’. Mol. Syst. Biol. 6 (1), p. 406. Pontius, Joan U. and Wagner, Lukas and Schuler, Gregory D. (2002). ‘UniGene: A Unified View of the Transcriptome’. The NCBI Handbook [Internet]. Ed. by O. J. McEntyre J. Bethesda, MD, US: National Center for Biotechnology Information (US). Pootakham, W., W. Mhuantong, T. Yoocha, L. Putchim, C. Sonthirod, C. Naktang, N. Thongtham, and S. Tangphatsornruang (2017). ‘High resolution profiling of coral-associated bacterial communities using full-length 16S rRNA sequence data from PacBio SMRT sequencing system’. Sci. Rep. 7, p. 2774. Poverennaya, E. V., E. V. Ilgisonis, E. A. Ponomarenko, A. T. Kopylov, V. G. Zgoda, S. P. Radko, A. V. Lisitsa, and A. I. Archakov (2017). ‘Why Are the Correlations between mRNA and Protein Levels so Low among the 275 Predicted Protein-Coding Genes on Human Chromosome 18?’ J. Proteome Res. 16 (12), pp. 4311–4318. Prieto, G., K. Aloria, N. Osinalde, A. Fullaondo, J. M. Arizmendi, and R. Matthiesen (2012). ‘PAnalyzer: a software tool for protein inference in shotgun proteomics’. BMC Bioinf. 13, p. 288. R Core Team (2019). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria. Ramakrishnan, S. R., C. Vogel, J. T. Prince, Z. Li, L. O. Penalva, M. Myers, E. M. Marcotte, D. P. Miranker, and R. Wang (2009). ‘Integrating shotgun proteomics and mRNA expression data to improve protein identification’. Bioinformatics 25 (11), pp. 1397–1403. 281 references Ramsköld, D., E. T. Wang, C. B. Burge, and R. Sandberg (2009). ‘An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data.’ PLOS Comput. Biol. 5 (12), e1000598. Rau, A., G. Marot, and F. Jaffrézic (2014). ‘Differential meta-analysis of RNA-seq data frommultiple studies’. BMC Bioinf. 15, p. 91. Rechenberger, J., P. Samaras, A. Jarzab, J. Behr, M. Frejno, A. Djukovic, J. Sanz, E. M. González-Barberá, M. Salavert, J. L. López-Hontangas, K. B. Xavier, L. Debrauwer, J.-M. Rolain, M. Sanz, M. Garcia-Garcera, M. Wilhelm, C. Ubeda, and B. Kuster (2019). ‘Challenges in Clinical Metaproteomics Highlighted by the Analysis of Acute Leukemia Patients with Gut Colonization by Multidrug-Resistant Enterobacteriaceae’. Proteomes 7 (1). Renard, B. Y., M. Kirchner, H. Steen, J. A. J. Steen, and F. A. Hamprecht (2008). ‘NITPICK: peak identification for mass spectrometry data’. BMC Bioinf. 9, p. 355. Révész, Á., M. G. Milley, K. Nagy, D. Szabó, G. Kalló, É. Csősz, K. Vékey, and L. Drahos (2021). ‘Tailoring to Search Engines: Bottom-Up Proteomics with Collision Energies Optimized for Identification Confidence’. J. Proteome Res. 20, pp. 474–484. Ringwald, M., C. Wu, and A. I. Su (2012). ‘BioGPS and GXD: mouse gene expression data-the benefits and challenges of data integration’. Mamm. Genome 23 (9-10), pp. 550–558. Risso, D., J. Ngai, T. P. Speed, and S. Dudoit (2014). ‘Normalization of RNA-seq data using factor analysis of control genes or samples’. Nat. Biotechnol. 32 (9), pp. 896–902. Robert, C. and M. Watson (2015). ‘Errors in RNA-Seq quantification affect genes of relevance to human disease’. Genome Biol. 16, p. 177. Roberts, A., H. Pimentel, C. Trapnell, and L. Pachter (2011). ‘Identification of novel transcripts in annotated genomes using RNA-Seq’. Bioinformatics 27 (17), pp. 2325–2329. Robertson, G., J. Schein, R. Chiu, R. Corbett, M. Field, S. D. Jackman, K. Mungall, S. Lee, H. M. Okada, J. Q. Qian, M. Griffith, A. Raymond, N. Thiessen, T. Cezard, Y. S. Butterfield, R. Newsome, S. K. Chan, R. She, R. Varhol, B. Kamoh, A.-L. Prabhu, A. Tam, Y. Zhao, R. A. Moore, M. Hirst, M. A. Marra, S. J. M. Jones, P. A. Hoodless, and I. Birol (2010). ‘De novo assembly and analysis of RNA-seq data’. Nat. Methods 7 (11), pp. 909–912. Robinson, M. D., D. J. McCarthy, and G. K. Smyth (2010). ‘edgeR: a Bioconductor package for differential expression analysis of digital gene expression data’. Bioinformatics 26 (1), pp. 139– 140. Robotham, S. A., A. P. Horton, J. R. Cannon, V. C. Cotham, E. M. Marcotte, and J. S. Brodbelt (2016). ‘UVnovo: A de Novo Sequencing Algorithm Using Single Series of Fragment Ions via Chromophore Tagging and 351 nm Ultraviolet Photodissociation Mass Spectrometry’. Anal. Chem. 88 (7), pp. 3990–3997. Rocke, D. M. (1983). ‘Robust statistical analysis of interlaboratory studies’. Biometrika 70 (2), pp. 421–431. Rodriguez, J. M., J. Rodriguez-Rivas, T. Di Domenico, J. Vázquez, A. Valencia, and M. L. Tress (2018). ‘APPRIS 2017: principal isoforms for multiple gene sets’. Nucleic Acids Res. 46 (D1), pp. D213– D217. Roepstorff, P. and J. Fohlman (1984). ‘Proposal for a common nomenclature for sequence ions in mass spectra of peptides’. Biomed. Mass Spectrom. 11 (11), p. 601. Rorbach, J., A. Bobrowicz, S. Pearce, and M. Minczuk (2014). ‘Polyadenylation in Bacteria and Organelles’. Polyadenylation: Methods and Protocols. Ed. by J. Rorbach and A. J. Bobrowicz. Totowa, NJ, US: Humana Press, pp. 211–227. Rosenberger, G., Y. Liu, H. L. Röst, C. Ludwig, A. Buil, A. Bensimon, M. Soste, T. D. Spector, E. T. Dermitzakis, B. C. Collins, L. Malmström, and R. Aebersold (2017). ‘Inference and quantification of peptidoforms in large sample cohorts by SWATH-MS’. Nat. biotechnol. 35 (8), pp. 781–788. 282 references Rosman, K. J. R. and P. D. P. Taylor (1998). ‘Isotopic compositions of the elements 1997’. Pure Appl. Chem. 70 (1), pp. 217–235. Ross, P. L., Y. N. Huang, J. N. Marchese, B. Williamson, K. Parker, S. Hattan, N. Khainovski, S. Pillai, S. Dey, S. Daniels, S. Purkayastha, P. Juhasz, S. Martin, M. Bartlet-Jones, F. He, A. Jacobson, and D. J. Pappin (2004). ‘Multiplexed protein quantitation in Saccharomyces cerevisiae using amine- reactive isobaric tagging reagents’. Mol. Cell. Proteom. 3 (12), pp. 1154–1169. Röst, H. L., T. Sachsenberg, S. Aiche, C. Bielow, H. Weisser, F. Aicheler, S. Andreotti, H.-C. Ehrlich, P. Gutenbrunner, E. Kenar, X. Liang, S. Nahnsen, L. Nilse, J. Pfeuffer, G. Rosenberger, M. Rurik, U. Schmitt, J. Veit, M. Walzer, D. Wojnar, W. E. Wolski, O. Schilling, J. S. Choudhary, L. Malmström, R. Aebersold, K. Reinert, and O. Kohlbacher (2016). ‘OpenMS: a flexible open-source software platform for mass spectrometry data analysis’. Nat. Methods 13 (9), pp. 741–748. Rouillard, A. D., G. W. Gundersen, N. F. Fernandez, Z. Wang, C. D. Monteiro, M. G. McDermott, and A. Ma’ayan (2016). ‘The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins’. Database 2016. Rudnick, P. A., X. Wang, X. Yan, N. Sedransk, and S. E. Stein (2014). ‘Improved normalization of systematic biases affecting ion current measurements in label-free proteomics data’. Mol. Cell. Proteom. 13 (5), pp. 1341–1351. Rudolph, K. L. M. (2014). ebits: An alternative module system for R. Rung, J. and A. Brazma (2013). ‘Reuse of public genome-wide gene expression data’. Nat. Rev. Genet. 14 (2), pp. 89–99. Sadygov, R. G., D. Cociorva, and J. R. Yates 3rd (2004). ‘Large-scale database searching using tandem mass spectra: looking up the answer in the back of the book’. Nat. Methods 1 (3), pp. 195–202. Saltzman, A. B., M. Leng, B. Bhatt, P. Singh, D. W. Chan, L. Dobrolecki, H. Chandrasekaran, J. M. Choi, A. Jain, S. Y. Jung, M. T. Lewis, M. J. Ellis, and A. Malovannaya (2018). ‘gpGrouper: A Peptide Grouping Algorithm for Gene-Centric Inference and Quantitation of Bottom-Up Proteomics Data’. Mol. Cell. Proteom. 17 (11), pp. 2270–2283. Sandin, M., J. Teleman, J. Malmström, and F. Levander (2014). ‘Data processing methods and quality control strategies for label-free LC-MS protein quantification’. BBA 1844 (1 Pt A), pp. 29–41. Sanger, F. and A. R. Coulson (1975). ‘A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase’. J. Mol. Biol. 94 (3), pp. 441–448. Santos, A., K. Tsafou, C. Stolte, S. Pletscher-Frankild, S. I. O’Donoghue, and L. J. Jensen (2015). ‘Comprehensive comparison of large-scale tissue expression datasets’. PeerJ 3, e1054. Savitski, M. M., M. Wilhelm, H. Hahne, B. Kuster, and M. Bantscheff (2015). ‘A Scalable Approach for Protein False Discovery Rate Estimation in Large Proteomic Data Sets’.Mol. Cell. Proteom. 14 (9), pp. 2394–2404. Schena, M., D. Shalon, R. W. Davis, and P. O. Brown (1995). ‘Quantitative monitoring of gene expression patterns with a complementary DNA microarray’. Science 270 (5235), pp. 467–470. Schubert, M. (2019). ‘clustermq enables efficient parallelisation of genomic analyses’. Bioinformatics 35 (21). Schubert, M. and K. L. M. Rudolph (2014). modules: An alternative module system for R. Schwanhäusser, B., D. Busse, N. Li, G. Dittmar, J. Schuchhardt, J. Wolf, W. Chen, and M. Selbach (2011). ‘Global quantification of mammalian gene expression control.’Nature 473 (7347), pp. 337– 342. Schwanhäusser, B., D. Busse, N. Li, G. Dittmar, J. Schuchhardt, J. Wolf, W. Chen, and M. Selbach (2013). ‘Corrigendum: Global quantification of mammalian gene expression control’. Nature 495 (7439), pp. 126–127. Scigelova, M., M. Hornshaw, A. Giannakopulos, and A. Makarov (2011). ‘Fourier transform mass spectrometry’. Mol. Cell. Proteom. 10 (7), p. M111.009431. 283 references Searle, B. C., M. Turner, and A. I. Nesvizhskii (2008). ‘Improving sensitivity by probabilistically combining results from multiple MS/MS search methodologies’. J. Proteome Res. 7 (1), pp. 245– 253. SEQC/MAQC-III Consortium (2014). ‘A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium’. Nat. Biotechnol. 32 (9), pp. 903–914. Serang, O., M. J. MacCoss, and W. S. Noble (2010). ‘Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data’. J. Proteome Res. 9 (10), pp. 5346– 5357. Serang, O. and W. Noble (2012). ‘A review of statistical methods for protein identification using tandem mass spectrometry’. Stat. Interface 5 (1), pp. 3–20. Shaffer, J. P. (1995). ‘Multiple Hypothesis Testing’. Annu. Rev. Psychol. 46 (1), pp. 561–584. Shevchenko, A., H. Tomas, J. Havlis, J. V. Olsen, and M. Mann (2006). ‘In-gel digestion for mass spectrometric characterization of proteins and proteomes’. Nat. Protoc. 1 (6), pp. 2856–2860. Shi Jing, L., L. S. Jing, F. F. M. Shah, M. S. Mohamad, K. Moorthy, S. Deris, Z. Zakaria, and S. Napis (2015). ‘A Review on Bioinformatics Enrichment Analysis Tools Towards Functional Analysis of High Throughput Gene Set Data’. Curr. Proteomics 12 (1), pp. 14–27. Shi, T., E. Song, S. Nie, K. D. Rodland, T. Liu,W.-J. Qian, and R. D. Smith (2016). ‘Advances in targeted proteomics and applications to biomedical research’. Proteomics 16 (15-16), pp. 2160–2182. Shiferaw, G. A., E. Vandermarliere, N. Hulstaert, R. Gabriels, L. Martens, and P.-J. Volders (2020). ‘COSS: A Fast and User-Friendly Tool for Spectral Library Searching’. J. Proteome Res. 19 (7), pp. 2786–2793. Shokolenko, I. N. andM. F. Alexeyev (2017). ‘Mitochondrial transcription inmammalian cells’. Front. Biosci. 22, pp. 835–853. Shteynberg, D., E. W. Deutsch, H. Lam, J. K. Eng, Z. Sun, N. Tasman, L. Mendoza, R. L. Moritz, R. Aebersold, and A. I. Nesvizhskii (2011). ‘iProphet: multi-level integrative analysis of shotgun proteomic data improves peptide and protein identification rates and error estimates’. Mol. Cell. Proteom. 10 (12), p. M111.007690. Shteynberg, D., A. I. Nesvizhskii, R. L. Moritz, and E. W. Deutsch (2013). ‘Combining results of multiple search engines in proteomics’. Mol. Cell. Proteom. 12 (9), pp. 2383–2393. Sikdar, S., R. Gill, and S. Datta (2016). ‘Improving protein identification from tandem mass spectrometry data by one-step methods and integrating data from other platforms’. Brief. Bioinformatics 17 (2), pp. 262–269. Silva, J. C., M. V. Gorenstein, G.-Z. Li, J. P. C. Vissers, and S. J. Geromanos (2006). ‘Absolute quantification of proteins by LCMSE: a virtue of parallel MS acquisition’. Mol. Cell. Proteom. 5 (1), pp. 144–156. Sinitcyn, P., J. D. Rudolph, and J. Cox (2018). ‘Computational Methods for Understanding Mass Spectrometry-Based Shotgun Proteomics Data’. Annu. Rev. Biomed. Data Sci. Annual Review of Biomedical Data Science 1. Ed. by R. B. Altman and M. Levitt, pp. 207–234. Slavin, R. E. (1986). ‘Best-Evidence Synthesis: An Alternative to Meta-Analytic and Traditional Reviews’. Educ. Res. 15 (9), pp. 5–11. Smith, R. N., J. Aleksic, D. Butano, A. Carr, S. Contrino, F. Hu, M. Lyne, R. Lyne, A. Kalderimis, K. Rutherford, R. Stepan, J. Sullivan, M. Wakeling, X. Watkins, and G. Micklem (2012). ‘InterMine: a flexible data warehouse system for the integration and analysis of heterogeneous biological data’. Bioinformatics 28 (23), pp. 3163–3165. Spivak, M., J. Weston, L. Bottou, L. Käll, and W. S. Noble (2009). ‘Improvements to the percolator algorithm for Peptide identification from shotgun proteomics data sets’. J. Proteome Res. 8 (7), pp. 3737–3745. 284 references Spivak, M., J. Weston, D. Tomazela, M. J. MacCoss, and W. S. Noble (2012). ‘Direct maximization of protein identifications from tandem mass spectra’. Mol. Cell. Proteom. 11 (2), p. M111.012161. Stairs, C. W., M. M. Leger, and A. J. Roger (2015). ‘Diversity and origins of anaerobic metabolism in mitochondria and related organelles’. Philos. Trans. R. Soc. Lond., B, Biol. Sci. 370 (1678), p. 20140326. Stanke, M., R. Steinkamp, S. Waack, and B. Morgenstern (2004). ‘AUGUSTUS: a web server for gene finding in eukaryotes’. Nucleic Acids Res. 32 (Web Server issue), W309–12. Starr, A. E., S. A. Deeke, L. Li, X. Zhang, R. Daoud, J. Ryan, Z. Ning, K. Cheng, L. V. H. Nguyen, E. Abou-Samra, M. Lavallée-Adam, and D. Figeys (2018). ‘Proteomic and Metaproteomic Approaches to Understand Host-Microbe Interactions’. Anal. Chem. 90 (1), pp. 86–109. Stead, D. A., N. W. Paton, P. Missier, S. M. Embury, C. Hedeler, B. Jin, A. J. P. Brown, and A. Preece (2008). ‘Information quality in proteomics’. Briefings Bioinf. 9 (2), pp. 174–188. Steen, H. and M. Mann (2004). ‘The ABC’s (and XYZ’s) of peptide sequencing’. Nat. Rev. Mol. Cell Biol. 5 (9), pp. 699–711. Stegle, O., L. Parts, M. Piipari, J. Winn, and R. Durbin (2012). ‘Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses’. Nat. Protoc. 7 (3), pp. 500–507. Steiner, C., A. Ducret, J.-C. Tille, M. Thomas, T. A. McKee, L. Rubbia-Brandt, A. Scherl, P. Lescuyer, and P. Cutler (2014). ‘Applications of mass spectrometry for quantitative protein analysis in formalin-fixed paraffin-embedded tissues’. Proteomics 14 (4-5), pp. 441–451. Stelpflug, S. C., R. S. Sekhon, B. Vaillancourt, C. N. Hirsch, C. R. Buell, N. de Leon, and S. M. Kaeppler (2016). ‘An Expanded Maize Gene Expression Atlas based on RNA Sequencing and its Use to Explore Root Development’. Plant Genome 9 (1). Student (Gosset, W. S. (1908). ‘The probable error of a mean’. Biometrika 6 (1), pp. 1–25. Su, A. I., T. Wiltshire, S. Batalov, H. Lapp, K. A. Ching, D. Block, J. Zhang, R. Soden, M. Hayakawa, G. Kreiman, M. P. Cooke, J. R. Walker, and J. B. Hogenesch (2004). ‘A gene atlas of the mouse and human protein-encoding transcriptomes’. PNAS 101 (16), pp. 6062–6067. Subramanian, A., P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette, A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander, and J. P. Mesirov (2005). ‘Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles’. PNAS 102 (43), pp. 15545–15550. Sudmant, P. H., M. S. Alexis, and C. B. Burge (2015). ‘Meta-analysis of RNA-seq expression data across species, tissues and studies’. Genome Biol. 16, p. 287. Sultan, M., V. Amstislavskiy, T. Risch, M. Schuette, S. Dökel, M. Ralser, D. Balzereit, H. Lehrach, and M.-L. Yaspo (2014). ‘Influence of RNA extraction methods and library selection schemes on RNA-seq data’. BMC Genom. 15 (1), p. 675. Sun, Y., U. Braga-Neto, and E. R. Dougherty (2012). ‘A systematic model of the LC-MS proteomics pipeline’. BMC Genom. 13 Suppl 6, S2. Suntsova, M., N. Gaifullin, D. Allina, A. Reshetun, X. Li, L. Mendeleeva, V. Surin, A. Sergeeva, P. Spirin, V. Prassolov, A. Morgan, A. Garazha, M. Sorokin, and A. Buzdin (2019). ‘Atlas of RNA sequencing profiles for normal human tissues’. Sci. Data 6, p. 36. Sutandy, F. X. R., J. Qian, C.-S. Chen, and H. Zhu (2013). ‘Overview of protein microarrays’. Curr. Protoc. Protein Sci. Chapter 27, Unit 27.1. Syka, J. E. P., J. J. Coon, M. J. Schroeder, J. Shabanowitz, and D. F. Hunt (2004). ‘Peptide and protein sequence analysis by electron transfer dissociation mass spectrometry’. PNAS 101 (26), pp. 9528– 9533. Tabb, D. L. (2015). ‘The SEQUEST family tree’. J. Am. Soc. Mass Spectrom. 26 (11), pp. 1814–1819. 285 references Tabb, D. L., Z.-Q. Ma, D. B. Martin, A.-J. L. Ham, and M. C. Chambers (2008). ‘DirecTag: accurate sequence tags from peptide MS/MS through statistical scoring’. J. Proteome Res. 7 (9), pp. 3838– 3846. Tabb, D. L., W. H. McDonald, and J. R. Yates 3rd (2002). ‘DTASelect and Contrast: tools for assembling and comparing protein identifications from shotgun proteomics’. J. Proteome Res. 1 (1), pp. 21–26. Tabb, D. L., A. Saraf, and J. R. Yates 3rd (2003). ‘GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model’. Anal. Chem. 75 (23), pp. 6415–6421. Tabb, D. L., L. Vega-Montoto, P. A. Rudnick, A. M. Variyath, A.-J. L. Ham, D. M. Bunk, L. E. Kilpatrick, D. D. Billheimer, R. K. Blackman, H. L. Cardasis, S. A. Carr, K. R. Clauser, J. D. Jaffe, K. A. Kowalski, T. A. Neubert, F. E. Regnier, B. Schilling, T. J. Tegeler, M. Wang, P. Wang, J. R. Whiteaker, L. J. Zimmerman, S. J. Fisher, B. W. Gibson, C. R. Kinsinger, M. Mesri, H. Rodriguez, S. E. Stein, P. Tempst, A. G. Paulovich, D. C. Liebler, and C. Spiegelman (2010). ‘Repeatability and reproducibility in proteomic identifications by liquid chromatography-tandem mass spectrometry’. J. Proteome Res. 9 (2), pp. 761–776. Tamayo, P., G. Steinhardt, A. Liberzon, and J. P. Mesirov (2012). ‘The limitations of simple gene set enrichment analysis assuming gene independence’. Stat. Methods Med. Res. 25 (1), pp. 472–487. Taminau, J., C. Lazar, S. Meganck, and A. Nowé (2014). ‘Comparison of merging and meta-analysis as alternative approaches for integrative gene expression analysis’. ISRN Bioinform. 2014, p. 345106. Tang, H., R. J. Arnold, P. Alves, Z. Xun, D. E. Clemmer, M. V. Novotny, J. P. Reilly, and P. Radivojac (2006). ‘A computational approach toward label-free protein quantification using predicted peptide detectability’. Bioinformatics 22 (14), e481–8. Tanner, S., H. Shu, A. Frank, L.-C. Wang, E. Zandi, M. Mumby, P. A. Pevzner, and V. Bafna (2005). ‘InsPecT: identification of posttranslationallymodified peptides from tandemmass spectra’.Anal. Chem. 77 (14), pp. 4626–4639. Taylor, J. A. and R. S. Johnson (1997). ‘Sequence database searches via de novo peptide sequencing by tandem mass spectrometry’. Rapid Commun. Mass Spectrom. 11 (9), pp. 1067–1075. The UniProt Consortium (2017). ‘UniProt: the universal protein knowledgebase’. Nucleic Acids Res. 45 (D1), pp. D158–D169. The, M., F. Edfors, Y. Perez-Riverol, S. H. Payne, M. R. Hoopmann, M. Palmblad, B. Forsström, and L. Käll (2018). ‘A Protein Standard That Emulates Homology for the Characterization of Protein Inference Algorithms’. J. Proteome Res. 17 (5), pp. 1879–1886. The, M., M. J. MacCoss, W. S. Noble, and L. Käll (2016). ‘Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0’. J. Am. Soc. Mass Spectrom. 27 (11), pp. 1719–1727. The, M., A. Tasnim, and L. Käll (2016). ‘How to talk about protein-level false discovery rates in shotgun proteomics’. Proteomics 16 (18), pp. 2461–2469. Thompson, A., J. Schäfer, K. Kuhn, S. Kienle, J. Schwarz, G. Schmidt, T. Neumann, R. Johnstone, A. K. A. Mohammed, and C. Hamon (2003). ‘Tandem mass tags: a novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS’. Anal. Chem. 75 (8), pp. 1895– 1904. Thomson, J. J. (1913). ‘Rays of Positive Electricity’. Proc. R. Soc. A 89 (607), pp. 1–20. Tian, Q., S. B. Stepaniants, M.Mao, L.Weng, M. C. Feetham,M. J. Doyle, E. C. Yi, H. Dai, V. Thorsson, J. Eng, D. Goodlett, J. P. Berger, B. Gunter, P. S. Linseley, R. B. Stoughton, R. Aebersold, S. J. Collins, W. A. Hanlon, and L. E. Hood (2004). ‘Integrated genomic and proteomic analyses of gene expression in Mammalian cells’. Mol. Cell. Proteom. 3 (10), pp. 960–969. Tipney, H. and L. Hunter (2010). ‘An introduction to effective use of enrichment analysis software’. Hum. Genomics 4 (3), pp. 202–206. 286 references Tran, N. H., X. Zhang, L. Xin, B. Shan, and M. Li (2017). ‘De novo peptide sequencing by deep learning’. PNAS 114 (31), pp. 8247–8252. Trapnell, C., B. A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. J. van Baren, S. L. Salzberg, B. J. Wold, and L. Pachter (2010). ‘Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation’. Nat. Biotechnol. 28 (5), pp. 511–515. Tsiatsiani, L. and A. J. R. Heck (2015). ‘Proteomics beyond trypsin’. FEBS J. 282 (14), pp. 2612–2626. Tu, C., J. Li, S. Shen, Q. Sheng, Y. Shyr, and J. Qu (2016). ‘Performance Investigation of Proteomic Identification by HCD/CID Fragmentations in Combination with High/Low-Resolution Detectors on a Tribrid, High-Field Orbitrap Instrument’. PLOS ONE 11 (7), e0160160. Tu, C., J. Li, Q. Sheng, M. Zhang, and J. Qu (2014). ‘Systematic assessment of survey scan and MS2- based abundance strategies for label-free quantitative proteomics using high-resolutionMS data’. J. Proteome Res. 13 (4), pp. 2069–2079. Tu, C., Q. Sheng, J. Li, D. Ma, X. Shen, X. Wang, Y. Shyr, Z. Yi, and J. Qu (2015). ‘Optimization of Search Engines and Postprocessing Approaches to Maximize Peptide and Protein Identification for High-Resolution Mass Data’. J. Proteome Res. 14 (11), pp. 4662–4673. Turner, S. D. (2015). RNA-SEQ quality control and analysis of Differential gene expression using the TUXEDO software suite. Tyanova, S., T. Temu, and J. Cox (2016). ‘The MaxQuant computational platform for mass spectrometry-based shotgun proteomics’. Nat. Protoc. 11 (12), pp. 2301–2319. Tyanova, S., T. Temu, P. Sinitcyn, A. Carlson, M. Y. Hein, T. Geiger, M. Mann, and J. Cox (2016). ‘The Perseus computational platform for comprehensive analysis of (prote)omics data’. Nat. Methods 13 (9), pp. 731–740. Uhlén, M., L. Fagerberg, B. M. Hallström, C. Lindskog, P. Oksvold, A. Mardinoglu, Å. Sivertsson, C. Kampf, E. Sjöstedt, A. Asplund, I. Olsson, K. Edlund, E. Lundberg, S. Navani, C. A.-K. Szigyarto, J. Odeberg, D. Djureinovic, J. O. Takanen, S. Hober, T. Alm, P.-H. Edqvist, H. Berling, H. Tegel, J. Mulder, J. Rockberg, P. Nilsson, J. M. Schwenk, M. Hamsten, K. von Feilitzen, M. Forsberg, L. Persson, F. Johansson, M. Zwahlen, G. von Heijne, J. Nielsen, and F. Pontén (2015). ‘Tissue-based map of the human proteome’. Science 347 (6220). Uhlén, M., B. M. Hallström, C. Lindskog, A. Mardinoglu, F. Pontén, and J. Nielsen (2016). ‘Transcriptomics resources of human tissues and organs’. Mol. Syst. Biol. 12 (4). Unlü, M., M. E. Morgan, and J. S. Minden (1997). ‘Difference gel electrophoresis: a single gel method for detecting changes in protein extracts’. Electrophoresis 18 (11), pp. 2071–2077. Urbanek, S. and J. Horner (2019). Cairo: R Graphics Device using Cairo Graphics Library for Creating High-Quality Bitmap (PNG, JPEG, TIFF), Vector (PDF, SVG, PostScript) and Display (X11 andWin32) Output. R package version 1.5-10. Ushey, K., J. McPherson, J. Cheng, A. Atkins, and J. Allaire (2018). packrat: A Dependency Management System for Projects and their R Package Dependencies. R package version 0.5.0. Uszkoreit, J., A. Maerkens, Y. Perez-Riverol, H. E. Meyer, K. Marcus, C. Stephan, O. Kohlbacher, and M. Eisenacher (2015). ‘PIA: An Intuitive Protein Inference Engine with a Web-Based User Interface’. J. Proteome Res. 14 (7), pp. 2988–2997. Välikangas, T., T. Suomi, and L. L. Elo (2018a). ‘A comprehensive evaluation of popular proteomics software workflows for label-free proteome quantification and imputation’. Briefings Bioinf. 19 (6), pp. 1344–1355. Välikangas, T., T. Suomi, and L. L. Elo (2018b). ‘A systematic evaluation of normalization methods in quantitative label-free proteomics’. Briefings Bioinf. 19 (1), pp. 1–11. Van Dijk, E. L., H. Auger, Y. Jaszczyszyn, and C. Thermes (2014). ‘Ten years of next-generation sequencing technology’. Trends Genet. 30 (9), pp. 418–426. 287 references Van Leeuwen, J. and J. Leeuwen (1990). Handbook of Theoretical Computer Science. Cambridge, MA, US: Elsevier. VanGuilder, H. D., K. E. Vrana, and W. M. Freeman (2008). ‘Twenty-five years of quantitative PCR for gene expression analysis’. BioTechniques 44 (5), pp. 619–626. Väremo, L., C. Scheele, C. Broholm, A. Mardinoglu, C. Kampf, A. Asplund, I. Nookaew, M. Uhlén, B. K. Pedersen, and J. Nielsen (2015). ‘Proteome- and transcriptome-driven reconstruction of the human myocyte metabolic network and its use for identification of markers for diabetes’. Cell Rep. 11 (6), pp. 921–933. Velculescu, V. E., L. Zhang, W. Zhou, J. Vogelstein, M. A. Basrai, D. E. Bassett Jr, P. Hieter, B. Vogelstein, and K. W. Kinzler (1997). ‘Characterization of the yeast transcriptome’. Cell 88 (2), pp. 243–251. Venables, W. N. and B. D. Ripley (2002).Modern Applied Statistics with S. Fourth. New York, NY, US: Springer. Venter, J. C. et al. (2001). ‘The sequence of the human genome’. Science 291 (5507), pp. 1304–1351. Visser, N. F. C., H. Lingeman, and H. Irth (2005). ‘Sample preparation for peptides and proteins in biological matrices prior to liquid chromatography and capillary zone electrophoresis’. Anal. Bioanal. Chem. 382 (3), pp. 535–558. Vitek, O. (2009). ‘Getting started in computational mass spectrometry-based proteomics’. PLOS Comput. Biol. 5 (5), e1000366. Vogel, C., R. d. S. Abreu, D. Ko, S.-Y. Le, B. A. Shapiro, S. C. Burns, D. Sandhu, D. R. Boutz, E. M. Marcotte, and L. O. Penalva (2010). ‘Sequence signatures and mRNA concentration can explain two-thirds of protein abundance variation in a human cell line’. Mol. Syst. Biol. 6 (1), p. 400. Vogel, C. and E. M. Marcotte (2012). ‘Insights into the regulation of protein abundance from proteomic and transcriptomic analyses’. Nat. Rev. Genet. 13 (4), pp. 227–232. Vyatkina, K., S. Wu, L. J. M. Dekker, M. M. VanDuijn, X. Liu, N. Tolić, M. Dvorkin, S. Alexandrova, T. M. Luider, L. Paša-Tolić, and P. A. Pevzner (2015). ‘De Novo Sequencing of Peptides from Top-Down Tandem Mass Spectra’. J. Proteome Res. 14 (11), pp. 4450–4462. Wagner, G. P., K. Kin, and V. J. Lynch (2012). ‘Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples’. Theory Biosci. 131 (4), pp. 281–285. Wajid, B. and E. Serpedin (2012). ‘Review of general algorithmic features for genome assemblers for next generation sequencers’. Genomics, Proteomics & Bioinformatics 10 (2), pp. 58–73. Walsh, C. J., P. Hu, J. Batt, and C. C. D. Santos (2015). ‘Microarray Meta-Analysis and Cross- Platform Normalization: Integrative Genomics for Robust Biomarker Discovery’. Microarrays 4 (3), pp. 389–406. Walther, T. C. and M. Mann (2010). ‘Mass spectrometry–based proteomics in cell biology’. J. Cell Biol. 190 (4), pp. 491–500. Wang, D. G., J. B. Fan, C. J. Siao, A. Berno, P. Young, R. Sapolsky, G. Ghandour, N. Perkins, E. Winchester, J. Spencer, L. Kruglyak, L. Stein, L. Hsie, T. Topaloglou, E. Hubbell, E. Robinson, M. Mittmann, M. S. Morris, N. Shen, D. Kilburn, J. Rioux, C. Nusbaum, S. Rozen, T. J. Hudson, R. Lipshutz, M. Chee, and E. S. Lander (1998). ‘Large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome’. Science 280 (5366), pp. 1077–1082. Wang, D., B. Eraslan, T. Wieland, B. Hallström, T. Hopf, D. P. Zolg, J. Zecha, A. Asplund, L.-H. Li, C. Meng, M. Frejno, T. Schmidt, K. Schnatbaum, M. Wilhelm, F. Ponten, M. Uhlen, J. Gagneur, H. Hahne, and B. Kuster (2019). ‘A deep proteome and transcriptome abundance atlas of 29 healthy human tissues’. Mol. Syst. Biol. 15 (2), e8503. Wang, E. T., R. Sandberg, R. Sandberg, R. Sandberg, L. Zhang, C.Mayr, S. F. Kingsmore, G. P. Schroth, and C. B. Burge (2008). ‘Alternative isoform regulation in human tissue transcriptomes’. Nature 456 (7221), pp. 470–476. 288 references Wang, G., W. W. Wu, Z. Zhang, S. Masilamani, and R.-F. Shen (2009). ‘Decoy methods for assessing false positives and false discovery rates in shotgun proteomics’. Anal. Chem. 81 (1), pp. 146–159. Wang, J., J. Pérez-Santiago, J. E. Katz, P. Mallick, and N. Bandeira (2010). ‘Peptide identification from mixture tandem mass spectra’. Mol. Cell. Proteom. 9 (7), pp. 1476–1485. Wang, Q., J. Armenia, C. Zhang, A. V. Penson, E. Reznik, L. Zhang, T. Minet, A. Ochoa, B. E. Gross, C. A. Iacobuzio-Donahue, D. Betel, B. S. Taylor, J. Gao, and N. Schultz (2017). ‘Enabling cross- study analysis of RNA-Sequencing data’. bioRxiv (110734). Wang, S., I. Pandis, D. Johnson, I. Emam, F. Guitton, A. Oehmichen, and Y. Guo (2014). ‘Optimising parallel R correlationmatrix calculations on gene expression data usingMapReduce’. BMCBioinf. 15, p. 351. Wang, X., S. Shen, S. S. Rasam, and J. Qu (2019a). ‘MS1 ion current-based quantitative proteomics: A promising solution for reliable analysis of large biological cohorts’.Mass Spectrom. Rev. 38 (6), pp. 461–482. Wang, X., S. Shen, S. S. Rasam, and J. Qu (2019b). ‘MS1 ion current-based quantitative proteomics: A promising solution for reliable analysis of large biological cohorts’.Mass Spectrom. Rev. 38 (6), pp. 461–482. Wang, Z., M. Gerstein, and M. Snyder (2009). ‘RNA-Seq: a revolutionary tool for transcriptomics’. Nat. Rev. Genet. 10 (1), pp. 57–63. Ward, J. H. (1963). ‘Hierarchical Grouping to Optimize an Objective Function’. J. Am. Stat. Assoc. 58 (301), pp. 236–244. Warnes, G. R., B. Bolker, L. Bonebakker, R. Gentleman, W. H. A. Liaw, T. Lumley, M. Maechler, A. Magnusson, S. Moeller, M. Schwartz, and B. Venables (2019). gplots: Various R Programming Tools for Plotting Data. R package version 3.0.1.1. Webb-Robertson, B.-J. M., H. K. Wiberg, M. M. Matzke, J. N. Brown, J. Wang, J. E. McDermott, R. D. Smith, K. D. Rodland, T. O. Metz, J. G. Pounds, and K. M. Waters (2015). ‘Review, evaluation, and discussion of the challenges of missing value imputation for mass spectrometry-based label-free global proteomics’. J. Proteome Res. 14 (5), pp. 1993–2001. Weir, A. (2014). The Martian. New York, NY, US: Penguin Random House. Weisser, H., J. C. Wright, J. M. Mudge, P. Gutenbrunner, and J. S. Choudhary (2016). ‘Flexible Data Analysis Pipeline for High-Confidence Proteogenomics’. J. Proteome Res. 15 (12), pp. 4686–4695. Welch, B. L. (1947). ‘The generalization of Student’s problem when several different population variances are involved’. Biometrika 34 (1-2), pp. 28–35. Welch, B. L. (1951). ‘On the Comparison of Several Mean Values: An Alternative Approach’. Biometrika 38 (3/4), pp. 330–336. Wenger, C. D. and J. J. Coon (2013). ‘A proteomics search algorithm specifically designed for high- resolution tandem mass spectra’. J. Proteome Res. 12 (3), pp. 1377–1386. Westerhoff, H. V., C. Winder, H. Messiha, E. Simeonidis, M. Adamczyk, M. Verma, F. J. Bruggeman, and W. Dunn (2009). ‘Systems biology: the elements and principles of life’. FEBS Lett. 583 (24), pp. 3882–3890. Wheeler, D. L., D. M. Church, S. Federhen, A. E. Lash, T. L. Madden, J. U. Pontius, G. D. Schuler, L. M. Schriml, E. Sequeira, T. A. Tatusova, and L. Wagner (2003). ‘Database resources of the National Center for Biotechnology’. Nucleic Acids Res. 31 (1), pp. 28–33. Wickham, H. (2007). ‘Reshaping Data with the reshape Package’. J. Stat. Softw. 21 (12), pp. 1–20. Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. New York, NY, US: Springer-Verlag New York. Wickham, H. (2018). scales: Scale Functions for Visualization. R package version 1.0.0. Wickham, H., J. Hester, and W. Chang (2019). devtools: Tools to Make Developing R Packages Easier. R package version 2.1.0. 289 references Wilhelm, M., J. Schlegl, H. Hahne, A. M. Gholami, M. Lieberenz, M. M. Savitski, E. Ziegler, L. Butzmann, S. Gessulat, H. Marx, T. Mathieson, S. Lemeer, K. Schnatbaum, U. Reimer, H. Wenschuh, M. Mollenhauer, J. Slotta-Huspenina, J.-H. Boese, M. Bantscheff, A. Gerstmair, F. Faerber, and B. Kuster (2014). ‘Mass-spectrometry-based draft of the human proteome’. Nature 509 (7502), pp. 582–587. Williams, C. R., A. Baccarella, J. Z. Parrish, and C. C. Kim (2016). ‘Trimming of sequence reads alters RNA-Seq gene expression estimates’. BMC Bioinf. 17, p. 103. Wiśniewski, J. R., M. Y. Hein, J. Cox, and M. Mann (2014). ‘A “proteomic ruler” for protein copy number and concentration estimation without spike-in standards’. Mol. Cell. Proteom. 13 (12), pp. 3497–3506. Wiśniewski, J. R., A. Zougman, N. Nagaraj, and M. Mann (2009). ‘Universal sample preparation method for proteome analysis’. Nat. Methods 6 (5), pp. 359–362. Wood, S. N. (2004). ‘Stable and efficient multiple smoothing parameter estimation for generalized additive models’. J. Am. Stat. Assoc. 99 (467), pp. 673–686. Wright, J. C. and J. S. Choudhary (2016). ‘DecoyPyrat: Fast Non-redundant Hybrid Decoy Sequence Generation for Large Scale Proteomics’. J. Proteom. Bioinform. 9 (6), pp. 176–180. Wright, J. C., M. O. Collins, L. Yu, L. Käll, M. Brosch, and J. S. Choudhary (2012). ‘Enhanced peptide identification by electron transfer dissociation using an improved Mascot Percolator’. Mol. Cell. Proteom. 11 (8), pp. 478–491. Wright, J. C., J. Mudge, H. Weisser, M. P. Barzine, J. M. Gonzalez, A. Brazma, J. S. Choudhary, and J. Harrow (2016). ‘Improving GENCODE reference gene annotation using a high-stringency proteogenomics workflow’. Nat. Commun. 7, p. 11778. Wu, C., I. Macleod, and A. I. Su (2013). ‘BioGPS and MyGene.info: organizing online, gene-centric information’. Nucleic Acids Res. 41 (Database issue), pp. D561–5. Wu, C., C. Orozco, J. Boyer, M. Leglise, J. Goodale, S. Batalov, C. L. Hodge, J. Haase, J. Janes, J. W. Huss 3rd, and A. I. Su (2009). ‘BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources’. Genome Biol. 10 (11), R130. Wu, C., J. C. Tran, L. Zamdborg, K. R. Durbin, M. Li, D. R. Ahlf, B. P. Early, P. M. Thomas, J. V. Sweedler, and N. L. Kelleher (2012). ‘A protease for ’middle-down’ proteomics’. Nat. Methods 9 (8), pp. 822–824. Wu, X., C.-W. Tseng, andN. Edwards (2007). ‘HMMatch: peptide identification by spectral matching of tandem mass spectra using hidden Markov models’. J. Comput. Biol. 14 (8), pp. 1025–1043. Wysocki, V. H., K. A. Resing, Q. Zhang, and G. Cheng (2005). ‘Mass spectrometry of peptides and proteins’. Methods 35 (3), pp. 211–222. Xiao, S.-J., C. Zhang, Q. Zou, and Z.-L. Ji (2010). ‘TiSGeD: a database for tissue-specific genes’. Bioinformatics 26 (9), pp. 1273–1275. Xie, Y., J. Allaire, and G. Grolemund (2018). R Markdown: The Definitive Guide. Boca Raton, FL, US: Chapman and Hall/CRC. Xie, Y., J. Cheng, and X. Tan (2019). DT: AWrapper of the JavaScript Library ’DataTables’. R package version 0.6. Xu, M., Z. Li, and L. Li (2013). ‘Combining percolator with X!Tandem for accurate and sensitive peptide identification’. J. Proteome Res. 12 (6), pp. 3026–3033. Yagüe, J., A. Paradela, M. Ramos, S. Ogueta, A. Marina, F. Barahona, J. A. López de Castro, and J. Vázquez (2003). ‘Peptide rearrangement during quadrupole ion trap fragmentation: added complexity to MS/MS spectra’. Anal. Chem. 75 (6), pp. 1524–1535. Yang, H., H. Chi, W.-F. Zeng, W.-J. Zhou, and S.-M. He (2019). ‘pNovo 3: precise de novo peptide sequencing using a learning-to-rank framework’. Bioinformatics 35 (14), pp. i183–i190. 290 references Yang, X., V. Dondeti, R. Dezube, D. M. Maynard, L. Y. Geer, J. Epstein, X. Chen, S. P. Markey, and J. A. Kowalak (2004). ‘DBParser: web-based software for shotgun proteomic data analyses’. J. Proteome Res. 3 (5), pp. 1002–1008. Yao, L., H. Wang, Y. Song, and G. Sui (2017). ‘BioQueue: a novel pipeline framework to accelerate bioinformatics analysis’. Bioinformatics 33 (20), pp. 3286–3288. Yao, S., C. Jiang, Z. Huang, I. Torres-Jerez, J. Chang, H. Zhang, M. Udvardi, R. Liu, and J. Verdier (2016). ‘The Vigna unguiculata Gene Expression Atlas (VuGEA) from de novo assembly and quantification of RNA-seq data provides insights into seed maturation mechanisms’. Plant J. 88 (2), pp. 318–327. Ye, D., Y. Fu, R.-X. Sun, H.-P. Wang, Z.-F. Yuan, H. Chi, and S.-M. He (2010). ‘Open MS/MS spectral library search to identify unanticipated post-translational modifications and increase spectral identification rate’. Bioinformatics 26 (12), pp. i399–406. Ye, X., B. Luke, T. Andresson, and J. Blonder (2009). ‘18O stable isotope labeling in MS-based proteomics’. Brief. Funct. Genom. Proteomics 8 (2), pp. 136–144. Yeung, E. S. (2011). ‘Genome-wide correlation between mRNA and protein in a single cell’. Angew. Chem. 50 (3), pp. 583–585. Yi, H., A. T. Raman, H. Zhang, G. I. Allen, and Z. Liu (2018). ‘Detecting hidden batch factors through data-adaptive adjustment for biological effects’. Bioinformatics 34 (7), pp. 1141–1147. Yi, H., L. Xue, M.-X. Guo, J. Ma, Y. Zeng,W.Wang, J.-Y. Cai, H.-M. Hu, H.-B. Shu, Y.-B. Shi, andW.-X. Li (2010). ‘Gene expression atlas for human embryogenesis’. FASEB J. 24 (9), pp. 3341–3350. Yu, G., L.-G. Wang, Y. Han, and Q.-Y. He (2012). ‘clusterProfiler: an R package for comparing biological themes among gene clusters’. OMICS 16 (5), pp. 284–287. Yu, N. Y.-L., B. M. Hallström, L. Fagerberg, F. Ponten, H. Kawaji, P. Carninci, A. R. R. Forrest, Fantom Consortium, Y. Hayashizaki, M. Uhlén, and C. O. Daub (2015). ‘Complementing tissue characterization by integrating transcriptome profiling from the Human Protein Atlas and from the FANTOM5 consortium’. Nucleic Acids Res. 43 (14), pp. 6787–6798. Yu, X., J. Lin, D. J. Zack, and J. Qian (2006). ‘Computational analysis of tissue-specific combinatorial gene regulation: predicting interaction between transcription factors in human tissues’. Nucleic Acids Res. 34 (17), pp. 4925–4936. Zhang, J., L. Xin, B. Shan, W. Chen, M. Xie, D. Yuen, W. Zhang, Z. Zhang, G. A. Lajoie, and B. Ma (2012). ‘PEAKS DB: de novo sequencing assisted database search for sensitive and accurate peptide identification’. Mol. Cell. Proteom. 11 (4), p. M111.010587. Zhang, Y., Q. Li, F. Wu, R. Zhou, Y. Qi, N. Su, L. Chen, S. Xu, T. Jiang, C. Zhang, G. Cheng, X. Chen, D. Kong, Y. Wang, T. Zhang, J. Zi, W. Wei, Y. Gao, B. Zhen, Z. Xiong, S. Wu, P. Yang, Q. Wang, B. Wen, F. He, P. Xu, and S. Liu (2015). ‘Tissue-Based Proteogenomics Reveals that Human Testis Endows Plentiful Missing Proteins’. J. Proteome Res. 14 (9), pp. 3583–3594. Zhang, Y., B. R. Fonslow, B. Shan, M.-C. Baek, and J. R. Yates 3rd (2013). ‘Protein analysis by shotgun/bottom-up proteomics’. Chemical Rev. 113 (4), pp. 2343–2394. Zhang, Z., S. Wu, D. L. Stenoien, and L. Paša-Tolić (2014). ‘High-throughput proteomics’. Annu. Rev. Anal. Chem. 7, pp. 427–454. Zhao, S. (2014). ‘Assessment of the impact of using a reference transcriptome in mapping short RNA-Seq reads’. PLOS ONE 9 (7), e101374. Zhao, S., Y. Zhang, R. Gamini, B. Zhang, and D. von Schack (2018). ‘Evaluation of twomain RNA-seq approaches for gene quantification in clinical RNA sequencing: polyA+ selection versus rRNA depletion’. Sci. Rep. 8, p. 4781. Zhou, Y., Y. Shan, L. Zhang, and Y. Zhang (2014). ‘Recent advances in stable isotope labeling based techniques for proteome relative quantification’. J. Chromatogr. A 1365, pp. 1–11. 291 references Zhu, J., G. Chen, S. Zhu, S. Li, Z. Wen, Bin Li, Y. Zheng, and L. Shi (2016). ‘Identification of Tissue- Specific Protein-Coding and Noncoding Transcripts across 14 Human Tissues Using RNA-seq’. Sci. Rep. 6, p. 28400. Zhuang, F., R. T. Fuchs, andG. B. Robb (2012). ‘Small RNAExpression Profiling byHigh-Throughput Sequencing: Implications of Enzymatic Manipulation’. J. Nucleic Acids 2012. Zhuo, B., S. Emerson, J. H. Chang, and Y. Di (2016). ‘Identifying stably expressed genes frommultiple RNA-Seq data sets’. PeerJ 4, e2791. Zwiener, I., B. Frisch, and H. Binder (2014). ‘Transforming RNA-Seq data to improve the performance of prognostic gene signatures’. PLOS ONE 9 (1), e85150. Zyprych-Walczak, J., A. Szabelska, L. Handschuh, K. Górczak, K. Klamecka, M. Figlerowicz, and I. Siatkowski (2015). ‘The Impact of Normalization Methods on RNA-Seq Data Analysis’. BioMed Res. Int. 2015, p. 621690. 292