1 Supplementary Materials Novel Open Reading Frames in Human Accelerated Regions and Transposable Elements reveal new leads to understand schizophrenia and bipolar disorder Chaitanya Erady 1†, Krishna Amin 1†, Temiloluwa O. A. E. Onilogbo 1,2, Jakub Tomasik 2, Rebekah Jukes-Jones 3, Yagnesh Umrania 4, Sabine Bahn 2, and Sudhakaran Prabakaran 5* 1 Department of Genetics, University of Cambridge, CB2 3EH, UK 2Department of Chemical Engineering and Biotechnology, University of Cambridge, Cambridge, UK 3 Leicester Cancer Research Centre, RKCSB, University of Leicester, University Road, Leicester, LE1 7RH, UK 4Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge CB2 1QR, UK 5NonExomics, Inc, 2 Simon Willard Road, Acton, MA 01720, US †These authors contributed equally to the manuscript *Corresponding author, email: sudhakaran.prabakaran@nonexomics.com 2 Supplementary Figure 1 Supplementary Figure 1. An ideogram of unique HARs. Constructed using https://www.ncbi.nlm.nih.gov/genome/tools/gdp) the figure illustrates the location of unique HARs within the hg19 genome assembly. 3 Supplementary Figure 2 Supplementary Figure 2A Supplementary Figure 2B Supplementary Figure 2. Classification of nORFs. A. The 248,135 curated nORF entries were classified with respect to known genes. The translation frame was compared between nORFs and protein-coding genes and the nORF was categorised as in-frame or not in-frame. For not in-frame nORFs, their genomic position in relation to the gene structure was annotated. For nORFs in noncoding region, the overlapping transcripts’ biotype was used for categorisation. B. Approximately, 42% of the nORFs in our dataset are within the CDS of a protein-coding gene, albeit in an alternative frame. 4 Supplementary Figure 3 Supplementary Figure 3A Supplementary Figure 3B Supplementary Figure 3. PsychENCODE sample set. A. Three different studies from the PsychENCODE consortium were used in our analysis. Transcript expression correlation (Spearman’s correlation) between the different samples analyzed in this study shows greater correlation between samples from the same study than those from different studies. Dark Blue cluster – BrainGVEX; Light Blue cluster – CMC_HBCC; Pink cluster – CMC. B. PCA of gene and transcript expression for the samples highlights the gene-level and transcript-level differences based on the study used. SVA and MARS analysis were utilized for additional covariate identification for downstream DE analysis. 5 Supplementary Figure 4 Supplementary Figure 4. Metadata gender labels evaluation against gene expression for PsychENCODE dataset. Male and female metadata sample labels were evaluated by comparing chrY (left) and XIST (right) gene expression for three different PsychENCODE studies – BrainGVEX (top), CMC (middle) and CMC_HBCC (bottom). PCA based clustering of the chrY gene expression 6 split on the y-axis based on the original metadata labels, shows no discrepancy in sample labelling for BrainGVEX and CMC_HBCC, whereas the male and female clusters overlap for CMC. Hierarchical clustering of XIST gene expression gave 3 clusters coloured with different shades of blue. Open circles represent female samples and closed circles represent male samples. XIST expression analysis did not identify any sample labelling discrepancy for BrainGVEX and CMC_HBCC. XIST expression analysis on the CMC study, revealed female samples within the lower XIST expression cluster and a few male samples within the higher XIST expression clusters. 7 Supplementary Figure 5 Supplementary Figure 5. Metadata gender labels evaluation against gene expression for PsychENCODE study: CMC. Male and female metadata sample labels evaluated by comparing chrY and XIST gene expression revealed potential incorrect labelling for a few samples within the CMC study (supplementary figure 19). After removing these seven samples, chrY (top) and XIST (bottom) expression was re-evaluated. PCA of chrY expression plotted across top 2 PCs (top right) confirmed two distinct male and female clusters. No further sample label discrepancies were identified 8 Supplementary Figure 6 Supplementary Figure 6: The number of disorder-associated SNP loci for each disorder and each p- value cut-off. 9 Supplementary Figure 7 Supplementary Figure 7. SMRI samples used in our analysis. RNA-seq and Mass spec samples for CNT, SCZ and BD were obtained from the Array Collection via SMRI. Available sample numbers are highlighted in the figure. 10 Supplementary Figure 8 Supplementary Figure 8. HISAT2 alignment of SMRI samples. HISAT2 alignment summary showing ~95% alignment rate. Y-axis shows proportion of different types of read alignments. Concordant alignment (dark blue) refers to reads that aligned to the reference genome with a specified orientation (forward-reverse) and within a specific distance with respect to each other. Other alignment (light grey) contains reads that mapped concordantly >1-time, discordant alignments as well as single reads of a mate pair that aligned at least 1 time. Unaligned (dark grey) refers to reads that aligned to the genome 0 times. 11 Supplementary Figure 9 Supplementary Figure 9. StringTie transcript assembly of SMRI samples. No. of transcripts identified before and after StringTie merge. Blue bars indicate unannotated transcripts (no corresponding ENSMEBL identifiers), and grey bars indicate transcripts with Ensembl ids. 12 Supplementary Figure 10 Supplementary Figure 10. Identification of confounding factors using sample metadata. Significant differences between disorder and control for each metadata category was evaluated using Mann-Whitney U test in R. Significance was assigned as ‘*’ for p-value < 0.05, ‘**’ for p-value < 0.01, ‘***’ for p-value < 0.001 and N.S. for non-significant p-values. Fluphenazine concentration was considered as a covariate for downstream DE analysis. 13 Supplementary Figure 11 Supplementary Figure 11. PCA of gene expression levels for Array samples before and after outlier sample removal. After conduction a PCA analysis on gene expression values of our samples, three outlier samples: A61, A6 and A16 (left plot) were removed and the recalculated gene expression levels were replotted (right plot). Labels denote sample ids. The three batches identified were fastq +33: FASTQ files with Phred +33 encoding; fastq +64: FASTQ files with Phred +64 encoding; solexa +64: FASTQ files with +64 encoding and file names suffixed with ‘solexa’. 14 Supplementary Figure 12 Supplementary Figure 12. PCA re-done after sample removal evaluated for batch-metadata correlations. 17 different sample metadata categories were evaluated for batch-metadata category correlation. Different colours indicate different sub-categories being evaluated and labels represent different batches. 15 Supplementary Figure 13 Supplementary Figure 13. Gene expression PCA of SMRI samples. Gene expression PCA of SMRI samples based on batch (left) and disorder (right) is shown in the figure above. 16 Supplementary Figure 14 Supplementary Figure 14. Metadata gender labels evaluation against gene expression for SMRI dataset. Male and female metadata sample labels were evaluated by comparing chrY (left) and XIST (right) gene expression. PCA based clustering of the chrY gene expression split on the y-axis based on the original metadata labels, shows no discrepancy in sample labelling. Hierarchical clustering of XIST gene expression gave 3 clusters coloured with different shades of blue. Open circles represent female samples and closed circles represent male samples. XIST expression analysis did not identify any sample labelling discrepancy either. 17 Supplementary Figure 15 Supplementary Figure 15. Proteogenomic workflow for novel peptide identification. Proteomic samples collected from brain tissue were processed using mass spectrometry, and known proteins were filtered out based on matches of the resultant spectra against the human UniProt database. Any unmapped spectra were processed in two ways: (1) by mapping against the amino acid sequences of nORFs and (2) by processing matched RNA-seq samples isolated from the same brain tissue, assembling the transcripts and six-frame translating their sequences to map against the MS spectra. After additional filtration steps, this proteogenomic workflow allows for the identification of novel peptides or nORF peptides within our protein samples. 18 Supplementary Figure 16 Supplementary Figure 16. Novel peptide identification. Distribution of MASCOT peptide and protein scores calculated for novel peptides in neuropsychiatric samples using proteogenomic analysis are shown. For novel peptide filtration, a peptide score of greater than 50 and peptide expectation value of less than 0.05 was used. 19 Supplementary Figure 17 Supplementary Figure 17. nORF peptide identification. Distribution of MASCOT peptide and protein scores calculated for nORF peptides in neuropsychiatric samples using proteogenomic analysis are shown. For nORF peptide filtration, a protein score of greater than 50 and a protein expectation score of less than 0.05 was used. Following this, a peptide score of greater than 50 and peptide expectation value of less than 0.05 was used for further filtration, Finally, nORF peptides expressed in at least 30% of each case/control group were retained. 20 Supplementary Figure 18 Supplementary Figure 18. nORF amino acid length distribution. For the 21 nORFs identified as translated in neuropsychiatric samples, the distribution plot above shows the length (no. of amino acids) of these nORFs. Majority of the nORFs identified are 200-300 aa in length. 21 Supplementary Figure 19 Supplementary Figure 19A Supplementary Figure 19B Supplementary Figure 19. Evaluation of known protein-coding genes containing nORFs. A. Number of genes with and without nORFs, split according to gene biotype is presented. nORFs are largely localized within protein-coding genes. B. Further evaluation of protein-coding genes was performed using FunRich v3.1.3. Percentage of genes with and without nORFs were evaluated for enrichment of clinical phenotypes. Importantly, genes with nORFs are associated significantly with neurological disorders. 22 Supplementary Figure 20 Supplementary Figure 20. An ideogram (constructed using https://www.ncbi.nlm.nih.gov/genome/tools/gdp) illustrating the location of DE nORFs within the hg19 genome assembly. 23 Supplementary Figure 21 24 Supplementary Figure 21. nORF enrichment in BD- and SCZ- specific loci. nORFs transcribed within neuropsychiatric samples (blue peaks) and DE in SCZ (red peaks) or DE in BD (green peaks), were evaluated for overlap and enrichment within SCZ- specific CNVs and BD-specific loci (black vertical lines – outermost circular panel), using GLANET. If nORF enrichment was identified, the corresponding loci (vertical line) is marked with circular dots. For BD-specific loci and SCZ-specific CNVs (copy-number variations) showed no nORF enrichment. Enrichment was found within SCZ-specific loci (Figure 2A) 25 Supplementary Figure 22 Supplementary Figure 22: nORFs transcribed within neuropsychiatric samples were defined to be associated with a unique HAR if the unique HAR overlapped the nORF or regions extending 1 kb upstream or downstream of the nORF. All other details are as described in Figure 2C. 26 Supplementary Figure 23 Supplementary Figure 23. Types of individual HARs within the sets of associated HARs. BD: DE nORF-HARs in BD; S1: DE nORF-HARs in SCZ; S2: DE nORF - DE HARs in SCZ. 27 Supplementary Figure 24 nORFID Disorder Classification Chr Nearby gene tracer_109118 BD nmd 9 STXBP1, syntaxin binding protein tracer_31861 BD utr5-cds 14 HSPA2: Heat shock protein family A (Hsp70) member 2 tracer_21834 BD, SCZ nmd(truncation) 12 ENO2: enolase 2 dvgsH1 BD, SCZ pseudogene 11 TUBAP2: tubulin alpha pseudogene 2 and larger DISC1FP1: DISC 1 Fusion partner 1 Supplementary Figure 24. Characterization of disorder-specific translated nORFs. Of the 21 nORFs that were identified to be translated in neuropsychiatric samples, two were uniquely expressed in BD and two in BD and SCZ. The table above highlights key features of these four nORFs including the chromosome there are present in, the nearby genes and the classification of the nORFs against the sample transcript they arise from. 28 Supplementary Figure 25 Supplementary Figure 25. GO enrichment results for transcribed and translated nORFs. For all the 248,135 nORFs used in this study, GO terms were obtained using InterProScan. For nORFs with evidence of transcription, structural molecular activity within ribosomes and therefore, potential involvement in translation was found. For nORFs that were DE, no enrichment was found possibly owing to the small set of DE nORFs with GO terms (2 in BD and 13 in SCZ). Structural molecular activity as part of the myelin sheath and cytoskeleton, GTP binding, GTPase and oxidoreductase activity were found enriched within the translated nORF set. 29 Supplementary Figure 26 Supplementary Figure 26. Enrichment of genomic elements within nORFs. DHS1, histone modifications and TF enrichment within nORFs transcribed in neuropsychiatric samples (3103) and DE nORFs as evaluated using the GLANET tool. 30 Supplementary Figure 27 2ysyH1 ajg1H1 brt7H1 cycoH1 dvgsH1 tracer_21667 31 tracer_23418 tracer_31861 tracer_34487 tracer_44350 tracer_54924 tracer_65165 32 tracer_65237 tracer_65246 tracer_91684 tracer_103118 tracer_109118 fs1rH2 33 ji52H8 jsksH2 tracer_70164 tracer_87517 2v5xH1 cp2xH1 34 Supplementary Figure 27: Predicted structure for nORFs. Structures predicted for nORFs using i- TASSER or Raptor-X and visualised with Avogadro or Jena3D viewer. These nORFs are either translated in neuropsychiatric analysis or are associated with HARs in SCD and BD samples. To complement Figure 5. Structures for two nORFs - eveeH1 & 7t8hH4 are not shown because their corresponding amino acid length (20 and 15, respectively) is below the minimum sequence threshold required for structure prediction.