Nucleic Acids Research , 2024, 52 , D107–D114 https://doi.org/10.1093/nar/gkad1021 Advance access publication date: 22 November 2023 Database issue Expression Atlas update: insights from sequencing data at both bulk and single cell level Nancy George 1 ,† , Silvie Fexova 1 , Alfonso Munoz Fuentes 1 , Pedro Madrigal 1 ,† , Yalan Bi 1 , Haider Iqbal 1 , Upendra Kumbham 1 , Nadja F r ancesca Nolt e 1 ,† , Ling yun Zhao 1 , Anil S. Thanki 1 , Iris D. Yu 1 ,† , Jose C. Marugan Calles 1 , Karoly Erdos 1 , Liora Vilmovsky 1 , Sandeep R. Kur r i 1 , Anna Vathrak ok oili-P our nara 1 , David Osumi-Suther land 1 , Ananth Prakash 1 ,† , Shengbo Wang 1 ,† , Marcela K. Tello-Ruiz 2 , Sunita Kumari 2 , Dor een War e 2 , 3 , D amien Gout te-Gat tat 4 ,† , Yanhui Hu 5 , Nick Brown 4 , Norbert P er r imon 5 , 6 , Juan Antonio Vizcaíno 1 ,† , Tony Burdett 1 , Sar ah Teic hmann 7 , Alvis Br azma 1 and Irene Papatheodorou 1 , * 1 European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton CB10 1SD, UK 2 Cold Spring Harbour Laboratory, One Bungtown Road, Cold Spring Harbor, NY 11724, USA 3 USDA ARS NEA, Plant Soil & Nutrition Laboratory Research Unit, Ithaca, NY 14853, USA 4 FlyBase-Cambridge, Department of Physiology, Development and Neuroscience, University of Cambridge Downing Street, Cambridge CB2 3DY, UK 5 Perrimon Lab, Department of Genetics, Harvard Medical School, Boston MA 02115, USA 6 FlyBase-Harvard Biological Laboratories, Harvard University, 16 Divinity Avenue, Cambridge, MA 02138, USA 7 Wellcome Trust Sanger Institute. Wellcome Genome Campus, Hinxton CB10 1SA, UK * To whom correspondence should be addressed. Tel: +44 122349 2568; Email: irenep@ebi.ac.uk † Contributed to the manuscript. Abstract Expression Atlas ( www.ebi.ac.uk/gxa ) and its ne w est counterpart the Single Cell Expression Atlas ( www.ebi.ac.uk/ gxa/ sc ) are EMBL-EBI’s knowl- edgebases for gene and protein expression and localisation in bulk and at single cell level. These resources aim to allow users to investigate their expression in normal tissue (baseline) or in response to perturbations such as disease or changes to genotype (differential) across multiple species. Users are invited to search for genes or met adat a terms across species or biological conditions in a standardised consistent interface. Alongside these data, new features in Single Cell Expression Atlas allow users to query met adat a through our new cell type wheel search. At the e xperiment le v el data can be e xplored through tw o t ypes of dimensionalit y reduction plots, t-distributed Stochastic Neighbor Embedding (tSNE) and Unif orm Manif old Appro ximation and P rojection (UMAP), o v erlaid with either clustering or met adat a information to assist users’ underst anding . Dat a are also visualised as marker gene heatmaps identifying genes that help confer cluster identity. For some data, additional visualisations are a v ailable as interactive cell level anatomograms and cell type gene expression heatmaps. Gr aphical abstr act Received: September 15, 2023. Revised: October 13, 2023. Editorial Decision: October 13, 2023. Accepted: October 30, 2023 © The Author(s) 2023. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http: // creativecommons.org / licenses / by / 4.0 / ), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. https://doi.org/10.1093/nar/gkad1021 https://orcid.org/0000-0003-4183-8865 https://orcid.org/0000-0002-2701-1987 https://orcid.org/0000-0002-8125-3821 https://orcid.org/0000-0002-6095-8718 https://orcid.org/0000-0003-1494-1402 http://www.ebi.ac.uk/gxa http://www.ebi.ac.uk/gxa/sc D 108 Nucleic Acids Research , 2024, Vol. 52, Database issue Table 1. Top 10 species represented by experiment number in Single Cell Expression Atlas and Expression Atlas Species Single Cell Expression Atlas Expression Atlas Proteomics Homo sapiens 146 1600 64 Mus musculus 122 1273 20 Drosophila melanogaster 29 150 Danio rerio 14 26 Arabidopsis thaliana 13 625 Gallus gallus 4 40 Rattus norvegicus 3 186 9 Zea mays 2 89 Oryza sativa Japonica Group 2 112 Saccharomyces cerevisiae 1 49 Species Single Cell Expression Atlas Expression Atlas Arabidopsis thaliana 13 625 Zea mays 2 89 Oryza sativa 3 112 Glycine max 22 Triticum aestivum 19 Vitis vinifera 33 Sorghum bicolor 11 Solanum lycopersicum 2 20 Hordeum vulgare 16 Medicago truncatula 10 Additional table representing the scarcity of single cell sequencing data in plants compared to the top 10 plant species represented in Expression Atlas. Introduction Expression Atlas ( 1 ) ( https:// www.ebi.ac.uk/ gxa ) and the Sin- gle Cell Expression Atlas ( https:// www.ebi.ac.uk/ gxa/ sc ) are knowledgebases (added-value resource) for consistently anal- ysed gene and protein expression data that are developed and maintained by the Gene Expression team at the European Molecular Biology Laboratory’s European Bioinformatics In- stitute (EMBL-EBI). Data provided in the Expression Atlases are aimed at allowing users to investigate the localisation and abundance of gene and protein expression at the bulk and sin- gle cell level. Presenting these data in a consistent interface across resources allows users to investigate gene expression data across technologies (bulk sequencing, microarray, well- based and multiplexed single cell sequencing) to gain consen- sus and insight from publicly available and controlled access data in a single resource. Datasets are sourced from public archives, such as BioStud- ies ( 2 ), PRIDE ( 3 ), NCBI’s Gene Expression Omnibus (GEO) ( 4 ), the European Nucleotide Archive (ENA) ( 5 ), dbGaP ( 6 ) and the European Genome-Phenome Archive (EGA) ( 7 ). With the continuous advancements in single-cell technologies and the increasing availability of data from a wider array of or- ganisms, Single Cell Expression Atlas has increased its cover- age of scRNA-seq datasets from various plant species and cell atlas projects, such as the Fly Cell Atlas ( 8 ). In addition to transcriptomics data, Expression Atlas has integrated protein expression information in the same web in- terface alongside the gene expression data. Public mass spec- trometry (MS)-based proteomics datasets coming mainly from the PRIDE database are selected, manually curated and re- analysed. For enabling data integration, protein expression re- sults are reported in gene coordinates. With the increase in datasets within Single Cell Expres- sion Atlas and the addition of new species, the user interface has been improved to accommodate for metadata searches in addition to gene searches. Moreover, the results of metadata searches (including cell type search) can be viewed in a sum- marised way, highlighting the coverage of data that matches these keywords and cell types across different species. Top- scoring genes can be easily viewed across studies and species, providing a powerful way for easy interpretation by the sci- entific research community. All data and analysis pipelines are designed to incorporate FAIR data principles ( 9 ) (Findable, Accessible, Interoperable, Reusable) to allow for their reuse and uptake by the scientific community. Data ingestion focuses on structuring metadata and mapping to ontology terms (controlled vocabularies) with input from species specific and subject matter experts (SMEs) where appropriate. This work aims to richly describe the en- tities represented in the resource and allow comparison of the same uniquely identified entity (such as cell type) across mul- tiple datasets. Additionally, we apply these principles to our analysis workflows so that users can freely access and reuse the tools and processes developed by the team in their own work with a greater understanding of how data is derived. Lastly, as part of the EMBL-EBI’s access policy all data visu- alisations and processed data are also made available under a CC0 licence. Main updates Datasets and species update From their inception, we aim to consistently improve the qual- ity and quantity of datasets in both Expression Atlas and Sin- gle Cell Expression Atlas. The latest release of Expression At- las in July 2023 focuses on increasing the quality of data pre- sented to users. The resource contains 4424 datasets compris- ing 15 840 assays. The inclusion of data from a new species Drosophila pseudoobscura brings the total of represented or- ganisms in the knowledgebase to 66. Of these, we also con- tinue to increase the representation of proteomics data (93 datasets) and the inclusion of baseline expression data com- prising 340 experiments across 47 different organisms. Single Cell Expression Atlas has also focused on increasing the coverage of species and the inclusion of new data into the resource. As of its latest release in March 2023 the knowl- edgebase contains 355 datasets across 21 species. These data are derived from over 17 million cells, of which 10.5 mil- lion passed our quality checks and are displayed for users to explore, through dimensionality reduction plots (t-SNE and UMAP) as well as gene expression heatmaps and marker gene identification. We also include a new species in this release, Xenopus tropicalis, increasing the representation of single cell data from multiple species. The top 10 species represented by experiments in Expres- sion Atlas and Single Cell Expression Atlas are summarised in (Table 1 ). Proteomics data In the last two years, we have continued to increase the content of proteomics datasets in Expression Atlas, in collaboration with the PRIDE team at EMBL-EBI. Expression Atlas now includes protein expression results, which have been further refined, coming from 93 proteomics datasets. The datasets, all of them generated using label-free techniques, can be split in two main groups according to the type of proteomics data acquisition: https://www.ebi.ac.uk/gxa https://www.ebi.ac.uk/gxa/sc Nucleic Acids Research , 2024, Vol. 52, Database issue D 109 d i S I p N e M w v e w i o p n m o I I t fl s a a ( ( a w (i) Datasets generated using Data Dependent Acquisition (DDA) approaches. In this case, MaxQuant ( 10 ) was used as the analysis software, followed by an in-house post-processing pipeline. The integration of three groups of baseline tissue-based datasets was finalised, coming from human (32 organs represented) ( 11 ), mouse (13 or- gans) and rat (8 organs) ( 12 ) samples. To complement these, a study including 15 datasets coming from farm pig ( Sus scrofa ) (14 organs) has just been finalised and is now being integrated (not yet finalised at the moment of writing). In addition to baseline tissue data, 12 datasets coming from colorectal cancer samples have been reanal- ysed and integrated (datasets in Expression Atlas tagged as ‘ColCancer2023’), enabling the detection of biomark- ers at the protein level. This is a continuation of our pre- vious efforts in cell lines and tumour tissue ( 13 ). (ii) Datasets generated using Data Independent Acquisi- tion (DIA) approaches. We performed a pilot project to study the feasibility of performing a systematic re- analysis of DIA datasets and included cell-line, human cancer-related and plasma samples ( 14 ). At the time, a spectral library based approach using the ‘Pan Human library’ was used; these datasets are tagged as ‘DIAPi- lot2021’. At present we are benchmarking a library-free methodology using DIA-NN ( 15 ), and applying it to ad- ditional datasets generated from human baseline tissues as a starting point. These datasets will get integrated into Expression Atlas in the near future, once the analyses are finalised. Data integration between transcriptomics and proteomics atasets is enabled because protein expression data is reported n a gene-centric manner. tandardisation of analysis workflow n the last two years we have shifted most of our analysis ipelines to modern workflow managers (Snakemake ( 16 ), extflow ( 17 ), Galaxy ( 18 )), which could be portable and asily cloud deployed, with automatic dependency resolution. igration to a more modern, community maintained, explicit orkflow environment enables a faster turn-over in terms of ariations to the analysis workflows, execution on multiple nvironments, granular tool updates and in general make the orkflow more maintainable and continue our adherence to ncorporating FAIR ( 9 ) principles into our pipelines. Because f the workflow modernisation achieved, the transition of our ipelines between different environments is straightforward, ot only facilitating deployment in HPC and cloud environ- ents, but also smoother release cycles and the continuation f the service in the future. mprovement of bioinformatics pipelines n terms of bioinformatics tools update, we have adapted he Single Cell Expression Atlas droplet quantification work- ow, which runs under Nextflow ( 17 ) for single-nucleus RNA equencing experiments (snRNA-seq), which presented low lignment rates. We performed an internal benchmark and nalysed the impact of different references and tools (alevin 19 ), alevin-fry ( 20 ), kallisto|bustools ( 21 ) and STARsolo 22 )) on the mapping rates. In addition to increasing the lignment rate for snRNA-seq experiments, in corroboration ith the original publication ( 20 ) we observed that alevin-fry was faster and required less memory. Therefore, we decided to switch from alevin to alevin-fry with unspliced (intron- containing) transcripts references for quantification of droplet based experiments. User interface Data curation At the heart of all data that gets ingested into Expression Atlas and Single Cell Expression Atlas is the incorporation of FAIR principles and data curation. From the onset, data are identi- fied by the curation team for their rich biological and technical metadata and file integrity. These are then curated so that all metadata available for the experiment at the sample and file level are identified, and incorporated into the dataset. Where possible, all metadata are mapped to the corresponding ontol- ogy (controlled hierarchical vocabulary) term and species spe- cific ontology where applicable (e.g. Drosophila melanogaster metadata are mapped to the ‘Drosophila gross anatomy on- tology’ (FBbt) maintained by the FlyBase ( 23 ) team). In addition, we strive to contact authors for cell type specific information which is inferred directly from the cells’ expres- sion profile. These are represented to users in the original for- mat as provided by the data owner, termed authors cell type. A second mapping where terms are mapped by the curation team to the closest relevant ontology term is also provided and represented to users as ‘authors cell type – ontology la- bels’. These are overlaid as metadata onto the dimensionality reduction plots on the results page for every experiment where available. These data visualisations are also made freely avail- able to users for download as high quality images for reuse and integration into their work. The benefit of this ontology mapping is clear as it allows users to consolidate and investigate all data where cell types are identified across diverse species, tissues and datasets to look for common or cell type specific genes which may infer functionality. This functionality is also integral to the meta- data search wheel visualisation described in this paper (see below). Metadata search and cell type wheel The latest feature for Single Cell Expression Atlas is the addi- tion of a ‘metadata search’ option that allows users to inves- tigate data annotated with a biological entity, such as an or- ganism part, cell type or disease. The addition of this feature to Single Cell Expression Atlas mirrors the existing function- ality in Expression Atlas and allows users to understand data in greater detail and assist in gaining insight into the data. The metadata search allows users to search for a biological entity from the search bar on the Single Cell Expression Atlas land- ing page. From there, a powerful ontology-mediated search expansion is applied so that a user’s search encompasses all synonyms, spellings and ‘child terms’ (associated more spe- cific terms for a search e.g. a search for cancer would include child terms such as lung cancer, glioblastoma etc.). Once a search is submitted to the web browser, the results are presented in a ‘cell type wheel’. This presents the search to the user as a series of ‘layers’ in a wheel. The innermost entity being the search term, and corresponding rings increas- ing in specificity, from species, to tissue to finally the cell type associated with that entity in the outermost ring. Selecting each ring refactors the results to expand the next associated outer ring and display to the user in more detail the entities D 110 Nucleic Acids Research , 2024, Vol. 52, Database issue associated with that ring. Users can manipulate the search dis- play by clicking on the ‘rings’ along the top of the page to go back through the specific layers and their previous search his- tory within the metadata results for their search. Upon selecting an entity (cell type) in the outermost ring where possible, a cell type heatmap showing the top 5 genes associated with that cell type is displayed alongside the cell type wheel. This allows users to see all datasets where their particular entity (e.g. Paneth cell in pancreas in human datasets) have been investigated. Users can then understand broad consensus and differences in the top genes expressed for that entity (Figure 1 ). Again, these data visualisations can also be freely downloaded by users as high quality images for integration into their work. Anatomograms expansion and inclusion of new species As described in the previous Expression Atlas update paper ( 1 ), anatomograms are interactive single cell visualisations of cell types ‘in situ’ for healthy adult human tissue. These are developed through cross team collaborations between web de- velopers, bioinformaticians, artists and data curators as well as SME’s in the scientific research community for that organ and species. The aim of these visualisations is to allow users to gain an ‘in situ’ understanding of organ structure and cell types alongside a ‘cell type heatmap’ that shows the top 5 marker genes specific to that cell type. An example of the use of anatomograms is shown in Figure 2 , with a study on pan- creatic cells. In order to develop an anatomogram, in collaboration with research experts the curation and artists create an overview of the organ structure, including sub-organism and cell type con- figurations. These are represented as a structural series of im- ages from top level macrostructures through a series of ‘zoom in’ images to microstructures and cell level architecture. Im- ages are developed as shape layers on top of the base level im- age, representing the hierarchical nature of these tissues where cells are layered onto sub-tissue structures within the organ. This hierarchical nature is also recreated in the corresponding organ and cell type ontologies. Where required, additional cell type and sub-tissue structures and relationships missing from the relevant ontologies are identified, defined and added to the relevant ontology through community collaborations. In this way the tissue is represented both visually and hierarchically. All structures are then mapped to the corresponding image and ontology term prior to implementation by the web devel- opment team. To link an image shape (e.g. cell type) to the corresponding data we again leverage the power of metadata curation. Cu- rated datasets where possible are mapped to ontology terms, including as described earlier, inferred cell type information. These mappings correspond to the ontology shape mapping defined in the anatomogram. This linkage allows us to create a link between the shape and the corresponding expression profile for that entity and the cell type heatmap. We continue to develop anatomograms and as part of these efforts, we have developed these visualisations for the organs related to the anatomy of the gut. Therefore in collaboration with the Human Gut Cell Atlas ( 24 ), we have developed a series of anatomograms representing the whole healthy adult human digestive tract, as well as a series representing the com- posite elements, including the colon, large and small intestines, mesenteric lymph nodes and anus. Discussion Community curation Since their inception, both Expression Atlas and the Single Cell Expression Atlas have been committed to incorporating public datasets and where possible, controlled access, large scale and selected consortia data for use by the scientific com- munity. This involves working with species and project com- munities to identify and incorporate data of value to their members. Additionally, we work with these community ex- perts to ensure that the data incorporated aligns with their standards, ontologies and any additional requirements. Through collaborating with these communities (Plant Cell Atlas, Fly Cell Atlas, Human Cell Atlas, European Diagnos- tic Transcriptomic Library (EDTL) and more) we are one of the few resources to contain multiple datasets across multiple species. This is particularly relevant for the plant community where single cell data is difficult to obtain, (Table 1 ). Another element where community engagement is essential is the incorporation of inferred cell types. These are cell types conferred by investigation of their transcriptional profile as opposed to their biological characteristics, which have been the main source of cell classification. By reaching out to data owners we aim to include this information as often as possi- ble for their dataset. Inferred cell type information are incor- porated into the data visualisation both as metadata overlay on the dimensionality reduction plots and as mentioned previ- ously for selected datasets in the anatomograms and cell type heatmap (see the relevant sections). Ontology mapping where possible is done in collaboration with a community expert and the corresponding ontology is updated where required to clas- sify these new cell types. The addition of inferred cell type identity to these data has been instrumental in future work, such as the cell type decon- volution of existing bulk data (see the deconvolution section below) as well as any future pipelines aiming to programmat- ically identify cells from existing expression profiles. Future work in the team aims to make the process of com- munity engagement as easy as possible with the aim of encour- aging users to submit data directly to the Atlases as a potential endpoint. This would help resources continue to ingest data in line with the increase in publications which far outstrips the ability of manual curation teams to do so. With this in mind we aim to make the process of converting datasets from ex- isting International Nucleotide Sequence Database Collabo- ration, INSDC, (The International Nucleotide Sequence Col- laboration, https:// www.insdc.org/ ) resources, ENA ( 25 ), Se- quence Read Archive, SRA ( 26 ) and DNA Data Bank of Japan, DDBJ, ( 27 ) formats into MAGE-TAB ( 28 ) an automated pro- cess, requiring only the corresponding dataset accession. This is possible due to the shared data model across these resources and the alignment of this to the MAGE-TAB model. These MAGE-TAB would then be annotated by the community with ontology terms where possible and submitted for curation re- view. For single cell transcriptomic data for Single Cell Ex- pression Atlas we would also encourage users to submit the corresponding inferred cell type data in a standardised format for inclusion to their dataset. We have already started on this process, partly through the improvement of existing scripts which query the ENA API (Application Programming Interface) to convert both GEO and ENA data into MAGE-TAB format programmatically rather than manually, significantly reducing manual effort. We https://www.insdc.org/ Nucleic Acids Research , 2024, Vol. 52, Database issue D 1 1 1 Figure 1. Cell type wheel visualisation of pancreatic D (delta) cell metadata search in Single Cell Expression Atlas alongside heatmap showing the top 5 genes across datasets for this cell type in Homo sapiens datasets. h t T a O t s b n S r F A a i E p u v o t m S c t o i e e v E d u ave also worked extensively with community partners and raining events in data and knowledge management, MAGE- AB structure and the inclusion of ontology terms to train nd pass on our knowledge and tools to these communities. ur aim is to allow these communities to have these tools at heir disposal and promote their reuse to their communities o that data comes directly from them rather than identified y the curation team through publication searches which are ot extensive. ingle cell expression atlas and community esources lyBase s part of our commitment to working with communities, we lso make data available to feed back into community repos- tories. For the FlyBase collaboration, data from Single Cell xpression Atlas are incorporated into this repository. The oint of this is to fulfil three main aims: (i) to help FlyBase sers to discover what data and datasets are available; (ii) pro- ide information about relevant datasets and (iii) get a quick verview of expression data from these datasets. To assist with his, Single Cell Expression Atlas provides firstly, additional etadata about samples from the manual curation of datasets. econdly, Single Cell Expression Atlas provides data matrices, ontaining gene expression per cell alongside the inferred cell ype identity. With these data, FlyBase extracts i. the extent f expression ie. the proportion of cells of a given cell type n that dataset in which a gene is detected and ii. The average xpression (normalised to CPM) in cells of that type which do xpress that gene. For a particular dataset record in FlyBase, users are pro- ided with links to the corresponding datasets in Single Cell xpression Atlas. In the specific case of the Fly Cell Atlas ataset ( 8 ), users can also explore the data for a given gene sing either the cell type ribbon, where tiles are coloured by the extent of expression of that gene in a range of cell types from the dataset or a graphical display of high throughput ex- pression data as a bargraph corresponding to both the propor- tion of cell types which express the gene alongside the average expression of that gene in those cell types (Figure 3 ) ( 23 ). An additional expansion to this project is the development of anatomograms for healthy adult tissue from Drosophila melanogaster as part of the community Fly Cell Atlas project and the FlyBase curation team. Initial development includes anatomograms for reproductive organs, the ovary and testis as well as a representation of the whole adult fly and composite organs. Gramene Another long standing collaboration between Expression At- lases and the plant community is our collaboration with the Gramene project ( 29 ). Expression data derived from plant data ingested into the Atlases is dynamically represented and updated via an embedded Atlas widget within Gramene’s search browser with future plans to include Single Cell Ex- pression data at the cellular level. We also work closely with the Gramene team to identify and ingest key datasets of in- terest to the plant community for both bulk and single cell expression. Cell-type deconvolution Deconvolution of RNA-seq experiments in Atlas have been implemented based on the recommendations of Vathrakokoili-Pournara et al. ( 30 ) using a set of organ- ism part-specific references from the Single Cell Expression Atlas. Three selected deconvolution tools implemented in R (DWLS ( 31 ), FARDEEP ( 32 ), EpiDISH ( 33 )) are run for bulk RNA-seq experiments from human, mouse and fruit fly. The estimated deconvolution results are reported if the mean Pearson correlation between the output cell proportion matrices is equal or higher than 0.6 to ensure robustness of D 112 Nucleic Acids Research , 2024, Vol. 52, Database issue A B Figure 2. ( A ) The Single Cell Expression Atlas organ anatomogram for pancreas (e.g. https:// www.ebi.ac.uk/ gxa/ sc/ experiments/ E- GEOD- 83139/ results/ anatomogram ), displaying marker genes for the different pancreatic cell types. Ho v ering o v er sections of the heatmap giv es details about the gene’s e xpression. As the user clicks on an activ e section of the pancreas anatomogram, the heatmap to the right changes to display only cell types that exist under that specific part of the organ. ( B ) As the user dives into more and more detailed views, it will end up at a cellular view. the predictions. In the future, the results provided by this analysis will deliver Atlas users with additional information about estimated cellular heterogeneity of bulk samples. The user will then be linked back to the cell type wheel of the re- spective organism part and cell type in Single Cell Expression Atlas. Proteomics In the case of label-free quantitative proteomics datasets, the main focus in the proteomics field is shifting to DIA ap- proaches, thanks to advances in instrumentation and compu- tational analysis. One of the effects in DIA datasets is the re- duction of missing values. DDA MS2-labelled approaches as https://www.ebi.ac.uk/gxa/sc/experiments/E-GEOD-83139/results/anatomogram Nucleic Acids Research , 2024, Vol. 52, Database issue D 113 Figure 3. FlyBase results for Dmel / w showing the cell type ribbon, where tiles are coloured by the extent of expression of Dmel / w in a range of cell types from the dataset alongside a graphical display of high throughput expression data as a bar graph corresponding to both the proportion of cell types which express Dmel / w alongside the average expression of that gene in those cell types. T t o b p ( b t p r s D E a / w r a / D c 1 p e c 1 A W P n A i n MT (Tandem Mass Tagging) remain also popular, although hey are preferred for differential studies. However, methodol- gy has been recently developed to represent these datasets as aseline data ( 34 ). In addition to bulk tissue data, single cell roteomics datasets are being generated at an increasing pace 35 ) although the instrumentation required makes this possi- le only for a small number of groups still. Although many of hese datasets are still generated for method development pur- oses mainly, we anticipate a higher number of biologically elevant ones. We will attempt to use the ‘Single Cell Expres- ion Atlas’ for providing access to these datasets. ata availability xpression Atlas and Single Cell Expression Atlas are avail- ble for users at https:// www.ebi.ac.uk/ gxa/ and at https: / www.ebi.ac.uk/ gxa/ sc/ , respectively. The Expression Atlas eb application is open source and available in the GitHub epositories https:// github.com/ ebi- gene- expression- group/ tlas- web- single- cell DOI: 10.5281 / zenodo.10021406, https: / github.com/ ebi- gene- expression- group/atlas- web- bulk OI: 10.5281 / zenodo.10021638 and https://github. om/ Papatheodorou-Group/ CATD _ snakemake DOI: 0.5281 / zenodo.10021678 among others. The Nextflow ipeline to perform benchmark of different tools and refer- nces for snRNA-seq datasets is available at https://github. om/ebi- gene- expression- group/snRNA- mapping- rate DOI 0.5281 / zenodo.10021661. c kno wledg ements e would like to thank Olamidipupo Ajigboye and Helen arkinson for their contributions in enriching EFO in terms eeded to describe samples studied in Atlas; Awais Athar, hmed Ali, Ugis Sarkans for their help with the BioStudies nterface and assistance in submissions of new functional ge- omics studies to BioStudies. We would like to thank the Bio- conda community, the Galaxy community for assistance with Bioconda and Galaxy. We would like to thank the data wran- glers, past and present of the Human Cell Atlas Data Coordi- nation Platform for their assistance collating HCA data for the Single Cell Expression Atlas. Finally, we thank the Expression Atlas SAB members, Jurg Bahler (University College London), Angela Brookes (University of California Santa Cruz), Roderic Guigó (Center for Genomic Regulation, chair), Kathryn Lilley (Cambridge University) and Zemin Zhang (Peking University). Funding European Molecular Biology Laboratory (EMBL) mem- ber states; Wellcome Trust [Single Cell Gene Expression Atlas 108437 / Z / 15 / Z, 221401 / Z / 20 / Z and PRIDE 223745 / Z / 21 / Z]; BBSRC grants ‘DIA-eXchange’ [BB / X001911 / 1], ‘GRAPPA’ [BB / T019670 / 1]; ‘Fly Cell Atlas’ [BB / T014563 / 1]; Open Targets ‘Cell type decon- volution’ project and ‘Target Safety’ project. ‘Gramene’ [USDA-ARS-8062-21000-041-00D]. Funding for open access charge: European Molecular Biology Laboratory (EMBL) [108437 / Z / 15 / Z, 221401 / Z / 20 / Z, 223745 / Z / 21 / Z]. Conflict of interest statement None declared. References 1. Papatheodorou, I. , Moreno, P. , Manning, J. , Fuentes, A.M.-P. , George, N. , Fexova, S. , Fonseca, N.A. , Füllgrabe, A. , Green, M. , Huang, N. , et al. (2020) Expression Atlas update: from tissues to single cells. Nucleic Acids Res. , 48 , D77–D83. 2. Sarkans, U. , Gostev, M. , Athar, A. , Behrangi, E. , Melnichuk, O. , Ali, A. , Minguet, J. , Rada, J.C. , Snow, C. , T ikhonov, A. , et al. (2018) The BioStudies database—one stop shop for all data supporting a life sciences study. Nucleic Acids Res. , 46 , D1266–D1270. https://www.ebi.ac.uk/gxa/ https://www.ebi.ac.uk/gxa/sc/ https://github.com/ebi-gene-expression-group/atlas-web-single-cell https://github.com/ebi-gene-expression-group/atlas-web-bulk https://github.com/Papatheodorou-Group/CATD_snakemake https://github.com/ebi-gene-expression-group/snRNA-mapping-rate D 114 Nucleic Acids Research , 2024, Vol. 52, Database issue 3. Perez-Riverol, Y. , Bai, J. , Bandla, C. , García-Seisdedos, D. , Hewapathirana, S. , Kamatchinathan, S. , Kundu, D.J. , Prakash, A. , Frericks-Zipper, A. , Eisenacher, M. , et al. (2022) The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences. Nucleic Acids Res., 50 , D543–D55 4. Barrett, T. , Wilhite, S.E. , Ledoux, P. , Evangelista, C. , Kim, I.F. , Tomashevsky, M. , Marshall, K.A. , Phillippy, K.H. , Sherman, P.M. , Holko, M. , et al. (2013) NCBI GEO: archive for functional genomics data sets–update. Nucleic Acids Res. , 41 , D991–D995. 5. Toribio, A.L. , Alako, B. , Amid, C. , Cerdeño-Tarrága, A. , Clarke, L. , Cleland, I. , Fairley, S. , Gibson, R. , Goodgame, N. , Ten Hoopen, P. , et al. (2017) European Nucleotide Archive in 2016. Nucleic Acids Res., 45 , D32–D36. 6. Tryka, K.A. , Hao, L. , Sturcke, A. , Jin, Y. , Wang, Z.Y. , Ziyabari, L. , Lee, M. , Popova, N. , Sharopova, N. , Kimura, M. , et al. (2014) NCBI’s Database of Genotypes and Phenotypes: dbGaP. Nucleic Acids Res., 42 , D975–D979, 7. Lappalainen, I. , Almeida-King, J. , Kumanduri, V. , Senf, A. , Spalding, J.D. , Ur-Rehman, S. , Saunders, G. , Kandasamy, J. , Caccamo, M. , Leinonen, R. , et al. (2015) The European genome-phenome archive of human data consented for biomedical research. Nat. Genet., 47 , 692–695. 8. Li, H. , Janssens, J. , De Waegeneer, M. , Kolluru, S.S. , Davie, K. , Gardeux, V. , Saelens, W. , David, F. , Brbi ́c, M. , Spanier, K. , et al. (2022) Fly Cell Atlas: a single-nucleus transcriptomic atlas of the adult fruit fly. Science , 375 , eabk2432. 9. Wilkinson, M. , Dumontier, M. , Aalbersberg, I. , Appleton, G. , Axton, M. , Baak, A. , Blomberg, N. , Boiten, J.-W. , da Silva Santos, L.B. , Bourne, P.E. , et al. (2016) The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data , 3 , 160018. 10. Sinitcyn, P. , T iwary, S. , Rudolph, J. , Gutenbrunner, P. , Wichmann, C. , Yılmaz, ̧S . , Hamzeiy, H. , Salinas, F. and Cox, J. (2018) MaxQuant goes Linux. Nat. Methods , 15 , 401. 11. Prakash, A. , García-Seisdedos, D. , Wang, S. , Kundu, D.J. , Collins, A. , George, N. , Moreno, P. , Papatheodorou, I. , Jones, A.R. and V izcaíno, J.A. (2023) Integrated view of baseline protein expression in human tissues. J. Proteome Res., 22 , 729–742. 12. Wang, S. , García-Seisdedos, D. , Prakash, A. , Kundu, D.J. , Collins, A. , George, N. , Fexova, S. , Moreno, P. , Papatheodorou, I. , Jones, A.R. , et al. (2022) Integrated view and comparative analysis of baseline protein expression in mouse and rat tissues. PLoS Comput. Biol., 18 , e1010174. 13. Jarnuczak, A.F. , Najgebauer, H. , Barzine, M. , Kundu, D.J. , Ghavidel, F. , Perez-Riverol, Y. , Papatheodorou, I. , Brazma, A. and V izcaíno, J.A. (2023) An integrated landscape of protein expression in human cancer. Sci. Data , 8 , 115. 14. Walzer, M. , García-Seisdedos, D. , Prakash, A. , Brack, P. , Crowther, P. , Graham, R.L. , George, N. , Mohammed, S. , Moreno, P. , Papatheodorou, I. , et al. (2022) Implementing the reuse of public DIA proteomics datasets: from the PRIDE database to Expression Atlas. Sci. Data , 9 , 335. 15. Demichev, V. , Messner, C.B. , Vernardis, S.I. , Lilley, K.S. and Ralser,M. (2020) DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nat. Methods , 17 , 41–44. 16. Mölder, F. , Jablonski, K.P. , Letcher, B. , Hall, M.B. , Tomkins-T inch, C.H. , Sochat, V. , Forster, J. , Lee, S. , Twardziok, S.O. , Kanitz, A. , et al. (2021) Sustainable data analysis with Snakemake [version 2; peer review: 2 approved]. F1000Research , 10 , 33. 17. Di Tommaso, P. , Chatzou, M. , Floden, E. , Barja, P .P . , Palumbo, E. and Notredame,C. (2017) Nextflow enables reproducible computational workflows. Nat. Biotechnol., 35 , 316–319. 18. Galaxy Community (2022) The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update. Nucleic Acids Res., 50 , W345–W351. Received: September 15, 2023. Revised: October 13, 2023. Editorial Decision: October 13, 2023. Acc © The Author(s) 2023. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution Lice distribution, and reproduction in any medium, provided the original work is properly cited. 19. Srivastava, A. , Malik, L. , Smith, T. , Sudbery, I. and Patro, R. (2019) Alevin efficiently estimates accurate gene abundances from dscRNA-seq data. Genome Biol. , 20 , 65. 20. He, D. , Zakeri, M. , Sarkar, H. , Soneson, C. , Srivastava, A. and Patro,R. (2022) Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data. Nat. Methods , 19 , 316–322. 21. Melsted, P. , Booeshaghi, A.S. , Liu, L. , Gao, F. , Lu, L. , Min, K.H.J. , da Veiga Beltrame, E. , Hjörleifsson, K.E. , Gehring, J. and Pachter, L. (2021) Modular, efficient and constant-memory single-cell RNA-seq preprocessing. Nat. Biotechnol., 39 , 813–818. 22. Kaminow, B. , Yunusov, D. and Dobin, A. (2021) STARsolo: accurate, fast and versatile mapping / quantification of single-cell and single-nucleus RNA-seq data. biorXiv doi: https:// doi.org/ 10.1101/ 2021.05.05.442755 , 05 May 2021, preprint: not peer reviewed. 23. The FlyBase Consortium, Gramates, L.S. , Agapite, J. , Attrill, H. , Calvi, B.R. , Crosby, M.A. , Santos, G. , Goodman, J.L. , Goutte-Gattat, D. , Jenkins, V.K. , Kaufman, T. , et al. (2022) FlyBase: a guided tour of highlighted features. Genetics , 220 , iyac035. 24. Elmentaite, R. , Kumasaka, N. , Roberts, K. , Fleming, A. , Dann, E. , King, H.W. , Kleshchevnikov, V. , Dabrowska, M. , Pritchard, S. , Bolt, L. , et al. (2021) Cells of the human intestinal tract mapped across space and time. Nature , 597 , 250–255 25. Harrison,P .W ., Ahamed,A., Aslam,R., Alako,B.T.F., Burgin,J., Buso, N. , Courtot, M. , Fan, J. , Gupta, D. , Haseeb, M. , et al. (2021) The european nucleotide archive in 2020. Nucleic Acids Res., 49 , D82–D85. 26. International Nucleotide Sequence Database Collaboration, Leinonen, R. , Sugawara, H. and Shumway, M. (2011) The sequence read archive. Nucleic Acids Res. , 39 , D19–D21. 27. Tanizawa, Y. , Fujisawa, T. , Kodama, Y. , Kosuge, T. , Mashima, J. , Tanjo, T. and Nakamura, Y. (2023) DNA Data Bank of Japan (DDBJ) update report 2022. Nucleic Acids Res. , 51 D101–D105. 28. Rayner, T.F. , Rocca-Serra, P. , Spellman, P.T. , Causton, H.C. , Farne, A. , Holloway, E. , Irizarry, R.A. , Liu, J. , Maier, D.S. , Miller, M. , et al. (2006) A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB. BMC Bioinf., 7 , 489. 29. Tello-Ruiz, M.K. , Jaiswal, P. and Ware, D. (2022) Gramene: a Resource for Comparative Analysis of Plants Genomes and Pathways. Methods Mol. Biol., 2443 , 101–131. 30. Vathrakokoili Pournara, A. , Miao, Z. , Beker, O.Y. , Brazma, A. and Papatheodorou,I. (2023) Power analysis of cell-type deconvolution methods across tissues. bioRxiv doi: https:// doi.org/ 10.1101/ 2023.01.19.523443 , 20 January 2023, preprint: not peer reviewed. 31. Tsoucas, D. , Dong, R. , Chen, H. , Zhu, Q. , Guo, G. and Yuan, G.-C. (2019) Accurate estimation of cell-type composition from gene expression data. Nat. Commun., 10 , 2975. 32. Hao, Y. , Yan, M. , Heath, B.R. , Lei, Y.L. and Xie, Y. (2019) Fast and robust deconvolution of tumor infiltrating lymphocyte from expression profiles using least trimmed squares. PLoS Comput. Biol. , 15 : e1006976. 33. Teschendorff, A.E. , Breeze, C.E. , Zheng, S.C. and Beck, S. (2017) A comparison of reference-based algorithms for correcting cell-type heterogeneity in Epigenome-Wide Association Studies. BMC Bioinf., 18 , 105. 34. Wang, H. , Dai, C. , Pfeuffer, J. , Sachsenberg, T. , Sanchez, A. , Bai, M. and Perez-Riverol,Y. (2023) Tissue-based absolute quantification using large-scale TMT and LFQ experiments. Proteomics , 24 , e2300188. 35. Bennett, H.M. , Stephenson, W. , Rose, C.M. and Darmanis, S. (2023) Single-cell proteomics enabled by next-generation sequencing or mass spectrometry. Nat. Methods , 20 , 363–374. epted: October 30, 2023 nse (http: // creativecommons.org / licenses / by / 4.0 / ), which permits unrestricted reuse, https://doi.org/10.1101/2021.05.05.442755 https://doi.org/10.1101/2023.01.19.523443 Graphical abstract Introduction Main updates Discussion Data availability Acknowledgements Funding Conflict of interest statement References