Models and methods to process single-cell RNA sequencing data for neuroscience

DSpace Repositorium (Manakin basiert)

Zur Kurzanzeige

dc.contributor.advisor Berens, Philipp (Prof. Dr.)
dc.contributor.author Lause, Jan
dc.date.accessioned 2025-02-24T12:02:54Z
dc.date.available 2025-02-24T12:02:54Z
dc.date.issued 2025-02-24
dc.identifier.uri http://hdl.handle.net/10900/162431
dc.identifier.uri http://nbn-resolving.org/urn:nbn:de:bsz:21-dspace-1624312 de_DE
dc.identifier.uri http://dx.doi.org/10.15496/publikation-103763
dc.description.abstract Implementing brain functions like vision, memory and cognition requires billions of cells of diverse types to work together. Thus, one important goal in neuroscience is to comprehensively map which cell types exist. These cell types differ in factors such as physiology, morphology and location. Those factors are determined by the transcriptome, the entirety of RNA molecules expressed in each cell. Measuring this RNA fingerprint for brain cells with single-cell RNA sequencing therefore became a popular route to understand neural systems. In the last years, large transcriptomic atlas datasets were published, containing the gene expression of ten thousands of genes for millions of brain cells. However, data from single-cell sequencing is subject to strong technical noise, and thus requires special preprocessing: cell-to-cell variability in sequencing, unequal variances between genes of different expression level, and noise from PCR amplification. Currently, computational biologists often address those noise sources with heuristics for normalization and variance stabilization, but these methods have intrinsic limitations and are poorly motivated by theory. In addition, single-cell data is high-dimensional and therefore hard to visualize. Many practitioners use PCA followed by non-linear embedding methods like UMAP or t-SNE to reduce single-cell data to two dimensions. This practice has received substantial criticism, as it is impossible to preserve all aspects of the original data, e.g., high-dimensional distances. As a result, the field currently debates if UMAP and t-SNE should be used at all. In this thesis, we address challenges in both preprocessing and visualization. For preprocessing, we present a model-based strategy to normalize single-cell RNA sequencing data: Null model Pearson residuals. In this approach, we model the expected technical and statistical noise from the data generation process. Consequently, the residuals of this null model will contain the biologically meaningful signal, and can be used for downstream processing. We show that this approach leads to fast, scalable and effective normalization, and additionally allows for theoretical insights into the data generation process of single-cell RNA sequencing. For visualization, we investigate the claim that 2D embeddings of single-cell data are generally arbitrary and misleading. We show that this claim is false and misleading itself, as it was based on inadequate and limited metrics of embedding quality. More appropriate metrics quantifying neighborhood and class preservation reveal that while t-SNE and UMAP embeddings of single-cell data do not preserve high-dimensional distances, they can nevertheless provide biologically relevant information. Finally, we reflect on future directions for the field of single-cell data preprocessing and visualization, sketch out how neuroscience can build on top of the exciting single-cell work from the last decade, and how this might change how we think about brain cell types in the future. en
dc.language.iso en de_DE
dc.publisher Universität Tübingen de_DE
dc.rights cc_by de_DE
dc.rights ubt-podok de_DE
dc.rights.uri https://creativecommons.org/licenses/by/4.0/legalcode.de de_DE
dc.rights.uri https://creativecommons.org/licenses/by/4.0/legalcode.en en
dc.rights.uri http://tobias-lib.uni-tuebingen.de/doku/lic_mit_pod.php?la=de de_DE
dc.rights.uri http://tobias-lib.uni-tuebingen.de/doku/lic_mit_pod.php?la=en en
dc.subject.classification Transkriptom , Zellen , Neurowissenschaften , Vorverarbeitung , Datenverarbeitung , Visualisierung , Dimensionsreduktion , Normalisierung , Verallgemeinertes lineares Modell , Biologie , Data Science , Einzelzellanalyse de_DE
dc.subject.ddc 500 de_DE
dc.subject.ddc 510 de_DE
dc.subject.ddc 570 de_DE
dc.subject.ddc 590 de_DE
dc.subject.other single cell genomics en
dc.subject.other dimensionality reduction en
dc.subject.other visualisation en
dc.subject.other variance stabilization en
dc.subject.other normalization en
dc.subject.other preprocessing en
dc.subject.other computational biology en
dc.subject.other Compound models en
dc.subject.other UMAP en
dc.subject.other t-SNE en
dc.subject.other Pearson residuals en
dc.subject.other scRNA en
dc.subject.other scRNA-seq en
dc.subject.other single cell transcriptomics en
dc.subject.other noise removal en
dc.subject.other GLM en
dc.subject.other linear models en
dc.subject.other Poisson models en
dc.subject.other negative binomial models en
dc.subject.other overdispersion en
dc.title Models and methods to process single-cell RNA sequencing data for neuroscience en
dc.type PhDThesis de_DE
dcterms.dateAccepted 2025-01-24
utue.publikation.fachbereich Medizin de_DE
utue.publikation.fakultaet 4 Medizinische Fakultät de_DE
utue.publikation.source Drei Kapitel der Arbeit sind bereits einzeln (unter unterschiedlichen CC Lizenzen) erschienen: 1) Lause et al., Genome Biology 22, 258 (2021), CC-BY 4.0. DOI: https://doi.org/10.1186/s13059-021-02451-7 -- 2) Lause et al., bioRxiv 2023.08.02.551637, CC-BY-NC-ND 4.0. DOI: https://doi.org/10.1101/2023.08.02.551637 -- 3) Lause et al., PLOS Computational Biology 20(10): e1012403, CC-BY 4.0. DOI: https://doi.org/10.1371/journal.pcbi.1012403 de_DE
utue.publikation.noppn yes de_DE

Dateien:

Das Dokument erscheint in:

Zur Kurzanzeige

cc_by Solange nicht anders angezeigt, wird die Lizenz wie folgt beschrieben: cc_by