dc.contributor.advisor |
Berens, Philipp (Prof. Dr.) |
|
dc.contributor.author |
Lause, Jan |
|
dc.date.accessioned |
2025-02-24T12:02:54Z |
|
dc.date.available |
2025-02-24T12:02:54Z |
|
dc.date.issued |
2025-02-24 |
|
dc.identifier.uri |
http://hdl.handle.net/10900/162431 |
|
dc.identifier.uri |
http://nbn-resolving.org/urn:nbn:de:bsz:21-dspace-1624312 |
de_DE |
dc.identifier.uri |
http://dx.doi.org/10.15496/publikation-103763 |
|
dc.description.abstract |
Implementing brain functions like vision, memory and cognition requires billions of cells of diverse types to work together. Thus, one important goal in neuroscience is to comprehensively map which cell types exist. These cell types differ in factors such as physiology, morphology and location. Those factors are determined by the transcriptome, the entirety of RNA molecules expressed in each cell. Measuring this RNA fingerprint for brain cells with single-cell RNA sequencing therefore became a popular route to understand neural systems. In the last years, large transcriptomic atlas datasets were published, containing the gene expression of ten thousands of genes for millions of brain cells. However, data from single-cell sequencing is subject to strong technical noise, and thus requires special preprocessing: cell-to-cell variability in sequencing, unequal variances between genes of different expression level, and noise from PCR amplification. Currently, computational biologists often address those noise sources with heuristics for normalization and variance stabilization, but these methods have intrinsic limitations and are poorly motivated by theory. In addition, single-cell data is high-dimensional and therefore hard to visualize. Many practitioners use PCA followed by non-linear embedding methods like UMAP or t-SNE to reduce single-cell data to two dimensions. This practice has received substantial criticism, as it is impossible to preserve all aspects of the original data, e.g., high-dimensional distances. As a result, the field currently debates if UMAP and t-SNE should be used at all. In this thesis, we address challenges in both preprocessing and visualization. For preprocessing, we present a model-based strategy to normalize single-cell RNA sequencing data: Null model Pearson residuals. In this approach, we model the expected technical and statistical noise from the data generation process. Consequently, the residuals of this null model will contain the biologically meaningful signal, and can be used for downstream processing. We show that this approach leads to fast, scalable and effective normalization, and additionally allows for theoretical insights into the data generation process of single-cell RNA sequencing. For visualization, we investigate the claim that 2D embeddings of single-cell data are generally arbitrary and misleading. We show that this claim is false and misleading itself, as it was based on inadequate and limited metrics of embedding quality. More appropriate metrics quantifying neighborhood and class preservation reveal that while t-SNE and UMAP embeddings of single-cell data do not preserve high-dimensional distances, they can nevertheless provide biologically relevant information. Finally, we reflect on future directions for the field of single-cell data preprocessing and visualization, sketch out how neuroscience can build on top of the exciting single-cell work from the last decade, and how this might change how we think about brain cell types in the future. |
en |
dc.language.iso |
en |
de_DE |
dc.publisher |
Universität Tübingen |
de_DE |
dc.rights |
cc_by |
de_DE |
dc.rights |
ubt-podok |
de_DE |
dc.rights.uri |
https://creativecommons.org/licenses/by/4.0/legalcode.de |
de_DE |
dc.rights.uri |
https://creativecommons.org/licenses/by/4.0/legalcode.en |
en |
dc.rights.uri |
http://tobias-lib.uni-tuebingen.de/doku/lic_mit_pod.php?la=de |
de_DE |
dc.rights.uri |
http://tobias-lib.uni-tuebingen.de/doku/lic_mit_pod.php?la=en |
en |
dc.subject.classification |
Transkriptom , Zellen , Neurowissenschaften , Vorverarbeitung , Datenverarbeitung , Visualisierung , Dimensionsreduktion , Normalisierung , Verallgemeinertes lineares Modell , Biologie , Data Science , Einzelzellanalyse |
de_DE |
dc.subject.ddc |
500 |
de_DE |
dc.subject.ddc |
510 |
de_DE |
dc.subject.ddc |
570 |
de_DE |
dc.subject.ddc |
590 |
de_DE |
dc.subject.other |
single cell genomics |
en |
dc.subject.other |
dimensionality reduction |
en |
dc.subject.other |
visualisation |
en |
dc.subject.other |
variance stabilization |
en |
dc.subject.other |
normalization |
en |
dc.subject.other |
preprocessing |
en |
dc.subject.other |
computational biology |
en |
dc.subject.other |
Compound models |
en |
dc.subject.other |
UMAP |
en |
dc.subject.other |
t-SNE |
en |
dc.subject.other |
Pearson residuals |
en |
dc.subject.other |
scRNA |
en |
dc.subject.other |
scRNA-seq |
en |
dc.subject.other |
single cell transcriptomics |
en |
dc.subject.other |
noise removal |
en |
dc.subject.other |
GLM |
en |
dc.subject.other |
linear models |
en |
dc.subject.other |
Poisson models |
en |
dc.subject.other |
negative binomial models |
en |
dc.subject.other |
overdispersion |
en |
dc.title |
Models and methods to process single-cell RNA sequencing data for neuroscience |
en |
dc.type |
PhDThesis |
de_DE |
dcterms.dateAccepted |
2025-01-24 |
|
utue.publikation.fachbereich |
Medizin |
de_DE |
utue.publikation.fakultaet |
4 Medizinische Fakultät |
de_DE |
utue.publikation.source |
Drei Kapitel der Arbeit sind bereits einzeln (unter unterschiedlichen CC Lizenzen) erschienen: 1) Lause et al., Genome Biology 22, 258 (2021), CC-BY 4.0. DOI: https://doi.org/10.1186/s13059-021-02451-7 -- 2) Lause et al., bioRxiv 2023.08.02.551637, CC-BY-NC-ND 4.0. DOI: https://doi.org/10.1101/2023.08.02.551637 -- 3) Lause et al., PLOS Computational Biology 20(10): e1012403, CC-BY 4.0. DOI: https://doi.org/10.1371/journal.pcbi.1012403 |
de_DE |
utue.publikation.noppn |
yes |
de_DE |