Biomedical Machine Learning Beyond the Training Distribution

DSpace Repositorium (Manakin basiert)


Dateien:

Zitierfähiger Link (URI): http://hdl.handle.net/10900/172807
http://nbn-resolving.org/urn:nbn:de:bsz:21-dspace-1728078
http://dx.doi.org/10.15496/publikation-114132
Dokumentart: Dissertation
Erscheinungsdatum: 2025-12-03
Sprache: Englisch
Fakultät: 7 Mathematisch-Naturwissenschaftliche Fakultät
Fachbereich: Informatik
Gutachter: Schölkopf, Bernhard (Prof. Dr.)
Tag der mündl. Prüfung: 2025-10-29
DDC-Klassifikation: 004 - Informatik
Schlagworte: Maschinelles Lernen , Biomedizin , Generalisierung , Epigenetik
Freie Schlagwörter: Maschinelles Lernen
Biomedizin
Generalisierung
Epigenetik
Antibiotikaresistenz
Antimicrobial Resistance
Epigenetics
Biomedicine
Machine Learning
Generalization
Lizenz: http://tobias-lib.uni-tuebingen.de/doku/lic_ohne_pod.php?la=de http://tobias-lib.uni-tuebingen.de/doku/lic_ohne_pod.php?la=en
Zur Langanzeige

Abstract:

Machine learning (ML) holds the potential to impact many aspects of our lives, particularly in high-stakes areas like law, autonomous systems, and healthcare. The prospects of leveraging large quantities of data to mine patterns, improve decision-making, and navigate the complexity of biological systems are especially appealing and can have far-ranging consequences; however, ensuring the robustness and reliability of machine learning models has proven a remarkably difficult challenge, leading to considerable efforts by the research community. In particular, understanding how ML models generalize to new observations is a necessary condition for the fruitful translation of these advancements in machine learning to clinical practice or to expand biological domain knowledge. When the training and test settings correspond, and the individual observations do not affect each other---the so-called independent, identically distributed (IID) setting---machine learning and deep learning have displayed remarkable capabilities. But when the data-generating distribution shifts, or when we want to solve related but slightly different tasks, then the quality of the predictions of a model can rapidly deteriorate. In this thesis, I will examine the challenges that arise when generalizing beyond the training distribution in biomedical machine learning and the approaches developed to tackle such challenges. The first part of the thesis will provide a broad overview of the topic of generalization in machine learning, starting from a conceptual formulation of the generalization problem and the progress made in laying theoretical foundations for generalization in ML. Delving into the topic, I will provide an examination of the most common paradigms developed to improve predictive performance when generalizing outside the training distribution, and I will discuss the role of causal reasoning within this picture. Afterwards, I will review the state of biomedical applications of machine learning, highlighting some of the most well-studied areas of research, as well as fields where the use of ML has yet to deliver on its promise. Of particular interest is the topic of biases in biomedical data: given the staggering complexity of biological phenomena, and the considerable experimental constraints on gathering relevant data, it is crucial that we understand how to separate noise and natural variability from meaningful signal. Related to this idea, I will also discuss the ever-present challenge of validating the results of biomedical ML models. Following these broad overviews of generalization and biomedical machine learning, I will present two works revolving around the application of deep learning to biological and clinical data. In each of them, the generalization challenges and paradigms presented in the earlier chapters play a crucial role, enabling novel prediction tasks or revealing insights into the properties of the models. The first work, that focuses on the task of imputing epigenomic signals, showcases how the use of transfer learning enables the out-of-distribution imputation of individual-specific epigenomic patterns, a case study in personalized epigenomics that is, to the best of my knowledge, the first of its kind. Afterwards, I will present a research project that tackles the task of predicting antimicrobial resistance from clinical proteomics data; when delving into the workings of the models proposed, the analysis of zero-shot prediction tasks offers a window into their robustness, which can guide future developments and offer insights for the data collection efforts required to progress further.

Das Dokument erscheint in: