Abstract:
Variation graphs provide a powerful solution to overcome the limitations of linear reference
genomes, especially in representing the diversity and structural complexity within species.
As genome sequencing becomes more accessible and datasets grow in both quality and
scope, it is increasingly clear that traditional reference-based analyses fall short in capturing
large-scale variation, population structure, and genomic complexity. However, the practical
interpretation and use of genome graphs remains an open challenge. Both graph construction
and downstream analysis require new tools that can operate at scale, preserve biological
interpretability, and offer meaningful metrics to describe the underlying structure.
In this thesis, I present a set of tools developed to address key challenges in variation
graph analysis. The core contribution is gretl, a fast and flexible framework for computing
graph- and path-based statistics. It enables systematic comparisons across parameter settings
and graph construction methods, and has been used to analyze graphs built from multiple
species, including a yeast dataset and the 1001 Genomes Arabidopsis pangenome. The
framework reveals how parameters such as segment length and alignment thresholds strongly
affect graph structure and interpretability. I also introduce gfa2bin, a graph-to-GWAS
bridge that supports association testing directly from graph node coverage. This method
demonstrates the potential of graph-based GWAS to detect both known and novel signals
of trait associations. In addition, I develop a novel variation detection approach based on
bifurcation events between paths, offering a complementary alternative to standard bubble
detection algorithms.
Together, these tools enable direct statistical exploration and biological analysis of
genome graphs at both global and sample-specific levels. Applied to the Arabidopsis dataset,
they reveal population structure, patterns of pangenome expansion, and the role of private and
structural variation across diverse accessions. While challenges remain in variant extraction,
graph augmentation, and performance scaling, this work demonstrates that genome graphs
can be used not only to store variation, but also to interpret and analyze it in meaningful
ways. The tools and methods presented here are a step toward more flexible, interpretable,
and biologically aware graph-based genomics.