Algebraic and semi-algebraic phylogenetic reconstruction by Marina Garrote López

Marina Garrote López defended her PhD thesis Algebraic and semi-algebraic phylogenetic reconstruction, supervised by Professor Marta Casanellas and Professor Jesús Fernández-Sánchez, on 22 July 2021 within the UPC doctoral program in Applied Mathematics. She is currently a postdoctoral researcher in the Nonlinear algebra group of Professor Bernd Sturmfels at the Max Planck Institute for Mathematics in the Sciences in Leipzig. In Spring 2022 she will visit Professor Elizabeth S. Allman and Professor John A. Rhodes at the University of Alaska, Fairbanks. In summer 2022, she will move to University of British Columbia to work as a postdoctoral researcher with Professor Elina Robeva.

Thesis summary

Phylogenetics is the study of the evolutionary history and relationships among groups of biological entities (called taxa). These evolutionary processes are modeled by phylogenetic trees whose nodes represent different taxa and whose branches correspond to the evolutionary processes between them. The leaves symbolize contemporary taxa and the root is their common ancestor. Phylogenetic reconstruction aims to estimate the phylogenetic tree that best explains the evolutionary relationships of current taxa using solely information from their genome. We focus on the reconstruction of the topology of phylogenetic trees, which means reconstructing the shape of the tree considering labels at the leaves.

To this end, one usually assumes that DNA sequences evolve according to a Markov process on a phylogenetic tree ruled by a model of nucleotide substitutions. These substitution models are specified by transition matrices associated to the edges of the tree and by a distribution of nucleotides at the root. Given a tree T, one can compute the distribution of nucleotide patterns at the leaves of T in terms of the parameters of the model. This joint distribution is represented as a vector whose entries can be expressed as polynomials on the model parameters. There exist certain algebraic relationships between the entries of the joint distribution, and the study of them and the geometry of the algebraic varieties that they define (called phylogenetic varieties) have provided further successful results on the problem of phylogenetic reconstruction. However, from a biological perspective we are not interested in the whole variety, but only in points that have arisen on a tree with stochastic parameters. The description of such distributions lead to semi-algebraic constraints and the region of the algebraic varieties defined by them is called the phylogenetic stochastic region. This semi-algebraic description seems important since it characterizes distributions with a biological and probabilistic sense, but could it improve the already
existent algebraic tools for phylogenetic reconstruction?

To answer this question, we compute the Euclidean distance of data points to the phylogenetic varieties and their stochastic regions for cases of special interest in phylogenetics, such as trees with short branches in the external edges and with the long branch attraction phenomenon [5]. In some cases, we compute these distances analytically and we can decide which tree has stochastic region closer to the data point. As a consequence, we can prove that, even if the data point was close to the phylogenetic variety of a given tree, it might be closer to the stochastic region of another tree. In particular, under the phenomenon of long branch attraction, considering the stochastic phylogenetic region seems to be fundamental for the phylogenetic reconstruction problem [2].

However, incorporating semi-algebraic tools into phylogenetic reconstruction methods might be extremely difficult and the procedure to do it is not at all evident. In this thesis, we present two phylogenetic reconstruction methods that combine algebraic and semi-algebraic conditions for the general Markov model, based on the result proved in [1]. The first method we present is SAQ, which stands for Semi-Algebraic Quartet reconstruction method [3]. Next, we introduce an improved method, ASAQ (for Algebraic and Semi-Algebraic Quartet reconstruction method, [4]), which combines SAQ with the method Erik+2 (based on certain algebraic constraints). Both are phylogenetic reconstruction methods for DNA alignments on four taxa and have been proven to be statistically consistent.

We test the proposed methods on simulated and real data to check their actual performance in several scenarios, both consistent and violating the assumptions of the methods. Our results show that both methods SAQ and ASAQ are highly successful, even with short alignments and with data that violates their assumptions.

Highlighted publications

References
[1] M. Casanellas, J. Fernández-Sánchez, and M. Garrote-López.
The inertia of the symmetric approximation for low-rank matrices. Linear and Multilinear Algebra 66(11):2349–2353 (2018).

[2] M. Casanellas, J. Fernández-Sánchez, and M. Garrote-López. Distance to the stochastic part of phylogenetic varieties. Journal of Symbolic Computation 104:653–682 (2021).

[3] M. Casanellas, J. Fernández-Sánchez, and M. Garrote-López. SAQ: Semi-algebraic quartet reconstruction. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1 (2021).

[4] M. Casanellas, J. Fernández-Sánchez, M. Garrote-López, and M. Sabaté-Vidales. Designing weights for quartet-based methods when data is branch-heterogeneous. In Preparation, 2021.

[5] M. Garrote-López. Computing the distance to the stochastic part of a phylogenetic variety. Extended Abstracts GEOMVAP 2019: Geometry, Topology, Algebra, and Applications. Women in Geometry and Topology, 2022.

Scroll al inicio