The embedding problem for Markov matrices by Jordi Roca Lacostena

Jordi Roca Lacostena defended his PhD thesis The embedding problem for Markov matrices, supervised by Professors Marta Casanellas Rius and Jesús Fernández Sánchez, on 18 May 2021 within the UPC doctoral program in Applied Mathematics. Currently, he is a researcher and data scientist at Amalfi Analytics, a UPC spin-off focused on the use of artificial intelligence and machine learning for health and clinic management.

Thesis summary

The goal of this thesis is to solve the embedding problem for Markov matrices, posed by Gustav Elfving in 1937. This problem consists in deciding whether a given non-negative square matrix M with rows summing to 1 (Markov matrix or transition matrix) can be written as M = exp(Q) for some real square matrix Q with rows summing to zero and non-negative off-diagonal entries). In this case, such a Markov matrix is said to be embeddable and Q is called a Markov generator for M.

The embedding problem is motivated by Markov processes, which are used to model the change of state of a random variable over time under the assumption that future is independent from past given the present. In this framework, the entries of Markov matrices represent the substitution probabilities between states along a fixed time interval. When these probabilities are considered to be continuous (and differentiable) functions depending on time, then the process is ruled by the instantaneous rates of substitution between states, which are usually assumed to be constant with respect to the time in order to keep the process tractable. In this context, the Markov process is said to be a homogeneous Markov process in continuous-time and the substitution rates ruling it are displayed all together in a rate matrix Q. By construction, the substitution probabilities between states after time t correspond to the entries of the matrix M(t) = exp(Qt) for all t ≥ 0. Hence, the embedding problem is equivalent to decide whether the substitution process given by a transition matrix could potentially arise from a homogeneous Markov process in continuous-time. It is worth noting
that, in this case such a transition matrix may admit more than one Markov generator, each corresponding to a different underlying continuous-time process with the same terminal substitution probabilities but with different values for the intermediate substitution probabilities. The study of the uniqueness of Markov generators is known as the rate-identifiability problem and it is also studied in this thesis.

Prior to the results presented in the thesis, the embedding problem was already solved for 2 × 2 and 3 × 3 Markov matrices, as well as for some other particular cases such as Markov matrices close to the identity or Markov matrices with different real eigenvalues. However, these solutions took around fifty years to be found. The characterization of 4 × 4 embeddable matrices did not only mean solving the embedding problem for the smaller size of matrices for which it is not solved, but also had practical consequences in biology. Indeed, Markov processes are typically used in the context of phylogenetics to model the substitution of nucleotides over time in a given DNA sequence. The embedding problem appears related to fundamental questions concerning the consistency and definition of some of the resulting nucleotide substitution models.

In this thesis, we present a method to decide whether a Markov matrix with different eigenvalues (real or not) is embeddable. This effectively solves the embedding problem in a dense subset of Markov matrices for any size. Moreover, we also provide a explicit solution for all 4 × 4 Markov matrices and the resulting criteria for embeddability when restricted to some commonly used nucleotide substitution models, such as Kimura models and the strand-symmetric model. We also show that the proportion of embeddable matrices (i.e. those potentially arising from a continuous-time version of the model) decreases as the model constraints are relaxed. More precisely, we observed that the proportion of embeddable matrices lies in the range between the 0.05% and the 75% depending on the complexity of the model, showing that the restriction to time-continuous nucleotide substitution models might be much more restrictive than expected.

Highlighted publication

References
[1] Marta Casanellas, Jesús Fernández-Sanchez, Jordi RocaLacostena (2021). The embedding problem for Markov matrices. Publicacions Matemàtiques (accepted for publication).

[2] Jordi Roca-Lacostena (2021). Generating embeddable matrices whose principal logarithm is not a Markov generator. Chapter in Extended Abstracts GEOMVAP 2019 – Geometry, Topology, Algebra, and Applications; Women in Geometry and Topology. Birkhäuser Basel. DOI: 10.1007/978-3-030-84800-2.

[3] Marta Casanellas, Jesús Fernández-Sanchez, Jordi Roca-Lacostena (2021). An open set of 4×4 embeddable matrices whose principal logarithm is not a Markov generator. Linear and Multilinear Algebra, 0(0), pp. 1-12. DOI: 10.1080/03081087.2020.1854165.

[4] Marta Casanellas, Jesús Fernández-Sanchez, Jordi RocaLacostena (2020). Embeddability and rate identifiability of Kimura 2-parameter matrices. Journal of Mathematical Biology, 80(4), pp. 995-1019. DOI: 10.1007/s00285-019-01446-0

[5] Jordi Roca-Lacostena, Jesús Fernández-Sanchez (2018). Embeddability of Kimura 3ST Markov matrices. Journal of Theoretical Biology, 445, pp. 128-135. DOI: 10.1016/j.jtbi.2018.02.005