Tropical Geometry of Phylogenetic Trees

Phylogenetic tree is widely used in many fields especially bioinfomatics. The evolution history manifested in the trees are powerful tools in the study of the migration of animals, spread of viruses and so on.

How to reconstruct efficiently and effectively such trees from real world data has been an problem of significant importance. This project utilized the tools from tropical geometry to study the structures of the space consisted of the phylogenetic trees.

$$ \def\SHADE{\color{gray}} \def\HIGH{\color{utcolor}} $$

Phylogenetic tree

Various models of phylogenetic trees are propsed and studied. One that is commonly used is the rooted tree with weights specified on each internal node. Here is an example of a phylogenetic tree on four species ${A,B,C,D}$. The lower the common ancestor is, the closer the two species are. For example, specie $A$ is closer to $B$ than to $D$.

It has a corresponding matrix representation and we call it $\delta$. The $ij^{\,\text{th}}$ entry $\delta(i,j)$ represents the weight of the lowest common ancestor of specie $i$ and $j$. Notice that this matrix should be symmetric and has zero diagonal. $$\delta=\left( \begin{matrix} \SHADE 0 & 5 & 7 & 9 \\ \SHADE5 & \SHADE0 & 7 & 9 \\ \SHADE7 & \SHADE7 & \SHADE0 & 9 \\ \SHADE9 & \SHADE9 & \SHADE{9} & \SHADE0 \
\end{matrix} \right)\in \mathbb{R}^{[4] \choose 2}$$

Dissimilarity map

Typically, the differences between species are compared pair-wisely and quantified via some distance measures, e.g., Hamming distance of their gene sequences:

Suppose we are interested in 4 species, we are likely to get a matrix called a dissimilarity map, where the $ij^{\,\text{th}}$ entry $d(i,j)$ represents the distance between specie $i$ to $j$. Also, it is symmetric and has zero diagonal. $$d=\left( \begin{matrix} \SHADE 0 & 2 & 4 & 6 \\ \SHADE2 & \SHADE0 & 8 & 10 \\ \SHADE4 & \SHADE8 & \SHADE0 & 12 \\ \SHADE6 & \SHADE{10} & \SHADE{12} & \SHADE0 \
\end{matrix} \right)\in \mathbb{R}^{[4] \choose 2}$$

Not all dissimilarity maps are valid phylogenetic trees. For it to be a valid tree, it should satisfy the so-called strengthened triangle inequality: $$\forall i,j,k,\,\, d(i,k) \leq \max(d(i,j), d(j,k))$$ A $d$ that satisfying such inequality is called an ultrametric.

Reconstruction

Thus, a natural question is how to reconstruct phylogenetic trees from the dissimilarity maps we acquired from experimental data. This has already been studied in the old days ¹². But they give only one or few reconstructions that can potentially introduce bias to the later analysis.

Our goal is to find a complete description of the reconstructed tree space. Before doing that, we need to find a way to measure how good our reconstruction is. A popular choice is the $\ell^\infty$-distance measure: $$ {\left\lVert d-\delta \right\rVert}_{\infty} = \max |d(i,j)-\delta(i,j)| $$

Then, mathematically, what we want to do is: given the experimental dissimilarity map $d$, find the ultrametrics that are the minimizers: $$ \text{argmin}_{\text{ultrametric } \delta} {\left\lVert d-\delta \right\rVert}_{\infty} $$ Of course, we don’t expect an analytical solution to that due to its combinatorial nature of the problem (think about permuting the vertices of the trees). Instead, we are in pursuit of an efficient algorithm, i.e., runs in $O(\text{poly})$ time, that can generate all such ultrametrics. But how can a problem with potentially combinatorially many solutions have an efficient algorithm?

Tropical geometry

Wikipedia says:

Tropical geometry is a relatively new area in mathematics, which might loosely be described as a piecewise linear or skeletonized version of algebraic geometry, using the tropical semiring instead of a field.

which translated to human language is basically that: what if we replace the addition and multiplication in our everyday algebra to the $\max$ (or $\min$) operation and addition? $$\begin{align} + &\rightarrow \max \,\,(\text{or } \min) \\ \times &\rightarrow + \end{align}$$

Surprisingly, a lot of things can happen and people have found tropical analog of classical geometry and many non-trivial applications are proposed, e.g., tropical PCA³. A good resources of tropical geometry is this book⁴.

Why tropical geometry here?

The obvious reason is that both the strengthened triangle inequality $$\forall i,j,k,\,\, d(i,k) \leq \max(d(i,j), d(j,k))$$ and the $\ell^\infty$-distance measure: $$ {\left\lVert d-\delta \right\rVert}_{\infty} = \max |d(i,j)-\delta(i,j)| $$ contains the $\max$ operation, which is perfectly suitable to study using tropical geometry. And there are many pioneer studies about efficient algorithms in tropical geometry⁵. Moreover, people have found deeper and profound connections between phylogenetic trees and tropical geometry⁶⁷.

Let’s get back to our problem. What tropical geometry tells us is that the space of all minimizer ultrametrics is a tropical polytope, which is the tropical analog of the classical polytope.

A tropical polytope projected on $\mathbb{R}^2$. Black dots are vertices of the polytope.

A tropical polytope admits multiple vertices. Among those vertices, there is a special set of the vertices that fully describes the polytope — they are called the extreme vertices or simply extremes. So the key question here is that whether or not we can find all the extreme vertices efficiently.

Current progress

Firstly, given a vertex, we need to know whether it is an extreme. We’ve shown the previous characterization by Bernstein⁸ for the extreme vertices is not sufficient for most cases. The only case when it’s sufficient is when the number of the leaves of the tree is $3$⁹.

We proposed an exterior description of the tropical polytope formed by the minimizer ultrametrics. Thus we can use the algorithms proposed by Allamigeon⁵ to determine the extremality of a vertex efficiently. We can also compute all the extremes. However, the time complexity is exponential.

The tangent hypergraph techniques proposed in ⁵ enables us to transform each vertex of the polytope as a hypergraph. We are working on an algorithm that computes the extremes based on the manipulation of the hypergraph.

References

Phylogenetic trees

1 Naoko Takezaki and Masatoshi Nei. Genetic distances and reconstruction of phylogenetic trees from microsatellite DNA. Genetics, 144(1):389-399, 1996.

2 Naruya Saitou and Masatoshi Nei. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular biology and evolution, 4(4):406-425, 1987.

Tropical geometry

3 Ruriko Yoshida, Leon Zhang and Xu Zhang. Tropical principal component analysis and its application to phylogenetics. Bulletin of mathematical biology, 81(2):568-597, 2019.

4 Diane Maclagan and Bernd Sturmfels. Introduction to tropical geometry. American Mathematical Soc., volume 161, 2015.

5 Xavier Allamigeon, Stéphane Gaubert and Éric Goubault. Computing the vertices of tropical polyhedra using directed hypergraphs. Discrete & Computational Geometry, 49(2):247-279, 2013.

6 Federico Ardila and Caroline J Klivans. The Bergman complex of a matroid and phylogenetic trees. Journal of Combinatorial Theory, Series B, 96(1):38-49, 2006.

7 Bo Lin, Anthea Monod and Ruriko Yoshida. Tropical foundations for probability & statistics on phylogenetic tree space. arXiv preprint arXiv:1805.12400, 2018.

8 Daniel Irving Bernstein. L-infinity optimization to Bergman fans of matroids with an application to phylogenetics. arXiv preprint arXiv:1702.05141, 2017.

9 Luyan Yu. Extreme rays of the $\ell^\infty $-nearest ultrametric tropical polytope. Linear Algebra and its Application, 587:23-44, 2019.