Phylogenetic tree is widely used in many fields especially bioinfomatics. The evolution history manifested in the trees are powerful tools in the study of the migration of animals, spread of viruses and so on.
How to reconstruct efficiently and effectively such trees from real world data has been an problem of significant importance. This project utilized the tools from tropical geometry to study the structures of the space consisted of the phylogenetic trees.
$$ \def\SHADE{\color{gray}} \def\HIGH{\color{utcolor}} $$
Various models of phylogenetic trees are propsed and studied. One that is commonly used is the rooted tree with weights specified on each internal node. Here is an example of a phylogenetic tree on four species ${A,B,C,D}$. The lower the common ancestor is, the closer the two species are. For example, specie $A$ is closer to $B$ than to $D$.
It has a corresponding matrix representation and we call it $\delta$. The $ij^{\,\text{th}}$ entry $\delta(i,j)$ represents the weight of the lowest common ancestor of specie $i$ and $j$.
Notice that this matrix should be symmetric and has zero diagonal.
$$\delta=\left(
\begin{matrix}
\SHADE 0 & 5 & 7 & 9 \\ \SHADE5 & \SHADE0 & 7 & 9 \\ \SHADE7 & \SHADE7 & \SHADE0 & 9 \\ \SHADE9 & \SHADE9 & \SHADE{9} & \SHADE0 \
\end{matrix}
\right)\in \mathbb{R}^{[4] \choose 2}$$
Typically, the differences between species are compared pair-wisely and quantified via some distance measures, e.g., Hamming distance of their gene sequences:
Suppose we are interested in 4 species, we are likely to get a matrix called a dissimilarity map, where the $ij^{\,\text{th}}$ entry $d(i,j)$ represents the distance between specie $i$ to $j$.
Also, it is symmetric and has zero diagonal.
$$d=\left(
\begin{matrix}
\SHADE 0 & 2 & 4 & 6 \\ \SHADE2 & \SHADE0 & 8 & 10 \\ \SHADE4 & \SHADE8 & \SHADE0 & 12 \\ \SHADE6 & \SHADE{10} & \SHADE{12} & \SHADE0 \
\end{matrix}
\right)\in \mathbb{R}^{[4] \choose 2}$$
Not all dissimilarity maps are valid phylogenetic trees. For it to be a valid tree, it should satisfy the so-called strengthened triangle inequality: $$\forall i,j,k,\,\, d(i,k) \leq \max(d(i,j), d(j,k))$$ A $d$ that satisfying such inequality is called an ultrametric.
Thus, a natural question is how to reconstruct phylogenetic trees from the dissimilarity maps we acquired from experimental data. This has already been studied in the old days 1 2 . But they give only one or few reconstructions that can potentially introduce bias to the later analysis.
Our goal is to find a complete description of the reconstructed tree space. Before doing that, we need to find a way to measure how good our reconstruction is. A popular choice is the $\ell^\infty$-distance measure: $$ {\left\lVert d-\delta \right\rVert}_{\infty} = \max |d(i,j)-\delta(i,j)| $$
Then, mathematically, what we want to do is: given the experimental dissimilarity map $d$, find the ultrametrics that are the minimizers: $$ \text{argmin}_{\text{ultrametric } \delta} {\left\lVert d-\delta \right\rVert}_{\infty} $$ Of course, we don’t expect an analytical solution to that due to its combinatorial nature of the problem (think about permuting the vertices of the trees). Instead, we are in pursuit of an efficient algorithm, i.e., runs in $O(\text{poly})$ time, that can generate all such ultrametrics. But how can a problem with potentially combinatorially many solutions have an efficient algorithm?
Wikipedia says:
Tropical geometry is a relatively new area in mathematics, which might loosely be described as a piecewise linear or skeletonized version of algebraic geometry, using the tropical semiring instead of a field.
which translated to human language is basically that: what if we replace the addition and multiplication in our everyday algebra to the $\max$ (or $\min$) operation and addition? $$\begin{align} + &\rightarrow \max \,\,(\text{or } \min) \\ \times &\rightarrow + \end{align}$$
Surprisingly, a lot of things can happen and people have found tropical analog of classical geometry and many non-trivial applications are proposed, e.g., tropical PCA 3 . A good resources of tropical geometry is this book 4 .
The obvious reason is that both the strengthened triangle inequality $$\forall i,j,k,\,\, d(i,k) \leq \max(d(i,j), d(j,k))$$ and the $\ell^\infty$-distance measure: $$ {\left\lVert d-\delta \right\rVert}_{\infty} = \max |d(i,j)-\delta(i,j)| $$ contains the $\max$ operation, which is perfectly suitable to study using tropical geometry. And there are many pioneer studies about efficient algorithms in tropical geometry 5 . Moreover, people have found deeper and profound connections between phylogenetic trees and tropical geometry 6 7 .
Let’s get back to our problem. What tropical geometry tells us is that the space of all minimizer ultrametrics is a tropical polytope, which is the tropical analog of the classical polytope.
A tropical polytope admits multiple vertices. Among those vertices, there is a special set of the vertices that fully describes the polytope — they are called the extreme vertices or simply extremes. So the key question here is that whether or not we can find all the extreme vertices efficiently.
Firstly, given a vertex, we need to know whether it is an extreme. We’ve shown the previous characterization by Bernstein 8 for the extreme vertices is not sufficient for most cases. The only case when it’s sufficient is when the number of the leaves of the tree is $3$ 9 .
We proposed an exterior description of the tropical polytope formed by the minimizer ultrametrics. Thus we can use the algorithms proposed by Allamigeon 5 to determine the extremality of a vertex efficiently. We can also compute all the extremes. However, the time complexity is exponential.
The tangent hypergraph techniques proposed in 5 enables us to transform each vertex of the polytope as a hypergraph. We are working on an algorithm that computes the extremes based on the manipulation of the hypergraph.