## Topological Data Analysis - overview

Data analysis is a challenging task in almost all areas of applied science, including Computational Genomics, due to the inherent difficulties in understanding large, high-dimensional and, often, noisy data. Traditional data analysis tools, despite their effectiveness, still have some drawbacks which might introduce unwanted biases or the need for ad hoc adjustments. Some of these drawbacks take the form of depending the way metrics are chosen for the data.

Topological data analysis (TDA) provides a general framework for analyzing data, with the advantages of being able to extract information from large volumes of high-dimensional data, while not depending on the choice of metrics and providing stability against noise. TDA combines tools from algebraic topology and statistical learning to give a quantitative basis for the study of the "shape" of data.

One of the tools from algebraic topology used for TDA, is called persistent homology. This tool allows us to detect large, or small, features in the data in the form of holes. The ability to identify large-scale holes in data, for instance, represents a powerful tool in the analysis of a wide range of problems in computational biology, from unraveling the history of admixtures to recognizing the presence of pathogenic pathways.

The Computational Genomics Group pursues basic and exploratory use of TDA on genomic data. Below are a couple of examples of such applications.

### Population Genomics

As populations with multilinear transmission (e.g., mixing of genetic material from two parents) evolve over generations, the genetic transmission lines constitute complicated networks. In contrast, unilinear transmission leads to simpler network structures (trees). The genetic exchange in multilinear transmission is further influenced by migration, incubation, mixing, and other factors.

In [1] we show, based on controlled simulations, that topological characteristics have the potential for detecting subtle admixture in related populations. We then apply the technique successfully to a set of avocado germplasm data indicating that the approach has the potential for novel characterizations of relatedness in populations. The figure below shows the different signature in the case of avocados with and without admixture on simulated and real data [1]. For the interested reader an improved/extended version of [1] can be downloaded here.

In [5], we identify essential elements in a basis of homology and prove that these elements are unique, and we propose a visualization of the essential elements of the basis of the homology space through a rainfall-like plot (RFL), where such essential elements are associated with individual samples of the data.

### Metagenomics

A microbiome can have a complex relationship with the environment or host it inhabits, such as in gastrointestinal disease. In situations like these, visualization-of-prediction techniques can help us understand how the data is distributed and lead to new insights.

In [2] we applied TDA in the form of the Mapper algorithm to cat, dog and human microbiome data obtained from fecal samples. The goal of this approach is to accurately predict a host’s trait using only metagenomic data, by training a statistical model on available metagenome sequencing data. Mapper outputs the data as a network of clusters, and the clusters and connections can be visualized with additional meta-data about the individuals, as shown in the figure.

We have studied the cross-mapping of sequencing reads to multiple genomes with related sequence content and demonstrated how TDA can be applied to disentangle the resulting signals [6].

### Application to Epidemiology: topology of logic

Much of epidemiological analysis involves determining the relationship between disease and exposure to risk factors, and in particular whether a candidate exposure condition impacts the probability that an individual will have a disease diagnosis: P(D|E). So generally the test involves a test of whether P(D|E)=P(D) (independence) or not (dependence). In practice, the tests are usually more complicated, with multiple exposure variables: genetic, behavioral, and medical conditions in their own right (e.g. hyperlipidemia and type II diabetes diagnoses impact risk for atherosclerosis), which are often highly correlated among themselves. Even the conditions of enrollment into a study can induce spurious correlations not present in the population at large.

One approach to identifying risk exposures and relationships between diseases and conditions described in [3], is simply to identify whether some patterns occur more frequently than expected if they were independent: P(D&E) > P(D)P(E) or P(D&E) < P(D)P(E), where “>” and “<“ must meet statistical tests (binomial) indicating that the sample had sufficient statistical power to resolve the relationship. These lists of variables, E1, E2, … D1, D2,… represent patterns matched by some subset of the study population (subjects enrolled in the study). Those lists of subjects can be thought of as labeled by these logical combinations. So, for example, E1 & E2 (“&” = “and” = set intersection) represents a list of subjects. If E1 & E2 = E1, then E1 is a subset of E2, or E1 implies E2. If the patterns are significant (e.g. it is possible to identify a meaningful interaction or dependence between the variables), then equality in the lists can be characterized by Jaccard distances, and tested by Fisher exact tests. Significant equalities are called "redescriptions": different descriptions that capture the same lists of subjects. This implies that it becomes possible to explore logical relationships, including implications, via tests identifying equality among pattern members. Nearest neighbor linkage between statements, treated as vertices, can be defined in terms of whether Jaccard distances between vertices are less than some threshold. Identifying clusters in terms of vertices connected to at least one other cluster member most simply identifies with “nerves” and filtrations in topological data analysis deriving simplicial complexes, and are reflected in hierarchic clustering with single linkage.

The homology groups provide more information about structure in the logical relationships among phenotypic, demographic, and disease variables, with multiply connected homologies possibly signaling the presence of more complex physiological pathways to disease. The relationship among these logical statements as the span of connectivities explored by a filtration, the “persistence,” gives more clues to connectivity and structure among the variables, spanning the ranges of significant sensitivities available within a study dataset.

### Related Publications

- Topological Signatures for Population Admixture. L. Parida, F. Utro, D. Yorukoglu, A.P. Carrieri, D. Kuhn, S. Basu,
*Research in Computational Molecular Biology*, Elsevier, 2015. - Host Trait Prediction of Metagenomic Data for Topology-based Visualization. L. Parida, N. Haiminen, D. Haws, J. Suchodolski.
*Lecture Notes in Computer Science*, Springer, 2015. - Characterizing redescriptions using persistent homology to isolate genetic pathways contributing to pathogenesis. D.E. Platt, S. Basu, P.A. Zalloua, L. Parida,
*BMC Syst*Biomed*Biology,*2016.*,* - Spectral Sequences, Exact Couples and Persistent Homology of filtrations. L. Parida, S. Basu.
*Expositiones Mathematicae*, 2017. - Essential Simplices in Persistent Homology and Subtle Admixture Detection. S. Basu, F. Utro, L. Parida,
*WABI 2018, LIPIcs*, 2018 - Signal Enrichment of Metagenome Sequencing Reads using Topological Data Analysis. A. Guzmán-Sáenz, N. Haiminen, S. Basu, L. Parida.
*BMC Genomics Supplement*, 2019. - A Topological Data Analysis Approach on Predicting Phenotypes from Gene Expression Data. Sayan Mandal, Aldo Guzman-Saenz, Niina Haiminen, Saugata Basu, Laxmi Parida.
*Proc. 7th International Conference on Algorithms for Computational Biology (AlCoB), Lecture Notes in Bioinformatics*, pp. 178-187, Springer, 2020.