Introduction

Topological data analysis is a relatively new area of data science which can compare and contrast data sets via non-linear global structure. The main tool of topological data analysis, persistent homology (Edelsbrunner, Letscher, and Zomorodian 2000; Zomorodian and Carlsson 2005), builds on techniques from the field of algebraic topology to describe shape features present in a data set (stored in a “persistence diagram”). Persistent homology has been used in a number of applications, including

For a broad introduction to the mathematical background and main tools of topological data analysis with applications, see (Chazal and Michel 2017; G. Carlsson and Vejdemo-Johansson 2021).

Traditional data science pipelines in academia and industry are focused on machine learning and statistical inference of structured (tabular) data, being able to answer questions like:

While persistence diagrams have been found to be a useful summary of datasets in many domains, they are not structured data and therefore require special analysis methods. Some papers (for example (Robinson and Turner 2017; Le and Yamada 2018)) have described post-processing pipelines for analyzing persistence diagrams built on distance (Kerber, Morozov, and Nigmetov 2017) and kernel (Le and Yamada 2018) calculations, however these papers are lacking publicly available implementations in R (and Python), and many more data science methods are possible using such calculations (Murphy 2012; Scholkopf, Smola, and Muller 1998; Gretton et al. 2007; Cox and Cox 2008; Dhillon, Guan, and Kulis 2004).

TDApplied is the first R package which provides applied analysis implementations of published methods for analyzing persistence diagrams using machine learning and statistical inference. Its functions contain highly optimized and scalable code (see the package vignette “Benchmarking and Speedups”) and have been tested and validated (see the package vignette “Comparing Distance Calculations”). TDApplied can interface with other data science packages to perform powerful and flexible analyses (see the package vignette “Personalized Analyses with TDApplied”), and an example usage of TDApplied on real data has been demonstrated (see the package vignette “Human Connectome Project Analysis”).

This vignette documents the background of TDApplied functions and the usage of those functions on simulated data, by considering a typical data analysis workflow for topological data analysis:

  1. Computing and comparing persistence diagrams.
  2. Visualizing and interpreting persistence diagrams.
  3. Analyzing statistical properties of groups of persistence diagrams.
  4. Finding latent structure in groups of persistence diagrams.
  5. Predicting labels from persistence diagram structure.

To start we must load the TDApplied package:

library("TDApplied")

Let’s get started!

Computing and Comparing Persistence Diagrams

Computing Diagrams and TDApplied’s PyH Function

The main tool of topological data analysis is called persistent homology (Edelsbrunner, Letscher, and Zomorodian 2000; Zomorodian and Carlsson 2005). Persistent homology starts with either data points and a distance function, or a distance matrix storing distance values between data points. It assumes that these points arise from a dataset with some kind of “shape”. This “shape” has certain features that exist at various scales, but sampling induces noise in these features. Persistent homology aims to describe certain mathematical features of this underlying shape, by forming approximations to the shape at various distance scales. The mathematical features which are tracked include clusters (connected components), loops (ellipses) and voids (spheres), which are examples of cycles (i.e. different types of holes). The homological dimension of these features are 0, 1 and 2, respectively. What is interesting about these particular mathematical features is that they can tell us where our data is not, which is extremely important information which other data analysis methods cannot provide.

The persistent homology algorithm proceeds in the following manner: first, if the input is a dataset and distance metric, then the distance matrix, storing the distance metric value of each pair of points in the dataset, is computed. Next, a parameter \(\epsilon \geq 0\) is grown starting at 0, and at each \(\epsilon\) value we compute a shape approximation of the dataset \(C_{\epsilon}\), called a simplicial complex or in this case a Rips complex. We construct \(C_{\epsilon}\) by connecting all pairs of points whose distance is at most \(\epsilon\). To encode higher-dimensional structure in these approximations, we also add a triangle between any triple of points which are all connected (note that no triangles are formally shaded on the above diagram, even though there are certainly triples of connected points), a tetrahedron between any quadruple of points which are all connected, etc. Note that this process of forming a sequence of skeletal approximations is called a Rips-Vietoris filtration, and other methods exist for forming the approximations.

At any given \(\epsilon\) value, some topological features will exist in \(C_{\epsilon}\). As \(\epsilon\) grows, the \(C_{\epsilon}\)’s will contain each other, i.e. if \(\epsilon_{1} < \epsilon_{2}\), then every edge (triangle, tetrahedron etc.) in \(C_{\epsilon_1}\) will also be present in \(C_{\epsilon_2}\). Each topological feature of interest will be “born” at some \(\epsilon_{birth}\) value, and “die” at some some \(\epsilon_{death}\) value – certainly each feature will die once the whole dataset is connected and has trivial shape structure. Consider the example of a loop – a loop will be “born” when the last connection around the circumference of the loop is connected (at the \(\epsilon\) value which is the largest distance between consecutive points around the loop), and the loop will “die” when enough connections across the loop fill in its hole. Since the topological features are tracked across multiple scales, we can estimate their (static) location in the data, i.e. finding the points on these structures, by calculating what are called representative cycles.

The output of persistent homology, a persistence diagram, has one 2D point for each topological feature found in the filtration process in each desired homological dimension, where the \(x\)-value of the point is the birth \(\epsilon\)-value and the \(y\)-value is the death \(\epsilon\)-value. Hence every point in a persistence diagram lies above the diagonal – features die after they are born! The difference of a points \(y\) and \(x\) value, \(y-x\), is called the “persistence” of the corresponding topological feature. Points which have high (large) persistence likely represent real topological features of the dataset, whereas points with low persistence likely represent topological noise.

For a more practical and scalable computation of persistence diagrams, a method has been introduced called persistence cohomology (Silva and Vejdemo-Johansson 2011b, 2011a) which calculates the exact same output as persistent homology (with analogous “representative co-cycles” to persistent homology’s representative cycles) just much faster. Persistent cohomology is implemented in the c++ library ripser (Bauer 2015), which is wrapped in R in the TDAstats package (Wadhwa et al. 2019) and in Python in the ripser module. However, it was observed in simulations that the Python implementation of ripser seemed faster, even when called in R via the reticulate package (Ushey, Allaire, and Tang 2022) (see the package vignette “Benchmarking and Speedups”). Therefore, the TDApplied PyH function has been implemented as a wrapper of the Python ripser module for fast calculations of persistence diagrams.

There are three prerequisites that must be satisfied in order to use the PyH function:

  1. The reticulate package must be installed.
  2. Python must be installed and configured to work with reticulate.
  3. The ripser Python module must be installed, via reticulate::py_install("ripser"). Some windows machines have had issues with recent versions of the ripser module, but version 0.6.1 has been tested and does work on Windows. So, Windows users may try reticulate::py_install("ripser==0.6.1").

For a sample use of PyH we will use the following pre-loaded dataset called “circ” (which is stored as a data frame in this vignette):

We would then calculate the persistence diagram as follows:

# import the ripser module
ripser <- import_ripser()

# calculate the persistence diagram
diag <- PyH(X = circ,maxdim = 1,thresh = 2,ripser = ripser)

# view last five rows of the diagram
diag[47:51,]
#>    dimension     birth     death
#> 47         0 0.0000000 0.2545522
#> 48         0 0.0000000 0.2813237
#> 49         0 0.0000000 0.2887432
#> 50         0 0.0000000 2.0000000
#> 51         1 0.5579783 1.7385925

In the package vignette “Benchmarking and Speedups”, simulations are used to demonstrate the practical advantages of using PyH to calculate persistence diagrams compared to other alternatives.

Note that the installation status of Python for PyH is checked using the function reticulate::py_available(), which according to several online forums does not always behave as expected. If error messages occur using TDApplied functions regarding Python not being installed then we recommend consulting online resources to ensure that the py_available function returns TRUE on your system. Due to the complicated dependencies required to use the PyH function, it is only an optional function in the TDApplied package and therefore the reticulate package is only suggested in the TDApplied namespace.

Determining the Radius Threshold with TDApplied’s enclosing_radius Function

One of the most important parameters in a persistent (co)homology calculation is the maximum connectivity radius of the filtration (which is the maxdim parameter in the PyH function). A small radius threshold may prematurely stop the filtration before we can identify real large-scale features of our data, while an overly large threshold could quickly cause the calculations to become intractable. While there is no method that can determine an optimal balance between these two extremes, a radius threshold exists, called the “enclosing radius”, beyond which the topology of the dataset is trivial (i.e. there are no topological features except for one connected component). This radius is obtained by calculating, for each data point, the maximum distance from that point to all other points, and taking the minimum of these values.

The enclosing radius of a dataset can be computed using TDApplied’s enclosing_radius function. Consider the topology of a dataset sampled from a circle and its origin:

Without knowing anything about the structure of our new dataset we would not know which distance threshold to use for persistent homology calculations. We would compute the enclosing radius as follows:

enc_rad <- enclosing_radius(new_data, distance_mat = FALSE)
print(enc_rad)
#> [1] 1

This radius is the distance from the origin point to all the points on the circumference - at which point the whole graph is connected and the loop is gone:

We will see in a later section how to construct these types of graphs using the vr_graphs and plot_vr_graph functions.

Converting Diagrams to DataFrames with TDApplied’s diagram_to_df Function

The most typical data structure used in R for data science is a data frame. The output of the PyH function is a data frame (unless representatives are calculated, in which case the output is a list containing a data frame and other information), but the persistence diagrams calculated from the popular R packages TDA (B. T. Fasy et al. 2021) and TDAstats (Wadhwa et al. 2019) are not stored in data frames, making subsequent machine learning and inference analyses of these diagrams challenging. Since In order to solve this problem the TDApplied function diagram_to_df can convert TDA/TDAstats persistence diagrams into data frames:

# convert TDA diagram into data frame
diag1 <- TDA::ripsDiag(circ,maxdimension = 1,maxscale = 2,library = "dionysus")
diag1_df <- diagram_to_df(diag1)
class(diag1_df)
#> [1] "data.frame"
# convert TDAstats diagram into data frame
diag2 <- TDAstats::calculate_homology(circ,dim = 1,threshold = 2)
diag2_df <- diagram_to_df(diag1)
class(diag2_df)
#> [1] "data.frame"

When a persistence diagram is calculated with either PyH, ripsDiag or alphaComplexDiag and contains representatives, diagram_to_df only returns the persistence diagram data frame (i.e. the representatives are ignored).

Comparing Persistence Diagrams and TDApplied’s diagram_distance and diagram_kernel Functions

Earlier we mentioned that persistence diagrams do not form structured data, and now we will give an intuitive argument for why this is the case. A persistence diagram \(\{(x_1,y_1),\dots,(x_n,y_n)\}\) containing \(n\) topological features can be represented in a vector of length \(2n\), \((x_1,y_1,x_2,y_2,\dots,x_n,y_n)\). However, we cannot easily combine vectors calculated in this way into a table with a fixed number of feature columns because

  1. persistence diagrams can contain different numbers of features, meaning their vectors would be of different lengths, and
  2. the ordering of the features is arbitrary, calling into question what the right way to compare the vectors would be.

Fortunately, we can still apply many machine learning and inference techniques to persistence diagrams provided we can quantify how near (similar) or far (distant) they are from each other, and these calculations are possible with distance and kernel functions.

There are several ways to compute distances between persistence diagrams in the same homological dimension. The most common two are called the 2-wasserstein and bottleneck distances (Kerber, Morozov, and Nigmetov 2017; Edelsbrunner and Harer 2010). These techniques find an optimal matching of the 2D points in their input two diagrams, and compute a cost of that optimal matching. A point from one diagram is allowed either to be paired (matched) with a point in the other diagram or its diagonal projection, i.e. the nearest point on the diagonal line \(y=x\) (matching a point to its diagonal projection is essentially saying that feature is likely topological noise because it died very soon after it was born).

Allowing points to be paired with their diagonal projections both allows for matchings of persistence diagrams with different numbers of points (which is almost always the case in practice) and also formalizes the idea that some points in a persistence diagram represent noise. The “cost” value associated with a matching is given by either (i) the maximum of infinity-norm distances between paired points, or (ii) the square-root of the sum of squared infinity-norm between matched points. The cost of the optimal matching under loss (i) is the bottleneck distance of persistence diagrams, and the cost of the optimal matching of cost (ii) is called the 2-wasserstein metric of persistence diagrams. Both distance metrics have been used in a number of applications, but the 2-wasserstein metric is able to find more fine-scale differences in persistence diagrams compared to the bottleneck distance. The problem of finding an optimal matching can be solved with the Hungarian algorithm, which is implemented in the R package clue (Hornik 2005).

We will introduce three new simple persistence diagrams, D1, D2 and D3, for examples in this section (and future ones):

Here is a plot of the optimal matchings between D1 and D2, and between D1 and D3:

In the picture we can see that there is a “better” matching between D1 and D2 compared to D1 and D3, so the (wasserstein/bottleneck) distance value between D1 and D2 would be smaller than that of D1 and D3.

Another distance metric between persistence diagrams, which will be useful for kernel calculations, is called the Fisher information metric, \(d_{FIM}(D_1,D_2,\sigma)\) (Le and Yamada 2018). The idea is to represent the two persistence diagrams as probability density functions, with a 2D-Gaussian point mass centered at each point in both diagrams (including the diagonal projections of the points in the opposite diagram), all of variance \(\sigma^2 > 0\), and calculate how much those distributions disagree on their pdf value at each point in the plane (called their Fisher information metric).

Points in the rightmost plot which are close to white in color have the most similar pdf values in the two distributions, and would not contribute to a large distance value; however, having more points with a red color would contribute to a larger distance value.

The wasserstein and bottleneck distances have been implemented in the TDApplied function diagram_distance. We can confirm that the distance between D1 and D2 is smaller than D1 and D3 for both distances:

# calculate 2-wasserstein distance between D1 and D2
diagram_distance(D1,D2,dim = 0,p = 2,distance = "wasserstein")
#> [1] 0.3905125

# calculate 2-wasserstein distance between D1 and D3
diagram_distance(D1,D3,dim = 0,p = 2,distance = "wasserstein")
#> [1] 0.559017

# calculate bottleneck distance between D1 and D2
diagram_distance(D1,D2,dim = 0,p = Inf,distance = "wasserstein")
#> [1] 0.3

# calculate bottleneck distance between D1 and D3
diagram_distance(D1,D3,dim = 0,p = Inf,distance = "wasserstein")
#> [1] 0.5

There is a generalization of the 2-wasserstein distance for any \(p \geq 1\), the p-wasserstein distance, which can also be computed using the diagram_distance function by varying the parameter p, although \(p = 2\) seems to be the most popular parameter choice.

The diagram_distance function can also calculate the Fisher information metric between persistence diagrams:

# Fisher information metric calculation between D1 and D2 for sigma = 1
diagram_distance(D1,D2,dim = 0,distance = "fisher",sigma = 1)
#> [1] 0.02354779

# Fisher information metric calculation between D1 and D3 for sigma = 1
diagram_distance(D1,D3,dim = 0,distance = "fisher",sigma = 1)
#> [1] 0.08821907

Again, D1 and D2 are less different than D1 and D3 using the Fisher information metric.

A fast approximation to the Fisher information metric was described in (Le and Yamada 2018), and C++ code in the GitHub repository (https://github.com/vmorariu/figtree) was used to calculate this approximation in Matlab. Using the Rcpp package (Eddelbuettel and Francois 2011) this code is included in TDApplied and the approximation can be calculated by providing the positive rho parameter:

# Fisher information metric calculation between D1 and D2 for sigma = 1
diagram_distance(D1,D2,dim = 0,distance = "fisher",sigma = 1)
#> [1] 0.02354779
# fast approximate Fisher information metric calculation between D1 and D3 for sigma = 1
diagram_distance(D1,D2,dim = 0,distance = "fisher",sigma = 1,rho = 0.001)
#> [1] 0.02354779

Now we will explore calculating similarity of persistence diagrams using kernel functions. A kernel function is a special (positive semi-definite) symmetric similarity measure between objects in some complicated space which can be used to project data into a space suitable for machine learning (Murphy 2012). Some examples of machine learning techniques which can be “kernelized” when dealing with complicated data are k-means (kernel k-means), principal components analysis (kernel PCA), and support vector machines (SVM) which are inherently based on kernel calculations.

There have been, to date, four main kernels proposed for persistence diagrams. In TDApplied the persistence Fisher kernel (Le and Yamada 2018) has been implemented because of its practical advantages over the other kernels – smaller cross-validation SVM error on a number of test data sets and a faster method for cross validation. For information on the other three kernels see (Kusano, Fukumizu, and Hiraoka 2018; Carriere, Cuturi, and Oudot 2017; Reininghaus et al. 2014).

The persistence Fisher kernel is computed directly from the Fisher information metric between two persistence diagrams: let \(\sigma > 0\) be the parameter for \(d_{FIM}\), and let \(t > 0\). Then the persistence Fisher kernel is defined as \(k_{PF}(D_1,D_2) = \mbox{exp}(-t*d_{FIM}(D_1,D_2,\sigma))\).

Computing the persistence Fisher kernel can be achieved with the diagram_kernel function in TDApplied:

# calculate the kernel value between D1 and D2 with sigma = 2, t = 2
diagram_kernel(D1,D2,dim = 0,sigma = 2,t = 2)
#> [1] 0.9872455
# calculate the kernel value between D1 and D3 with sigma = 2, t = 2
diagram_kernel(D1,D3,dim = 0,sigma = 2,t = 2)
#> [1] 0.9707209

As before, D1 and D2 are more similar than D1 and D3, and if desired a fast approximation to the kernel value can be computed.

Visualizing and Interpreting Persistence Diagrams

TDApplied’s Function plot_diagram

Persistence diagrams can contain any number of points representing different types of topological features from different homological dimensions. However we can easily view this information in a two-dimensional plot of the birth and death values of the topological features, where each homological dimension has a unique point shape and color, using TDApplied’s plot_diagram function:

plot_diagram(diag,title = "Circle diagram")

Now that we can visualize persistence diagrams we will describe four tools which can be used for their interpretation – filtering out noisy topological features with the universal null distribution and bootstrapping methods, representative (co)cycles and Vietoris-VR graphs (called VR graphs for short).

Filtering Topological Features and TDApplied’s universal_null Function

Noise in datasets can drastically affect the results of inference and machine learning procedures. Therefore it is desirable to clean data before applying such procedures. Noise in persistence diagrams are features whose birth and death values are very similar. The question of determining which points in a persistence diagrams are “significant” and which are “noise” has been recently addressed via a parametric approach in (Bobrowski and Skraba 2023).

The idea is to compute a test statistic for each feature based on the ratio of death divided by birth and to compare these statistics against the left-skewed Gumbel distribution to compute p-values, and use a Bonferroni correction to ensure that the probability of getting a false positive significant feature is at most \(\alpha\). This is a very strict thresholding procedure, testing each p-value \(p\) for a topological feature in dimension \(i\) against \(\alpha/n_i\) where \(n_i\) is the number of topological features of that dimension (so typical choices for \(\alpha\), like 0.05 or 0.01 etc., may be too strict for some analyses).

We would be tempted to use this procedure to locate the main loop in our circ dataset, but the results are surprising:

# determine significant features
circ_result <- universal_null(circ,
                                   thresh = enclosing_radius(circ))
circ_result$subsetted_diag
#> [1] dimension birth     death    
#> <0 rows> (or 0-length row.names)

No loop was found! The reason for this is because since circ is a “topologically perfect” dataset, with no noise and therefore only one loop, the singular loop can never be significant without smaller loops to compare it to. The same phenomenon will occur in other datasets like a sampled unit sphere, etc., but does this mean that the procedure does not work?

Certainly not! All real world datasets will contain some measure of noise and therefore will likely have more than one feature in a given homological dimension. Let’s add some noise to our unit_circ dataset and rerun the procedure:

# create circle with noise dataset and plot
circ_with_noise <- circ
x_noise <- stats::rnorm(n = 50,sd = 0.1)
y_noise <- stats::rnorm(n = 50,sd = 0.1)
circ_with_noise$x <- circ_with_noise$x + x_noise
circ_with_noise$y <- circ_with_noise$y + y_noise
plot(circ_with_noise)


# rerun the inference procedure
library(TDA)
noisy_circ_result <- universal_null(circ_with_noise, 
                                         FUN_diag = "ripsDiag",
                                         thresh = enclosing_radius(circ_with_noise),
                                         return_pvals = TRUE)
noisy_circ_result$subsetted_diag
#>    dimension     birth    death
#> 54         1 0.7802017 1.542033
noisy_circ_result$pvals
#> [1] 0.0002181983

We have successfully found the significant loop in the noisy dataset (with p-value less than alpha). Remember though that this procedure is very strict - it is much more likely to disregard real topological features than to retain noise features, so if the universal_null function says that your dataset has no significant features when you are confident there are some you can try increasing the alpha parameter to carry out a more relaxed thresholding.

While the enclosing_radius function, used in our examples, may be an appropriate connectivity threshold for most applications, it may be too computationally intensive for others. When lower thresholds are used, real topological features may be obscured because the threshold is much smaller than its (unobserved) death radius. Thankfully though, the universal_null procedure can handle these “infinite cycles”, by:

  1. Using the left-skewed Gumbel distribution and alpha parameter to determine the minimum connectivity threshold for persistent homology where a given infinite cycle feature might be significant,
  2. Computing the persistence diagram at that threshold, and
  3. Checking if the feature is still infinite (i.e. if its death value is the new larger threshold). If yes then the feature is reported as being significant, otherwise it is filtered out.

Picking a very small threshold, resulting in many infinite cycles, will be very computationally demanding requiring the computation of many persistence diagrams, so care should be taken when picking the threshold when performing this infinite cycle inference.

Here’s an example of performing inference on the noisy circ dataset, with and without infinite cycle inference at a lower threshold value:

# inference without infinite cycle inference
res_non_inf_small_thresh <- universal_null(circ_with_noise,
                                           FUN_diag = "ripsDiag",
                                           thresh = 0.9)
res_non_inf_small_thresh$subsetted_diag
#> [1] dimension birth     death    
#> <0 rows> (or 0-length row.names)

# inference with infinite cycle inference
res_inf_small_thresh <- universal_null(circ_with_noise,
                                       FUN_diag = "ripsDiag",
                                       thresh = 0.9,
                                       infinite_cycle_inference = TRUE)
res_inf_small_thresh$subsetted_diag
#>   dimension     birth death
#> 1         1 0.7802017   0.9

At this smaller threshold the procedure with infinite cycle inference was able to locate the main loop.

In a later section you will read about “representative (co)cycles”, which are subsets of the datapoints on a topological feature that can be used to help locate the feature in our dataset, but for now just know that if desired this procedure will subset the representatives for only the significant topological features, but infinite cycles will not have representatives.

Bootstrapping Topological Features and TDApplied’s bootstrap_persistence_thresholds Function

Another approach in the literature for filtering significant topological features of datasets was based on bootstrapping ((B. Fasy et al. 2014)), although this method is far slower than the aforementioned universal null distribution approach because it requires the computation of many persistence diagrams and distances. On the other hand, for topologically simple datasets without noise (like a sampled unit circle) the universal null distribution will often fail to retrieve the real topological features.

The idea of the bootstrap procedure is as follows, where \(X\) is the input data set and \(\alpha\) is the desired threshold for type 1 error:

  1. We first compute \(D\), the diagram of \(X\).
  2. Then we repeatedly sample, with replacement, the original data to obtain \(\{X_1,\dots,X_n\}\) and compute new persistence diagrams \(\{D_1,\dots,D_n\}\).
  3. We calculate the bottleneck distance of each new diagram with the original, \(d_{\infty}(D_i,D)\), in each dimension separately.
  4. Finally we compute the \(1-\alpha\) percentile of these values in each dimension.

These thresholds form a square-shaped “confidence interval” around each point in \(D\). In particular, if \(t\) was the threshold found for dimension \(k\) then the confidence interval around a point \((x,y) \in D\) (of dimension \(k\)) is the set of points \(\{(x',y'):\mbox{max}(|x-x'|,|y-y'|) < t\}\). For example, if we calculated the bottleneck-threshold-based confidence interval around the single 1-dimensional point in the circ dataset’s persistence diagram, we would get something like this:

In this setup, “significant” points will be those whose confidence intervals do not touch the diagonal line, i.e. where birth and death is the same. Note that the persistence threshold is twice the bottleneck distance threshold calculated above: the (Euclidean) distance from the point to the bottom right corner of the box is \(\sqrt{2}t\), which must be greater than or equal to the (Euclidean) distance of the point to its diagonal projection, which is \(\frac{y-x}{\sqrt{2}}\). Therefore, for the point to be considered real (i.e. significant), \(\sqrt{2}t \leq \frac{y-x}{\sqrt{2}}\), implying that the persistence of the point, \(y-x\), must be no less than twice the bottleneck threshold \(t\).

Like in (Robinson and Turner 2017) we can calculate the p-value for a feature as \(p = \frac{Z + 1}{N + 1}\) where \(Z\) is the number of bootstrap iterations which, when doubled, are at least the persistence of the feature, and \(N\) is the number of bootstrap samples. In order to ensure that the p-value of any topological feature which survives the thresholding is less than \(\alpha\) we transform \(\alpha\) by \(\alpha \leftarrow \frac{\mbox{max}\{\alpha(N + 1) - 1,0\}}{N + 1}\).

The function bootstrap_persistence_thresholds can be used to determine these persistence thresholds. Here is an example for the circ dataset, and the results can be plotted with the plot_diagram function (with overlaid p-values using the graphics text function):

# calculate the bootstrapped persistence thresholds using 2 cores
# and 30 iterations. We'll use the distance matrix of circ to
# make representative cycles more comprehensible
library("TDA")
thresh <- bootstrap_persistence_thresholds(X = as.matrix(dist(circ)),
                                           FUN_diag = 'ripsDiag',
                                           FUN_boot = 'ripsDiag',
                                           distance_mat = T,
                                           maxdim = 1,thresh = 2,num_workers = 2,
                                           alpha = 0.05,num_samples = 30,
                                           return_subsetted = T,return_pvals = T,
                                           calculate_representatives = T)
diag <- thresh$diag

# plot original diagram and thresholded diagram side-by-side, including
# p-values. These p-values are the smallest possible (1/31) when there
# are 30 bootstrap iterations
par(mfrow = c(1,2))

plot_diagram(diag,title = "Circ diagram")

plot_diagram(diag,title = "Circ diagram with thresholds",
             thresholds = thresh$thresholds)
text(x = c(0.2,0.5),y = c(2,1.8),
     paste("p = ",round(thresh$pvals,digits = 3)),
     cex = 0.5)

To see why we needed to load the TDA package prior to the function call please see the bootstrap_persistence_thresholds function documentation. The bootstrap procedure can be done in parallel or sequentially, depending on which function is specified to calculate the persistence diagrams. There are three possible such functions – TDAstatscalculate_homology, TDA’s ripsDiag and TDApplied’s PyH. The functions calculate_homology and ripsDiag can be run in parallel. However, PyH is the fastest function followed by calculate_homology based on our simulations. Both ripsDiag and PyH allow for the calculation of representative (co)cycles (i.e. the approximate location in the data of each topological feature), whereas calculate_homology does not. Therefore, our recommendation is to pick the function according to the following criteria: if a user can use the PyH function, then it should be used in all cases except for when the input data set is small, the machine has many available cores and the number of desired bootstrap iterations is large. Otherwise, use calculate_homology for speed or ripsDiag for calculating representatives.

Representative (Co)Cycles

One of the advantages of the R package TDA over TDAstats is its ability to calculate representative cycles in the data, i.e. locating the persistence diagram topological features in the input data set. Having access to representative cycles can permit deep analyses of the original data set by finding particular types of features spanned by certain subsets of data points. For example, a representative cycle of a 1-dimensional topological feature would be a set of edges between data points which lie along that feature (a loop). The PyH function can also find representative cocycles (i.e. analogues of representative cycles for persistent cohomology) in its input data, which are returned if the calculate_representatives parameter is set to TRUE. In that case, the PyH function returns a list with a data frame called “diagram”, containing the persistence diagram, and a list called “representatives”. The “representatives” list has one element for each homological dimension in the persistence diagram, with one matrix/array for each point in the persistence diagram of that dimension (except for dimension 0, where the list is always empty). The matrix for a \(d\)-dimensional feature (1 for loops, 2 for voids, etc.) has \(d + 1\) columns, where row \(i\) contains the row indices in the data set of the data points in the \(i\)-th substructure in the representative (a substructure of a loop would be an edge, a substructure of a void would be a triangle, etc.). Here is an example where we calculate the representative cocycles of our circ dataset:

# ripser has already been imported, so calculate diagram with representatives
diag_rep <- PyH(circ,maxdim = 1,thresh = 2,ripser = ripser,calculate_representatives = T)

# identify the loops in the diagram
diag_rep$diagram[which(diag_rep$diagram$dimension == 1),]
#>    dimension     birth    death
#> 50         1 0.5579783 1.738593
# show the representative for the loop, just the first five rows
diag_rep$representatives[[2]][[1]][1:5,]
#>      [,1] [,2]
#> [1,]   50   42
#> [2,]   46   42
#> [3,]   50    4
#> [4,]   42   29
#> [5,]   42   16

The representative of the one loop contains the edges found to be present in the loop. We could iterate over the representative for the loop to find all the data points in that representative:

unique(c(diag_rep$representatives[[2]][[1]][,1],diag_rep$representatives[[2]][[1]][,2]))
#>  [1] 50 46 42 29 16  7 48  4 11 25 40 37 22 43 15 20 34 32 27  8 44 36 39  5  1
#> [26] 21 24

Since the circ dataset is two-dimensional, we could actually plot the loop representative according to the following process:

  1. Pick a threshold \(\epsilon\) between the birth and death radii of the cocycle.
  2. Plot all edges between pairs of points in circ of distance no more than \(\epsilon\).
  3. Highlight all edges in the representative.

Since the death radius of the main cocycle is 1.738894, we can use the following code to plot the cocycle at thresholds value 1.7:

plot(x = circ$x,y = circ$y,xlab = "x",ylab = "y",main = "circ with representative")
for(i in 1:nrow(circ))
{
  for(j in 1:nrow(circ))
  {
    pt1 <- circ[i,]
    pt2 <- circ[j,]
    if(sqrt((pt1[[1]] - pt2[[1]])^2 + (pt1[[2]] - pt2[[2]])^2) <= 1.7)
    {
      graphics::lines(x = c(pt1[[1]],pt2[[1]]),y = c(pt1[[2]],pt2[[2]]))
    }
  }
}

for(i in 1:nrow(diag_rep$representatives[[2]][[1]]))
{
  pt1 <- circ[diag_rep$representatives[[2]][[1]][i,1],]
  pt2 <- circ[diag_rep$representatives[[2]][[1]][i,2],]
  if(sqrt((pt1[[1]] - pt2[[1]])^2 + (pt1[[2]] - pt2[[2]])^2) <= 1.7)
  {
    graphics::lines(x = c(pt1[[1]],pt2[[1]]),y = c(pt1[[2]],pt2[[2]]),col = "red")
  }
}

This plot shows the main loop that was found via persistent cohomology, and the representative is a set of edges (in this 2D case) whose removal would destroy the loop. A more intuitive notion of a “representative loop” can be found with persistent homology, for instance using the TDA ripsDiag function with the location parameter set to TRUE. Another example of the representative cocycle can be found using another threshold value, for instance the (rounded up) birth value of 0.6009:

Since we know that the data points in the representatives help form a loop in the original data set, we could perform a further exploratory analysis in circ to explore the periodic nature of the feature. While in this example we know that a loop will be present, this type of analysis could help find hidden latent structure in data sets.

VR Graphs and TDApplied’s vr_graphs and plot_vr_graph Functions

Integral to the plots in the previous section were that the circ dataset is 2D, but most datasets are not 2D. In order to investigate topological features of high-dimensional datasets we can use the Rips complexes which are calculated as an intermediate step in persistent homology (as described earlier). Recall that the Rips complex of a dataset is a skeletal representations of the dataset’s structure at a particular scale \(\epsilon\) formed by connecting data points of distance at most \(\epsilon\) and adding higher-dimensional structures (like triangles between triples of connected points). The connections between data points in a Rips complex form a “VR graph” (Zomorodian 2010), and this graph can be visualized regardless of the dimension of the underlying dataset.

We can use the function vr_graphs to compute VR graphs at a variety of scales, and can visualize the resulting graphs with the igraph package (Csardi and Nepusz 2006) and the plot_vr_graph function. To demonstrate this functionality, we will pick two \(\epsilon\) scales to study the dataset structure of circ, only based on its persistence diagram - the first scale will be half the birth radius of the loop and the second scale will be the average of the loop’s birth and death radius:

# get half of loop's birth radius
eps_1 <- diag[nrow(diag),2L]/2

# get mean of loop's birth and death radii
eps_2 <- (diag[nrow(diag),3L] + diag[nrow(diag),2L])/2

# compute two VR graphs
gs <- vr_graphs(X = circ,eps = c(eps_1,eps_2))

Next we can plot both VR graphs:

# plot first graph
plot_vr_graph(gs,eps_1)


# plot second graph
layout <- plot_vr_graph(gs,eps_2,return_layout = TRUE)

layout <- apply(layout,MARGIN = 2,FUN = function(X){
  
  return(-1 + 2*(X - min(X))/(max(X) - min(X)))
  
})

The first graph shows that at a scale before the loop is born, the dominant loop does not exist in the data (and that the space is more fragmented), while the second graph was able to retrieve the dominant loop. Visualizing multiple VR graphs of a dataset, choosing particular scales of interest from the persistence diagram of the dataset, can be an effective tool for interpreting topological features.

The input parameters of plot_vr_graph can help customize the graph visualizations. For example, if we want to investigate the loop we can also customize the plot to highlight the loop representative, and to only plot the component of the graph which contains the loop vertices:

We can customize one level further by removing the vertex labels and using the graph layout (i.e. x-y coordinates of all vertices) to plot other things on the graph:

# plot only component containing the loop stimuli with vertex colors
plot_vr_graph(gs,eps_2,cols = colors,
              component_of = stimuli_in_loop[[1]],
              vertex_labels = FALSE,
              layout = layout)

# get indices of vertices in loop
# not necessary in this case but necessary when we have
# removed some vertices from the graph
vertex_inds <- match(stimuli_in_loop,as.numeric(rownames(layout)))

# add volcano image over loop nodes
# image could be anything like rasters read from
# png files! this is just an example..
utils::data("volcano")
volcano <- (volcano - min(volcano)) / diff(range(volcano))
for(i in vertex_inds)
{
  graphics::rasterImage(volcano,xleft = layout[i,1L] - 0.05,
                              xright = layout[i,1L] + 0.05,
                              ybottom = layout[i,2L] - 0.05,
                              ytop = layout[i,2L] + 0.05)
}

These visualizations can help determine which points are part of a representative and what the dataset structure is at various scales.

Hypothesis Testing

Having visualized and interpreted our data of interest (persistence diagrams), a common next step in data analysis pipelines is to ask statistical questions in the form of hypothesis testing (Casella, Berger, and Company 2002). Three such questions that we will focus on are:

  1. Are groups of persistence diagrams different from each other?
  2. Are two groups of persistence diagrams independent of each other?
  3. Are two datasets good topological models of each other?

In order to answer these questions we need to generate groups of persistence diagrams. For the third question we will generate diagrams from simulated circle datasets. For the first two questions we will generate diagrams which are random deviations of D1, D2 and D3 diagrams from earlier:

The helper function generate_TDApplied_vignette_data (seen below) adds Gaussian noise with a small variance to the birth and death values of the points in D1, D2 and D3, making “noisy copies” of the three diagrams:

Here is an example D1 and two noisy copies:

Detecting Group Differences and TDApplied’s permutation_test Function

One of the most important inference procedures in classical statistics is the analysis of variance (ANOVA), which can find differences in the means of groups of normally-distributed measurements (Casella, Berger, and Company 2002). Distributions of persistence diagrams and their means can be complicated (see (Turner 2013) and (Turner et al. 2014)). Therefore, a non-parametric permutation test has been proposed which can find differences in groups of persistence diagrams. Such a test was first proposed in (Robinson and Turner 2017), and some variations have been suggested in later publications. In (Robinson and Turner 2017), two groups of persistence diagrams would be compared. The null hypothesis, \(H_0\), is that the diagrams from the two groups are generated from shapes with the same type and scale of topological features, i.e. they “come” from the same “shape”. The alternative hypothesis, \(H_A\), is that the underlying type or scale of the features are different between the two groups. In each dimension a p-value is computed, finding evidence against \(H_0\) in that dimension. A measure of within-group distances (a “loss function”) is calculated for the two groups, and that measure is compared to a null distribution for when the group labels are permuted.

This inference procedure is implemented in the permutation_test function, with several speedups and additional functionalities. Firstly, the loss function is computed in parallel for scalability since distance computations can be expensive. Secondly, we store distance calculations as we compute them because these calculations are often repeated. Additional functionality includes allowing for any number groups (not just two) and allowing for a pairing between groups of the same size as described in (Abdallah (2021)). When a natural pairing exists between the groups (such as if the groups represent persistence diagrams from the same subject of a study in different conditions) we can simulate a more realistic null distribution by restricting the way in which we permute group labels, achieving higher statistical power.

In order to demonstrate the permutation test we will detect differences between noisy copies of D1, D2, D3:

# permutation test between three diagrams
g1 <- generate_TDApplied_vignette_data(3,0,0)
g2 <- generate_TDApplied_vignette_data(0,3,0)
g3 <- generate_TDApplied_vignette_data(0,0,3)
perm_test <- permutation_test(g1,g2,g3,
                              num_workers = 2,
                              dims = c(0))
perm_test$p_values
#>          0 
#> 0.04761905

As expected, a difference was found (at the \(\alpha = 0.05\) significance level) between the three groups.

The package TDAstats also has a function called permutation_test which is based on the same test procedure. However, it uses an unpublished distance metric between persistence diagrams (see the package vignette “Comparing Distance Calculations”) and does not use parallelization for scalability. As such, care must be taken when both TDApplied and TDAstats are attached in an R script to use the particular permutation_test function desired.

Independence Between Two Groups of Paired Diagrams and TDApplied’s independence_test Function

An important question when presented with two groups of paired data is determining if the pairings are independent or not. A procedure was described in (Gretton et al. 2007) which can be used to answer this question using kernel computations, and importantly uses a parametric null distribution. The null hypothesis for this test is that the groups are independent, and the alternative hypothesis is that the groups are not independent. A test statistic called the Hilbert-Schmidt independence criteria (Gretton et al. 2007) is calculated, and its value is compared to a gamma distribution with certain parameters which are estimated from the data.

This inference procedure has been implemented in the independence_test function, and returns the p-value of the test in each desired dimension of the diagrams (among other additional information). We would expect to find no dependence between noisy copies of D1, D2 and D3, since each copy is generated randomly:

# create 10 noisy copies of D1 and D2
g1 <- generate_TDApplied_vignette_data(10,0,0)
g2 <- generate_TDApplied_vignette_data(0,10,0)

# do independence test with sigma = t = 1
indep_test <- independence_test(g1,g2,dims = c(0),num_workers = 2)
indep_test$p_values
#>         0 
#> 0.3045212

The p-value of this test would not be significant at any typical significance threshold, reflecting the fact that there is no real (i.e. non-spurious) dependence between the two groups, as expected.

Model Inference and TDApplied’s permutation_model_inference Function

The permutation test procedure allows us to test whether two groups of persistence diagrams are different, and the bootstrap procedure allows us to estimate the sampling distribution of the persistence diagram of a single dataset. Putting these two concepts together we can test whether two datasets are topologically similar or not, i.e. if each dataset is a good topological “model” of the other.

The procedure we will employ is to

  1. compute two groups of bootstrapped persistence diagrams, one from each dataset, and
  2. carry out a permutation test on the two groups.

A small p-value in a given homological dimension of this test indicates that the two datasets had a significant topological difference in that dimension. Let’s take an example of two datasets generated by a circle, our original circ and a new dataset circ2:

When we run the model inference procedure we find that there is no topological difference (at the 0.05 significance level) in dimensions 0 or 1:

mod_comp <- permutation_model_inference(circ, circ2, iterations = 20,
                                        thresh = 2, num_samples = 3,
                                        num_workers = 2L)
mod_comp$p_values
#>         0         1 
#> 0.6190476 1.0000000

Now suppose we knew there was a correspondence between the rows of the two datasets, for instance row \(i\) in both datasets corresponded to the same subject in two studies. For a practical example consider the two datasets circ and circ with a small variance Gaussian noise source:

We can carry out a paired permutation test to obtain increased statistical power as follows:

mod_comp_paired <- permutation_model_inference(circ, circ_noise, iterations = 20,
                                        thresh = 2, num_samples = 3,
                                        paired = TRUE, num_workers = 2L)
mod_comp_paired$p_values
#>         0         1 
#> 0.7619048 0.6666667

Once again the p-values are not significant. In general a high p-value only “suggests” a better model fit, but in this case we know that the two datasets are highly related.

If we wanted to carry out multiple paired tests between datasets (for instance data from the same subjects across multiple studies) we can ensure that the same bootstrap samples are used in all comparisons by explicitly defining those samples and supplying them to the procedure. If we had defined another dataset, circ3, we could carry out the paired tests as follows:

# in this case we are creating 3 samples of the rows of circ
# (and equivalently the other two datasets)
boot_samples <- lapply(X = 1:3,
                       FUN = function(X){return(unique(sample(1:nrow(circ),size = nrow(circ),replace = TRUE)))})

# carry out three model comparisons between circ, circ2 and circ3
mod_comp_1_2 <- permutation_model_inference(circ, circ2, iterations = 20,
                                        thresh = 2, num_samples = 3,
                                        paired = TRUE, num_workers = 2L,
                                        samp = boot_samples)
mod_comp_1_3 <- permutation_model_inference(circ, circ3, iterations = 20,
                                        thresh = 2, num_samples = 3,
                                        paired = TRUE, num_workers = 2L,
                                        samp = boot_samples)
mod_comp_2_3 <- permutation_model_inference(circ2, circ3, iterations = 20,
                                        thresh = 2, num_samples = 3,
                                        paired = TRUE, num_workers = 2L,
                                        samp = boot_samples)

Finding Latent Structure

Patterns may exist in a collection of persistence diagrams. For example, if the collection contained diagrams from three distinct shapes then we would like to be able to assign three distinct (discrete) labels to the correct subsets of diagrams. On the other hand, maybe the diagrams vary along several dimensions, like the mean persistence of their loops and the mean persistence of their components. In this case we would like to be able to retrieve these continuous features for visualizing all diagrams in the same (2D in this example) space. These two types of analyses are called clustering and dimension reduction respectively, and are two of the most common and popular machine learning applications. We will explore three techniques from these areas - kernel k-means from clustering, and multidimensional scaling and kernel principal components analysis from dimension reduction.

Kernel k-means and TDApplied’s diagram_kkmeans Function

Kernel k-means (Dhillon, Guan, and Kulis 2004) is a method which can find hidden groups in complex data, extending regular k-means clustering (Murphy 2012) via a kernel. A “kernel distance” is calculated between a persistence diagram and a cluster center using only the kernel function, and the algorithm converges like regular k-means. This algorithm is implemented in the function diagram_kkmeans as a wrapper of the kernlab function kkmeans. Moreover, a prediction function predict_diagram_kkmeans can be used to find the nearest cluster labels for a new set of diagrams. Here is an example clustering three groups of noisy copies from D1, D2 and D3:

# create noisy copies of D1, D2 and D3
g <- generate_TDApplied_vignette_data(3,3,3)
                              
# calculate kmeans clusters with centers = 3, and sigma = t = 0.1
clust <- diagram_kkmeans(diagrams = g,centers = 3,dim = 0,t = 0.1,sigma = 0.1,num_workers = 2)

# display cluster labels
clust$clustering@.Data
#> [1] 3 3 3 2 2 2 1 1 1

As we can see, the diagram_kkmeans function was able to correctly separate the three generating diagrams D1, D2 and D3 (the cluster labels are arbitrary and therefore may not be 1,1,1,2,2,2,3,3,3; however, they induce the correct partition).

If we wish to predict the cluster label for new persistence diagrams (computed via the largest kernel value to any cluster center), we can use the predict_diagram_kkmeans function as follows:

# create nine new diagrams
g_new <- generate_TDApplied_vignette_data(3,3,3)

# predict cluster labels
predict_diagram_kkmeans(new_diagrams = g_new,clustering = clust,num_workers = 2)
#> [1] 3 3 3 2 2 2 1 1 1

This function correctly predicted the cluster for each new diagram (assigning each diagram to the cluster label by D1, D2 or D3, depending on which diagram it was generated from).

Multidimensional Scaling and TDApplied’s diagram_mds Function

Dimension reduction is a task in machine learning which is commonly used for data visualization, removing noise in data, and decreasing the number of covariates in a model (which can be helpful in reducing overfitting). One common dimension reduction technique in machine learning is called multidimensional scaling (MDS) (Cox and Cox 2008). MDS takes as input an \(n\) by \(n\) distance (or dissimilarity) matrix \(D\), computed from \(n\) points in a dataset, and outputs an embedding of those points into a Euclidean space of chosen dimension \(k\) which best preserves the inter-point distances. MDS is often used for visualizing data in exploratory analyses, and can be particularly useful when the input data points do not live in a shared Euclidean space (as is the case for persistence diagrams). Using the R function cmdscale from the package stats (R Core Team (2021)) we can compute the optimal embedding of a set of persistence diagrams using any of the three distance metrics with the function diagram_mds. Here is an example of the diagram_mds function projecting nine persistence diagrams, three noisy copies sampled from each of D1, D2 and D3, into 2D space:

# create 9 diagrams based on D1, D2 and D3
g <- generate_TDApplied_vignette_data(3,3,3)

# calculate their 2D MDS embedding in dimension 0 with the bottleneck distance
mds <- diagram_mds(diagrams = g,dim = 0,p = Inf,k = 2,num_workers = 2)

# plot
par(mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
plot(mds[,1],mds[,2],xlab = "Embedding coordinate 1",ylab = "Embedding coordinate 2",
     main = "MDS plot",col = as.factor(rep(c("D1","D2","D3"),each = 3)),bty = "L")
legend("topright", inset=c(-0.2,0), 
       legend=levels(as.factor(c("D1","D2","D3"))), 
       pch=16, col=unique(as.factor(c("D1","D2","D3"))))

The MDS plot shows the clear separation between the three generating diagrams (D1, D2 and D3), and the embedded coordinates could be used for further downstream analyses.

Kernel Principal Components Analysis, and TDApplied’s diagram_kpca Function

PCA is another dimension reduction technique in machine learning, but can be preferable compared to MDS in certain situations because it allows for the projection of new data points onto an old embedding model (Murphy 2012). For example, this can be important if PCA is used as a pre-processing step in model fitting. Kernel PCA (kPCA) (Scholkopf, Smola, and Muller 1998) is an extension of regular PCA which uses a kernel to project complex data into a high-dimensional Euclidean space and then uses PCA to project that data into a low-dimensional space. The diagram_kpca method computes the kPCA embedding of a set of persistence diagrams, and the predict_diagram_kpca function can be used to project new diagrams using a pre-trained kPCA model. Here is an example using a group of noisy copies of D1, D2 and D3:

# create noisy copies of D1, D2 and D3
g <- generate_TDApplied_vignette_data(3,3,3)

# calculate their 2D PCA embedding with sigma = t = 2
pca <- diagram_kpca(diagrams = g,dim = 0,t = 2,sigma = 2,features = 2,num_workers = 2)

# plot
par(mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
plot(pca$pca@rotated[,1],pca$pca@rotated[,2],xlab = "Embedding coordinate 1",
     ylab = "Embedding coordinate 2",main = "PCA plot",
     col = as.factor(rep(c("D1","D2","D3"),each = 3)))
legend("topright",inset = c(-0.2,0), 
       legend=levels(as.factor(c("D1","D2","D3"))), pch=16, 
       col=unique(as.factor(c("D1","D2","D3"))))

The function was able to recognize the three groups, and the embedding coordinates can be used for further downstream analysis. However, an important advantage of kPCA over MDS is that in kPCA we can project new points onto an old embedding using the predict_diagram_kpca function:

# create nine new diagrams
g_new <- generate_TDApplied_vignette_data(3,3,3)

# project new diagrams onto old model
new_pca <- predict_diagram_kpca(new_diagrams = g_new,embedding = pca,num_workers = 2)

# plot
par(mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
plot(new_pca[,1],new_pca[,2],xlab = "Embedding coordinate 1",
     ylab = "Embedding coordinate 2",main = "PCA prediction plot",
     col = as.factor(rep(c("D1","D2","D3"),each = 3)))
legend("topright",inset = c(-0.2,0), 
       legend=levels(as.factor(c("D1","D2","D3"))), pch=16, 
       col=unique(as.factor(c("D1","D2","D3"))))

As we can see, the original three groups, and their approximate location in 2D space, is preserved during prediction.

Predicting Labels of Persistence Diagrams

One of the most valuable problems that machine learning is used to solve is that of prediction - can a variable \(Y\) be predicted from other variables \(X\). Kernel functions can be used to predict a variable \(Y\) from a persistence diagram \(D\) using a class of models called support vector machines (SVMs) (Murphy 2012).

Support Vector Machines and TDApplied’s diagram_ksvm Function

SVMs are one of the most popular machine learning techniques for regression and classification tasks. SVMs use a kernel function to project complex data into a high-dimensional space and then find a sparse set of training examples, called “support vectors”, which maximally linearly separate the outcome variable classes (or yield the highest explained variance in the case of regression).

Unlike in the case of dimension reduction or clustering, it is possible that our persistence diagrams may each have a label, either discrete or continuous, which gives us more information about the underlying data represented by the diagram. For instance, if our persistence diagrams represented periodic behavior in stock market trends during a particular temporal window then a useful (discrete) label could be whether the overall market went up or down during that period. Being able to predict such labels from the persistence diagrams themselves is a way to link persistence diagrams to external data through modeling, i.e. classification and regression. Support vector machines

SVMs have been implemented in TDApplied by the function diagram_ksvm, tailored for input datasets which contain pairs of persistence diagrams and their outcome variable labels. A prediction method is supplied called predict_diagram_ksvm which can be used to predict the label value of a set of new persistence diagrams given a pre-trained model. A parallelized implementation of cross-validation model-fitting is used based on the remarks in (Le and Yamada 2018) for scalability (which avoids needlessly recomputing persistence Fisher information metric values). Here is an example of fitting an SVM model on a list of persistence diagrams for a classification task (guessing whether the diagram comes from D1, D2 or D3):

# create thirty noisy copies of D1, D2 and D3
g <- generate_TDApplied_vignette_data(10,10,10)

# create response vector
y <- as.factor(rep(c("D1","D2","D3"),each = 10))

# fit model with cross validation
model_svm <- diagram_ksvm(diagrams = g,cv = 2,dim = c(0),
                          y = y,sigma = c(1,0.1),t = c(1,2),
                          num_workers = 2)

We can use the function predict_diagram_ksvm to predict new diagrams like so:

# create nine new diagrams
g_new <- generate_TDApplied_vignette_data(3,3,3)

# predict
predict_diagram_ksvm(new_diagrams = g_new,model = model_svm,num_workers = 2)
#> [1] D1 D1 D1 D2 D2 D2 D3 D3 D3
#> Levels: D1 D2 D3

As we can see the best SVM model was able to separate the three diagrams We can gain more information about the best model found during model fitting and the CV results by accessing different list elements of model_svm.

Limitations of TDApplied Functionality

There is one main limitation of TDApplied which should be discussed for its own future improvements – TDApplied functions cannot analyze numeric and factor features with persistence diagrams. This may be too inflexible for some applications, where the data may include several persistence diagrams, or a mix of persistence diagrams, numeric and categorical variables. The package vignette “Personalized Analyses with TDApplied” demonstrates how one can circumvent this issue using extra code; however, a future update to TDApplied might construct such functionality directly into its functions.

Conclusion

Current topological data analysis packages in R (and Python) do not provide the ability to carry out statistics and machine learning with persistence diagrams, leading to a high barrier to adoption of topological data analysis in academia and industry. By filling in this gap, the TDApplied package aims to bridge topological data analysis with researchers and data practitioners in the R community. Topological data analysis is an exciting and powerful new field of data analysis, and with TDApplied anyone can access its power for meaningful and creative analyses of data.

References

Abdallah, Hassan et al. 2021. Github. https://github.com/hassan-abdallah/Statistical_Inference_PH_fMRI/blob/main/Abdallah_et_al_Statistical_Inference_PH_fMRI.pdf.
Bauer, Ulrich. 2015. Persistent Homology Algorithm Toolbox. https://github.com/Ripser/ripser.
Bobrowski, Omer, and Primoz Skraba. 2023. “A Universal Null-Distribution for Topological Data Analysis.” Scientific Reports 13 (1): 12274.
Carlsson, Gunnar E., Tigran Ishkhanov, Vin de Silva, and Afra Zomorodian. 2007. “On the Local Behavior of Spaces of Natural Images.” International Journal of Computer Vision 76: 1–12.
Carlsson, Gunnar, and Mikael Vejdemo-Johansson. 2021. Topological Data Analysis with Applications. Cambridge University Press. https://doi.org/10.1017/9781108975704.
Carriere, Mathieu, Marco Cuturi, and Steve Oudot. 2017. “Sliced Wasserstein Kernel for Persistence Diagrams.” In Proceedings of the 34th International Conference on Machine Learning, edited by Doina Precup and Yee Whye Teh, 70:664–73. Proceedings of Machine Learning Research. PMLR.
Casella, G., R. L. Berger, and Brooks/Cole Publishing Company. 2002. Statistical Inference. Duxbury Advanced Series in Statistics and Decision Sciences. Thomson Learning.
Chazal, Frederic, and Bertrand Michel. 2017. “An Introduction to Topological Data Analysis: Fundamental and Practical Aspects for Data Scientists.” Frontiers in Artificial Intelligence 4 (October). https://doi.org/10.3389/frai.2021.667963.
Cox, Michael A. A., and Trevor F. Cox. 2008. “Multidimensional Scaling.” In Handbook of Data Visualization, 315–47. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-33037-0_14.
Csardi, Gabor, and Tamas Nepusz. 2006. “The Igraph Software Package for Complex Network Research.” InterJournal Complex Systems: 1695. https://igraph.org.
Dhillon, Inderjit S., Yuqiang Guan, and Brian Kulis. 2004. “A Unified View of Kernel k-Means , Spectral Clustering and Graph Cuts.” In UTCS Technical Report.
Eddelbuettel, Dirk, and Romain Francois. 2011. Rcpp: Seamless R and C++ Integration.” Journal of Statistical Software 40 (8): 1–18. https://doi.org/10.18637/jss.v040.i08.
Edelsbrunner, Herbert, and John Harer. 2010. Computational Topology: An Introduction. https://doi.org/10.1007/978-3-540-33259-6_7.
Edelsbrunner, Herbert, David Letscher, and Afra Zomorodian. 2000. “Topological Persistence and Simplification.” Discrete & Computational Geometry 28: 511–33.
Fasy, Brittany T., Jisu Kim, Fabrizio Lecci, Clement Maria, David L. Millman, and Vincent Rouvreau. 2021. TDA: Statistical Tools for Topological Data Analysis. https://CRAN.R-project.org/package=TDA.
Fasy, Brittany, Fabrizio Lecci, Alessandro Rinaldo, Larry Wasserman, Sivaraman Balakrishnan, and Aarti Singh. 2014. “Confidence Sets for Persistence Diagrams.” The Annals of Statistics 42 (March): 2301–39.
Gracia-Tabuenca, Zeus, Juan Carlos Diaz-Patino, Isaac Arelio, and Sarael Alcauter. 2020. “Topological Data Analysis Reveals Robust Alterations in the Whole-Brain and Frontal Lobe Functional Connectomes in Attention-Deficit/Hyperactivity Disorder.” Eneuro.
Gretton, Arthur, Kenji Fukumizu, Choon Teo, Le Song, Bernhard Scholkopf, and Alex Smola. 2007. “A Kernel Statistical Test of Independence.” In Advances in Neural Information Processing Systems, edited by J. Platt, D. Koller, Y. Singer, and S. Roweis. Vol. 20. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2007/file/d5cfead94f5350c12c322b5b664544c1-Paper.pdf.
Haim Meirom, Shaked, and Omer Bobrowski. 2022. “Unsupervised Geometric and Topological Approaches for Cross-Lingual Sentence Representation and Comparison.” In Proceedings of the 7th Workshop on Representation Learning for NLP, 173–83. Dublin, Ireland: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.repl4nlp-1.18.
Hornik, Kurt. 2005. “A CLUE for CLUster Ensembles.” Journal of Statistical Software 14 (12). https://doi.org/10.18637/jss.v014.i12.
Hyekyoung, Lee, Moo Chung, Hyejin Kang, and Dong Lee. 2014. “Hole Detection in Metabolic Connectivity of Alzheimer’s Disease Using Kappa-Laplacian.” In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, 17:297–304. https://doi.org/10.1007/978-3-319-10443-0_38.
Kerber, Michael, Dmitriy Morozov, and Arnur Nigmetov. 2017. “Geometry Helps to Compare Persistence Diagrams.” ACM Journal of Experimental Algorithmics 22 (September). https://doi.org/10.1145/3064175.
Krishnapriyan, Aditi S. et al. 2021. “Machine Learning with Persistent Homology and Chemical Word Embeddings Improves Prediction Accuracy and Interpretability in Metal-Organic Frameworks.” Nature Scientific Report 11 (April).
Kusano, Genki, Kenji Fukumizu, and Yasuaki Hiraoka. 2018. “Kernel Method for Persistence Diagrams via Kernel Embedding and Weight Factor.” Journal of Machine Learning Research 18 (189): 1–41. https://jmlr.org/papers/v18/17-317.html.
Le, Tam, and Makoto Yamada. 2018. “Persistence Fisher Kernel: A Riemannian Manifold Kernel for Persistence Diagrams.” In Advances in Neural Information Processing Systems, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett. Vol. 31. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2018/file/959ab9a0695c467e7caf75431a872e5c-Paper.pdf.
Lee, Hyekyoung, Moo K. Chung, Hyejin Kang, Bung-Nyun Kim, and Dong Soo Lee. 2011. “Discriminative Persistent Homology of Brain Networks.” In 2011 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, 841–44. https://doi.org/10.1109/ISBI.2011.5872535.
Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. MIT press.
R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Reininghaus, Jan, Stefan Huber, Ulrich Bauer, and Roland Kwitt. 2014. “A Stable Multi-Scale Kernel for Topological Machine Learning,” December.
Robinson, Andrew, and Katharine Turner. 2017. “Hypothesis Testing for Topological Data Analysis.” Journal of Applied and Computational Topology 1.
Scholkopf, Bernhard, Alexander Smola, and Klaus-Robert Muller. 1998. “Nonlinear Component Analysis as a Kernel Eigenvalue Problem.” Neural Computation 10: 1299–319.
Silva, Morozov de, V., and M. Vejdemo-Johansson. 2011a. “Dualities in Persistent (Co)homology.” Inverse Problems 27.
———. 2011b. “Persistent Cohomology and Circular Coordinates.” Discrete & Computational Geometry 45: 737–59.
Turner, Katharine. 2013. “Means and Medians of Sets of Persistence Diagrams.” arXiv, July.
Turner, Katharine, Yuriy Mileyko, Sayan Mukherjee, and John Harer. 2014. “Frechet Means for Distributions of Persistence Diagrams.” Discrete & Computational Geometry 52 (1): 44–70.
Ushey, Kevin, JJ Allaire, and Yuan Tang. 2022. Reticulate: Interface to ’Python’. https://CRAN.R-project.org/package=reticulate.
Wadhwa, Raoul, Andrew Dhawan, Drew Williamson, and Jacob Scott. 2019. TDAstats: Pipeline for Topological Data Analysis. https://github.com/rrrlw/TDAstats.
Yen, Peter Tsung-Wen, and Siew Ann Cheong. 2021. “Using Topological Data Analysis (TDA) and Persistent Homology to Analyze the Stock Markets in Singapore and Taiwan.” Frontiers in Physics 9. https://doi.org/10.3389/fphy.2021.572216.
Zomorodian, Afra. 2010. “The Tidy Set: A Minimal Simplicial Set for Computing Homology of Clique Complexes.” In Proceedings of the Twenty-Sixth Annual Symposium on Computational Geometry, 257–66. SoCG ’10. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/1810959.1811004.
Zomorodian, Afra, and Gunnar Carlsson. 2005. “Computing Persistent Homology.” Discrete and Computational Geometry 33 (February): 249–74. https://doi.org/10.1007/s00454-004-1146-y.