Here, we will provide a walkthrough of jrSiCKLSNMF on simulated data. First, install and load the jrSiCKLSNMF package:
For this walkthrough, we will be working with simulated data object SimData. These data were generated from GSE130399 using the packages SparSim and simATAC. The details for the simulations can be found in our paper. SimData has already gone through quality control (QC); however, when working with real data, you should QC your data and select features that appear in at least 10 cells for both modalities. After loading jrSiCKLSNMF into R, you should have access to SimData.
Now, we create a SickleJr object. This object will hold all of the results and intermediary steps for jrSiCKLSNMF. We will also create a vector containing the cell types for the simulated data for use in the next section.
SimData<-jrSiCKLSNMF::SimData
DataMatrices<-SimData$Xmatrices
cell_type<-SimData$cell_type
SimSickleJr<-CreateSickleJr(DataMatrices)
#> Warning in CreateSickleJr(DataMatrices):
#> Length of names does not match length of list of count matrices. Names of count matrices will remain unchanged.
rm(DataMatrices,SimData)
Next, we will add metadata in the form of the true cell identity to the SickleJr object.
You can add any metadata to the SickleJr object. Make sure to set the variable metadata to the variable containing the metadata you wish to add and the metadataname to a string containing what you wish the metadata to be called. Here, we call the metadata “true_cell_type.” # Generation of Graph Laplacians for Graph Regularization
Next, we build our KNN graphs and extract the graph Laplacian to use for graph regularization. Note that for large matrices, this can take some time. Please set a seed if you want to have the exact same graph Laplacians every time.
set.seed(10)
SimSickleJr<-BuildKNNGraphLaplacians(SimSickleJr)
#> Warning in (function (to_check, X, clust_centers, clust_info, dtype, nn, :
#> detected tied distances to neighbors, see ?'BiocNeighbors-ties'
#> Warning in (function (to_check, X, clust_centers, clust_info, dtype, nn, :
#> detected tied distances to neighbors, see ?'BiocNeighbors-ties'
For values \(\lambda_{W^v}\) and our value \(\lambda_H\), when we constrain the rows of \(\textbf{H}\) such that the L2 Norm of each row is equal to 1 (the default), we suggest setting \(\lambda_{\textbf W^{RNA}}=3\), \(\lambda_{\textbf W^{ATAC}}=15\), and \(\lambda_{\textbf H}=0\). When not using this constraint, we suggest using \(\lambda_{\textbf W^{RNA}}=10\), \(\lambda_{W^{ATAC}}=50\), and \(\lambda_{H}=500\).
Next, we will normalize our count matrices using median library size normalization.
Now we can make plots to determine an ideal number of latent factors. Here, 5 or 6 appears to be an ideal number of latent factors. Note that this does change depending on the initialization of the graph matrices and the seed for the number of initial samples; however, the ideal number appears to be between 4 and 7. To more accurately identify an ideal number of latent factors, perform more rounds. This is the opposite behavior of the L2 Norm regularized version.
Also note that you should use run this in parallel, for more iterations, and over more d values. Default is 2 to 20 d values. and 200 iterations
SimSickleJr<-PlotLossvsLatentFactors(SimSickleJr,d_vector=4:6,parallel = FALSE,rounds=25)
#>
#> Subsample not specified. Setting to number of cells.
#>
#> Calculating initial WH matrices...
#>
#> Running jrSiCKLSNMF...
In this toy example, around 5 latent factors appears to be a good value. If you already have an idea of the number of latent factors you would like to use, you may generate the H and W matrices as below:
However, since we already calculated part of the W matrices and the shared H matrix for all cells, we can use the previously calculated values to run jrSiCKLSNMF.
Finally, we can run jrSicKLSNMF. By default, we constrain the rows of \(\textbf{H}\) such that the L2 Norm of each row is equal to 1. Please note that we store \(\textbf{H}\) as \(\textbf{H}^T\). Please note that, for real data, you should use enough rounds to reach convergence of \(10^{-6}\).
After this, we can perform cell clustering. In version 1.0, users can use k-means or spectral clustering. Here, we demonstrate k-means. We can first determine an optimal number of clusters using the fviz_nbclust function from the library factoextra and clValid from package clValid. From the plots and diagnostics, we can see that 3 appears to be the optimal number of clusters.
Finally, we can calculate and then plot the UMAP of our SickleJr. Note that if you would like to plot the UMAP of the compressed \(\mathbf{W}^v\mathbf{H}\) matrix, please enter the number corresponding to the modality you wish to see. For the plots, you can either color based off of identified cluster or based off of metadata.
SimSickleJr<-CalculateUMAPSickleJr(SimSickleJr)
#Plotting based off of cluster
SimSickleJr<-PlotSickleJrUMAP(SimSickleJr,title="K-means clusters")
#Plotting based off of true cell type metadata
SimSickleJr<-PlotSickleJrUMAP(SimSickleJr,colorbymetadata="true_cell_type",title="True Cell Types",legendname="True Cell Types")
We can also visualize data in the RNA modality and the ATAC modality.
SimSickleJr<-CalculateUMAPSickleJr(SimSickleJr,modality=1)
SimSickleJr<-PlotSickleJrUMAP(SimSickleJr,title="K-means clusters: RNA modality",
umap.modality="W1H")
SimSickleJr<-PlotSickleJrUMAP(SimSickleJr,colorbymetadata="true_cell_type",
title="True Cell Type: RNA modality",legendname="True Cell Types",
umap.modality="W1H")