README

This R package (Hahsler, Piekenbrock, and Doran 2019) provides a fast C++ (re)implementation of several density-based algorithms with a focus on the DBSCAN family for clustering spatial data. The package includes:

The implementations use the kd-tree data structure (from library ANN) for faster k-nearest neighbor search, and are typically faster than the native R implementations (e.g., dbscan in package fpc), or the implementations in WEKA, ELKI and Python’s scikit-learn.

Installation

Usage

Extract DBSCAN-like clustering from OPTICS and create a reachability plot (extracted DBSCAN clusters at eps_cl=.4 are colored)

Visualize the hierarchical clustering as a simplified tree. HDBSCAN finds 2 stable clusters.

Using dbscan with tidyverse

dbscan provides for all clustering algorithms tidy(), augment(), and glance() so they can be easily used with tidyverse, ggplot2 and tidymodels.

Using dbscan from Python

import pandas as pd
import numpy as np

### prepare data
iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', 
                   header = None, 
                   names = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species'])
iris_numeric = iris[['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']]

# get R dbscan package
from rpy2.robjects import packages
dbscan = packages.importr('dbscan')

# enable automatic conversion of pandas dataframes to R dataframes
from rpy2.robjects import pandas2ri
pandas2ri.activate()

db = dbscan.dbscan(iris_numeric, eps = 0.5, MinPts = 5)
print(db)

License

The dbscan package is licensed under the GNU General Public License (GPL) Version 3. The OPTICSXi R implementation was directly ported from the ELKI framework’s Java implementation (GNU AGPLv3), with permission by the original author, Erich Schubert.

Changes

References

Ankerst, Mihael, Markus M Breunig, Hans-Peter Kriegel, and Jörg Sander. 1999. “OPTICS: Ordering Points to Identify the Clustering Structure.” In ACM Sigmod Record, 28:49–60. 2. ACM. https://doi.org/10.1145/304181.304187.

Breunig, Markus M, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. 2000. “LOF: Identifying Density-Based Local Outliers.” In ACM Int. Conf. On Management of Data, 29:93–104. 2. ACM. https://doi.org/10.1145/335191.335388.

Campello, Ricardo JGB, Davoud Moulavi, and Jörg Sander. 2013. “Density-Based Clustering Based on Hierarchical Density Estimates.” In Pacific-Asia Conference on Knowledge Discovery and Data Mining, 160–72. Springer. https://doi.org/10.1007/978-3-642-37456-2_14.

Campello, Ricardo JGB, Davoud Moulavi, Arthur Zimek, and Joerg Sander. 2015. “Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection.” ACM Transactions on Knowledge Discovery from Data (TKDD) 10 (1): 5. https://doi.org/10.1145/2733381.

Ertöz, Levent, Michael Steinbach, and Vipin Kumar. 2003. “Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data.” In Proceedings of the 2003 SIAM International Conference on Data Mining (SDM), 47–58. https://doi.org/10.1137/1.9781611972733.5.

Ester, Martin, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” In Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), 226–31. https://dl.acm.org/doi/10.5555/3001460.3001507.

Hahsler, Michael, Matthew Piekenbrock, and Derek Doran. 2019. “dbscan: Fast Density-Based Clustering with R.” Journal of Statistical Software 91 (1): 1–30. https://doi.org/10.18637/jss.v091.i01.

Jarvis, R. A., and E. A. Patrick. 1973. “Clustering Using a Similarity Measure Based on Shared Near Neighbors.” IEEE Transactions on Computers C-22 (11): 1025–34. https://doi.org/10.1109/T-C.1973.223640.

R package dbscan - Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Related Algorithms