DBSCAN belongs to the unsupervised learning algorithm, and the connotation of the unsupervised algorithm is to observe the unlabeled data set to automatically discover the hidden structure and level, and to find the hidden law in the unlabeled data.
The application of clustering models in data analysis: they can be used as a separate process to find the intrinsic laws of data, or as a precursor to other analytical tasks such as classification.
In the previous article, we talked about the prototype-based k-means clustering algorithm, and in this article, we will talk about the density clustering DBSCAN, which usually performs better.
What is DBSCAN
DBSCAN is a density-based spatial clustering algorithm that takes into account noise. In simple terms, given a set of points, DBSCAN aggregates points that are close to each other (Euclidean distance) into a class, and it also marks points in low-density areas as outliers. To understand the DBSCAN algorithm, let’s first familiarize ourselves with some key concepts:
The following figure is an example, all points are circled based on the radius (ε), the data point density is specified as 3, we find that the density of the red points in the following figure within the specified radius > 3, so the red point is the core point;
The B and C points are in the core point neighborhood, but there are only 2 points in their radius, which is less than the specified density, so B and C are boundary points;
N point is not in the core point neighborhood, and the density is not reachable from any core point, so N is an outlier point; The above points A are connected to B, A and C are homogeneous;
The algorithm finds the core points in all points according to the specified neighborhood density parameter ( ε , Minpts ) and determines that the core point set is Ω;
From the Ω, a core point is randomly selected as an object to find out all samples generated by its density to generate clusters;
Repeat process 2, randomly select the remaining core points that have not been clustered in the Ω, and continue to do so until all the clusters with a density of core points are completely discovered;
High-dimensional data is not applicable
Determining reasonable parameters is difficult and has a greater impact on the results
Running more efficiently in Sklearn