DBSCAN

Density-Based Spatial Clustering of Applications with Noise
Works on the same principle as mean-shift clustering
steps
- 1. start with a node that hasn’t been visited
- 1. if this point has sufficient amount of neighbours, the clustering process starts
  - A neighbour is defined as within epsilon away
  - if there are insufficient points, the datapoint is labelled as noise
  - both cases mark the point as visited
- 1. the points within epsilon away are part of the same cluster
- 1. then repeat the steps 2&3 to find all of the points of this cluster
  - we are done when all points within the epsilon neighbourhood has been visited
- 1. once we’re done with the current cluster, a new unvisited point is retrieved and processed
pros
- it doesn’t require a pe-set number of clusters at all
- it identifies outliers as noise
- it can find arbitrarily shaped clusters quite well
cons
- it doesn’t perform as well when the clusters are of varying density
  - cause the epsilon and minPoints requirement will differ from cluster to cluster
- it’s hard to determine epsilon for high-dimensional data since it’s difficult to estimate

🏖️ Kaggle Solutions