TL;DR – Yet another clustering evaluation metric
Davies-Bouldin index was suggested by David L. Davies and Donald W. Bouldin in “A Cluster Separation Measure” (IEEE Transactions on Pattern Analysis and Machine Intelligence. PAMI-1 (2): 224–227. doi:10.1109/TPAMI.1979.4766909, full pdf)
Just like Silhouette score, Calinski-Harabasz index and Dunn index, Davies-Bouldin index provide an internal evaluation schema. I.e. the score is based on the cluster itself and not on external knowledge such as labels.
The Silhouette score reflects how similar a point is to the cluster it is associated with. I.e .for each point with compute the average distance of the point from the points in the nearest cluster minus the average distance of the point from the points in its own cluster divided by the maximum between those distances. The overall score is the average of the score per point. The Silhouette score is bounded from -1 to 1 and higher score means more distinct clusters.
The Calinski-Harabasz index compares the variance between-clusters to the variance within each cluster. This measure is much simpler to calculate then the Silhouette score however it is not bounded. The higher the score the better the separation is.
The intuition behind Davies-Bouldin index is the ratio between the within cluster distances and the between cluster distances and computing the average overall the clusters. It is therefore relatively simple to compute, bounded – 0 to 1, lower score is better. However, since it measures the distance between clusters’ centroids it is restricted to using Euclidean distance function.
Silhouette score and Calinski-Harabasz index were previously implemented and are part of scikit-learn and I implemented Davies-Bouldin index. Hopefully this will be my first contribution to scikit-learn, my implementation is here.
thanks for this nice implementation.
Hey! It is a nice implementation. I copied your function davies_bouldin_index in my source code and imported required modules but I am getting a score greater than 1. Is it possible?
I think that if the clustering is something like the result of the dbscan in the first row (here – http://ogrisel.github.io/scikit-learn.org/sklearn-tutorial/auto_examples/cluster/plot_cluster_comparison.html) than the centroids (M_ij) are really close to each other while the S_i’s are larger and that can result values greater than 1.
Hi Tom! Congrats. I found your implementation by chance only after I implemented my own function to compute the DB index. I checked mine against yours and we’re getting the same results (values greater than 1 included). Why is your implementation not merged to scikit-learn alreay on gitbub? Could you check if there is something else you need to address? I would really like to see it in scikit. 🙂
Thank you for the kind words.
This is the current status of the PR – https://github.com/scikit-learn/scikit-learn/pull/7942
I might get to it someday 🙂
I know, I downloaded the code from there. However, it’s been stalled for a while now and it would be nice to finally merge it for us…