K-nearest-neighbors is a very well known classification algorithm. It is based on the phrase – “show me who your friends are and I’ll tell you who you are”.
Apache Spark MLLib contains several algorithms including linear regression, k-means, etc . But it does not currently include an implementation to KNN. One of the reasons for that is the time complexity it requires (roughly n^2 where n is the number of items, ignoring the dimension).
At Apache Spark JIRA you can see 2 tickets involving this issue – SPARK-2335, SPARK-2336. The first ask for KNN feature and discuss the difficulties. The second, open based on the first, discuss approximations to KNN and wish to implement it.
I implemented a very naive approximation to KNN algorithm on Apache Spark with a distance function of similarity (looking for max) rather hen euclidian but this can be easily changed (change distance function and change sorting key in line 71).
The algorithm is based on splitting the data to partitions and calculating item distances only in the same partition. You can increase the accuracy either by decreasing the number of partitions (compare to more items) or by repeating the process several times (repartition differently every time) and choose the best results.
This calculation retrieves the list of the most similar neighbors and then one can decide how to use this data.