KNN approximation Apache Spark

K-nearest-neighbors is a very well known classification algorithm. It is based on the phrase – “show me who your friends are and I’ll tell you who you are”.

Apache Spark MLLib contains several algorithms including linear regression, k-means, etc . But it does not currently include an implementation to KNN. One of the reasons for that is the time complexity it requires (roughly n^2 where n is the number of items, ignoring the dimension).
At Apache Spark JIRA you can see 2 tickets involving this issue – SPARK-2335, SPARK-2336. The first ask for KNN feature and discuss the difficulties. The second, open based on the first, discuss approximations to KNN and wish to implement it.

I implemented a very naive approximation to KNN algorithm on Apache Spark with a distance function of similarity (looking for max) rather hen euclidian but this can be easily changed (change distance function and change sorting key in line 71).

The algorithm is based on splitting the data to partitions and calculating item distances only in the same partition. You can increase the accuracy either by decreasing the number of partitions (compare to more items) or by repeating the process several times (repartition differently every time) and choose the best results.

This calculation retrieves the list of the most similar neighbors and then one can decide how to use this data.

My gist –


5 interesting things (15/11/2015)

Counting things in Python – This post spotlights very nicely and simple how Python and Pythonic writing changed over the years. One of the interesting things in this post is the analogy to natural languages. Natural languages also changes and evolve over time – slang, new phrases, out-dates expressions etc. and apparently so is programming languages.

Travelling Beer drinker problem – Figuring out the shortest road trip path to visit best microbreweries in US. Finally a good use to the travelling sales man problem :). Data science wise there is not much into this problem – connecting the dots between location data and google API but that is an exciting ground for this connection.

Hacker news 9 year statistics – analysis about Hacker news activities, volume, users and trends over the last 9 years. Interesting specially because it became such a central place to consume technology news.

Evolving strategies for an iterated Prisoner’s Dilemma tournament – I don’t get to read many posts about evolutionary algorithms or practical game theory. So such in in-depth post with python implementation is really refreshing.
Visualizing Chess data with ggplot – Although I’m not an R user I love chess (and I’m horrible player) and I loved the analysis