5 interesting things (1/12/2015)

Improving My CLI’s Autocomplete with Markov Chains – Markov chains are the basis for many auto-complete algorithms we know and use on daily basis, e.g keyboards on mobile devices. In this case it is a developer hack to improve auto-complete in a development tool. It is always nice when theory comes to life.

http://nicolewhite.github.io/2015/10/05/improving-cycli-autocomplete-markov-chains.html

10 more lessons learned from building Machine Learning systems – slides of a presentation by Xavier Amatriain, VP Engineering at Quora (previously Director Algorithms Engineering @Netflix). Very insightful presentation (I would of loved to hear the full one). The name refers to a lecture by Amatriain called “10 lessons learned from building Machine Learning systems”  exactly a year before).

http://www.slideshare.net/xamat/10-more-lessons-learned-from-building-machine-learning-systems

http://www.slideshare.net/xamat/10-lessons-learned-from-building-machine-learning-systems

Beyond One-Hot: an exploration of categorical variables – It is not all about numbers.. in many cases features are not numeric. If we are lucky – features will be binary – has \ does not have a symptom, spam \ not spam or ordinal – amount of pain a patient experiences, etc. But sometimes it is neither – e.g a state, color, etc. What then? this post compare several techniques to deal with categorical variables. While it is very basic it well explained (although examples would have helped) and it can give a great intuition for someone who faces those problems for the first time.

http://willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/

Bandit Algorithms for Bullying – Getting More Lunch Money – this post explains bandit algorithm using very common, easy to understand example. However, I would build the post a bit different. There is a lot of text telling the story comparing to the scientific parts. In my opinion the scientific parts should be emphasized a bit more (bold text, bullets, etc.)

http://1oclockbuzz.com/2015/11/24/bandit-algorithms-for-bullying-getting-more-lunch-money/

C.H.I.P vs Pi zero – The new Pi zero made a lot of buzz in the last week offering a computer for 5$. However, Pi zero is not the only player in the field of sub 10$ computers. This post compares Pi zero and C.H.I.P spec and abilities. There are many comments saying the comparison is biased towards C.H.I.P (ignoring shipping costs, unfair comparison of cable costs, reputation, etc.) but overall I think it is worth reading.

http://makezine.com/2015/11/28/chip-vs-pi-zero/

KNN approximation Apache Spark

K-nearest-neighbors is a very well known classification algorithm. It is based on the phrase – “show me who your friends are and I’ll tell you who you are”.

Apache Spark MLLib contains several algorithms including linear regression, k-means, etc . But it does not currently include an implementation to KNN. One of the reasons for that is the time complexity it requires (roughly n^2 where n is the number of items, ignoring the dimension).
At Apache Spark JIRA you can see 2 tickets involving this issue – SPARK-2335, SPARK-2336. The first ask for KNN feature and discuss the difficulties. The second, open based on the first, discuss approximations to KNN and wish to implement it.

I implemented a very naive approximation to KNN algorithm on Apache Spark with a distance function of similarity (looking for max) rather hen euclidian but this can be easily changed (change distance function and change sorting key in line 71).

The algorithm is based on splitting the data to partitions and calculating item distances only in the same partition. You can increase the accuracy either by decreasing the number of partitions (compare to more items) or by repeating the process several times (repartition differently every time) and choose the best results.

This calculation retrieves the list of the most similar neighbors and then one can decide how to use this data.

My gist – https://gist.github.com/tomron/70e5fefe128214b7d2a1

5 interesting things (15/11/2015)

Counting things in Python – This post spotlights very nicely and simple how Python and Pythonic writing changed over the years. One of the interesting things in this post is the analogy to natural languages. Natural languages also changes and evolve over time – slang, new phrases, out-dates expressions etc. and apparently so is programming languages.

http://treyhunner.com/2015/11/counting-things-in-python/

Travelling Beer drinker problem – Figuring out the shortest road trip path to visit best microbreweries in US. Finally a good use to the travelling sales man problem :). Data science wise there is not much into this problem – connecting the dots between location data and google API but that is an exciting ground for this connection.

http://flowingdata.com/2015/10/26/top-brewery-road-trip-routed-algorithmically/

Hacker news 9 year statistics – analysis about Hacker news activities, volume, users and trends over the last 9 years. Interesting specially because it became such a central place to consume technology news.

Evolving strategies for an iterated Prisoner’s Dilemma tournament – I don’t get to read many posts about evolutionary algorithms or practical game theory. So such in in-depth post with python implementation is really refreshing.
Visualizing Chess data with ggplot – Although I’m not an R user I love chess (and I’m horrible player) and I loved the analysis

Five interesting things (25/10/2015)

Whatsapp CLI – control you server using Whatsapp. The next step for me is to connect my home (heating, lights, doors) and control it via Whatsapp. To be honest I assume that some such implementations exists.

https://github.com/KarimJedda/whatsappcli

Receipt parser – A project which was created as part of Trivago Hackathon. I actually thought about the need of this product many times – first for personal accounting, keep an eye on my spending. Second for taking it to the next steps – alerts about things I should buy, alerts for buying things I don’t need and alerts for cheaper prices.

http://tech.trivago.com/2015/10/06/python_receipt_parser/

Time magazine visual trends – good ideas are priceless and when the implementation is also clean, nice and reveals interesting insights it is even more exciting.

http://www.pyimagesearch.com/2015/10/19/analyzing-91-years-of-time-magazine-covers-for-visual-trends

Google spreadsheet to ElasticSearch – to tell the truth, I could not think of a worse architecture than to use Google spreadsheet as a database (or a proxy in the way to a DB). Having said that, Elastic release a google spreadsheet plugin to import spreadsheet content to ElasticSearch instance.

https://www.elastic.co/blog/introducing-google-sheets-to-elasticsearch-add-on

Snakefooding python – snakefood is a tool to create graph dependencies. This post show the dependency graphs for some very common python libraries (Flask, django, Celery, requests, etc.). It presents the pros and cons of using snakefood, what it exposes and what it does not expose (many small files -> many imports -> complex dependency graph vs one file, spaghetti code -> no imports -> very clear simple graph). I find it as a tool that supports developing and detecting non necessary dependencies.

http://grokcode.com/864/snakefooding-python-code-for-complexity-visualization/

AWS loft Berlin

This week Amazon opened a loft in Berlin which is suppose to be open for 4 weeks. The loft is currently on pilot and there are several other lofts around the world. I think it is a very good strategic call to have it in Berlin as the startup is a emerging and many people want to try and learn more about cloud services while the hands-on experience is sometimes limited.

So what is going on in the loft? It is open everyday 10-18 and there is amazon employee “in duty” which you can consult with regarding AWS services. A very inviting work space. And workshops, demos, bootcamp, etc. All, of course, related to AWS services.

I took part in two workshops on Thursday morning – “An overview of Hadoop & Spark, using Amazon Elastic MapReduce” and “Processing streams of data with Amazon Kinesis (and other tools)“. Both lectures were given by Michael Hanisch, solution architect at Amazon. The first talk was a bit messy as it covered many topics but eventually ended up jumping here and there between general things about Hadoop, tips about EMR and changes in the AMI concepts and versioning and clues about Spark.

The second talk was much more focused. It started by introducing the need to Amazon Kinesis. Then explaining the architecture – producer, streams, shards, clients, clearing up the capabilities and constraints also mentioning kinesis autoscaling utils.. The next step was a deeper dive to kinesis producer library and kinesis client library. Moving forward to kinesis firehose (which was introduced in the re:invent last week) and integration with additional input and output sources and aws services. To sum up the talk ended with tips and best practices. AWS Lambda​ was also mentioned several times over the talk as a tool to process stream data.

Quite exciting time to be in Berlin.

5 interesting things (06/10/2015)

Restaurants recommendations – I read quite a lot about recommendation systems lately and I love this post because it talks about a restaurant domain while many of the posts related to recommendation system refers to music, television and movies. And geospatial features are very important here comparing to movie, music and television recommendations.

http://www.slideshare.net/SudeepDasPhD/recsys-2015-making-meaningful-restaurant-recommendations-at-opentable

Data Science workflow – this blog post present a well structured approach to data science process. While I loved the structured approach the entry point – “get data, have fun with it” is twisted. I believe that when working on a product most of the time you will want to solve a problem or introduce a new feature, i.e. you already have the question you want to answer rather than explore a dataset and think about the questions you can answer with it. Also missed part of documenting you work.

http://blog.binaryedge.io/2015/09/08/the-data-science-workflow/

AWS in plain English – or AWS for humans. If you are not that yet or not familiar with the different services this is a nice way to introduce the terminology.

https://www.expeditedssl.com/aws-in-plain-english

Fedrer vs Djokovic – why not Nadal? it seem kind of abuse in the product but for sure it exposes the product and make some buzz.

https://www.elastic.co/blog/building-dashboards-using-data-from-the-federer-&-djokovic-tennis-rivalry

What to do with small data – big data is one of the buzz words in the last few years. While it was previously whispered only by tech people it is now a common, well known phrase. But, many companies do not really have big data, they have few users and need to perform well for those users and one day they might have big data, many users and tons of features. Until then, there are some clues in this post.

https://medium.com/rants-on-machine-learning/what-to-do-with-small-data-d253254d1a89

#dmconf15

I got a free ticket to distributed matters conference winning the challenge of the python big data analytics meetup group. The line up was impressive (including Kyle Kingsbury from project “call me maybe”), the topics were very interesting and the location was very nice (first time I hear a talk in a disco party hall) and close to my house, perfect match. Due to earlier commitments I could only stay half of the day but what a day 🙂

Starting with Kyle Kingsbury (Stripe) as a key note talking about “call me maybe” and the jepsen test. The basic setup of the test is creating a cluster of some database, injecting random events (read, write, update, query, etc) and verify the database behavior. Next level include injecting all kind of failures – network errors, node pause, etc. Following the description of the test Kingsbury gave a brief summary of the results. Project “call me maybe” result creates pressure on the different database vendors to improve their products and align with the promises they make. It definitely serve as the watch dog of this industry. It is sometime amazing how not tested some products are.

The sentence I took with me from his talk is the choosing a database is a contract and you should check all the relevant aspects of it. Settings, complexity and needs sometimes change over time but as long as the initial contract is honored no complains.

The second talk I went to was Michael Hackstein‘s (ArangoDB) talk “NoSQL meets microservices”. Hackstein presented a problem of multiple calls to multiple databases where each answer different need and presented the solution – multi-model NoSQL database. Single call can answer all needs. ArangoDB is an example of such a database, OrientDB is also a major competitor. The problem multi-model database might be phrased as just multi-model data. I.e. documents data, graph data, key-value data, etc (see footnote 1).

The rest of the talk focused on ArangoDB different features –

  • foxx – REST API over the database which works as a microservice.
  • drivers – including python, php, java, js, ruby, etc.
  • AQL – ArrangoDB query language

Next talk was given by Charity Majors (Parse – acquired by Facebook) as she introduced herself – “hate software for living”. This talk was really really great one. Majors is super experienced, worked with many technologies and very charismatic. She spoke about her many years experience about upgrading databases. Sound easy? well.. not that much. Stepping a step back, why would you want to upgrade a database – new feature, improved preformance, bug fixing, etc.. So what can go wrong? well, many things. From down time, losing data, losing transactions, different hashing and indexing, optimization which are not optimized to your data, changes in clients, and much more fun. Good news  – you have enough interesting and important data to care for.

What should you do about it? tl;dr – be paranoid. Longer version –

  • Change only one component at a time
  • read the docs, change logs, release notes (also mentioned by Kyle Kingsbury)
  • Don’t believe to database vendors benchmarks, test it for your expected load (if you can expected it, another painful point) – at least 24h data, make sure to clear the cache.
  • Take consistent snapshots
  • Prepare for rollback option.

I really hope her slides will be published soon since it was a very informative talk.

Tour de Force: Stream Based Textanalysis by Hendrik SalyStefan Siprell (codecentric).

This talk was classified as expert level and I expected it to a bit more then what it actually was. Many buzzwords were thrown around – R, machine learning, twitter, Apache spark, elasticsearch, kibana, aws, etc. but the big picture, how those technologies relate to each, what are the alternatives. For example, for the goal of enriching the data R can be easily replaced with python. Or in order to read from twitter why use files and not use Spark streaming capabilities (or some queue). Beside Kibana being a great and easy to use tool, preferring Elasticsearch over other options was not clear. To sum up – I also missed some reference to the distributed properties of those tools. This is the reason we were all there after all, not?

Footnotes

1 Multi-model data is just a way of thinking of about data structures in patterns we are used to. Maybe there are other more efficient way to think of this data?

2. Sponsor wise – on one hand there were few sponsors that didn’t give any talk, for example idealo. On the other hand there were few companies I expected to sponsor the conference, i.e crate.io, Elastic, mongodb which was mentioned a lot but unfortunately didn’t give any dedicated talk.

3. Database vendors business models. I think that the most popular model as of today is to develop a database, possibly oen source (elasticsearch, ArangoDb etc.) and make money from supporting it. Is there any other way to make money from database vendoring those days (hosting is not included under vendoring).

4. Diversity in tech – I saw no more than 5 women in the crowd.

5 interesting things (11/9/2015)

Density based clustering – the clearest and most practical guide I read about density based clustering.

http://blog.dominodatalab.com/topology-and-density-based-clustering/

Word segment – this python library which is train with over a trillion-word corpus aims to help segment text to words. E.g “thisisatest” to “this is a test”. I tried a random example -“helloworld’ and it didn’t split it at all. I tried other examples as well (“mynameis<x>”, “ilivein<y>”, etc) and it worked well. Beside the segmentation functionality it also offers unigrams and bigrams counting this can be usable for all kind of applications without the need to get the data, clean it and process it yourself. Numbers do not appear in the unigram count, I find it interesting for other needs rather than splitting.

http://www.grantjenks.com/docs/wordsegment

Funny haha?! Predicting if a joke is funny or not based on the words it contains using Naive Bayes classifier from NLTK package. It is a good and funny beginners tutorial. NLTK contains additional classifiers beside Naive Bayes Classifier, e..g Decision Tree Classifier, it would also be interesting to see how they preform on this dataset.

http://vknight.org/unpeudemath/code/2015/06/14/natural-language-and-predicting-funny/

Scaling decision forests lessons learned – some of those lessons are specific to decision forests, some are for scaling and some are general good practices. I wish there was a more found discussion about dealing with missing data \ features. Not specifically in this post but in general.

http://blog.siftscience.com/blog/2015/large-scale-decision-forests-lessons-learned

Redshift howto – very extensive guide to redshift, mostly admin and configuration related stuff. I miss another chapter regarding tools above redshift such as re:dash.

https://www.periscope.io/amazon-redshift-guide

5 interesting things (05/09/2015)

Time map visualization of discrete event – Very good idea for visualization of discrete events when the order of events is not important but rather the general pattern. Good to visualize time between failures, time between visits of users in site \ user actions, etc.

https://districtdatalabs.silvrback.com/time-maps-visualizing-discrete-events-across-many-timescales

Cyber attacks map – so cool

http://map.norsecorp.com/

Why different people think different – the actual title of this post is “Why a Mathematician, Statistician, & Machine Learner Solve the Same Problem Differently” but I think it misses in many aspects. First, the comparison is a bit shallow ignoring non parametric statistics, machine learning models with hyper-paremeters, etc.  Machine does make assumptions on the data – by choosing the features you use (even if they’ll eventually assigned with weight of 0) you make assumption about the data. Moreover, choosing the model, kernel etc. assumes something regarding the features’ distribution.

Researchers, data scientist, statisticians, people think differently. Some of them tend to use tools they know and worked for them before, some of them want to use new tools and ideas. I believe the people with same education (ML, statisticians, etc) but from different sources \ institutes will also have different approaches no matching the theorem in this post.

http://www.galvanize.com/blog/2015/08/26/why-a-mathematician-statistician-machine-learner-solve-the-same-problem-differently-2

Spreadsheets are graphs? I like this post because it presents some fresh spirit to a painful problem we all experience – sharing documents \ [reserving knowledge. In almost every organizations there are all kind of documents (spreadsheets, word documents, presentations, RFPs, etc.) but it is almost never connected among them and almost always a mess. So this is another angle too look on this problem.

http://neo4j.com/blog/spreadsheets-are-graphs-too/

Cross validation done wrong – things that are clear when thinking about them but we don’t usually spend time thinking on them. The bottom line is always isolate completely between your training set to the cross validation and test sets.

http://www.alfredo.motta.name/cross-validation-done-wrong/

Apache Flink workshop

On Wednesday I took part in “Stream Processing with Apache Flink“. The workshop was hosted by Carmeq and was super generous.
 
Apache Flink is a distributed streaming dataflow engine. There are several obvious competitors including Apache Spark, Apache storm and MapReduce (and possible apache tez).
 
The main question for me when coming to adopt a new tool is why it is better than what I already use, which problems that it solves for me.
 
Apache Flink’s main advantages comparing to Apache Storm is the batch capabilities, support windowing and exactly once guarantee. Apache storm is designed for event processing, i.e. streaming data. The streaming window allow very easy and native aggregation by both time and capacity windows.
 
Advantages comparing to MapReduce are strong support of pipelines and iterative jobs as well as many types of data – Flink is more high level than MR. And of course the streaming.
 
Comparing to Apache Spark, the implementation of spark streaming is different and is implemented as small batches. Apache Spark is limited by memory size which Flink is less sensitive to it. However, I think Spark has a very big advantage at the moment by having API’s to R and Python (in addition to Scala and Java) which are very common for data scientist while Flink currently supports only Scala and Java.
 
Both Spark and Flink has Graph (Graphx and Gelly) and machine learning (MLLib and FlinkML) support which make them much more friendlier and high level than both MapReduce and Storm.

I think both Spark and Flink have a lot of things in common and knowing one it is relatively easy to switch to the other. Currently Apache spark is much more popular – 2273 results vs 59 results on stackoverflow and 8270000 results vs 363000 on google.

For further reading – flink overview.
 
The workshop focused on flink and we went through the slides and exercises in the flink training site. There were few issues – bugs, java version, flink version issues but it was generally well organized and the guides were eager to help and to explain.
 
Related links –