I got a free ticket to distributed matters conference winning the challenge of the python big data analytics meetup group. The line up was impressive (including Kyle Kingsbury from project “call me maybe”), the topics were very interesting and the location was very nice (first time I hear a talk in a disco party hall) and close to my house, perfect match. Due to earlier commitments I could only stay half of the day but what a day 🙂

Starting with Kyle Kingsbury (Stripe) as a key note talking about “call me maybe” and the jepsen test. The basic setup of the test is creating a cluster of some database, injecting random events (read, write, update, query, etc) and verify the database behavior. Next level include injecting all kind of failures – network errors, node pause, etc. Following the description of the test Kingsbury gave a brief summary of the results. Project “call me maybe” result creates pressure on the different database vendors to improve their products and align with the promises they make. It definitely serve as the watch dog of this industry. It is sometime amazing how not tested some products are.

The sentence I took with me from his talk is the choosing a database is a contract and you should check all the relevant aspects of it. Settings, complexity and needs sometimes change over time but as long as the initial contract is honored no complains.

The second talk I went to was Michael Hackstein‘s (ArangoDB) talk “NoSQL meets microservices”. Hackstein presented a problem of multiple calls to multiple databases where each answer different need and presented the solution – multi-model NoSQL database. Single call can answer all needs. ArangoDB is an example of such a database, OrientDB is also a major competitor. The problem multi-model database might be phrased as just multi-model data. I.e. documents data, graph data, key-value data, etc (see footnote 1).

The rest of the talk focused on ArangoDB different features –

  • foxx – REST API over the database which works as a microservice.
  • drivers – including python, php, java, js, ruby, etc.
  • AQL – ArrangoDB query language

Next talk was given by Charity Majors (Parse – acquired by Facebook) as she introduced herself – “hate software for living”. This talk was really really great one. Majors is super experienced, worked with many technologies and very charismatic. She spoke about her many years experience about upgrading databases. Sound easy? well.. not that much. Stepping a step back, why would you want to upgrade a database – new feature, improved preformance, bug fixing, etc.. So what can go wrong? well, many things. From down time, losing data, losing transactions, different hashing and indexing, optimization which are not optimized to your data, changes in clients, and much more fun. Good news  – you have enough interesting and important data to care for.

What should you do about it? tl;dr – be paranoid. Longer version –

  • Change only one component at a time
  • read the docs, change logs, release notes (also mentioned by Kyle Kingsbury)
  • Don’t believe to database vendors benchmarks, test it for your expected load (if you can expected it, another painful point) – at least 24h data, make sure to clear the cache.
  • Take consistent snapshots
  • Prepare for rollback option.

I really hope her slides will be published soon since it was a very informative talk.

Tour de Force: Stream Based Textanalysis by Hendrik SalyStefan Siprell (codecentric).

This talk was classified as expert level and I expected it to a bit more then what it actually was. Many buzzwords were thrown around – R, machine learning, twitter, Apache spark, elasticsearch, kibana, aws, etc. but the big picture, how those technologies relate to each, what are the alternatives. For example, for the goal of enriching the data R can be easily replaced with python. Or in order to read from twitter why use files and not use Spark streaming capabilities (or some queue). Beside Kibana being a great and easy to use tool, preferring Elasticsearch over other options was not clear. To sum up – I also missed some reference to the distributed properties of those tools. This is the reason we were all there after all, not?


1 Multi-model data is just a way of thinking of about data structures in patterns we are used to. Maybe there are other more efficient way to think of this data?

2. Sponsor wise – on one hand there were few sponsors that didn’t give any talk, for example idealo. On the other hand there were few companies I expected to sponsor the conference, i.e crate.io, Elastic, mongodb which was mentioned a lot but unfortunately didn’t give any dedicated talk.

3. Database vendors business models. I think that the most popular model as of today is to develop a database, possibly oen source (elasticsearch, ArangoDb etc.) and make money from supporting it. Is there any other way to make money from database vendoring those days (hosting is not included under vendoring).

4. Diversity in tech – I saw no more than 5 women in the crowd.


5 interesting things (11/9/2015)

Density based clustering – the clearest and most practical guide I read about density based clustering.


Word segment – this python library which is train with over a trillion-word corpus aims to help segment text to words. E.g “thisisatest” to “this is a test”. I tried a random example -“helloworld’ and it didn’t split it at all. I tried other examples as well (“mynameis<x>”, “ilivein<y>”, etc) and it worked well. Beside the segmentation functionality it also offers unigrams and bigrams counting this can be usable for all kind of applications without the need to get the data, clean it and process it yourself. Numbers do not appear in the unigram count, I find it interesting for other needs rather than splitting.


Funny haha?! Predicting if a joke is funny or not based on the words it contains using Naive Bayes classifier from NLTK package. It is a good and funny beginners tutorial. NLTK contains additional classifiers beside Naive Bayes Classifier, e..g Decision Tree Classifier, it would also be interesting to see how they preform on this dataset.


Scaling decision forests lessons learned – some of those lessons are specific to decision forests, some are for scaling and some are general good practices. I wish there was a more found discussion about dealing with missing data \ features. Not specifically in this post but in general.


Redshift howto – very extensive guide to redshift, mostly admin and configuration related stuff. I miss another chapter regarding tools above redshift such as re:dash.


5 interesting things (05/09/2015)

Time map visualization of discrete event – Very good idea for visualization of discrete events when the order of events is not important but rather the general pattern. Good to visualize time between failures, time between visits of users in site \ user actions, etc.


Cyber attacks map – so cool


Why different people think different – the actual title of this post is “Why a Mathematician, Statistician, & Machine Learner Solve the Same Problem Differently” but I think it misses in many aspects. First, the comparison is a bit shallow ignoring non parametric statistics, machine learning models with hyper-paremeters, etc.  Machine does make assumptions on the data – by choosing the features you use (even if they’ll eventually assigned with weight of 0) you make assumption about the data. Moreover, choosing the model, kernel etc. assumes something regarding the features’ distribution.

Researchers, data scientist, statisticians, people think differently. Some of them tend to use tools they know and worked for them before, some of them want to use new tools and ideas. I believe the people with same education (ML, statisticians, etc) but from different sources \ institutes will also have different approaches no matching the theorem in this post.


Spreadsheets are graphs? I like this post because it presents some fresh spirit to a painful problem we all experience – sharing documents \ [reserving knowledge. In almost every organizations there are all kind of documents (spreadsheets, word documents, presentations, RFPs, etc.) but it is almost never connected among them and almost always a mess. So this is another angle too look on this problem.


Cross validation done wrong – things that are clear when thinking about them but we don’t usually spend time thinking on them. The bottom line is always isolate completely between your training set to the cross validation and test sets.


Apache Flink workshop

On Wednesday I took part in “Stream Processing with Apache Flink“. The workshop was hosted by Carmeq and was super generous.
Apache Flink is a distributed streaming dataflow engine. There are several obvious competitors including Apache Spark, Apache storm and MapReduce (and possible apache tez).
The main question for me when coming to adopt a new tool is why it is better than what I already use, which problems that it solves for me.
Apache Flink’s main advantages comparing to Apache Storm is the batch capabilities, support windowing and exactly once guarantee. Apache storm is designed for event processing, i.e. streaming data. The streaming window allow very easy and native aggregation by both time and capacity windows.
Advantages comparing to MapReduce are strong support of pipelines and iterative jobs as well as many types of data – Flink is more high level than MR. And of course the streaming.
Comparing to Apache Spark, the implementation of spark streaming is different and is implemented as small batches. Apache Spark is limited by memory size which Flink is less sensitive to it. However, I think Spark has a very big advantage at the moment by having API’s to R and Python (in addition to Scala and Java) which are very common for data scientist while Flink currently supports only Scala and Java.
Both Spark and Flink has Graph (Graphx and Gelly) and machine learning (MLLib and FlinkML) support which make them much more friendlier and high level than both MapReduce and Storm.

I think both Spark and Flink have a lot of things in common and knowing one it is relatively easy to switch to the other. Currently Apache spark is much more popular – 2273 results vs 59 results on stackoverflow and 8270000 results vs 363000 on google.

For further reading – flink overview.
The workshop focused on flink and we went through the slides and exercises in the flink training site. There were few issues – bugs, java version, flink version issues but it was generally well organized and the guides were eager to help and to explain.
Related links –