On Wednesday I took part in “Stream Processing with Apache Flink“. The workshop was hosted by Carmeq and was super generous.
Apache Flink is a distributed streaming dataflow engine. There are several obvious competitors including Apache Spark, Apache storm and MapReduce (and possible apache tez).
The main question for me when coming to adopt a new tool is why it is better than what I already use, which problems that it solves for me.
Apache Flink’s main advantages comparing to Apache Storm is the batch capabilities, support windowing and exactly once guarantee. Apache storm is designed for event processing, i.e. streaming data. The streaming window allow very easy and native aggregation by both time and capacity windows.
Advantages comparing to MapReduce are strong support of pipelines and iterative jobs as well as many types of data – Flink is more high level than MR. And of course the streaming.
Comparing to Apache Spark, the implementation of spark streaming is different and is implemented as small batches. Apache Spark is limited by memory size which Flink is less sensitive to it. However, I think Spark has a very big advantage at the moment by having API’s to R and Python (in addition to Scala and Java) which are very common for data scientist while Flink currently supports only Scala and Java.
Both Spark and Flink has Graph (Graphx and Gelly) and machine learning (MLLib and FlinkML) support which make them much more friendlier and high level than both MapReduce and Storm.
I think both Spark and Flink have a lot of things in common and knowing one it is relatively easy to switch to the other. Currently Apache spark is much more popular – 2273 results vs 59 results on stackoverflow and 8270000 results vs 363000 on google.
For further reading – flink overview.
The workshop focused on flink and we went through the slides and exercises in the flink training site. There were few issues – bugs, java version, flink version issues but it was generally well organized and the guides were eager to help and to explain.
Related links –
- Interview with Matei Zaharia, creator of Apache Spark
- Flink forward conference – in Berlin next month
- Flink meetup Berlin