PyData Berlin 2016 #pydatabln

I got a diversity scholarship from Num Focus to attend the PyData Berlin event. Num Focus is an NGO which supports open source data science projects among them – Jupyter, matplotlib, Numpy, pandas etc.

This post is not a summary of the events or of the talks that I attended in but rather hints to a subset of the talks.

Keynote – Olivier Grisel (Inria)
Grisel talked about the Evolution of predictive modeling and scaling predictive modeling. Why do we need to scale predictive modeling –
  1. I\O intensive operations – e.g feature engineering and model serving.
  2. CPU intensive operations – e.g hyper-parameters search and cross validation.
PySpark seems like a very legit tool but it has its’ drawbacks –
  • No pure python local mode – impossible to use profiler or ipdb
  • There is latency which is induced by the network architecture (Python driver -> Scala (JVM) -> Python worker)
  • Traceback is sometimes hard to understand as there is a mix of scala and python errors and log data.
Grisel suggest to use instead Dask and Distributed as a native python packages for parallel and distributed computing.
Being a young project dask also have its’ limitation – mainly no distributed shuffles which means it does not support distributed merge, join, groupby and aggregation operations at the moment.

Frontera: open source, large scale web crawling framework (Alexander Sibiryakov, ScrapingHub)
When I read the talk abstract I was not sure what the difference between Frontera and Scrapy. Hearing this talk the goal of Frontera is actually different – schedule crawling and crawling strategy. In their architecture they use scrapy for crawling but can use other crawlers as well.
After introducing Frontera, Sibiryakov showed their result of crawling the Spanish internet and the problems they faced. Some of the solutions were quite trivial – indexing the data differently to avoid hotspot, caching, limiting the depth of crawling \ number of pages per host, etc. I cannot say that I was fully convinced.

Setting up predictive analytics services with Palladium (Andreas Lattner, Otto group)
The evolution of company tool – 80% development time, 20% deployment overhead which tend to repeat between different projects. So why not make a framework and try to automate it. This is exactly the reason Otto group developed Palladium – ease the development, deployment and integration predictive analytics services.
The main limitation of Palladium is that it does no support distributed calculation. Another issue is controlling the results on real time – boosting, filtering, etc. At current time, (not written in Python) is in a more mature state.

Spotting trends and tailoring recommendations: PySpark on Big Data in fashion (Martina Pugliese, Mallzee)
Mallzee, as stated by Pugliese is “the tinder of fashion”. I.e the applications shows you a stream of fashion items from multiple providers and you swipe them left and right according to your preferences.Their current big challenge is to produce valuable recommendation for each customer.
Their input on one hand are the items and the items information which they crawl from the providers websites and normalize to fit into their set of tags and properties and of course the brand itself. The second input is the user behaviors – swipes, buy actions, favourite and non favourite brands, etc. Their current choice is creating a random forest for each user based on their actions. This approach can work for users with a lot of signals and specifically positive signal (buy and positive swipes). Possibly the next step will be to cluster the users to learn more about the users with log positive signals.

Practical Word2Vec in Gensim (Lev Konstantinovskiy, Gensim)

Gensim is a “topic modeling for humans”. It implements many interesting algorithm including lda, word2vec and doc2vec, lot’s of potential and interesting to play with.
But this tutorial was a mess.. the class didn’t fit to a tutorial, not enough space, not tables people can put their computers on and really run the code.

Konstantinovskiy wanted to present many options and ideas and run too fast between the different algorithms without really explaining them. On the other hand there was not really a focus on running the code and showing the package API. Unfortunately I feel this talked was a bit miss handled.

Bayesian Optimization and it’s application to Neural Networks (Moritz Neeb, TU Berlin)
What I would take from this talk is approaching Hyper parameter tuning as an optimization problem. For example when we use grid search for hyper parameter tuning we invest a lot of resources but the experiments are independent from one another and we don’t learn from one another. Instead we can possible introduce a method which have some interactions between the different experiments.
Neeb introduced an approach which treat it as a Gaussian Process and at each point try to evaluate the point with the maximal possible gain.
He also mentioned few Python libraries –
  • Spearmint – designed to automatically run experiments (thus the code name spearmint) in a manner that iteratively adjusts a number of parameters so as to minimize some objective in as few runs as possible.
  • HPOlib – hyperparameter optimization library
  • Hyperopt –  is a Python library for optimizing over awkward search spaces with real-valued, discrete, and conditional dimensions.

Keynote – Wes McKinney (Cloudera)
Wes McKinney is probably Mr. Open Source he involved and leading some of the most known data science open source projects (in python but not only) – pandas, Apache Parquet, etc.
His talk had roughly 2 parts – talking about the community, code of conduct, etc. and challenges that he believes will play a major role in data science \ python \ big data communities in the near future. Some of the project \ ideas \ challenges he mentioned –

What’s new in Deep Learning (Dr Kashif Rasul, Zalando SE)
Deep learning survey on speeds.. Survey of several recent papers in deep learning. Beside on mention of Theano and short example of code no strong connection to Python. I would have compromise to less code and slower pace.
Some of the papers he pointed at –

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons (Jose Quesada, Data Science Retreat)
Jose Quesada presented their experience with teaching and using Python Scikit learn versus Scala Spark. The most interesting points in my opinion were the PySpark limitation and future directions and of course comparison of existing features.
One of the most important insights was to use spark DataFrames. is built on the top of dataframes and will replace Spark MLlib in the future.

Data Integration in the World of Microservices (Valentine Gogichashvili, Zalando)
Gogichashvili is the head of Data Engineering in Zalando and is working there for already 5.5 years which means he was there during business, structural and technological changes the organization went through.
While he talked about micro-services architecture I think there were two other interesting inputs in the talk –
  • Team structure – every team is responsible for their own components and infrastructure (including naming) while their align to engineering guidelines such as API structure, data types, etc. which are set by cross team guilds.
  • Open source code – whenever a team starts a new project by default the project will grow to be an open source unless they reason about why not. Zalando open source projects can be viewed here.

Brand recognition in real-life photos using deep learning (Lukasz Czarnecki, Samsung)
Czarnecki presented his project from Data Science retreat – brand recognition in Instagram photos where he combined neural networks with SVM. He showed the base line he started from (+ some preprocessing) and the steps he made to improve his results.
Some of the steps he did –
  • Increase training set from 300 images for brand to 800
  • Multiply the training set by cropping part of the pictures.
  • Looking on the error and adding training examples to make the NN more general
  • Increases threshold of the SVM
That’s all for now, until next year 🙂

5 interesting things (16/05/2016)

Similar Wikipedia Pages – This post present Wikipedia similar pages chrome extension. Phrasing this in other words it is a recommendation system for Wikipedia pages. They are not the first one to do it Wikiwand as well as Wikipedia themselves (in Beta) created such a feature. They compare the result a bit and it would have been interesting to have deeper dive to it. One obvious difference between them and Wikipedia is that while Wikipedia can analyze user behavior and use collaborative filtering or a hybrid approach they only have access to the content data. However they can possibly integrate page view data or edit data to inspect trends and to integrate more temporal data in their recommendations.



Cron best practices – while it is a very common to set a cron job it is sometime not fully understood – which user runs the command, where is it run, where to put the scripts, orcehstrating cron jobs and so on. A life hack post.



Resume Dreamer – since the release of TensorFlow as open source many people experience with TensorFlow in different tasks and settings and share their work. This post by untapt team show their efforts on generating CV using TensorFlow. While the task is amusing and one could think of it as a draft to their own CV I lack some technical details about the process. Sawing that the engine recognized the paragraph structure I wonder how it will deal with dates patterns.



Apache Libcloud – Python library for interacting with many of the popular cloud service providers using a unified API. This library have two major use cases as far as I see it – migration between different providers and working with multiple providers on a regular base which I believe is rare. In both cases this package can decrease the “lock-in” to specific provider and that’s good thing.



Graph DataFrames -Databricks expose GraphFrames – graph processing library for Apache Spark. The functionality of Apache Spark is extended step by step and now it is time for Graph algorithms. My guess is that the query syntax will change a bit in the future to support user defined functions and so on but that’s a start.