I got a diversity scholarship from Num Focus to attend the PyData Berlin event. Num Focus is an NGO which supports open source data science projects among them – Jupyter, matplotlib, Numpy, pandas etc.
This post is not a summary of the events or of the talks that I attended in but rather hints to a subset of the talks.
Keynote – Olivier Grisel (Inria)
Grisel talked about the Evolution of predictive modeling and scaling predictive modeling. Why do we need to scale predictive modeling –
- I\O intensive operations – e.g feature engineering and model serving.
- CPU intensive operations – e.g hyper-parameters search and cross validation.
PySpark seems like a very legit tool but it has its’ drawbacks –
- No pure python local mode – impossible to use profiler or ipdb
- There is latency which is induced by the network architecture (Python driver -> Scala (JVM) -> Python worker)
- Traceback is sometimes hard to understand as there is a mix of scala and python errors and log data.
Grisel suggest to use instead Dask and Distributed as a native python packages for parallel and distributed computing.
Being a young project dask also have its’ limitation – mainly no distributed shuffles which means it does not support distributed merge, join, groupby and aggregation operations at the moment.
Frontera: open source, large scale web crawling framework (Alexander Sibiryakov, ScrapingHub)
When I read the talk abstract I was not sure what the difference between Frontera and Scrapy. Hearing this talk the goal of Frontera is actually different – schedule crawling and crawling strategy. In their architecture they use scrapy for crawling but can use other crawlers as well.
After introducing Frontera, Sibiryakov showed their result of crawling the Spanish internet and the problems they faced. Some of the solutions were quite trivial – indexing the data differently to avoid hotspot, caching, limiting the depth of crawling \ number of pages per host, etc. I cannot say that I was fully convinced.
Setting up predictive analytics services with Palladium (Andreas Lattner, Otto group)
The evolution of company tool – 80% development time, 20% deployment overhead which tend to repeat between different projects. So why not make a framework and try to automate it. This is exactly the reason Otto group developed Palladium – ease the development, deployment and integration predictive analytics services.
The main limitation of Palladium is that it does no support distributed calculation. Another issue is controlling the results on real time – boosting, filtering, etc. At current time, prediction.io (not written in Python) is in a more mature state.
Spotting trends and tailoring recommendations: PySpark on Big Data in fashion (Martina Pugliese, Mallzee)
Mallzee, as stated by Pugliese is “the tinder of fashion”. I.e the applications shows you a stream of fashion items from multiple providers and you swipe them left and right according to your preferences.Their current big challenge is to produce valuable recommendation for each customer.
Their input on one hand are the items and the items information which they crawl from the providers websites and normalize to fit into their set of tags and properties and of course the brand itself. The second input is the user behaviors – swipes, buy actions, favourite and non favourite brands, etc. Their current choice is creating a random forest for each user based on their actions. This approach can work for users with a lot of signals and specifically positive signal (buy and positive swipes). Possibly the next step will be to cluster the users to learn more about the users with log positive signals.
Practical Word2Vec in Gensim (Lev Konstantinovskiy, Gensim)
Gensim is a “topic modeling for humans”. It implements many interesting algorithm including lda, word2vec and doc2vec, lot’s of potential and interesting to play with.
But this tutorial was a mess.. the class didn’t fit to a tutorial, not enough space, not tables people can put their computers on and really run the code.
Konstantinovskiy wanted to present many options and ideas and run too fast between the different algorithms without really explaining them. On the other hand there was not really a focus on running the code and showing the package API. Unfortunately I feel this talked was a bit miss handled.
Bayesian Optimization and it’s application to Neural Networks (Moritz Neeb, TU Berlin)
What I would take from this talk is approaching Hyper parameter tuning as an optimization problem. For example when we use grid search for hyper parameter tuning we invest a lot of resources but the experiments are independent from one another and we don’t learn from one another. Instead we can possible introduce a method which have some interactions between the different experiments.
Neeb introduced an approach which treat it as a Gaussian Process and at each point try to evaluate the point with the maximal possible gain.
He also mentioned few Python libraries –
- Spearmint – designed to automatically run experiments (thus the code name spearmint) in a manner that iteratively adjusts a number of parameters so as to minimize some objective in as few runs as possible.
- HPOlib – hyperparameter optimization library
- Hyperopt – is a Python library for optimizing over awkward search spaces with real-valued, discrete, and conditional dimensions.
Keynote – Wes McKinney (Cloudera)
Wes McKinney is probably Mr. Open Source he involved and leading some of the most known data science open source projects (in python but not only) – pandas, Apache Parquet, etc.
His talk had roughly 2 parts – talking about the community, code of conduct, etc. and challenges that he believes will play a major role in data science \ python \ big data communities in the near future. Some of the project \ ideas \ challenges he mentioned –
- manylinux – Python wheels that work on any linux (almost)
What’s new in Deep Learning (Dr Kashif Rasul, Zalando SE)
Deep learning survey on speeds.. Survey of several recent papers in deep learning. Beside on mention of Theano and short example of code no strong connection to Python. I would have compromise to less code and slower pace.
Some of the papers he pointed at –
- Understanding the difficulty of training deep feedforward neural networks (Glorot and Bengio) – The term Xavier initialization comes from here. See here for additional information.
- Delving Deep into Rectifiers:Surpassing Human-Level Performance on ImageNet Classification (He, Zhang, Ren, Sun – Microsoft research)
- Batch Normalization: Accelerating Deep Network Training byReducing Internal Covariate Shift (Ioffe and Szegedy – Google) – A method to accelerate the training of deep neural network by modifying the distribution of activations. See tutorial here.
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons (Jose Quesada, Data Science Retreat)
Jose Quesada presented their experience with teaching and using Python Scikit learn versus Scala Spark. The most interesting points in my opinion were the PySpark limitation and future directions and of course comparison of existing features.
One of the most important insights was to use spark DataFrames. spark.ml is built on the top of dataframes and will replace Spark MLlib in the future.
Data Integration in the World of Microservices (Valentine Gogichashvili, Zalando)
Gogichashvili is the head of Data Engineering in Zalando and is working there for already 5.5 years which means he was there during business, structural and technological changes the organization went through.
While he talked about micro-services architecture I think there were two other interesting inputs in the talk –
- Team structure – every team is responsible for their own components and infrastructure (including naming) while their align to engineering guidelines such as API structure, data types, etc. which are set by cross team guilds.
- Open source code – whenever a team starts a new project by default the project will grow to be an open source unless they reason about why not. Zalando open source projects can be viewed here.
Brand recognition in real-life photos using deep learning (Lukasz Czarnecki, Samsung)
Czarnecki presented his project from Data Science retreat – brand recognition in Instagram photos where he combined neural networks with SVM. He showed the base line he started from (+ some preprocessing) and the steps he made to improve his results.
Some of the steps he did –
- Increase training set from 300 images for brand to 800
- Multiply the training set by cropping part of the pictures.
- Looking on the error and adding training examples to make the NN more general
- Increases threshold of the SVM
That’s all for now, until next year 🙂