5 interesting things (10/01/2016)

Everything.me open source their inheritance – Everything.me was an Israeli startup closed few weeks ago. They now open source major parts of their code and tools including their prediction algorithm, Re:dash and others.

https://medium.com/@joeysim/everythingme-open-and-out-6ed94b436e4c

Python 3 module of the week – over the years Python Module of the week by Doug Hellman was one of the most reliable documentation resources for Python’s standard library. The documentation is always accompanied with very good examples to almost any functionality. It is also available as a command line tool and was translated to Chinese, German, Italian, Spanish and Japanese. And.. it is now updated to Python version 3.5. Kudos.

https://pymotw.com/3/

The Star wars social network – although I’m not a star wars fan I think this provides a very accessible introduction to graph algorithms and measurements.

http://evelinag.com/blog/2015/12-15-star-wars-social-network/

D3 in Jupyter – an intersection between 2 tools which I use quite a lot and find very important for data scientist. Not those tools specifically but tools which make the data science magic more approachable to others so one can share its’ findings and get feedback.

http://multithreaded.stitchfix.com/blog/2015/12/15/d3-jupyter/

Bonus – Building interactive dashboards with Jupyter – http://blog.dominodatalab.com/interactive-dashboards-in-jupyter/

Tl;dr man page – given a bash command creates a tldr of the man page. More of a gimmick but a nice one.

http://www.ostera.io/tldr.jsx/

Surviving Black Friday & Turning Behavioural Signals Into User Profiles

I was on a visit in Israel and went to “Surviving Black Friday & Turning Behavioural Signals Into User Profiles” meetup. It is a very long name which actually implying on 2 talks. The meetup took palce in Sears Israel offices, which is the department behind Shop Your Way.

The first talk – “Surviving Black Friday” was given by Omri Fima who is a resilience tech lead in Sears. Omri talked about resilience and scalability lessons learned based on Black Friday. Every internet shopping site knows that the traffic is much higher on Black Friday, so how can you prepare and test if your system can deal with such a load? What happens if one service fails? What is a graceful failure and what is less graceful? He presented 7 steps to make your service more stable and mentioned few tools both for testing and development.
Pablo Rosenman, VP Development @ Adience gave the second talk – “Turning Behavioural Signals Into User Profiles” (slides, video). Pablo presented 2 of Adience products – Adience SDK and Events SDK and showed how they use AWS services in their pipeline. He talked about Adience pipeline and what were their main concerns and focus when designing and implementing it – scability, decoupling  and cost effective. In the end he also presented what they would do differently if they would design it today.
All in all, interesting talks and a very good atmosphere. Looking forward that Omri’s slide will also be available online.

5 interesting things (13/12/2015)

Systemml – distributed and declarative machine learning platform. Looks like a promising project which now joined Apache Software foundation and was initially developed by IBM. I wonder how it will influence the development of Spark MLLib.

http://systemml.apache.org/

Monopoly as Markov Chain – I guess I am developing a fetish to Markov chains. Although this model is not always realistic (so as in this case..) it is amazing what we can get out of it and the approximate simulations we can create. This post simulates a Monopoly game using Markov chains and does a very interesting job. Monopoly is linear in the sense that you must move according to the dice and the choices you make are limited (buy or don’t buy). In contrast to backgammon where you also have strategy involved and therefore it is a better choice to model it with Markov chains.

http://koaning.io/monopoly-simulations.html

Vocabulary – “Python Module to get Meanings, Synonyms and what not for a given word”. This module brands itself as an alternative to NLTK presenting data about meaning, synonyms, antonyms, part of speech, pronunciation, etc with a leaner approach and more pythonic approach. I don’t know if is as good and NLTK or will evolve there but it sure looks like an alternative worth checking.

http://vocabulary.readthedocs.org/en/latest/

Probability recap – If you forgot probability class from university this will probably be a good recap. However, if you work as a data scientist you probably used those daily. But code visualization, good examples and good way to share your knowledge with other colleagues.

http://blog.cloudera.com/blog/2015/12/common-probability-distributions-the-data-scientists-crib-sheet/

What we talk about when we talk about distributed systems – for once a non misleading title. Thinking about the distributed systems course I took on grad school this would have been a great introduction.

http://videlalvaro.github.io/2015/12/learning-about-distributed-systems.html

5 interesting things (1/12/2015)

Improving My CLI’s Autocomplete with Markov Chains – Markov chains are the basis for many auto-complete algorithms we know and use on daily basis, e.g keyboards on mobile devices. In this case it is a developer hack to improve auto-complete in a development tool. It is always nice when theory comes to life.

http://nicolewhite.github.io/2015/10/05/improving-cycli-autocomplete-markov-chains.html

10 more lessons learned from building Machine Learning systems – slides of a presentation by Xavier Amatriain, VP Engineering at Quora (previously Director Algorithms Engineering @Netflix). Very insightful presentation (I would of loved to hear the full one). The name refers to a lecture by Amatriain called “10 lessons learned from building Machine Learning systems”  exactly a year before).

http://www.slideshare.net/xamat/10-more-lessons-learned-from-building-machine-learning-systems

http://www.slideshare.net/xamat/10-lessons-learned-from-building-machine-learning-systems

Beyond One-Hot: an exploration of categorical variables – It is not all about numbers.. in many cases features are not numeric. If we are lucky – features will be binary – has \ does not have a symptom, spam \ not spam or ordinal – amount of pain a patient experiences, etc. But sometimes it is neither – e.g a state, color, etc. What then? this post compare several techniques to deal with categorical variables. While it is very basic it well explained (although examples would have helped) and it can give a great intuition for someone who faces those problems for the first time.

http://willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/

Bandit Algorithms for Bullying – Getting More Lunch Money – this post explains bandit algorithm using very common, easy to understand example. However, I would build the post a bit different. There is a lot of text telling the story comparing to the scientific parts. In my opinion the scientific parts should be emphasized a bit more (bold text, bullets, etc.)

http://1oclockbuzz.com/2015/11/24/bandit-algorithms-for-bullying-getting-more-lunch-money/

C.H.I.P vs Pi zero – The new Pi zero made a lot of buzz in the last week offering a computer for 5$. However, Pi zero is not the only player in the field of sub 10$ computers. This post compares Pi zero and C.H.I.P spec and abilities. There are many comments saying the comparison is biased towards C.H.I.P (ignoring shipping costs, unfair comparison of cable costs, reputation, etc.) but overall I think it is worth reading.

http://makezine.com/2015/11/28/chip-vs-pi-zero/

KNN approximation Apache Spark

K-nearest-neighbors is a very well known classification algorithm. It is based on the phrase – “show me who your friends are and I’ll tell you who you are”.

Apache Spark MLLib contains several algorithms including linear regression, k-means, etc . But it does not currently include an implementation to KNN. One of the reasons for that is the time complexity it requires (roughly n^2 where n is the number of items, ignoring the dimension).
At Apache Spark JIRA you can see 2 tickets involving this issue – SPARK-2335, SPARK-2336. The first ask for KNN feature and discuss the difficulties. The second, open based on the first, discuss approximations to KNN and wish to implement it.

I implemented a very naive approximation to KNN algorithm on Apache Spark with a distance function of similarity (looking for max) rather hen euclidian but this can be easily changed (change distance function and change sorting key in line 71).

The algorithm is based on splitting the data to partitions and calculating item distances only in the same partition. You can increase the accuracy either by decreasing the number of partitions (compare to more items) or by repeating the process several times (repartition differently every time) and choose the best results.

This calculation retrieves the list of the most similar neighbors and then one can decide how to use this data.

My gist – https://gist.github.com/tomron/70e5fefe128214b7d2a1

5 interesting things (15/11/2015)

Counting things in Python – This post spotlights very nicely and simple how Python and Pythonic writing changed over the years. One of the interesting things in this post is the analogy to natural languages. Natural languages also changes and evolve over time – slang, new phrases, out-dates expressions etc. and apparently so is programming languages.

http://treyhunner.com/2015/11/counting-things-in-python/

Travelling Beer drinker problem – Figuring out the shortest road trip path to visit best microbreweries in US. Finally a good use to the travelling sales man problem :). Data science wise there is not much into this problem – connecting the dots between location data and google API but that is an exciting ground for this connection.

http://flowingdata.com/2015/10/26/top-brewery-road-trip-routed-algorithmically/

Hacker news 9 year statistics – analysis about Hacker news activities, volume, users and trends over the last 9 years. Interesting specially because it became such a central place to consume technology news.

Evolving strategies for an iterated Prisoner’s Dilemma tournament – I don’t get to read many posts about evolutionary algorithms or practical game theory. So such in in-depth post with python implementation is really refreshing.
Visualizing Chess data with ggplot – Although I’m not an R user I love chess (and I’m horrible player) and I loved the analysis

Five interesting things (25/10/2015)

Whatsapp CLI – control you server using Whatsapp. The next step for me is to connect my home (heating, lights, doors) and control it via Whatsapp. To be honest I assume that some such implementations exists.

https://github.com/KarimJedda/whatsappcli

Receipt parser – A project which was created as part of Trivago Hackathon. I actually thought about the need of this product many times – first for personal accounting, keep an eye on my spending. Second for taking it to the next steps – alerts about things I should buy, alerts for buying things I don’t need and alerts for cheaper prices.

http://tech.trivago.com/2015/10/06/python_receipt_parser/

Time magazine visual trends – good ideas are priceless and when the implementation is also clean, nice and reveals interesting insights it is even more exciting.

http://www.pyimagesearch.com/2015/10/19/analyzing-91-years-of-time-magazine-covers-for-visual-trends

Google spreadsheet to ElasticSearch – to tell the truth, I could not think of a worse architecture than to use Google spreadsheet as a database (or a proxy in the way to a DB). Having said that, Elastic release a google spreadsheet plugin to import spreadsheet content to ElasticSearch instance.

https://www.elastic.co/blog/introducing-google-sheets-to-elasticsearch-add-on

Snakefooding python – snakefood is a tool to create graph dependencies. This post show the dependency graphs for some very common python libraries (Flask, django, Celery, requests, etc.). It presents the pros and cons of using snakefood, what it exposes and what it does not expose (many small files -> many imports -> complex dependency graph vs one file, spaghetti code -> no imports -> very clear simple graph). I find it as a tool that supports developing and detecting non necessary dependencies.

http://grokcode.com/864/snakefooding-python-code-for-complexity-visualization/

AWS loft Berlin

This week Amazon opened a loft in Berlin which is suppose to be open for 4 weeks. The loft is currently on pilot and there are several other lofts around the world. I think it is a very good strategic call to have it in Berlin as the startup is a emerging and many people want to try and learn more about cloud services while the hands-on experience is sometimes limited.

So what is going on in the loft? It is open everyday 10-18 and there is amazon employee “in duty” which you can consult with regarding AWS services. A very inviting work space. And workshops, demos, bootcamp, etc. All, of course, related to AWS services.

I took part in two workshops on Thursday morning – “An overview of Hadoop & Spark, using Amazon Elastic MapReduce” and “Processing streams of data with Amazon Kinesis (and other tools)“. Both lectures were given by Michael Hanisch, solution architect at Amazon. The first talk was a bit messy as it covered many topics but eventually ended up jumping here and there between general things about Hadoop, tips about EMR and changes in the AMI concepts and versioning and clues about Spark.

The second talk was much more focused. It started by introducing the need to Amazon Kinesis. Then explaining the architecture – producer, streams, shards, clients, clearing up the capabilities and constraints also mentioning kinesis autoscaling utils.. The next step was a deeper dive to kinesis producer library and kinesis client library. Moving forward to kinesis firehose (which was introduced in the re:invent last week) and integration with additional input and output sources and aws services. To sum up the talk ended with tips and best practices. AWS Lambda​ was also mentioned several times over the talk as a tool to process stream data.

Quite exciting time to be in Berlin.

5 interesting things (06/10/2015)

Restaurants recommendations – I read quite a lot about recommendation systems lately and I love this post because it talks about a restaurant domain while many of the posts related to recommendation system refers to music, television and movies. And geospatial features are very important here comparing to movie, music and television recommendations.

http://www.slideshare.net/SudeepDasPhD/recsys-2015-making-meaningful-restaurant-recommendations-at-opentable

Data Science workflow – this blog post present a well structured approach to data science process. While I loved the structured approach the entry point – “get data, have fun with it” is twisted. I believe that when working on a product most of the time you will want to solve a problem or introduce a new feature, i.e. you already have the question you want to answer rather than explore a dataset and think about the questions you can answer with it. Also missed part of documenting you work.

http://blog.binaryedge.io/2015/09/08/the-data-science-workflow/

AWS in plain English – or AWS for humans. If you are not that yet or not familiar with the different services this is a nice way to introduce the terminology.

https://www.expeditedssl.com/aws-in-plain-english

Fedrer vs Djokovic – why not Nadal? it seem kind of abuse in the product but for sure it exposes the product and make some buzz.

https://www.elastic.co/blog/building-dashboards-using-data-from-the-federer-&-djokovic-tennis-rivalry

What to do with small data – big data is one of the buzz words in the last few years. While it was previously whispered only by tech people it is now a common, well known phrase. But, many companies do not really have big data, they have few users and need to perform well for those users and one day they might have big data, many users and tons of features. Until then, there are some clues in this post.

https://medium.com/rants-on-machine-learning/what-to-do-with-small-data-d253254d1a89

#dmconf15

I got a free ticket to distributed matters conference winning the challenge of the python big data analytics meetup group. The line up was impressive (including Kyle Kingsbury from project “call me maybe”), the topics were very interesting and the location was very nice (first time I hear a talk in a disco party hall) and close to my house, perfect match. Due to earlier commitments I could only stay half of the day but what a day 🙂

Starting with Kyle Kingsbury (Stripe) as a key note talking about “call me maybe” and the jepsen test. The basic setup of the test is creating a cluster of some database, injecting random events (read, write, update, query, etc) and verify the database behavior. Next level include injecting all kind of failures – network errors, node pause, etc. Following the description of the test Kingsbury gave a brief summary of the results. Project “call me maybe” result creates pressure on the different database vendors to improve their products and align with the promises they make. It definitely serve as the watch dog of this industry. It is sometime amazing how not tested some products are.

The sentence I took with me from his talk is the choosing a database is a contract and you should check all the relevant aspects of it. Settings, complexity and needs sometimes change over time but as long as the initial contract is honored no complains.

The second talk I went to was Michael Hackstein‘s (ArangoDB) talk “NoSQL meets microservices”. Hackstein presented a problem of multiple calls to multiple databases where each answer different need and presented the solution – multi-model NoSQL database. Single call can answer all needs. ArangoDB is an example of such a database, OrientDB is also a major competitor. The problem multi-model database might be phrased as just multi-model data. I.e. documents data, graph data, key-value data, etc (see footnote 1).

The rest of the talk focused on ArangoDB different features –

  • foxx – REST API over the database which works as a microservice.
  • drivers – including python, php, java, js, ruby, etc.
  • AQL – ArrangoDB query language

Next talk was given by Charity Majors (Parse – acquired by Facebook) as she introduced herself – “hate software for living”. This talk was really really great one. Majors is super experienced, worked with many technologies and very charismatic. She spoke about her many years experience about upgrading databases. Sound easy? well.. not that much. Stepping a step back, why would you want to upgrade a database – new feature, improved preformance, bug fixing, etc.. So what can go wrong? well, many things. From down time, losing data, losing transactions, different hashing and indexing, optimization which are not optimized to your data, changes in clients, and much more fun. Good news  – you have enough interesting and important data to care for.

What should you do about it? tl;dr – be paranoid. Longer version –

  • Change only one component at a time
  • read the docs, change logs, release notes (also mentioned by Kyle Kingsbury)
  • Don’t believe to database vendors benchmarks, test it for your expected load (if you can expected it, another painful point) – at least 24h data, make sure to clear the cache.
  • Take consistent snapshots
  • Prepare for rollback option.

I really hope her slides will be published soon since it was a very informative talk.

Tour de Force: Stream Based Textanalysis by Hendrik SalyStefan Siprell (codecentric).

This talk was classified as expert level and I expected it to a bit more then what it actually was. Many buzzwords were thrown around – R, machine learning, twitter, Apache spark, elasticsearch, kibana, aws, etc. but the big picture, how those technologies relate to each, what are the alternatives. For example, for the goal of enriching the data R can be easily replaced with python. Or in order to read from twitter why use files and not use Spark streaming capabilities (or some queue). Beside Kibana being a great and easy to use tool, preferring Elasticsearch over other options was not clear. To sum up – I also missed some reference to the distributed properties of those tools. This is the reason we were all there after all, not?

Footnotes

1 Multi-model data is just a way of thinking of about data structures in patterns we are used to. Maybe there are other more efficient way to think of this data?

2. Sponsor wise – on one hand there were few sponsors that didn’t give any talk, for example idealo. On the other hand there were few companies I expected to sponsor the conference, i.e crate.io, Elastic, mongodb which was mentioned a lot but unfortunately didn’t give any dedicated talk.

3. Database vendors business models. I think that the most popular model as of today is to develop a database, possibly oen source (elasticsearch, ArangoDb etc.) and make money from supporting it. Is there any other way to make money from database vendoring those days (hosting is not included under vendoring).

4. Diversity in tech – I saw no more than 5 women in the crowd.