5 interesting things (23/07/2015)

Diversity in tech? This post describe a black woman’s experience in the Stanford Computer Science Major. As part of TOA events I went to Zalando’s diversity in tech panel last week. So I got to think about this subject for a while both from my situation – foreign women in tech, not speaking the local language and from a wider point of view. Having women in Universities (and any other population) is a necessary condition (but not sufficient)  to having diversity in tech.

Invitation to Scala
 – this post tries to make Scala less intimidating for  Scala beginners. May the force be with him (and with me)

Pyxley
 – Python powered dashboards. I’m always excited about visualization tools. It is built with Pandas data frame in mind and therefore should be relatively intuitive for data scientist which uses python.

Cloudera Ibis
 – Cloudera reveals Ibis project which is aimed to give python end-to-end pipelines specially for data scientist with in the well known PyLab eco-system (pandas, scikit-learn, scipy, etc).

Clustering check-ins with Spark and Cassandra
 – the title is self explanatory… Loading check-ins data data to Cassandra, analyzing it with Spark and visualizing it with zeppelin. All in all a reasonable data product pipeline put together beautifully

 

Database debate

As part of the Berlin Tech Open Air events I want to the “Database debate” in DC Media networks. The two sides of the debate were Simon Willnauer, lead engineer of ElasticSearch and co-founder of Elasticand Carter Page, technical lead for Google BigTable.

The talk was hosted by DC Media networks employee and was navigated really well with nice questions and interactions between Willnauer and Page. However it was not a “Database debate” at least for two reasons. ElasticSearch currently does not brand itself as a database and as admitted by Willnauer is not mature enough. Therefore it was not a debate – no pros and cons, no one against the other. But rather two solutions for different problems which both somehow relate to the buzzword “big data”.

Some expected questions were asked – use-cases, road map and future features, bug fixes, comparison to other solutions. But also less expected questions – what would you do different if rebuilding the product, pitfalls of beginners and some more technical deep dive questions asked by the audience.

One of the nice questions asked by the host was – “What is the weirdest usage you have seen to your product”. Willnauer answered – “playing chess with it using near neighbor to compute the next step”. Following Call me maybe project I though of “Call me checkmate” project – playing chess using different databases.

Overall, very nice and chill atmosphere. Although I’m not sure why it is called open air..

DatabaseDebasePic

5 interesting things (12/07/2015)

Code management tools by AWS – in the last RE:INVENT event (October 2014) Amazon said that this year they are going to focus on new tools for code management and deployment. Now they reveal those tools –

https://aws.amazon.com/blogs/aws/code-management-and-deployment/

Mail received from the closest Oak tree – I find this blig post and the whole process charming. The city of Melbourne created a technology interface to get the citizens more involved in the city life and exciting things happened. I find those interactions between the everyday life and the public sphere as one of the most fascinating challenges of the coming years – making the public sphere more accessible, smart and open.

http://www.citylab.com/tech/2015/07/when-you-give-a-tree-an-email-address/398219/

Toyplot – another python plotting library. Seems to work natively with Numpy. Still quite young – number of different possible charts is limited but I’m it will become more mature in the near future. Beside nice, interactive charts which I’m able to configure to my needs (axes, legend, colors, scale, exporting \ embedding visualizations etc) what I look for in a good plotting package is answer all the different type of charts I need. I don’t want to start juggling between several packages each for a different type. At least on this area there is always a place to grow – heat maps, geo-spatial maps, 3d, etc.

http://toyplot.readthedocs.org/

Python design patterns – I’m Tom and I’m lazy, I admit it. I think I said it before but IMHO a good software developer is a lazy one. One who automates what she can, uses existing tools and packages when available and reuses her own code. This is the main task of design patterns – solve common problems and provide best practices. And also create common language so different developers and stakeholders can communicate. This github repository collects design patterns implementations in Python.

https://github.com/faif/python-patterns

Mining twitter data with Python – a seven posts series by Marco Bonzanini. Goes through the entire process starting with getting twitter access token and ends with data visualization using d3.js and sentiment analysis.

http://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/

5 interesting things (10/07/2015)

Three Useful Python Libraries for Startups – tl;dr this post suggest Whitenoise, Phonenumbers and Pdfki as 3 important packages for startups and requests and Python-dateutil as runner-ups. IMHO, those are very strange choices. I agree with choosing requests as a package that simplify http requests and which is important to infrastructure. I expected the other packages as well to relate to infrastructure. Possibly libraries I would think of – django and boto. Possibly also numpy and \ or pandas for very common statistic and analysis use cases.

http://blog.instavest.com/three-useful-python-libraries-for-startups

Trending @ Instagram – I used to work on a very similar problem to this one and facing almost the same challenges – ranking and scoring, grouping. It is always interesting to see how different people approach the same problem which I know intimately.

http://instagram-engineering.tumblr.com/post/122961624217/trending-at-instagram

Git from the inside out – version control is very important tool in the everyday life so it is nice to look into one possible implementation of it.

https://codewords.recurse.com/issues/two/git-from-the-inside-out

Document clustering with Python – simple, clear howto guide which both explain the theory lightly, examine several clustering algorithms and sums up with visualizations.

http://nbviewer.ipython.org/github/brandomr/document_cluster/blob/master/cluster_analysis_web.ipynb

Deploying python packages @ Nylas – I love such posts which explain the real life problem they faced, suggest several solution \ possible alternative and their pros and cons and show what and why they eventually choose. Specifically as I believe every python developer run onto those problems at least once (a day :))

https://nylas.com/blog/packaging-deploying-python

5 interesting things (26/06/2015)

Spark Summit 2015 Highlights in Tweets – Spark summit was held in San Francisco this week (15-17 June). Euro-Spark summit will be held on 27-29 October call for papers is open. This presentation is a brief of the highlights of different talks. I believe that (if not already) most of the talks and presentations will be available online soon.

How to implement neural network – somehow, neural networks are not fully supported by scikit-learn (probably will happen in the future) and there are several alternative such as PyBrain (not sure if it is maintained), PyLearn2, theano and so on.This tutorial is Neural Network DIY. It is both a technical-python guide and a step-by-step neural network reminder.http://peterroelants.github.io/posts/neural_network_implementation_part01/

Designing large scale python application – Talk by M.A Lumberg, Python core developer, in PyWaw summit. He also talked as a keynote about – “Past, present and future of Python”, I intend to hear that one when I have sometime.

http://www.egenix.com/library/presentations/PyWaw-Summit-2015-Designing-Large-Scale-Applications-in-Python/

The artpiece of command line – I’m a geek, I admit it and I like working from the command line (we love you Tom).  Both for small data science things (sum, average, uniq, etc.) and for checking up what is going on in my system (atop, ps, etc.). I fell I should keep this repository not more than a click a way.

https://github.com/jlevy/the-art-of-command-line

Mattermost – mattermost is an open source alternative for Slack. In one of the organizations I worked introducing Slack changed the entire communication in the organization. People were sharing more of their ideas, thoughts, results and got involved into additional projects. Still some features made me feel inconvenient like preferring Slack over documentation, using external servers and hard to search \ backup. I therefore find it very exciting to have an open source alternative that will improve both products.

http://www.mattermost.org/

5 interesting things (19/06/2015)

Interviewing as an art – Finding the smart, talented people with the right vibe for your company is always hard. Being on the other side is also hard. Not only in the sense of being good enough but also “interviewing” the work place – is the job interesting \ challenging? are the colleagues nice, talented, will you like to work with them? etc. There are many posts and theories about what you should ask in order to find the right candidate and less posts and discussions about how to approach the candidate, make him or her comparable and being able to get the maximum out of it during a stressing interview. I find this post specially sensitive and inviting.

http://www.zdfs.com/code/2015/on-interviewing-software-engineers

Airbnb airflow – a framework by AirBnb which help create, schedule and monitor data pipelines.
https://github.com/airbnb/airflow

Seaborn cheat-sheet – the statistical data visualization package completing \ improving the PyData eco-system. A simple tutorial that can also be used as a cheat-sheet.
https://beta.oreilly.com/learning/data-visualization-with-seaborn

Data Science Ipython notebooks – a collection of links of Ipython notebooks which relates to data science. Include notebooks from Kaggle, AWS, PyCon, etc. Can also add a the link to some seaborn notebook..
https://github.com/donnemartin/data-science-ipython-notebooks

Pitfalls of A\B testing – tl;dr – statistics is tricky and you should remember the underlining assumptions. Also a good chance to remind A/A testing as a sanity check.
http://blog.dato.com/how-to-evaluate-machine-learning-models-the-pitfalls-of-ab-testing

5 interesting things (04/06/2015)

Pitfalls when mining Wikipedia – this research was done on a bit old data (2013) but it show quite easy way to clean wikipedia dumps. Although they suggest a way to repeat the process they did, I suspect that the structure of some of the dumps was changed a bit.

An interesting question for me in this scope is the relation between content \ views of same entities cross different languages. This can be calculated by joining page count results with wikidata dumps once they are complete. Crunching wikidata dumps a bit, at the moment it does not feel mature enough and does not include the data about all the articles in wikipedia.

https://blog.lateral.io/2015/06/the-unknown-perils-of-mining-wikipedia/

Weather prediction with Amazon Machine Learning – nice, simple way to start playing with Amazon machine learning –

http://arnesund.com/2015/05/31/using-amazon-machine-learning-to-predict-the-weather/

How to evaluate machine learning models – a series of 5 posts by dato  –

http://blog.dato.com/how-to-evaluate-machine-learning-models-part-1-orientation

Mean shift clustering – clustering approach with few advantages over common methods. E.g – don’t have to define number of clusters in advance  – but you have to tune the bandwidth. On the other hand it slower than other algorithms.

http://spin.atomicobject.com/2015/05/26/mean-shift-clustering

The future of spark – post following Strata + Hadoop conference in London.
https://www.linkedin.com/pulse/future-apache-spark-rodrigo-rivera

5 interesting things (31/05/2015)

Vis.js demo – (Yet another) Javascript visualization library. I love the relative graph approach, I find it easier for Python developers that d3.js (which I also like) –
Predition.io –  a machine learning server for developers and data scientists. Built on Apache Spark, HBase and Spray. Haven’t used it but it seems very interesting –
Django performance – 4 litlle tricks – I’m familiar and experienced with Django but not deep dive. Possibly those are very basic tricks but for me they were good to keep in mind
Deep dive into elasticsearch storage – as I currently work a lot with elasticsearch (should say elastic..) I find this post very interesting
Amazon machine learning – the new buzz in the neighborhood

6 interesting things (20/2/2015)

Making an exception but I really came across some interesting things –

A\A Testing – well known idea in machine learning (train, cross-validation, test) and in other research best practices now used as a take off on the buzz word term – “A\B testing” but yet well explained and can be eye opening on the right moment –
Intro to into – Easier conversion between somehow complex data types in python
Reading this I was also exposed
Getting started with Spark in Python – very very clear tutorial about all the required steps to get started. I cannot wait to find a good enough excuse to work with spark.
Fuzzy – Fast Python phonetic algorithms. Nothing new or too fancy, just came across it this week and found it useful.
Typo Distance – finds typo distance between two strings. It uses qwerty layout but you can configure different layout pretty easily. The algorithm is quite heavy and time consuming, there is some room for improvement (although it is not actively maintained). For example – adding a max parameter which stops the computation once the typo distance is higher than the allowed distance.

Topy – Python script to fix typos in text, using rule-sets developed by theRegExTypoFix project from Wikipedia. The basic rule set is an English rule set but other rule sets are also available. Trying it, I’m positive about it but it is not baked \ mature enough and I would like it to be more easy to use in code than as a command line tool.

https://github.com/intgr/topy

5 interesting things (03/02/2015)

spaCy – yet another python library for text processing? maybe, haven’t tried it but seems like another tool taking part in this growing world. I’m looking for a package that will do a good job on short texts such as tweets and so on.
Moto – not what you thought.. not a short from automotive but mocking +boto. “A library that allows your python tests to easily mock out the boto library”. Looks very neat but still has some way to go.
Nothing like CLI – examples that for certain use cases command line tools can be much faster than hadoop cluster.
Self hell – elegance in code although I’m not sure how efficient it is.
NLP made visual with Neo4j –  the concept is cool – creating an interesting visualization not with the trivial \ designated tool for it. I doubt how much it scales.