6 interesting things (20/2/2015)

Making an exception but I really came across some interesting things –

A\A Testing – well known idea in machine learning (train, cross-validation, test) and in other research best practices now used as a take off on the buzz word term – “A\B testing” but yet well explained and can be eye opening on the right moment –
Intro to into – Easier conversion between somehow complex data types in python
Reading this I was also exposed
Getting started with Spark in Python – very very clear tutorial about all the required steps to get started. I cannot wait to find a good enough excuse to work with spark.
Fuzzy – Fast Python phonetic algorithms. Nothing new or too fancy, just came across it this week and found it useful.
Typo Distance – finds typo distance between two strings. It uses qwerty layout but you can configure different layout pretty easily. The algorithm is quite heavy and time consuming, there is some room for improvement (although it is not actively maintained). For example – adding a max parameter which stops the computation once the typo distance is higher than the allowed distance.

Topy – Python script to fix typos in text, using rule-sets developed by theRegExTypoFix project from Wikipedia. The basic rule set is an English rule set but other rule sets are also available. Trying it, I’m positive about it but it is not baked \ mature enough and I would like it to be more easy to use in code than as a command line tool.

https://github.com/intgr/topy

Advertisements

5 interesting things (03/02/2015)

spaCy – yet another python library for text processing? maybe, haven’t tried it but seems like another tool taking part in this growing world. I’m looking for a package that will do a good job on short texts such as tweets and so on.
Moto – not what you thought.. not a short from automotive but mocking +boto. “A library that allows your python tests to easily mock out the boto library”. Looks very neat but still has some way to go.
Nothing like CLI – examples that for certain use cases command line tools can be much faster than hadoop cluster.
Self hell – elegance in code although I’m not sure how efficient it is.
NLP made visual with Neo4j –  the concept is cool – creating an interesting visualization not with the trivial \ designated tool for it. I doubt how much it scales.