NLP is a broad term which contains many types of question and challenges such as – language detection, Part-of-Speech tagging, relation extraction, named entity recognition, OCR, speech recognition, sentiment extraction and many more.
There are of course, several Python libraries which try to tackle some of those problems. This post aims to provide a short overview of those packages.
Probably the oldest and the most known package in this area. Started in 2001 in Penn State Computer Science department the Natural Language Toolkit aims to support scientific research. Last stable release was released at the beginning of April, so the project is live and develops all the time.
The NLTK package includes wide variety of modules including – text tokenization, pos tagging, text classification, sentiment analysis, etc.
The downside of this package is that it is many times ungainly, heavy, complicated. It is more of academic level than on industry level. If you have lot’s of text to analyze, specially if it is complicated, expect it to run for ages.
Possibly NLTK strongest competitor with the goal of creating production level code.
Main capabilities include tokenizing, tagging, parsing, entity recognition and pattern matching. For now only supports English and German.
It is faster and suits more to industrial needs, but the community is still small comparing to NLTK and the features are also behind.
‘Topic modelling for humans”
Implements top notch algorithms focusing on topic modeling, documents ranking and significant terms identification, e.g. – tf-idf, word2vec, doc2vec, latent Dirichlet allocation (LDA), latent semantic analysis (LSA). Some of the algorithms can be run in a distributed manner.
Uses NumPy and SciPy for efficient processing.
While NLTK old the philosophy of “one package rule them all”. LDIG does one thing – language detection for short text, i.e n-grams distribution are based on twitter and meant to analyze texts with at least 3 words. Rather then relatively short texts.
They 99.1% accuracy over 17 languages. From my experience the accuracy with ldig was a bit lower (around 80%). However, still relatively good specially for Latin languages.
Scikit-learn is one of the biggest machine learning pacakges in Python. As NLP is one application of machine learning it features also specific modules to deal with text. So once you know the mathematical background of the algorithm you want to use you can use scikit-learn implementation.
Text feature extraction – http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
LDA – http://scikit-learn.org/dev/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html