Detecting Data Errors: Where are we and what needs to be done?

My summary and notes for “Detecting Data Errors: Where are we and what needs to be done?” by Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, Nan Tang Proceedings of the VLDB Endowment 9.12 (2016): 993-1004.

Paper can be found – here

In this paper the group of researchers evaluate several data cleaning tools to detect different types of data errors and suggest a strategy to holistically run multiple tools to optimize the detection efforts. This study focus on automatically detecting the errors and not repair them since automatically repairing is rarely allowed.

Current status – current data cleaning solutions are usually belong to one or more of the following categories:

  • Rules based detection algorithms – the user specify set of rules such as: not null, functional dependencies, user defined function that the data must obey and the data cleaner find any violation. Example: NADEEF.
  • Pattern enforcement and transformation tools – tools in this category discover either syntactic or semantic patterns in the data and detect those errors. Example:OpenRefine, Data Wrangler, DataXFormer, Trifacta, Katara.
  • Quantitative error detection algorithms – find outliers and glitches in the data.
  • Record linkage and de-depulication algorithms – identify data which refer to the same entity and is not consistent \ appear multiple times. Examples: Data Tamer, TAMR.

Evaluation of tools

  1. Precision and recall of each tool
  2. Errors detected when applying all the tools together
  3. How many false positives are detected as we would like to minimize the human effort.

Error types

  • Outliers include data values that deviate from the distribution of values in a column of a table.
  • Duplicates are distinct records that refer to the same real-world entity. If attribute values do not match, this could signify an error.
  • Rule violations refer to values that violate any kind of integrity constraints, such as Not Null constraints and Uniqueness constraints.
  • Pattern violations refer to values that violate syntactic and semantic constraints, such as alignment, for matting, misspelling and semantic data types.

(p. 995)

Some errors overlap and fit into more than one category.

[TR] – are those all the error types which exist? What about correlated errors between several records?

Data sets

[TR] – there are many data sets specific details in the paper. As I am more interested in the ideas those details will be omitted here.

[TR] – the data evaluated relatively small datasets with small number of columns. It would also be interesting to evaluate it bigger and more complex datasets, e.g. wikidata.

[TR]- consider the temporal dimension of the data. I.e some properties may have expiration date which other not (e.g birth place never changes while current location changes).

Data cleaning tools

DBoost DC-Clean OpenRefine Traficata Pentaho Knime Katara TAMR
Pattern violation + + + + +
Constraint violations +
Outliers +
Duplicates +
  • DBoost – use 3 common method for outlier detection – histograms, Gaussian and multivariate Gaussian mixtures. The UVP of this tool is decomposing types into their building blocks. For example expanding dates into day, month and year. DBoost require configuration such as number of bins and their width for histograms and mean and standard deviation for Gaussian and GMM.
  • DC-Clean – focus on denial constraints and subsume the majority of the commonly used constraint languages. The collect if denial constraints was designed for each data set.
  • OpenRefine – can digest data in multiple formats. Data exploration is performed through faceting and filtering operations ([TR] – reminds DBoost histograms)
  • Trifacta – commercial product which was developed from DataWrangler. Can predict and apply syntactic data transformation for data preparation and data cleaning. Transformations can also involve business logic.
  • Katara – uses external knowledge bases, e.g. Yago in order to detect errors that violate a semantic pattern. It does it by first identifying the type of the column and the relations between two columns in the data set using a knowledge base.

         [TR] – assumes that the knowledge base is ground truth. We need to doubt this as well.

  • Pentaho – provide graphical interface for data wrangling and can orchestrate ETL processes.
  • KNIME – focuses on workflow authoring and encapsulating data processing tasks with some machine learning capabilities.
  • TAMR – uses machine learning models to learn duplicate features through expert sourcing and similarity metrics.

Combination of Multiple tools

    • Union all and Min-K –
      • Union all – takes the union of the errors emitted by all tools (i.e k=1)
      • Min-k – error detected by at least k tools.
    • Ordering based on Precision
      • Cost model –
        • C – cost of having a human check a detected error
        • V – Value of identifying a real error (V > C otherwise make not sense).
        • P – Number of true positives
        • N – Number of false positives

        Total value should hold – P * V > (P +N) * C => P/(P+N> > C/V. P/(P+N) is the precision. Therefore if the model precision is less than C/V we should not run it. Model precision can be evaluated by sampling the detected errors.
        “We observed that some tools are not worth evaluating even if their precision is higher than the threshold, since the errors they detect may be covered by other tools with higher estimate precision (which would have been run earlier).” (p. 998)
        [TR] – not always the cost and value can be estimated correctly and easily and not all the errors have the same cost and value.

      • Maximum entropy-based order selection – the algorithm estimate the overlap between the tool results and picks the tool with the highest precision to reduce the entropy. Algorithm steps:
        1. Run individual tool – run each tool and get the detected errors.
        2. Estimate precision for each tool by checking samples
        3. Pick a tool which maximize the entropy among the unused tools so far – picks the one with the highest estimated precision on the sample and verifies its detected errors on the complete data that have not been verified before.
        4. Update – update the errors that were detected by the chosen tool in the last step and repeat last two steps.

    Discussion and Improvements

    • Using Domain specific tools – for example AddressCleaner. [TR] – only relevant when such a tool exists and is easy \ cheap to use.
    • Enrichment – rule-based systems and duplicate detection strategies can benefit from additional data. Future work would consider data enrichment system.

    Conclusions

    1. No clear winner – different tools worked well on different data set mainly due to different error types distribution. Therefore a holistic strategy must be used.
    2. Since there are errors which overlaps one can order the tools to minimize false positives. However the ordering strategy is data set specific.
    3. Yet, not 100% of the errors are detected. Suggested way to improve it is –
      1. Type-specific cleaning – for example date cleaning tools, address cleaning tools etc. Even those tools are limited in their recall.
      2. Enrichment of the data from external sources

    Future work

    1. A holistic combination of tools – algorithms for combining tools. [TR] – reminds ensemble methods from Machine learning.
    2. Data enrichment system – adding relevant knowledge and context to the data set.
    3. Interactive dashboard
    4. Reasoning on real-world data

5 interesting things (28/08/2016)

Joel test for Data Science – the inspiration and adjustment to data science done in Domino, both were interesting reads for me.

What I Wish I Knew About Data For Startups – I speciialy related to documenting and testing events tracking. I was surpsirsed about how little this topic and its best practices are discussed.

A Paper a week keeps the doctor a way? Git repository by Shagun Sodhani who summarizes and sometimes comment those papers.

Inside Elasticsearch – 3 parts blog series about how elasicsearch work.

Is it brunch time? Someone is finally asking the important questions in life… defining brunch time using on twitter tweets timing. I liked the method  used to define the interval of lunch which requires no more than high school math and also the question. From anthropological point of view it is interesting for me to see if different locations in the world define breakfast \ brunch \ lunch in different time. I.e in Spain people each lunch at 14 while in Israel they eat at 12. Also maybe interesting to compare tweets with pictures to tweets without pictures. My assumptions is that tweet without pictures are people on the way to brunch and would be a bit earlier comparing to tweets with pictures which are tweeted during the brunch.

Bonus track – Prediction.io is now an Apache project (incubating). After few changes in the last few month – being bought by Sales Force, fork by ActionML, etc. Sales Force donated Predicion.io to ASF. Looking forward for the future of this project.

5 Python NLP pacakges

NLP is a broad term which contains many types of question and challenges such as – language detection, Part-of-Speech tagging, relation extraction, named entity recognition, OCR, speech recognition, sentiment extraction and many more.
There are of course, several Python libraries which try to tackle some of those problems. This post aims to provide a short overview of those packages.

NLTK
Probably the oldest and the most known package in this area. Started in 2001 in Penn State Computer Science department the Natural Language Toolkit aims to support scientific research. Last stable release was released at the beginning of April, so the project is live and develops all the time.
The NLTK package includes wide variety of modules including – text tokenization, pos tagging, text classification, sentiment analysis, etc.
The downside of this package is that it is many times ungainly, heavy, complicated. It is more of academic level than on industry level. If you have lot’s of text to analyze, specially if it is complicated, expect it to run for ages.

Spacy.io
Possibly NLTK strongest competitor with the goal of creating production level code.
Main capabilities include tokenizing, tagging, parsing, entity recognition and pattern matching. For now only supports English and German.
It is faster and suits more to industrial needs, but the community is still small comparing to NLTK and the features are also behind.
gensim
‘Topic modelling for humans”
Implements top notch algorithms focusing on topic modeling, documents ranking and significant terms identification, e.g. – tf-idf, word2vec, doc2vec, latent Dirichlet allocation (LDA), latent semantic analysis (LSA). Some of the algorithms can be run in a distributed manner.
Uses NumPy and SciPy for efficient processing.
 

LDIG
While NLTK old the philosophy of “one package rule them all”. LDIG does one thing – language detection for short text, i.e n-grams distribution are based on twitter and meant to analyze texts with at least 3 words. Rather then relatively short texts.
They 99.1% accuracy over 17 languages. From my experience the accuracy with ldig was a bit lower (around 80%). However, still relatively good specially for Latin languages.
 

scikit-learn
Scikit-learn is one of the biggest machine learning pacakges in Python. As NLP is one application of machine learning it features also specific modules to deal with text. So once you know the mathematical background of the algorithm you want to use you can use scikit-learn implementation.

PyData Berlin 2016 #pydatabln

I got a diversity scholarship from Num Focus to attend the PyData Berlin event. Num Focus is an NGO which supports open source data science projects among them – Jupyter, matplotlib, Numpy, pandas etc.

This post is not a summary of the events or of the talks that I attended in but rather hints to a subset of the talks.

Keynote – Olivier Grisel (Inria)
Grisel talked about the Evolution of predictive modeling and scaling predictive modeling. Why do we need to scale predictive modeling –
  1. I\O intensive operations – e.g feature engineering and model serving.
  2. CPU intensive operations – e.g hyper-parameters search and cross validation.
PySpark seems like a very legit tool but it has its’ drawbacks –
  • No pure python local mode – impossible to use profiler or ipdb
  • There is latency which is induced by the network architecture (Python driver -> Scala (JVM) -> Python worker)
  • Traceback is sometimes hard to understand as there is a mix of scala and python errors and log data.
Grisel suggest to use instead Dask and Distributed as a native python packages for parallel and distributed computing.
Being a young project dask also have its’ limitation – mainly no distributed shuffles which means it does not support distributed merge, join, groupby and aggregation operations at the moment.

Frontera: open source, large scale web crawling framework (Alexander Sibiryakov, ScrapingHub)
When I read the talk abstract I was not sure what the difference between Frontera and Scrapy. Hearing this talk the goal of Frontera is actually different – schedule crawling and crawling strategy. In their architecture they use scrapy for crawling but can use other crawlers as well.
After introducing Frontera, Sibiryakov showed their result of crawling the Spanish internet and the problems they faced. Some of the solutions were quite trivial – indexing the data differently to avoid hotspot, caching, limiting the depth of crawling \ number of pages per host, etc. I cannot say that I was fully convinced.

Setting up predictive analytics services with Palladium (Andreas Lattner, Otto group)
The evolution of company tool – 80% development time, 20% deployment overhead which tend to repeat between different projects. So why not make a framework and try to automate it. This is exactly the reason Otto group developed Palladium – ease the development, deployment and integration predictive analytics services.
The main limitation of Palladium is that it does no support distributed calculation. Another issue is controlling the results on real time – boosting, filtering, etc. At current time, prediction.io (not written in Python) is in a more mature state.

Spotting trends and tailoring recommendations: PySpark on Big Data in fashion (Martina Pugliese, Mallzee)
Mallzee, as stated by Pugliese is “the tinder of fashion”. I.e the applications shows you a stream of fashion items from multiple providers and you swipe them left and right according to your preferences.Their current big challenge is to produce valuable recommendation for each customer.
Their input on one hand are the items and the items information which they crawl from the providers websites and normalize to fit into their set of tags and properties and of course the brand itself. The second input is the user behaviors – swipes, buy actions, favourite and non favourite brands, etc. Their current choice is creating a random forest for each user based on their actions. This approach can work for users with a lot of signals and specifically positive signal (buy and positive swipes). Possibly the next step will be to cluster the users to learn more about the users with log positive signals.

Practical Word2Vec in Gensim (Lev Konstantinovskiy, Gensim)

Gensim is a “topic modeling for humans”. It implements many interesting algorithm including lda, word2vec and doc2vec, lot’s of potential and interesting to play with.
But this tutorial was a mess.. the class didn’t fit to a tutorial, not enough space, not tables people can put their computers on and really run the code.

Konstantinovskiy wanted to present many options and ideas and run too fast between the different algorithms without really explaining them. On the other hand there was not really a focus on running the code and showing the package API. Unfortunately I feel this talked was a bit miss handled.


Bayesian Optimization and it’s application to Neural Networks (Moritz Neeb, TU Berlin)
What I would take from this talk is approaching Hyper parameter tuning as an optimization problem. For example when we use grid search for hyper parameter tuning we invest a lot of resources but the experiments are independent from one another and we don’t learn from one another. Instead we can possible introduce a method which have some interactions between the different experiments.
Neeb introduced an approach which treat it as a Gaussian Process and at each point try to evaluate the point with the maximal possible gain.
He also mentioned few Python libraries –
  • Spearmint – designed to automatically run experiments (thus the code name spearmint) in a manner that iteratively adjusts a number of parameters so as to minimize some objective in as few runs as possible.
  • HPOlib – hyperparameter optimization library
  • Hyperopt –  is a Python library for optimizing over awkward search spaces with real-valued, discrete, and conditional dimensions.

Keynote – Wes McKinney (Cloudera)
Wes McKinney is probably Mr. Open Source he involved and leading some of the most known data science open source projects (in python but not only) – pandas, Apache Parquet, etc.
His talk had roughly 2 parts – talking about the community, code of conduct, etc. and challenges that he believes will play a major role in data science \ python \ big data communities in the near future. Some of the project \ ideas \ challenges he mentioned –

What’s new in Deep Learning (Dr Kashif Rasul, Zalando SE)
Deep learning survey on speeds.. Survey of several recent papers in deep learning. Beside on mention of Theano and short example of code no strong connection to Python. I would have compromise to less code and slower pace.
Some of the papers he pointed at –

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons (Jose Quesada, Data Science Retreat)
Jose Quesada presented their experience with teaching and using Python Scikit learn versus Scala Spark. The most interesting points in my opinion were the PySpark limitation and future directions and of course comparison of existing features.
One of the most important insights was to use spark DataFrames. spark.ml is built on the top of dataframes and will replace Spark MLlib in the future.

Data Integration in the World of Microservices (Valentine Gogichashvili, Zalando)
Gogichashvili is the head of Data Engineering in Zalando and is working there for already 5.5 years which means he was there during business, structural and technological changes the organization went through.
While he talked about micro-services architecture I think there were two other interesting inputs in the talk –
  • Team structure – every team is responsible for their own components and infrastructure (including naming) while their align to engineering guidelines such as API structure, data types, etc. which are set by cross team guilds.
  • Open source code – whenever a team starts a new project by default the project will grow to be an open source unless they reason about why not. Zalando open source projects can be viewed here.

Brand recognition in real-life photos using deep learning (Lukasz Czarnecki, Samsung)
Czarnecki presented his project from Data Science retreat – brand recognition in Instagram photos where he combined neural networks with SVM. He showed the base line he started from (+ some preprocessing) and the steps he made to improve his results.
Some of the steps he did –
  • Increase training set from 300 images for brand to 800
  • Multiply the training set by cropping part of the pictures.
  • Looking on the error and adding training examples to make the NN more general
  • Increases threshold of the SVM
That’s all for now, until next year 🙂

5 interesting things (16/05/2016)

Similar Wikipedia Pages – This post present Wikipedia similar pages chrome extension. Phrasing this in other words it is a recommendation system for Wikipedia pages. They are not the first one to do it Wikiwand as well as Wikipedia themselves (in Beta) created such a feature. They compare the result a bit and it would have been interesting to have deeper dive to it. One obvious difference between them and Wikipedia is that while Wikipedia can analyze user behavior and use collaborative filtering or a hybrid approach they only have access to the content data. However they can possibly integrate page view data or edit data to inspect trends and to integrate more temporal data in their recommendations.

 

 

Cron best practices – while it is a very common to set a cron job it is sometime not fully understood – which user runs the command, where is it run, where to put the scripts, orcehstrating cron jobs and so on. A life hack post.

 

 

Resume Dreamer – since the release of TensorFlow as open source many people experience with TensorFlow in different tasks and settings and share their work. This post by untapt team show their efforts on generating CV using TensorFlow. While the task is amusing and one could think of it as a draft to their own CV I lack some technical details about the process. Sawing that the engine recognized the paragraph structure I wonder how it will deal with dates patterns.

 

 

Apache Libcloud – Python library for interacting with many of the popular cloud service providers using a unified API. This library have two major use cases as far as I see it – migration between different providers and working with multiple providers on a regular base which I believe is rare. In both cases this package can decrease the “lock-in” to specific provider and that’s good thing.

 

 

Graph DataFrames -Databricks expose GraphFrames – graph processing library for Apache Spark. The functionality of Apache Spark is extended step by step and now it is time for Graph algorithms. My guess is that the query syntax will change a bit in the future to support user defined functions and so on but that’s a start.

 

5 interesting things (20/04/2016)

Spoiler detector – Get your annotated data set for free! Nice way to get (though not perfect) an annotated data set and create some social good. Would be interesting to expand it to more movies and television shows and to other social networks. Interesting to see how predictive a model which is trained on one social network preforms on other social networks (for this case and on general)

http://www.insightdatascience.com/blog/fanguard.html

Deep dive to Python virtualenv – although it only addresses python requirements and not more general system requirements (e.g this case), it is a very important and common tool to make sure that all your python requirements are in place. This post provide some deeper look into virtualenv and pyenv, a python version management tool.

https://realpython.com/blog/python/python-virtual-environments-a-primer/

6 Lesser Known Python Data Analysis Libraries – TL;DR – mrjob, delorean, natsort, tinydb, prettytable and vincent. If I had to write the same blog post I am not sure that those are the packages I would have chosen. mrjob is maintained but I feel it is a bit outdated and there are better way now days to run multi-step mapreduce jobs (e.g Apache Spark). natsort – I understand the need, personally it is easier for me to write the sort function myself and even better avoid sorting as much as possible. prettytable – I prefer pandas printing over this printing. And last but not least – I am not sure that I would really categorize all those packages as “Data Analysis”.

http://jyotiska.github.io/blog/posts/python_libraries.html

Python datacleaner – would definitely be a strong candidate when writing a post such of the above. Designated to make the process of cleaning pandas data frames quicker and easier. Interesting to see how this project will evolve.

https://github.com/rhiever/datacleaner

Deep Detect – “Open Source + Deep Learning + API + Server”. Deep learning version of PredictionIO. It is written in C++11 and uses Caffe for deep learning. It seems natural to me that predictionIO and DeepDetect will cooperate in the future or someone will develop a deep learning template for predictionIO.

http://www.deepdetect.com/

5 interesting things (22/03/2016) – Hebrew

I usually don’t post link to posts in Hebrew because the audience is limited. This time I decided to make Hebrew Special. I appreciate people writing about technology in Hebrew from several reasons – audience and feedback are limited, not all the terms exists in Hebrew, indentation, etc. It took me a while to write this post. To find 5 blogs I can really recommend on. I hope all those blogs will keep being active and some more will join them.
This time the links are not to specific posts but to blogs.

Reversim – a podcast for developers but not only for them. Blog \ podcast was founded by Ran Tavori and Ori Lahav and is almost 7 years old now. They also hold an annual summit in Israel. There are several kind of sub-series \ chapter types, my favorite is Bumpers which is a monthly survey of cool things Ran Tavory and friends bumped into that month.

Software Archiblog – this blog is written by Lior Bar-On which is currently a Chief Architect @ Gett. On one hand there are deep dive post to AWS services, GO language on so on. On the other hand there are softer posts regarding processes, choosing technology stack, time management and so on.

the bloggerit – blog by Hadas Sheinfeld, former VP product at ClickTale. This blogs runs now for almost 8 years. The only female writer in this post. Although Reversim host many people from time to time, I don’t remember the last time they hosted a female guest. She writes about product, organizational culture as well as on other topics from a wider angle.

Navi Sheker – the name of the blog refers to the bible meaning “False prophet” and the subtitle of the blog is “Prophecy was given to fools” which is from Baba Batra one the books in the Mishnah. The posts in this blog mostly analyze twitter chatter and google search in Hebrew, sometimes with correlation to actual events.

I, Code – Comparing to the other blogs this one is relatively young one – few months old. Hopefully it will be active in a year from now (and more of course).  The posts in this blog are mostly code samples in Python and CSS. I like the post about how to secure your home in less than 80 lines of code.

5 interesting things (12/03/2016)

In memory of Udy Brill

pypi analysis – this posts starts with some technical issues, e.g. how to scrape pacakge dependencies from pypi packages and goes on to showing a graph of dependencies between python packages as well as analysis on the graph structure. What we learned from this analysis – requests, django and six are among the most important packages – PageRank and connectivity degree wise. requests also leads in the”betweenness centrality” matrix. While both django and six does not appear in the top 10 results of this matrix. I believe that the reason is the both django and six are used for more specific use-cases \ applications while requests which is more general. One can see that the top-10 packages in “betweenness centrality” matrix are quite general ones (testing and setup) as well as open stack clients.

The post includes two more visualizations – adjacency matrix, which exposes existence of cliques in the graph with some details about them and degree distribution in the graph. A long tail distribution as one can easily expect. Most packages are not imported by anyone else and a few packages are imported by many other packages.

http://kgullikson88.github.io/blog/pypi-analysis.html

Estimate user locations in social media – in the world of targeting and online advertisment one of the challenges is to learn as much as possible about the user for better targeting. Such data includes – gender, age, marital status, field of interest and of course location. No use in advertising a shop which is 500km from the user. But, no one is perfect and so is data. We don’t always know user location and this post describe two approaches for estimating user location from social media.

http://www.lab41.org/2-highly-effective-ways-to-estimate-user-location-in-social-media/

10 Lessons from 10 Years of Amazon Web Services – post by Werner Vogels, Amazon CTO concluding 10 years of AWS. Although I sometimes have criticism they (together with other palyer and technology enhancements) truly changed the world. A very interesting read both for AWS users and non users of AWS.

http://www.allthingsdistributed.com/2016/03/10-lessons-from-10-years-of-aws.html

Introduction to Boosting
 – TL;DR – think iterative – at each step you improve the model so it will predict well samples which were not predicted correctly so far. Really good overview on the concept of boosting, now I want to use this knowledge and play with it –

https://codesachin.wordpress.com/2016/03/06/a-small-introduction-to-boosting/

curl vs wget – a comparison of the two tools from a contributor to both projects. While (for me) there is not immediate day-to-day implication it is good to look a bit under the hood and to see the pros and cons of each for such a common tools. For me it would be easier to see it as a table.

https://daniel.haxx.se/docs/curl-vs-wget.html


Udy was a friend and a colleague of mine. He died in a track in New Zealand during his honeymoon. For me he had the perfect mix of curiosity, professionalism, team member and a positive person.

I’ll remember Udy in everyday moments like eating food with a lot of sauce, reading something about sorting algorithms and hearing Tracy Chapman.

This is a video we made in a company hackathon we worked together.

https://www.youtube.com/watch?v=jDZs4Mv_nMU