Category: Uncategorized
Detecting Data Errors: Where are we and what needs to be done?
My summary and notes for “Detecting Data Errors: Where are we and what needs to be done?” by Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, Nan Tang Proceedings of the VLDB Endowment 9.12 (2016): 993-1004.
Paper can be found – here
In this paper the group of researchers evaluate several data cleaning tools to detect different types of data errors and suggest a strategy to holistically run multiple tools to optimize the detection efforts. This study focus on automatically detecting the errors and not repair them since automatically repairing is rarely allowed.
Current status – current data cleaning solutions are usually belong to one or more of the following categories:
- Rules based detection algorithms – the user specify set of rules such as: not null, functional dependencies, user defined function that the data must obey and the data cleaner find any violation. Example: NADEEF.
- Pattern enforcement and transformation tools – tools in this category discover either syntactic or semantic patterns in the data and detect those errors. Example:OpenRefine, Data Wrangler, DataXFormer, Trifacta, Katara.
- Quantitative error detection algorithms – find outliers and glitches in the data.
- Record linkage and de-depulication algorithms – identify data which refer to the same entity and is not consistent \ appear multiple times. Examples: Data Tamer, TAMR.
Evaluation of tools
- Precision and recall of each tool
- Errors detected when applying all the tools together
- How many false positives are detected as we would like to minimize the human effort.
Error types
- Outliers include data values that deviate from the distribution of values in a column of a table.
- Duplicates are distinct records that refer to the same real-world entity. If attribute values do not match, this could signify an error.
- Rule violations refer to values that violate any kind of integrity constraints, such as Not Null constraints and Uniqueness constraints.
- Pattern violations refer to values that violate syntactic and semantic constraints, such as alignment, for matting, misspelling and semantic data types.
(p. 995)
Some errors overlap and fit into more than one category.
[TR] – are those all the error types which exist? What about correlated errors between several records?
Data sets
[TR] – there are many data sets specific details in the paper. As I am more interested in the ideas those details will be omitted here.
[TR] – the data evaluated relatively small datasets with small number of columns. It would also be interesting to evaluate it bigger and more complex datasets, e.g. wikidata.
[TR]- consider the temporal dimension of the data. I.e some properties may have expiration date which other not (e.g birth place never changes while current location changes).
Data cleaning tools
| DBoost | DC-Clean | OpenRefine | Traficata | Pentaho | Knime | Katara | TAMR | |
| Pattern violation | + | + | + | + | + | |||
| Constraint violations | + | |||||||
| Outliers | + | |||||||
| Duplicates | + |
- DBoost – use 3 common method for outlier detection – histograms, Gaussian and multivariate Gaussian mixtures. The UVP of this tool is decomposing types into their building blocks. For example expanding dates into day, month and year. DBoost require configuration such as number of bins and their width for histograms and mean and standard deviation for Gaussian and GMM.
- DC-Clean – focus on denial constraints and subsume the majority of the commonly used constraint languages. The collect if denial constraints was designed for each data set.
- OpenRefine – can digest data in multiple formats. Data exploration is performed through faceting and filtering operations ([TR] – reminds DBoost histograms)
- Trifacta – commercial product which was developed from DataWrangler. Can predict and apply syntactic data transformation for data preparation and data cleaning. Transformations can also involve business logic.
- Katara – uses external knowledge bases, e.g. Yago in order to detect errors that violate a semantic pattern. It does it by first identifying the type of the column and the relations between two columns in the data set using a knowledge base.
[TR] – assumes that the knowledge base is ground truth. We need to doubt this as well.
- Pentaho – provide graphical interface for data wrangling and can orchestrate ETL processes.
- KNIME – focuses on workflow authoring and encapsulating data processing tasks with some machine learning capabilities.
- TAMR – uses machine learning models to learn duplicate features through expert sourcing and similarity metrics.
Combination of Multiple tools
- Union all and Min-K –
- Union all – takes the union of the errors emitted by all tools (i.e k=1)
- Min-k – error detected by at least k tools.
- Ordering based on Precision
- Cost model –
- C – cost of having a human check a detected error
- V – Value of identifying a real error (V > C otherwise make not sense).
- P – Number of true positives
- N – Number of false positives
Total value should hold – P * V > (P +N) * C => P/(P+N> > C/V. P/(P+N) is the precision. Therefore if the model precision is less than C/V we should not run it. Model precision can be evaluated by sampling the detected errors.
“We observed that some tools are not worth evaluating even if their precision is higher than the threshold, since the errors they detect may be covered by other tools with higher estimate precision (which would have been run earlier).” (p. 998)
[TR] – not always the cost and value can be estimated correctly and easily and not all the errors have the same cost and value. - Maximum entropy-based order selection – the algorithm estimate the overlap between the tool results and picks the tool with the highest precision to reduce the entropy. Algorithm steps:
- Run individual tool – run each tool and get the detected errors.
- Estimate precision for each tool by checking samples
- Pick a tool which maximize the entropy among the unused tools so far – picks the one with the highest estimated precision on the sample and verifies its detected errors on the complete data that have not been verified before.
- Update – update the errors that were detected by the chosen tool in the last step and repeat last two steps.
- Cost model –
- Using Domain specific tools – for example AddressCleaner. [TR] – only relevant when such a tool exists and is easy \ cheap to use.
- Enrichment – rule-based systems and duplicate detection strategies can benefit from additional data. Future work would consider data enrichment system.
- No clear winner – different tools worked well on different data set mainly due to different error types distribution. Therefore a holistic strategy must be used.
- Since there are errors which overlaps one can order the tools to minimize false positives. However the ordering strategy is data set specific.
- Yet, not 100% of the errors are detected. Suggested way to improve it is –
- Type-specific cleaning – for example date cleaning tools, address cleaning tools etc. Even those tools are limited in their recall.
- Enrichment of the data from external sources
- A holistic combination of tools – algorithms for combining tools. [TR] – reminds ensemble methods from Machine learning.
- Data enrichment system – adding relevant knowledge and context to the data set.
- Interactive dashboard
- Reasoning on real-world data
Discussion and Improvements
Conclusions
Future work
5 interesting things (28/08/2016)
5 Python NLP pacakges
PyData Berlin 2016 #pydatabln
I got a diversity scholarship from Num Focus to attend the PyData Berlin event. Num Focus is an NGO which supports open source data science projects among them – Jupyter, matplotlib, Numpy, pandas etc.
- I\O intensive operations – e.g feature engineering and model serving.
- CPU intensive operations – e.g hyper-parameters search and cross validation.
- No pure python local mode – impossible to use profiler or ipdb
- There is latency which is induced by the network architecture (Python driver -> Scala (JVM) -> Python worker)
- Traceback is sometimes hard to understand as there is a mix of scala and python errors and log data.
Konstantinovskiy wanted to present many options and ideas and run too fast between the different algorithms without really explaining them. On the other hand there was not really a focus on running the code and showing the package API. Unfortunately I feel this talked was a bit miss handled.
- Spearmint – designed to automatically run experiments (thus the code name spearmint) in a manner that iteratively adjusts a number of parameters so as to minimize some objective in as few runs as possible.
- HPOlib – hyperparameter optimization library
- Hyperopt – is a Python library for optimizing over awkward search spaces with real-valued, discrete, and conditional dimensions.
- conda-forge
- manylinux – Python wheels that work on any linux (almost)
- Understanding the difficulty of training deep feedforward neural networks (Glorot and Bengio) – The term Xavier initialization comes from here. See here for additional information.
- Delving Deep into Rectifiers:Surpassing Human-Level Performance on ImageNet Classification (He, Zhang, Ren, Sun – Microsoft research)
- Batch Normalization: Accelerating Deep Network Training byReducing Internal Covariate Shift (Ioffe and Szegedy – Google) – A method to accelerate the training of deep neural network by modifying the distribution of activations. See tutorial here.
- Team structure – every team is responsible for their own components and infrastructure (including naming) while their align to engineering guidelines such as API structure, data types, etc. which are set by cross team guilds.
- Open source code – whenever a team starts a new project by default the project will grow to be an open source unless they reason about why not. Zalando open source projects can be viewed here.
- Increase training set from 300 images for brand to 800
- Multiply the training set by cropping part of the pictures.
- Looking on the error and adding training examples to make the NN more general
- Increases threshold of the SVM
5 interesting things (16/05/2016)
5 interesting things (20/04/2016)
Spoiler detector – Get your annotated data set for free! Nice way to get (though not perfect) an annotated data set and create some social good. Would be interesting to expand it to more movies and television shows and to other social networks. Interesting to see how predictive a model which is trained on one social network preforms on other social networks (for this case and on general)
http://www.insightdatascience.com/blog/fanguard.html
Deep dive to Python virtualenv – although it only addresses python requirements and not more general system requirements (e.g this case), it is a very important and common tool to make sure that all your python requirements are in place. This post provide some deeper look into virtualenv and pyenv, a python version management tool.
https://realpython.com/blog/python/python-virtual-environments-a-primer/
6 Lesser Known Python Data Analysis Libraries – TL;DR – mrjob, delorean, natsort, tinydb, prettytable and vincent. If I had to write the same blog post I am not sure that those are the packages I would have chosen. mrjob is maintained but I feel it is a bit outdated and there are better way now days to run multi-step mapreduce jobs (e.g Apache Spark). natsort – I understand the need, personally it is easier for me to write the sort function myself and even better avoid sorting as much as possible. prettytable – I prefer pandas printing over this printing. And last but not least – I am not sure that I would really categorize all those packages as “Data Analysis”.
http://jyotiska.github.io/blog/posts/python_libraries.html
Python datacleaner – would definitely be a strong candidate when writing a post such of the above. Designated to make the process of cleaning pandas data frames quicker and easier. Interesting to see how this project will evolve.
https://github.com/rhiever/datacleaner
Deep Detect – “Open Source + Deep Learning + API + Server”. Deep learning version of PredictionIO. It is written in C++11 and uses Caffe for deep learning. It seems natural to me that predictionIO and DeepDetect will cooperate in the future or someone will develop a deep learning template for predictionIO.
Python’s Guide to the Galaxy – SPS16
My talk from Swiss Python Summit is online –
5 interesting things (22/03/2016) – Hebrew
I usually don’t post link to posts in Hebrew because the audience is limited. This time I decided to make Hebrew Special. I appreciate people writing about technology in Hebrew from several reasons – audience and feedback are limited, not all the terms exists in Hebrew, indentation, etc. It took me a while to write this post. To find 5 blogs I can really recommend on. I hope all those blogs will keep being active and some more will join them.
This time the links are not to specific posts but to blogs.
Reversim – a podcast for developers but not only for them. Blog \ podcast was founded by Ran Tavori and Ori Lahav and is almost 7 years old now. They also hold an annual summit in Israel. There are several kind of sub-series \ chapter types, my favorite is Bumpers which is a monthly survey of cool things Ran Tavory and friends bumped into that month.
Software Archiblog – this blog is written by Lior Bar-On which is currently a Chief Architect @ Gett. On one hand there are deep dive post to AWS services, GO language on so on. On the other hand there are softer posts regarding processes, choosing technology stack, time management and so on.
the bloggerit – blog by Hadas Sheinfeld, former VP product at ClickTale. This blogs runs now for almost 8 years. The only female writer in this post. Although Reversim host many people from time to time, I don’t remember the last time they hosted a female guest. She writes about product, organizational culture as well as on other topics from a wider angle.
Navi Sheker – the name of the blog refers to the bible meaning “False prophet” and the subtitle of the blog is “Prophecy was given to fools” which is from Baba Batra one the books in the Mishnah. The posts in this blog mostly analyze twitter chatter and google search in Hebrew, sometimes with correlation to actual events.
I, Code – Comparing to the other blogs this one is relatively young one – few months old. Hopefully it will be active in a year from now (and more of course). The posts in this blog are mostly code samples in Python and CSS. I like the post about how to secure your home in less than 80 lines of code.
5 interesting things (12/03/2016)
In memory of Udy Brill
pypi analysis – this posts starts with some technical issues, e.g. how to scrape pacakge dependencies from pypi packages and goes on to showing a graph of dependencies between python packages as well as analysis on the graph structure. What we learned from this analysis – requests, django and six are among the most important packages – PageRank and connectivity degree wise. requests also leads in the”betweenness centrality” matrix. While both django and six does not appear in the top 10 results of this matrix. I believe that the reason is the both django and six are used for more specific use-cases \ applications while requests which is more general. One can see that the top-10 packages in “betweenness centrality” matrix are quite general ones (testing and setup) as well as open stack clients.
The post includes two more visualizations – adjacency matrix, which exposes existence of cliques in the graph with some details about them and degree distribution in the graph. A long tail distribution as one can easily expect. Most packages are not imported by anyone else and a few packages are imported by many other packages.
http://kgullikson88.github.io/blog/pypi-analysis.html
Estimate user locations in social media – in the world of targeting and online advertisment one of the challenges is to learn as much as possible about the user for better targeting. Such data includes – gender, age, marital status, field of interest and of course location. No use in advertising a shop which is 500km from the user. But, no one is perfect and so is data. We don’t always know user location and this post describe two approaches for estimating user location from social media.
http://www.lab41.org/2-highly-effective-ways-to-estimate-user-location-in-social-media/
10 Lessons from 10 Years of Amazon Web Services – post by Werner Vogels, Amazon CTO concluding 10 years of AWS. Although I sometimes have criticism they (together with other palyer and technology enhancements) truly changed the world. A very interesting read both for AWS users and non users of AWS.
http://www.allthingsdistributed.com/2016/03/10-lessons-from-10-years-of-aws.html
Introduction to Boosting – TL;DR – think iterative – at each step you improve the model so it will predict well samples which were not predicted correctly so far. Really good overview on the concept of boosting, now I want to use this knowledge and play with it –
https://codesachin.wordpress.com/2016/03/06/a-small-introduction-to-boosting/
curl vs wget – a comparison of the two tools from a contributor to both projects. While (for me) there is not immediate day-to-day implication it is good to look a bit under the hood and to see the pros and cons of each for such a common tools. For me it would be easier to see it as a table.
https://daniel.haxx.se/docs/curl-vs-wget.html
Udy was a friend and a colleague of mine. He died in a track in New Zealand during his honeymoon. For me he had the perfect mix of curiosity, professionalism, team member and a positive person.
I’ll remember Udy in everyday moments like eating food with a lot of sauce, reading something about sorting algorithms and hearing Tracy Chapman.
This is a video we made in a company hackathon we worked together.