My summary and notes for “Detecting Data Errors: Where are we and what needs to be done?” by Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, Nan Tang Proceedings of the VLDB Endowment 9.12 (2016): 993-1004.
Paper can be found – here
In this paper the group of researchers evaluate several data cleaning tools to detect different types of data errors and suggest a strategy to holistically run multiple tools to optimize the detection efforts. This study focus on automatically detecting the errors and not repair them since automatically repairing is rarely allowed.
Current status – current data cleaning solutions are usually belong to one or more of the following categories:
- Rules based detection algorithms – the user specify set of rules such as: not null, functional dependencies, user defined function that the data must obey and the data cleaner find any violation. Example: NADEEF.
- Pattern enforcement and transformation tools – tools in this category discover either syntactic or semantic patterns in the data and detect those errors. Example:OpenRefine, Data Wrangler, DataXFormer, Trifacta, Katara.
- Quantitative error detection algorithms – find outliers and glitches in the data.
- Record linkage and de-depulication algorithms – identify data which refer to the same entity and is not consistent \ appear multiple times. Examples: Data Tamer, TAMR.
Evaluation of tools
- Precision and recall of each tool
- Errors detected when applying all the tools together
- How many false positives are detected as we would like to minimize the human effort.
- Outliers include data values that deviate from the distribution of values in a column of a table.
- Duplicates are distinct records that refer to the same real-world entity. If attribute values do not match, this could signify an error.
- Rule violations refer to values that violate any kind of integrity constraints, such as Not Null constraints and Uniqueness constraints.
- Pattern violations refer to values that violate syntactic and semantic constraints, such as alignment, for matting, misspelling and semantic data types.
Some errors overlap and fit into more than one category.
[TR] – are those all the error types which exist? What about correlated errors between several records?
[TR] – there are many data sets specific details in the paper. As I am more interested in the ideas those details will be omitted here.
[TR] – the data evaluated relatively small datasets with small number of columns. It would also be interesting to evaluate it bigger and more complex datasets, e.g. wikidata.
[TR]- consider the temporal dimension of the data. I.e some properties may have expiration date which other not (e.g birth place never changes while current location changes).
Data cleaning tools
- DBoost – use 3 common method for outlier detection – histograms, Gaussian and multivariate Gaussian mixtures. The UVP of this tool is decomposing types into their building blocks. For example expanding dates into day, month and year. DBoost require configuration such as number of bins and their width for histograms and mean and standard deviation for Gaussian and GMM.
- DC-Clean – focus on denial constraints and subsume the majority of the commonly used constraint languages. The collect if denial constraints was designed for each data set.
- OpenRefine – can digest data in multiple formats. Data exploration is performed through faceting and filtering operations ([TR] – reminds DBoost histograms)
- Trifacta – commercial product which was developed from DataWrangler. Can predict and apply syntactic data transformation for data preparation and data cleaning. Transformations can also involve business logic.
- Katara – uses external knowledge bases, e.g. Yago in order to detect errors that violate a semantic pattern. It does it by first identifying the type of the column and the relations between two columns in the data set using a knowledge base.
[TR] – assumes that the knowledge base is ground truth. We need to doubt this as well.
- Pentaho – provide graphical interface for data wrangling and can orchestrate ETL processes.
- KNIME – focuses on workflow authoring and encapsulating data processing tasks with some machine learning capabilities.
- TAMR – uses machine learning models to learn duplicate features through expert sourcing and similarity metrics.
Combination of Multiple tools
- Union all and Min-K –
- Union all – takes the union of the errors emitted by all tools (i.e k=1)
- Min-k – error detected by at least k tools.
- Ordering based on Precision
- Cost model –
- C – cost of having a human check a detected error
- V – Value of identifying a real error (V > C otherwise make not sense).
- P – Number of true positives
- N – Number of false positives
Total value should hold – P * V > (P +N) * C => P/(P+N> > C/V. P/(P+N) is the precision. Therefore if the model precision is less than C/V we should not run it. Model precision can be evaluated by sampling the detected errors.
“We observed that some tools are not worth evaluating even if their precision is higher than the threshold, since the errors they detect may be covered by other tools with higher estimate precision (which would have been run earlier).” (p. 998)
[TR] – not always the cost and value can be estimated correctly and easily and not all the errors have the same cost and value.
- Maximum entropy-based order selection – the algorithm estimate the overlap between the tool results and picks the tool with the highest precision to reduce the entropy. Algorithm steps:
- Run individual tool – run each tool and get the detected errors.
- Estimate precision for each tool by checking samples
- Pick a tool which maximize the entropy among the unused tools so far – picks the one with the highest estimated precision on the sample and verifies its detected errors on the complete data that have not been verified before.
- Update – update the errors that were detected by the chosen tool in the last step and repeat last two steps.
- Cost model –
- Using Domain specific tools – for example AddressCleaner. [TR] – only relevant when such a tool exists and is easy \ cheap to use.
- Enrichment – rule-based systems and duplicate detection strategies can benefit from additional data. Future work would consider data enrichment system.
Discussion and Improvements
- No clear winner – different tools worked well on different data set mainly due to different error types distribution. Therefore a holistic strategy must be used.
- Since there are errors which overlaps one can order the tools to minimize false positives. However the ordering strategy is data set specific.
- Yet, not 100% of the errors are detected. Suggested way to improve it is –
- Type-specific cleaning – for example date cleaning tools, address cleaning tools etc. Even those tools are limited in their recall.
- Enrichment of the data from external sources
- A holistic combination of tools – algorithms for combining tools. [TR] – reminds ensemble methods from Machine learning.
- Data enrichment system – adding relevant knowledge and context to the data set.
- Interactive dashboard
- Reasoning on real-world data