Under construction. Coming soon!
- IEEE Data Engineering Bulletin March 2021 Special Issue on Data Validation for ML
- Xu Chu, Ihab F. Ilyas, Sanjay Krishnan, Jiannan Wang: Data Cleaning: Overview and Emerging Challenges. SIGMOD Conference 2016: 2201-2206
These tools focus on identify errors in datasets, without taking the downstream model or application into account. These include traditional constraint-based data cleaning methods, as well as those that use machine learning to detect and resolve data errors.
- HoloClean functional dependencies, quantitative statistics, external information as a single factor-graph model.
- Raha uses a library of error detectors, and treats the output of each as a feature in a holistic detection model. It then uses clustering and active learning to train the holistic model with few labels.
- Picket: Self-supervised Data Diagnostics for ML Pipelines: self-supervision to learn an error detection model.
These data cleaning tools are meant to clean training datasets, and are co-designed with the trained model in mind.
- ActiveClean VLDB 2016: leverages model convexity to treat cleaning as an active learning problem.
- CPClean VLDB 2021: leverages robustness of NN classifiers to local perturbations.
- Boost and AlphaClean: models data cleaning pipeline generation as an optimization problems, given a "data quality" function.
- Conformance Constraints SIGMOD 21: learning constraints that should fail if inference over a test record may be untrustworthy.
These data cleaning tools are used to clean training datasets by using errors detected in the downstream application results. For instance, the application may use the model as part of an analytic query and visualize the result. If the user sees an anomaly in the visualization, she can submit the issue as a complaint.
- From Cleaning before ML to Cleaning for ML DE Bulletin 2021: recent survey of cleaning for and using machine learning.
- Complaint-driven Training Data Debugging for Query 2.0 SIGMOD 2020: leveraging downstream query outputs to identify erroneous training data errors as an influence analysis problem.
- Explaining Inference Queries with Bayesian Optimization VLDB 2021: leveraging downstream query outputs to identify erroneous training data errors as a hyperparameter search problem.
This line of work is closely related to the area of query explanations (e.g., Wu2013, Roy2014, Abuzaid2019) in that it uses errors in downstream results for data debugging..
- Data Standardization:
- Label clean: