Paper #1 - Auto-Detect: Data-Driven Error Detection in Tables
Key Notes
- Values in a column not conforming to patterns associated with a data-type are flagged as errors.
- Formulas inconsistent with other formulas in the region
- Text clustering feature that groups together similar values in a column
- Single-column approaches detect errors only based on values within an input column.
- When certain multi-column data quality rules (e.g. function-dependencies and other types of first-order logic)
Methods
- Fixed-Regex (F-Regex)
- dBoost
- Compression-based dissimilarity measure (CDM)
- Support vector data description (SVDD)
- Distance-based outlier detection (DBOD)
- Local outlier factor (LOF)
- Multi-column error detection using rules
- Single-column error detection
- Numeric error detection
- Outlier detection
- Application-driven error correction. Recent approaches such as BoostClean and ActiveClean
I like this technique for data merging
- Similarity between two words
- Match between numbers
- Match between First Name
- Match between Last Name
Similarity distance function
Deep learning for ER
BoostClean selects an ensemble of methods (statistical and logic rules) for error detection and for repair combinations using statistical boosting.
- ActiveDetect - detects and prioritizes the most important data errors in a dataset.
- Sample clean - A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data
- AlphaClean -declaratively synthesizes data cleaning programs
- A Data Quality Metric (DQM)
- Data Cleaning for Data Science - PrivateClean, ActiveClean, and BoostClean.
More Reads
- BoostClean: Automated Error Detection and Repair for Machine Learning
- Machine Learning-Based Data Cleaning : Current Solutions and Challenges
- BoostClean: Automated Error Detection and Repair for Machine Learning
- Auto-Data Cleaning
- A Demonstration of DBWipes: Clean as You Query
- SampleClean: Fast and Reliable Analytics on Dirty Data
- ActiveClean: Interactive Data Cleaning For Statistical Modeling
- AlphaClean: Automatic Generation of Data Cleaning Pipelines
- Towards Automated Data Cleaning Workflows
Keep Exploring!!!
No comments:
Post a Comment