"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

September 25, 2021

Data Curation Paper Reads - Data Quality - Data Cleaning

Paper #1 - Auto-Detect: Data-Driven Error Detection in Tables

Key Notes

  • Values in a column not conforming to patterns associated with a data-type are flagged as errors.
  • Formulas inconsistent with other formulas in the region 
  • Text clustering feature that groups together similar values in a column
  • Single-column approaches detect errors only based on values within an input column.
  • When certain multi-column data quality rules (e.g. function-dependencies and other types of first-order logic)

Methods

  • Fixed-Regex (F-Regex)
  • dBoost
  • Compression-based dissimilarity measure (CDM)
  • Support vector data description (SVDD)
  • Distance-based outlier detection (DBOD)
  • Local outlier factor (LOF)
  • Multi-column error detection using rules
  • Single-column error detection
  • Numeric error detection
  • Outlier detection
  • Application-driven error correction. Recent approaches such as BoostClean  and ActiveClean

Record Linkage


I like this technique for data merging

  • Similarity between two words 
  • Match between numbers
  • Match between First Name
  • Match between Last Name

Similarity distance function

Deep learning for ER



BoostClean selects an ensemble of methods (statistical and logic rules) for error detection and for repair combinations using statistical boosting.

More Reads

Keep Exploring!!!

No comments: