"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

October 04, 2021

Data Curation paper - Reads

Paper #1 - A Survey on Data Cleaning Methods for Improved Machine Learning Model Performance

  • Two aspects of data cleaning: what to clean and how to clean

Key Notes

  • SampleClean: Simulated Clean Data Instances - SampleClean suggests a solution to sampling the raw data that can better present clean data instances.
  • Approximate Query Processing (AQP). The AQP consists of two steps: first, in Direct Estimate (DE), a set of k rows is sampled randomly and cleaned, and the training result is returned independently of the dirty data. The correction step is used to reweight the sample based on the contribution of the cleaned data
  • ActiveClean: Incremental Data Cleaning in Convex Models. ActiveClean gradually cleans a dirty dataset to learn a convex-loss model, such as Logistic Regression and Support Vector Machine (SVM).
  • HoloClean: Holistic Data Repairs With Probabilistic Inference
  • AlphaClean: Generate-Then-Search Parallel Data Cleaning
  • CPClean: Reusable Computation in Data Cleaning

ML Papers - Learning-with-Label-Noise

Paper #2 - Advancing Data Curation With Metadata and Statistical Relational Learning

Key Notes

  • We refer to data science as an umbrella term gathering algorithms and techniques from several disciplines, such as statistics, software engineering, and machine learning
  • Data is inconsistent, duplicated, stale, incomplete, and/or inaccurate. Data errors, such as outliers, duplicates, missing values, and inconsistencies.
  • Mapping Metadata to Data Quality Issues
  • Error Detection
  • Joint Error Detection and Repair Suggestion


Data Quality fundamentals

  • The Consistency dimension refers to the validity and integrity of values and tuples with respect to defined inter- and intra-relational constraints that exist within either single or multiple relations
  • The accuracy dimension identifies correct and true values of the entities presented by data.
  • Completeness is a degree to which values are included in a data collection
  • Timeliness dimension reflects the change and update of data by identifying the most current value of an entity in a database
  • Core data quality dimensions, the violation of Accuracy, Consistency,
  • Uniqueness, Completeness and Timeliness lead to data quality issues

  • Metadata is "structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource"


Single-Column Profiling Tasks

  • Cardinalities refers to the counts of values
  • Number of rows: the number of entities which are available in the table;
  • Distinctness: the number of distinct values of the single attribute;
  • Uniqueness: the ratio of the number of distinct values to the number of rows

Value Distribution refers to the distribution of values on the column. This category includes:

  • Constancy: the ratio between the most frequent value count and the number of rows;
  • Extreme values: minimum and maximum values in numeric columns; shortest and
  • longest strings in categorical, alphanumeric or text columns;
  • Histogram: values distribution summary on an attribute
  • Quartiles: three points that divide numeric distribution into four equal groups;
  • Inverse distribution: an inverse frequency distribution (a distribution of the frequency distribution);

Patterns

  • Patterns refers to the syntactic properties on the values of the individual column.
  • Lengths, which specifies the descriptive statistics of the column value lengths
  • Decimals, which determines the number of decimals in numeric columns

Multi-Column Profiling Tasks

  • Functional dependencies
  • What. The first dimension captures common data quality issues and typical data cleaning tasks, which had been found in the literature.
  • How. The second dimension reflects differently focused data cleaning approaches.

Rule-Based Approaches

  • Data cleaning rules or integrity constraints to detect and repair various error types in the dataset.

Statistical Approaches

  • DEC (DetectExplore-Clean) framework [22] uses statistical and other analytical techniques, such as the Fleiss’ kappa measure, to compute the glitch score, which identifies and scores the data glitches

Probabilistic and Machine Learning-Based Approaches

  • The BoostClean system [141] addresses the domain value violations while cleaning training data for predictive models
  • The HoloClean system [202] considers error detection as a black-box component and expects the specification of integrity constraints-aligned data quality rules to make probabilistic suggestions on how to repair erroneous data values.
  • Interactive Data Cleaning
  • Numerous data cleaning systems use crowdsourcing for duplicate detection and resolution






Supervised Error Detection with Metadata


1) an Error Detection Suite, which includes pluggable error detection systems that function as black boxes to our system.

2) a Metadata Profiler Suite, which extracts various metadata categories, and 

3) an Aggregation Suite, which combines the output of the error detection suite and the profiler. In the following, we describe each of the components.

Keep Exploring!!!

No comments: