"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;
Showing posts with label Data Quality. Show all posts
Showing posts with label Data Quality. Show all posts

September 25, 2021

Data Curation Paper Reads - Data Quality - Data Cleaning

Paper #1 - Auto-Detect: Data-Driven Error Detection in Tables

Key Notes

  • Values in a column not conforming to patterns associated with a data-type are flagged as errors.
  • Formulas inconsistent with other formulas in the region 
  • Text clustering feature that groups together similar values in a column
  • Single-column approaches detect errors only based on values within an input column.
  • When certain multi-column data quality rules (e.g. function-dependencies and other types of first-order logic)

Methods

  • Fixed-Regex (F-Regex)
  • dBoost
  • Compression-based dissimilarity measure (CDM)
  • Support vector data description (SVDD)
  • Distance-based outlier detection (DBOD)
  • Local outlier factor (LOF)
  • Multi-column error detection using rules
  • Single-column error detection
  • Numeric error detection
  • Outlier detection
  • Application-driven error correction. Recent approaches such as BoostClean  and ActiveClean

Record Linkage


I like this technique for data merging

  • Similarity between two words 
  • Match between numbers
  • Match between First Name
  • Match between Last Name

Similarity distance function

Deep learning for ER



BoostClean selects an ensemble of methods (statistical and logic rules) for error detection and for repair combinations using statistical boosting.

More Reads

Keep Exploring!!!

Data Quality - Algorithm Fairness - Data Curation Papers

Paper #1 - Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI

Key Notes

  • Poor data quality in high-stakes domains can have outsized effects on vulnerable communities and context
  • Data Cascades: compounding events causing negative, downstream effects from data issues, resulting in technical debt over time
  • Many researchers have pointed to the undervalued human labour that powers AI models 
  • Practitioners often work with a set of assumptions about their data during analysis and visualisation
  • Other frameworks to discover data bugs and clean data include ActiveClean and BoostClean
  • Data cascades are complex, long-term, occur frequently and persistently
  • Under-valuing of data work is common to all of AI development
  • Practitioners viewed data as operations, moved fast, hacked model performance (through hyperparameters rather than data quality)
  • Everyone wants to do the model work, not the data work
  • It was difficult to get buy-in from clients and funders to invest in good quality data collection and annotation work
  • Lack of adequate training on AI data quality
  • Cascades triggered by ‘hardware drifts’
  • Cascades triggered by ‘environmental drifts’

Paper #2 - Re-imagining Algorithmic Fairness in India and Beyond

Key Notes

  • While Indians are part of the AI workforce, a majority work in services, and engineers do not entirely represent marginalities,limiting re-mediation of distances
  • While other axes of discrimination and injustices such as disability status
  • Algorithmic powerful in India, where the distance between models and oppressed communities is large
  • “rich people problems like cardiac disease and cancer, not poor people’s Tuberculosis, prioritised in AI"


More Reads

Keep Thinking!!!