Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Data Quality

Showing posts with label Data Quality. Show all posts

Showing posts with label Data Quality. Show all posts

September 25, 2021

Data Curation Paper Reads - Data Quality - Data Cleaning

Paper #1 - Auto-Detect: Data-Driven Error Detection in Tables

Key Notes

Values in a column not conforming to patterns associated with a data-type are flagged as errors.
Formulas inconsistent with other formulas in the region
Text clustering feature that groups together similar values in a column
Single-column approaches detect errors only based on values within an input column.
When certain multi-column data quality rules (e.g. function-dependencies and other types of first-order logic)

Methods

Fixed-Regex (F-Regex)
dBoost
Compression-based dissimilarity measure (CDM)
Support vector data description (SVDD)
Distance-based outlier detection (DBOD)
Local outlier factor (LOF)
Multi-column error detection using rules
Single-column error detection
Numeric error detection
Outlier detection
Application-driven error correction. Recent approaches such as BoostClean and ActiveClean

I like this technique for data merging

Similarity between two words
Match between numbers
Match between First Name
Match between Last Name

Similarity distance function

Deep learning for ER

BoostClean selects an ensemble of methods (statistical and logic rules) for error detection and for repair combinations using statistical boosting.

ActiveDetect - detects and prioritizes the most important data errors in a dataset.
Sample clean - A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data
AlphaClean -declaratively synthesizes data cleaning programs
A Data Quality Metric (DQM)
Data Cleaning for Data Science - PrivateClean, ActiveClean, and BoostClean.

More Reads

Keep Exploring!!!

Data Quality - Algorithm Fairness - Data Curation Papers

Paper #1 - Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI

Key Notes

Poor data quality in high-stakes domains can have outsized effects on vulnerable communities and context
Data Cascades: compounding events causing negative, downstream effects from data issues, resulting in technical debt over time
Many researchers have pointed to the undervalued human labour that powers AI models
Practitioners often work with a set of assumptions about their data during analysis and visualisation
Other frameworks to discover data bugs and clean data include ActiveClean and BoostClean
Data cascades are complex, long-term, occur frequently and persistently
Under-valuing of data work is common to all of AI development
Practitioners viewed data as operations, moved fast, hacked model performance (through hyperparameters rather than data quality)
Everyone wants to do the model work, not the data work
It was difficult to get buy-in from clients and funders to invest in good quality data collection and annotation work
Lack of adequate training on AI data quality
Cascades triggered by ‘hardware drifts’
Cascades triggered by ‘environmental drifts’

Paper #2 - Re-imagining Algorithmic Fairness in India and Beyond

Key Notes

While Indians are part of the AI workforce, a majority work in services, and engineers do not entirely represent marginalities,limiting re-mediation of distances
While other axes of discrimination and injustices such as disability status
Algorithmic powerful in India, where the distance between models and oppressed communities is large
“rich people problems like cardiac disease and cancer, not poor people’s Tuberculosis, prioritised in AI"

More Reads

Keep Thinking!!!

Subscribe to: Posts (Atom)