Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Data Curation Paper Reads - Data Quality

September 25, 2021

Data Curation Paper Reads - Data Quality - Data Cleaning

Paper #1 - Auto-Detect: Data-Driven Error Detection in Tables

Key Notes

Values in a column not conforming to patterns associated with a data-type are flagged as errors.
Formulas inconsistent with other formulas in the region
Text clustering feature that groups together similar values in a column
Single-column approaches detect errors only based on values within an input column.
When certain multi-column data quality rules (e.g. function-dependencies and other types of first-order logic)

Methods

Fixed-Regex (F-Regex)
dBoost
Compression-based dissimilarity measure (CDM)
Support vector data description (SVDD)
Distance-based outlier detection (DBOD)
Local outlier factor (LOF)
Multi-column error detection using rules
Single-column error detection
Numeric error detection
Outlier detection
Application-driven error correction. Recent approaches such as BoostClean and ActiveClean

Record Linkage

I like this technique for data merging

Similarity between two words
Match between numbers
Match between First Name
Match between Last Name

Similarity distance function

Deep learning for ER

BoostClean selects an ensemble of methods (statistical and logic rules) for error detection and for repair combinations using statistical boosting.

ActiveDetect - detects and prioritizes the most important data errors in a dataset.
Sample clean - A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data
AlphaClean -declaratively synthesizes data cleaning programs
A Data Quality Metric (DQM)
Data Cleaning for Data Science - PrivateClean, ActiveClean, and BoostClean.

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

September 25, 2021

Data Curation Paper Reads - Data Quality - Data Cleaning

No comments:

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts