Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Data Curation paper

October 04, 2021

Data Curation paper - Reads

Paper #1 - A Survey on Data Cleaning Methods for Improved Machine Learning Model Performance

Two aspects of data cleaning: what to clean and how to clean

Key Notes

SampleClean: Simulated Clean Data Instances - SampleClean suggests a solution to sampling the raw data that can better present clean data instances.
Approximate Query Processing (AQP). The AQP consists of two steps: first, in Direct Estimate (DE), a set of k rows is sampled randomly and cleaned, and the training result is returned independently of the dirty data. The correction step is used to reweight the sample based on the contribution of the cleaned data
ActiveClean: Incremental Data Cleaning in Convex Models. ActiveClean gradually cleans a dirty dataset to learn a convex-loss model, such as Logistic Regression and Support Vector Machine (SVM).
HoloClean: Holistic Data Repairs With Probabilistic Inference
AlphaClean: Generate-Then-Search Parallel Data Cleaning
CPClean: Reusable Computation in Data Cleaning

ML Papers - Learning-with-Label-Noise

Paper #2 - Advancing Data Curation With Metadata and Statistical Relational Learning

Key Notes

We refer to data science as an umbrella term gathering algorithms and techniques from several disciplines, such as statistics, software engineering, and machine learning
Data is inconsistent, duplicated, stale, incomplete, and/or inaccurate. Data errors, such as outliers, duplicates, missing values, and inconsistencies.
Mapping Metadata to Data Quality Issues
Error Detection
Joint Error Detection and Repair Suggestion

Data Quality fundamentals

The Consistency dimension refers to the validity and integrity of values and tuples with respect to defined inter- and intra-relational constraints that exist within either single or multiple relations
The accuracy dimension identifies correct and true values of the entities presented by data.
Completeness is a degree to which values are included in a data collection
Timeliness dimension reflects the change and update of data by identifying the most current value of an entity in a database
Core data quality dimensions, the violation of Accuracy, Consistency,
Uniqueness, Completeness and Timeliness lead to data quality issues

Metadata is "structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource"

Single-Column Profiling Tasks

Cardinalities refers to the counts of values
Number of rows: the number of entities which are available in the table;
Distinctness: the number of distinct values of the single attribute;
Uniqueness: the ratio of the number of distinct values to the number of rows

Value Distribution refers to the distribution of values on the column. This category includes:

Constancy: the ratio between the most frequent value count and the number of rows;
Extreme values: minimum and maximum values in numeric columns; shortest and
longest strings in categorical, alphanumeric or text columns;
Histogram: values distribution summary on an attribute
Quartiles: three points that divide numeric distribution into four equal groups;
Inverse distribution: an inverse frequency distribution (a distribution of the frequency distribution);

Patterns

Patterns refers to the syntactic properties on the values of the individual column.
Lengths, which specifies the descriptive statistics of the column value lengths
Decimals, which determines the number of decimals in numeric columns

Multi-Column Profiling Tasks

Functional dependencies
What. The first dimension captures common data quality issues and typical data cleaning tasks, which had been found in the literature.
How. The second dimension reflects differently focused data cleaning approaches.

Rule-Based Approaches

Data cleaning rules or integrity constraints to detect and repair various error types in the dataset.

Statistical Approaches

DEC (DetectExplore-Clean) framework [22] uses statistical and other analytical techniques, such as the Fleiss’ kappa measure, to compute the glitch score, which identifies and scores the data glitches

Probabilistic and Machine Learning-Based Approaches

The BoostClean system [141] addresses the domain value violations while cleaning training data for predictive models
The HoloClean system [202] considers error detection as a black-box component and expects the specification of integrity constraints-aligned data quality rules to make probabilistic suggestions on how to repair erroneous data values.
Interactive Data Cleaning
Numerous data cleaning systems use crowdsourcing for duplicate detection and resolution

Supervised Error Detection with Metadata

1) an Error Detection Suite, which includes pluggable error detection systems that function as black boxes to our system.

2) a Metadata Profiler Suite, which extracts various metadata categories, and

3) an Aggregation Suite, which combines the output of the error detection suite and the profiler. In the following, we describe each of the components.

Keep Exploring!!!

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

October 04, 2021

Data Curation paper - Reads

No comments:

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts