Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Data Governance

Showing posts with label Data Governance. Show all posts

October 04, 2021

Data Curation paper - Reads

Paper #1 - A Survey on Data Cleaning Methods for Improved Machine Learning Model Performance

Two aspects of data cleaning: what to clean and how to clean

Key Notes

SampleClean: Simulated Clean Data Instances - SampleClean suggests a solution to sampling the raw data that can better present clean data instances.
Approximate Query Processing (AQP). The AQP consists of two steps: first, in Direct Estimate (DE), a set of k rows is sampled randomly and cleaned, and the training result is returned independently of the dirty data. The correction step is used to reweight the sample based on the contribution of the cleaned data
ActiveClean: Incremental Data Cleaning in Convex Models. ActiveClean gradually cleans a dirty dataset to learn a convex-loss model, such as Logistic Regression and Support Vector Machine (SVM).
HoloClean: Holistic Data Repairs With Probabilistic Inference
AlphaClean: Generate-Then-Search Parallel Data Cleaning
CPClean: Reusable Computation in Data Cleaning

ML Papers - Learning-with-Label-Noise

Paper #2 - Advancing Data Curation With Metadata and Statistical Relational Learning

Key Notes

We refer to data science as an umbrella term gathering algorithms and techniques from several disciplines, such as statistics, software engineering, and machine learning
Data is inconsistent, duplicated, stale, incomplete, and/or inaccurate. Data errors, such as outliers, duplicates, missing values, and inconsistencies.
Mapping Metadata to Data Quality Issues
Error Detection
Joint Error Detection and Repair Suggestion

Data Quality fundamentals

The Consistency dimension refers to the validity and integrity of values and tuples with respect to defined inter- and intra-relational constraints that exist within either single or multiple relations
The accuracy dimension identifies correct and true values of the entities presented by data.
Completeness is a degree to which values are included in a data collection
Timeliness dimension reflects the change and update of data by identifying the most current value of an entity in a database
Core data quality dimensions, the violation of Accuracy, Consistency,
Uniqueness, Completeness and Timeliness lead to data quality issues

Metadata is "structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource"

Single-Column Profiling Tasks

Cardinalities refers to the counts of values
Number of rows: the number of entities which are available in the table;
Distinctness: the number of distinct values of the single attribute;
Uniqueness: the ratio of the number of distinct values to the number of rows

Value Distribution refers to the distribution of values on the column. This category includes:

Constancy: the ratio between the most frequent value count and the number of rows;
Extreme values: minimum and maximum values in numeric columns; shortest and
longest strings in categorical, alphanumeric or text columns;
Histogram: values distribution summary on an attribute
Quartiles: three points that divide numeric distribution into four equal groups;
Inverse distribution: an inverse frequency distribution (a distribution of the frequency distribution);

Patterns

Patterns refers to the syntactic properties on the values of the individual column.
Lengths, which specifies the descriptive statistics of the column value lengths
Decimals, which determines the number of decimals in numeric columns

Multi-Column Profiling Tasks

Functional dependencies
What. The first dimension captures common data quality issues and typical data cleaning tasks, which had been found in the literature.
How. The second dimension reflects differently focused data cleaning approaches.

Rule-Based Approaches

Data cleaning rules or integrity constraints to detect and repair various error types in the dataset.

Statistical Approaches

DEC (DetectExplore-Clean) framework [22] uses statistical and other analytical techniques, such as the Fleiss’ kappa measure, to compute the glitch score, which identifies and scores the data glitches

Probabilistic and Machine Learning-Based Approaches

The BoostClean system [141] addresses the domain value violations while cleaning training data for predictive models
The HoloClean system [202] considers error detection as a black-box component and expects the specification of integrity constraints-aligned data quality rules to make probabilistic suggestions on how to repair erroneous data values.
Interactive Data Cleaning
Numerous data cleaning systems use crowdsourcing for duplicate detection and resolution

Supervised Error Detection with Metadata

1) an Error Detection Suite, which includes pluggable error detection systems that function as black boxes to our system.

2) a Metadata Profiler Suite, which extracts various metadata categories, and

3) an Aggregation Suite, which combines the output of the error detection suite and the profiler. In the following, we describe each of the components.

Keep Exploring!!!

September 25, 2021

Data Curation Paper Reads - Data Quality - Data Cleaning

Paper #1 - Auto-Detect: Data-Driven Error Detection in Tables

Key Notes

Values in a column not conforming to patterns associated with a data-type are flagged as errors.
Formulas inconsistent with other formulas in the region
Text clustering feature that groups together similar values in a column
Single-column approaches detect errors only based on values within an input column.
When certain multi-column data quality rules (e.g. function-dependencies and other types of first-order logic)

Methods

Fixed-Regex (F-Regex)
dBoost
Compression-based dissimilarity measure (CDM)
Support vector data description (SVDD)
Distance-based outlier detection (DBOD)
Local outlier factor (LOF)
Multi-column error detection using rules
Single-column error detection
Numeric error detection
Outlier detection
Application-driven error correction. Recent approaches such as BoostClean and ActiveClean

Record Linkage

I like this technique for data merging

Similarity between two words
Match between numbers
Match between First Name
Match between Last Name

Similarity distance function

Deep learning for ER

BoostClean selects an ensemble of methods (statistical and logic rules) for error detection and for repair combinations using statistical boosting.

ActiveDetect - detects and prioritizes the most important data errors in a dataset.
Sample clean - A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data
AlphaClean -declaratively synthesizes data cleaning programs
A Data Quality Metric (DQM)
Data Cleaning for Data Science - PrivateClean, ActiveClean, and BoostClean.

December 18, 2020

Review GDPR

Key Points

Paper - Link

Key notes

General Data Protection Regulation (GDPR)
Controller determines the purposes
Processor is responsible for processing personal data on behalf of a controller

Personal Data

Information about identified or identifiable individual
Name, IP address or a cookie
Content of the information, the purpose
Identification number;
Location data; and
Online identifier

Impact on Individual

The content of the data – is it directly about the individual or their activities?;
The purpose you will process the data for; and
Results of or effects on the individual from processing the data
Lawfulness, Fairness, Transparency
The GDPR does not dictate how long you should keep personal data.

Consent: the individual has given clear consent for you to process their personal data for a specific purpose.

Are they vulnerable?

Purpose

Scientific or historical research purposes; or
Statistical purposes

Key Points

Avoid making consent to processing a precondition of a service
Explicit consent requires a very clear and specific statement of consent
Are you processing children’s data?
Is any of the data particularly sensitive or private?
Sensitive Data - race;ethnic origin;politics;religion;trade union membership;genetics;biometrics (where used for ID purposes);health;sex life; or sexual orientation.

Rights for individuals:

The right to be informed
The right of access
The right to rectification
The right to erasure
The right to restrict processing
The right to data portability
The right to object
Rights in relation to automated decision making and profiling.

This part of the guide explains these rights. Individuals have the right to request the restriction or suppression of their personal data.

The right to data portability allows individuals to obtain and reuse their personal data for their own purposes across different services.

I am not sure how the data in FANG (Facebook, Amazon, Netflix, Google), Microsoft, Cookies, Browser info how much they are used to what extent :(

We hire people to handle laws, circumvent the clauses. There is always a catchup game being bending rules vs line of privacy. Debatable from both sides.

Do we have clarity from Reliance, Airtel, Flipkart other Indian providers for their GDPR similar data usage, retention, user consent. Time to check on this!!!

Good Read - Link2

Good Read - Ethical AI

Keep Thinking!!!

December 17, 2020

Data Governance - Research paper Reads

Data Governance for Platform Ecosystems: Critical Factors and the State of Practice

Key Notes

Data Challenges - data abuse, privacy violation and proper distribution
Data Governance - (availability, usability, security and privacy)

Data Governance Factors

Data Ownership access - presents who owns and uses the data in platform ecosystems.
Data ownership definition criteria
Monitoring - Invisible supply chain is a longstanding challenge
Conformance - Audit for compliance based on strict processes and rules
Data Use Case - Use the data in platform ecosystems
Data provenance - Data transparently for all participating groups
Contribution Estimation - User contribution against value creation by providing data

AI GOVERNANCE FOR BUSINESSES

Key Notes

AI exhibits forms of intelligent behavior allowing for a large range of cost-efficient, wellperforming applications
AI produces results that are partly outside the control of an organization or at least unexpected. It exhibits non-predictable, “ethics”-unaware, data-induced behavior yielding novel security, safety and fairness issues
To mitigate AI challenges and to raise AI potentials in organisations, governance mechanisms play an important role
Testing of ML models, ensuring fairness, explaining “black boxes”, data valuation
Data-driven lens of AI, based on the observation that most existing AI techniques
Prominent regulations include the European GDPR that touches upon data as well as models. Compliance monitoring, Audit

Data

Data is the representation of facts using text, numbers, images, sound or video
An essential characteristic of data is in addition also the primary source of data: Is it personal or non-personal?
GDPR 2018 grants the right to explanation to individuals for automated decisions based on their data
Governance model based on fairness, transparency, trustworthiness, accountability.

Model Explainability

Transparent models are intrinsically human understandable, whereas complex black-box models such as deep learning require external methods that provide explanations that might or might not suffice to understand the model

Data Valuation - valuation of data gains in relevance, if the acquisition of data comes with costs, e.g. data has to be labeled by humans as part of the construction of a dataset for an AI system or data requires costly processing, such as manual cleansing to raise data quality

Data quality denotes the ability of data to meet its usage requirements in a given context
An important data quality aspect with respect to fairness of ML systems is bias
Data is biased if it is not representative of the population or phenomenon of study.
Concept drift implies that the data used to train ML model does not capture the relationship that the model should capture
Robustness defines to what extent a ML model can function correctly in the presence of invalid inputs
Protected characteristics such as gender, religion, familial status, age and race must not be used
ML should allow to track provenance/lineage, ensure reproducability, enable audits and compliance checks of models, foster reusability, handle scale and heterogeneity, allow for flexible metadata usage

Unionized Data Governance in Virtual Power Plants

Collective bargaining The asset-owners should be able to bargain collectively about the conditions and purposes of the data flows. This includes which supplementary data flows to include and how to utilize them

Representation The asset-owners should be represented in a central organizational governing body, which is in charge of defining and overseeing the data principles.

Accountability Transparency measures should be put in place to ensure the asset-owners ability to audit the data usage performed by the aggregator, in order to detect misuse and assign accountability

Social and Governance Implications of Improved Data Efficiency

Data Readiness Report

Alternative Personal data governance models

Design Choices for Data Governance in Platform Ecosystems – A Contingency Model

Data Governance Strategies from Experience

Happy Learning!!!

About Me and Disclaimer

Welcome Visitor,
I have 20 years of experience (Coder - Emprical Learner - Teacher). I am currently working on Data Analytics (Video-Image-Text-Data) / Database / BI space. I dabble with "Data". Ping me or send a request to connect if what I do appeals to you and you want to talk about it (Data Science / Databases / Deep Learning / Architecture / Design Discussions / Consulting Projects/ Machine Learning Training's/ Strategic Leadership Roles).
Personal Goal - Reach / Teach up to 10 Million Students through various mediums (Catalyst between Academics and Industry)
My request to readers, Hope you find the posts, code snippets, notes helpful, please share your learning with others. We can only grow only by learning and teaching.

6+ years in AI, AI experience working on Image, Video, Text, Numbers - Data

15+ years in Databases

10+ in developing, deploying, monitoring large scale solutions in Supply Chain, Retail

Its my personal blog. The objective of this blog is to bookmark/share my learning's. Posts reflect my opinions, perspectives and interests. Blog post presented are my personal views and do not represent my employer's view. I have acknowledged all posts with References/Bookmarks.

For questions/feedback/career opportunities/training / consulting assignments/mentoring - please drop a note to sivaram2k10(at)gmail(dot)com
Coach / Code / Innovate

A blogpost a day keeps your thinking going.

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

October 04, 2021

Data Curation paper - Reads

September 25, 2021

Data Curation Paper Reads - Data Quality - Data Cleaning

December 18, 2020

Review GDPR

December 17, 2020

Data Governance - Research paper Reads

About Me

What is your Expertise

Search This Blog

Git Code Repository

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts