"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;
Showing posts with label Data Governance. Show all posts
Showing posts with label Data Governance. Show all posts

October 04, 2021

Data Curation paper - Reads

Paper #1 - A Survey on Data Cleaning Methods for Improved Machine Learning Model Performance

  • Two aspects of data cleaning: what to clean and how to clean

Key Notes

  • SampleClean: Simulated Clean Data Instances - SampleClean suggests a solution to sampling the raw data that can better present clean data instances.
  • Approximate Query Processing (AQP). The AQP consists of two steps: first, in Direct Estimate (DE), a set of k rows is sampled randomly and cleaned, and the training result is returned independently of the dirty data. The correction step is used to reweight the sample based on the contribution of the cleaned data
  • ActiveClean: Incremental Data Cleaning in Convex Models. ActiveClean gradually cleans a dirty dataset to learn a convex-loss model, such as Logistic Regression and Support Vector Machine (SVM).
  • HoloClean: Holistic Data Repairs With Probabilistic Inference
  • AlphaClean: Generate-Then-Search Parallel Data Cleaning
  • CPClean: Reusable Computation in Data Cleaning

ML Papers - Learning-with-Label-Noise

Paper #2 - Advancing Data Curation With Metadata and Statistical Relational Learning

Key Notes

  • We refer to data science as an umbrella term gathering algorithms and techniques from several disciplines, such as statistics, software engineering, and machine learning
  • Data is inconsistent, duplicated, stale, incomplete, and/or inaccurate. Data errors, such as outliers, duplicates, missing values, and inconsistencies.
  • Mapping Metadata to Data Quality Issues
  • Error Detection
  • Joint Error Detection and Repair Suggestion


Data Quality fundamentals

  • The Consistency dimension refers to the validity and integrity of values and tuples with respect to defined inter- and intra-relational constraints that exist within either single or multiple relations
  • The accuracy dimension identifies correct and true values of the entities presented by data.
  • Completeness is a degree to which values are included in a data collection
  • Timeliness dimension reflects the change and update of data by identifying the most current value of an entity in a database
  • Core data quality dimensions, the violation of Accuracy, Consistency,
  • Uniqueness, Completeness and Timeliness lead to data quality issues

  • Metadata is "structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource"


Single-Column Profiling Tasks

  • Cardinalities refers to the counts of values
  • Number of rows: the number of entities which are available in the table;
  • Distinctness: the number of distinct values of the single attribute;
  • Uniqueness: the ratio of the number of distinct values to the number of rows

Value Distribution refers to the distribution of values on the column. This category includes:

  • Constancy: the ratio between the most frequent value count and the number of rows;
  • Extreme values: minimum and maximum values in numeric columns; shortest and
  • longest strings in categorical, alphanumeric or text columns;
  • Histogram: values distribution summary on an attribute
  • Quartiles: three points that divide numeric distribution into four equal groups;
  • Inverse distribution: an inverse frequency distribution (a distribution of the frequency distribution);

Patterns

  • Patterns refers to the syntactic properties on the values of the individual column.
  • Lengths, which specifies the descriptive statistics of the column value lengths
  • Decimals, which determines the number of decimals in numeric columns

Multi-Column Profiling Tasks

  • Functional dependencies
  • What. The first dimension captures common data quality issues and typical data cleaning tasks, which had been found in the literature.
  • How. The second dimension reflects differently focused data cleaning approaches.

Rule-Based Approaches

  • Data cleaning rules or integrity constraints to detect and repair various error types in the dataset.

Statistical Approaches

  • DEC (DetectExplore-Clean) framework [22] uses statistical and other analytical techniques, such as the Fleiss’ kappa measure, to compute the glitch score, which identifies and scores the data glitches

Probabilistic and Machine Learning-Based Approaches

  • The BoostClean system [141] addresses the domain value violations while cleaning training data for predictive models
  • The HoloClean system [202] considers error detection as a black-box component and expects the specification of integrity constraints-aligned data quality rules to make probabilistic suggestions on how to repair erroneous data values.
  • Interactive Data Cleaning
  • Numerous data cleaning systems use crowdsourcing for duplicate detection and resolution






Supervised Error Detection with Metadata


1) an Error Detection Suite, which includes pluggable error detection systems that function as black boxes to our system.

2) a Metadata Profiler Suite, which extracts various metadata categories, and 

3) an Aggregation Suite, which combines the output of the error detection suite and the profiler. In the following, we describe each of the components.

Keep Exploring!!!

September 25, 2021

Data Curation Paper Reads - Data Quality - Data Cleaning

Paper #1 - Auto-Detect: Data-Driven Error Detection in Tables

Key Notes

  • Values in a column not conforming to patterns associated with a data-type are flagged as errors.
  • Formulas inconsistent with other formulas in the region 
  • Text clustering feature that groups together similar values in a column
  • Single-column approaches detect errors only based on values within an input column.
  • When certain multi-column data quality rules (e.g. function-dependencies and other types of first-order logic)

Methods

  • Fixed-Regex (F-Regex)
  • dBoost
  • Compression-based dissimilarity measure (CDM)
  • Support vector data description (SVDD)
  • Distance-based outlier detection (DBOD)
  • Local outlier factor (LOF)
  • Multi-column error detection using rules
  • Single-column error detection
  • Numeric error detection
  • Outlier detection
  • Application-driven error correction. Recent approaches such as BoostClean  and ActiveClean

Record Linkage


I like this technique for data merging

  • Similarity between two words 
  • Match between numbers
  • Match between First Name
  • Match between Last Name

Similarity distance function

Deep learning for ER



BoostClean selects an ensemble of methods (statistical and logic rules) for error detection and for repair combinations using statistical boosting.

More Reads

Keep Exploring!!!

December 18, 2020

Review GDPR

Key Points

Paper - Link

Key notes

  • General Data Protection Regulation (GDPR)
  • Controller determines the purposes
  • Processor is responsible for processing personal data on behalf of a controller

Personal Data

  • Information about identified or identifiable individual
  • Name, IP address or a cookie
  • Content of the information, the purpose
  • Identification number;
  • Location data; and
  • Online identifier

Impact on Individual

  • The content of the data – is it directly about the individual or their activities?;
  • The purpose you will process the data for; and
  • Results of or effects on the individual from processing the data
  • Lawfulness, Fairness, Transparency
  • The GDPR does not dictate how long you should keep personal data. 

Consent: the individual has given clear consent for you to process their personal data for a specific purpose.

  • Are they vulnerable?

Purpose

  • Scientific or historical research purposes; or
  • Statistical purposes

Key Points

  • Avoid making consent to processing a precondition of a service
  • Explicit consent requires a very clear and specific statement of consent
  • Are you processing children’s data?
  • Is any of the data particularly sensitive or private?
  • Sensitive Data - race;ethnic origin;politics;religion;trade union membership;genetics;biometrics (where used for ID purposes);health;sex life; or sexual orientation.

Rights for individuals:

  • The right to be informed
  • The right of access
  • The right to rectification
  • The right to erasure
  • The right to restrict processing
  • The right to data portability
  • The right to object
  • Rights in relation to automated decision making and profiling.

This part of the guide explains these rights. Individuals have the right to request the restriction or suppression of their personal data.

The right to data portability allows individuals to obtain and reuse their personal data for their own purposes across different services.

I am not sure how the data in FANG (Facebook, Amazon, Netflix, Google), Microsoft, Cookies, Browser info how much they are used to what extent :(

We hire people to handle laws, circumvent the clauses. There is always a catchup game being bending rules vs line of privacy. Debatable from both sides.

Do we have clarity from Reliance, Airtel, Flipkart other Indian providers for their GDPR similar data usage, retention, user consent. Time to check on this!!!

Good Read - Link2

Good Read - Ethical AI

Keep Thinking!!!

December 17, 2020

Data Governance - Research paper Reads

Data Governance for Platform Ecosystems: Critical Factors and the State of Practice

Key Notes

  • Data Challenges - data abuse, privacy violation and proper distribution 
  • Data Governance - (availability, usability, security and privacy)

Data Governance Factors

  • Data Ownership access - presents who owns and uses the data in platform ecosystems.
  • Data ownership definition criteria
  • Monitoring - Invisible supply chain is a longstanding challenge
  • Conformance - Audit for compliance based on strict processes and rules
  • Data Use Case - Use the data in platform ecosystems
  • Data provenance - Data transparently for all participating groups
  • Contribution Estimation - User contribution against value creation by providing data




AI GOVERNANCE FOR BUSINESSES

Key Notes

  • AI exhibits forms of intelligent behavior allowing for a large range of cost-efficient, wellperforming applications
  • AI produces results that are partly outside the control of an organization or at least unexpected. It exhibits non-predictable, “ethics”-unaware, data-induced behavior yielding novel security, safety and fairness issues
  • To mitigate AI challenges and to raise AI potentials in organisations, governance mechanisms play an important role
  • Testing of ML models, ensuring fairness, explaining “black boxes”, data valuation
  • Data-driven lens of AI, based on the observation that most existing AI techniques
  • Prominent regulations include the European GDPR that touches upon data as well as models. Compliance monitoring, Audit

Data

  • Data is the representation of facts using text, numbers, images, sound or video
  • An essential characteristic of data is in addition also the primary source of data: Is it personal or non-personal?
  • GDPR 2018 grants the right to explanation to individuals for automated decisions based on their data
  • Governance model based on fairness, transparency, trustworthiness, accountability.

Model Explainability

Transparent models are intrinsically human understandable, whereas complex black-box models such as deep learning require external methods that provide explanations that might or might not suffice to understand the model

Data Valuation - valuation of data gains in relevance, if the acquisition of data comes with costs, e.g. data has to be labeled by humans as part of the construction of a dataset for an AI system or data requires costly processing, such as manual cleansing to raise data quality

  • Data quality denotes the ability of data to meet its usage requirements in a given context
  • An important data quality aspect with respect to fairness of ML systems is bias
  • Data is biased if it is not representative of the population or phenomenon of study.
  • Concept drift implies that the data used to train ML model does not capture the relationship that the model should capture
  • Robustness defines to what extent a ML model can function correctly in the presence of invalid inputs 
  • Protected characteristics such as gender, religion, familial status, age and race must not be used
  • ML should allow to track provenance/lineage, ensure reproducability, enable audits and compliance checks of models, foster reusability, handle scale and heterogeneity, allow for flexible metadata usage



Unionized Data Governance in Virtual Power Plants

Collective bargaining The asset-owners should be able to bargain collectively about the conditions and purposes of the data flows. This includes which supplementary data flows to include and how to utilize them

Representation The asset-owners should be represented in a central organizational governing body, which is in charge of defining and overseeing the data principles.

Accountability Transparency measures should be put in place to ensure the asset-owners ability to audit the data usage performed by the aggregator, in order to detect misuse and assign accountability

Social and Governance Implications of Improved Data Efficiency

Data Readiness Report



Alternative Personal data governance models

Design Choices for Data Governance in Platform Ecosystems – A Contingency Model

Data Governance Strategies from Experience







Happy Learning!!!