"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

September 15, 2021

Data Curation Reads

Data curation – the process of discovering, integrating, and cleaning data. Data Curation needs to be guided by Data Governance

Paper - Data Curation with Deep Learning [Vision]

  • (Towards Automatic Curation)
  • Diverse DC tasks (such as deduplication, error detection, data repair) 
  • DC problems, such as data discovery [17] and entity resolution
  • Series of techniques (e.g., unsupervised representation learning, data augmentation, synthetic data generation, weak supervision, domain adaptation, and crowdsourcing)

An Approach Adapted from Word Embeddings

  • map words to a dense high dimensional vector such that semantically related words are close to each other
  • A big difference between databases and documents is that databases have many data dependencies (or integrity constraints), within tables

Combining Word and Graph Embeddings

  • Treat each relation as a heterogeneous network
  • Learn distributed representations for the cells over the entire data ocean, not only on one relation

Experiment for different levels

  • Column Embeddings (Column2Vec)
  • Table Embeddings (Table2Vec) 
  • Database Embeddings (Database2Vec)
  • Contextual Embeddings for DC

Entity Matching

  • Entity matching is a key problem in data integration

DeepER, applies DL techniques for ER.

DeepMatcher [35] proposes a template based architecture for entity matching


Paper - A Survey on Data Collection for Machine Learning

Key Notes

More Reads

Keep Reading!!!

No comments: