Data curation – the process of discovering, integrating, and cleaning data. Data Curation needs to be guided by Data Governance
Paper - Data Curation with Deep Learning [Vision]
- (Towards Automatic Curation)
- Diverse DC tasks (such as deduplication, error detection, data repair)
- DC problems, such as data discovery [17] and entity resolution
- Series of techniques (e.g., unsupervised representation learning, data augmentation, synthetic data generation, weak supervision, domain adaptation, and crowdsourcing)
An Approach Adapted from Word Embeddings
- map words to a dense high dimensional vector such that semantically related words are close to each other
- A big difference between databases and documents is that databases have many data dependencies (or integrity constraints), within tables
Combining Word and Graph Embeddings
- Treat each relation as a heterogeneous network
- Learn distributed representations for the cells over the entire data ocean, not only on one relation
Experiment for different levels
- Column Embeddings (Column2Vec)
- Table Embeddings (Table2Vec)
- Database Embeddings (Database2Vec)
- Contextual Embeddings for DC
Entity Matching
- Entity matching is a key problem in data integration
DeepER, applies DL techniques for ER.
DeepMatcher [35] proposes a template based architecture for entity matching
Paper - A Survey on Data Collection for Machine Learning
Key Notes
More Reads
- Data provenance, curation and quality in metrology
- CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks
- Data Acquisition for Improving Machine Learning Models
- “You might also like this model”: Data Driven Approach for Recommending Deep Learning Models for Unknown Image Datasets
- Data Curation with Deep Learning
- Data Loss Prevention Samples
Keep Reading!!!
No comments:
Post a Comment