"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

September 21, 2018

Day #133 - Data sets, Data Challenges in Machine Learning

Google blogs and research papers are mother of all Data Analysis work. Rather than jumping directly executing pieces of code, Its very interesting to understand the perspective and practices for data collection and maintenance. Listed below are good summary from my readings from google papers / blogs

Practical advice for analysis of large, complex data sets

Technical - Ideas to Analyse Data
  • Look at distributions within data
  • Look for examples for validate understanding
  • Consider outliers
  • Check for consistency over time (Validity over period of time)
Process - Recommendations for Data Collection
  • Data collection setup
  • Reproducible
  • Exploratory Data Analysis
Social - Communicating your insights
  • Data Analysis starts with questions not with code or data
  • Accept ignorance and mistakes
  • Be skeptical
  • Educate Consumers
Crawling the internet: data science within a large engineering system
  • Identify and compute the refresh rate pattern and accordingly refresh data 
Machine Learning: The High-Interest Credit Card of Technical Debt
Very interesting article on data related risks / challenges.
  • Unstable Data Dependencies
  • Underutilized Data Dependencies
  • Legacy Features
  • Correction Cascades
  • When Correlations No Longer Correlate
Happy Learning!!!

No comments: