"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

February 03, 2020

Best Practices for Building and Deploying Data Pipelines in Apache Spark - Vicky Avison

Key Summary
  • Collect Data from Tables
  • Perform Join and Aggregation and generate data with features engineered
  • OLTP + BI + Data Science Knowledge - Everything matters to build a model


Checklist for building pipeline
  • Handling Late Arriving data 
  • Pipeline failures
  • Handling data quality issues
  • App level configurations
  • Maximize performance (Indexes / Queries)
  • Extract the required data (ETL scripts)
Business Logic
  • Clean up data
  • Perform Aggregations / Joins
  • Generate feature engineered data







Spark Snippets






Happy Learning!!!

No comments: