"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

February 04, 2020

Physical Plans in Spark SQL—continues - David Vrba (Socialbakers) - Part II

OptimizationExamples

Example 1 - Posts (Messages from FB)
  • 4 Columns 
  • PostId (UnqueId), Id, DateDimensions, Interactions (Like)
  • Query > 100, < 20
  • Split, Run as two queries and union the results
  • Union scan same data multiple times (Reused in this case)





Reused Computation
  • Cache & Persistence

Example II - Query with Max columns results
  • Three approaches, Three queries
  • Window
  • GroupBy + Join
  • Subquery





Three approaches, Three queries
  • Window (1 Exchange + 1 Sort) - Efficient
  • GroupBy + Join (HashAggregate + 2 Exchange + 1 Sort)
  • Subquery (Broadcast HashJoin, 1 Exchange)
Join Recommendations

Example III - Sum interactions for profiles
  • 3 Exchange (Shuffle)
  • Sort
  • SortMergeJoin
  • Optimize Exchange 


Repartition





Happy Learning!!!

No comments: