Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Physical Plans in Spark SQL - David Vrba (Socialbakers)

February 04, 2020

Most RDBMS SQL Server Execution Plan Technique applicable here as well. Strategies are the same, Pattern is different.

Summary

Theory - Query Execution

Step 1 - Logical Planning (Building the query Tree)

Step 2 - Physical Planning

Frequently used operators

Link https://spark.apache.org/docs/2.3.1/sql-programming-guide.html
Key parameters

spark.sql.files.maxPartitionBytes - maximum number of bytes to pack into a single partition when reading files
spark.sql.files.openCostInBytes - estimated cost to open a file, measured by the number of bytes could be scanned in the same time
spark.sql.broadcastTimeout - Timeout in seconds for the broadcast wait time in broadcast joins
spark.sql.autoBroadcastJoinThreshold - maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled
spark.sql.shuffle.partitions - Configures the number of partitions to use when shuffling data for joins

Exchange

Represents Shuffle
Physical data movement on the cluster
Single Partition - All data moved to single partition, Might result in bottlenecks
HashPartitioning - columns used for partition, Induced by Aggregation operations - groupby, distinct, join
RoundRobinPartitioning - Specify number of partitions to be created
RangePartitioning - Happen when we are sorting data (orderBy)