"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

March 04, 2019

Spark Lessons #Applying Best Practices to Your Apache Spark Applications - Silvio Fiorito

Key Lessons
  • Spark is lazily executed
  • Apply transformations to query
  • Count, Write, For Each
  • Reader API - Spark.read.load

  • Class - InmemoryFileIndex (Responsible for partition discovery) - S3 / HDFS

  • Anything over 32 folders will kick off job
  • Dealing with Many partitions
  • InMemoryFileIndex to index paths you are interested in
Datasource Tables
  • Managed in Hive Metastore

  • External or unmanaged tables (Hive Schema over existing Dataset)
  • Managed Table (SparkSQL Manages)
  • Hive also keeps track of schema
  • Files / Tables diff


  • Tables - Schema in Metastore
  • For BI users you can use Tables
  • Dealing with CSV / JSON files
  • Scan dataset and creates schema - convenient for slow dataset (schema inference)


Compression and Partition Scheme
  • Depends on Adhoc / Batch
  • Splittable compression schemes
  • Avoid Large Gzip text files


Optimization
  • Partitioning / Bucketing (persist hash partitioned data, good for joins and keys)
  • Each task will write a file in the bucket


  • Repartition by partition by value / column
  • One File per partition

Query Optimization

  • SQL shuffle partition
  • Default value override based on data volume
  • Self-Union Reading dataset twice
  • Cost based optimizer










Happy Learning!!!

No comments: