- Spark is lazily executed
- Apply transformations to query
- Count, Write, For Each
- Reader API - Spark.read.load
- Class - InmemoryFileIndex (Responsible for partition discovery) - S3 / HDFS
- Anything over 32 folders will kick off job
- Dealing with Many partitions
- InMemoryFileIndex to index paths you are interested in
- Managed in Hive Metastore
- External or unmanaged tables (Hive Schema over existing Dataset)
- Managed Table (SparkSQL Manages)
- Hive also keeps track of schema
- Files / Tables diff
- Tables - Schema in Metastore
- For BI users you can use Tables
- Dealing with CSV / JSON files
- Scan dataset and creates schema - convenient for slow dataset (schema inference)
- Depends on Adhoc / Batch
- Splittable compression schemes
- Avoid Large Gzip text files
Optimization
- Partitioning / Bucketing (persist hash partitioned data, good for joins and keys)
- Each task will write a file in the bucket
- Repartition by partition by value / column
- One File per partition
Query Optimization
- SQL shuffle partition
- Default value override based on data volume
- Self-Union Reading dataset twice
- Cost based optimizer
Happy Learning!!!
No comments:
Post a Comment