Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Spark Lessons #Applying Best Practices to Your Apache Spark Applications

March 04, 2019

Spark Lessons #Applying Best Practices to Your Apache Spark Applications - Silvio Fiorito

Key Lessons

Spark is lazily executed
Apply transformations to query
Count, Write, For Each
Reader API - Spark.read.load

Class - InmemoryFileIndex (Responsible for partition discovery) - S3 / HDFS

Anything over 32 folders will kick off job
Dealing with Many partitions
InMemoryFileIndex to index paths you are interested in

Datasource Tables

Managed in Hive Metastore

External or unmanaged tables (Hive Schema over existing Dataset)

Managed Table (SparkSQL Manages)
Hive also keeps track of schema
Files / Tables diff

Tables - Schema in Metastore
For BI users you can use Tables
Dealing with CSV / JSON files
Scan dataset and creates schema - convenient for slow dataset (schema inference)

Compression and Partition Scheme

Depends on Adhoc / Batch
Splittable compression schemes
Avoid Large Gzip text files

Optimization

Partitioning / Bucketing (persist hash partitioned data, good for joins and keys)
Each task will write a file in the bucket

Repartition by partition by value / column
One File per partition

Query Optimization

SQL shuffle partition
Default value override based on data volume
Self-Union Reading dataset twice
Cost based optimizer

Happy Learning!!!

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

March 04, 2019

Spark Lessons #Applying Best Practices to Your Apache Spark Applications - Silvio Fiorito

No comments:

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts