"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

June 08, 2019

Data Warehousing with Amazon Redshift

  • Redshift started from Postgres, 4 Years 150+ features added
  • Rebuilt and improved version, columnar storage, scale horizontally, OLAP functionality added
  • Wrapped in AWS system
The 5 things to look at creating Redshift 
  • Step 1 - Find Fact, Dimension Tables and the Record Count
  • Step 2 - Analyze and find the query where clause filter columns
  • Step 3 - Define the Sort Keys
  • Step 4 - Analyze the record counts, data to find out distribution strategy
  • Step 5 - Design Distribution Strategy
Summary of Lessons from Link 
Good Tutorial Table Design Tuning - Link
The Where clause columns are the sort keys - (date, partnumber, year etc). Same as TSQL index design Link -

Summary of Lessons and Key Steps

Step #1 - Demo Example Tables and Record Count

LINEORDER - 600,037,902
PART  - 1,400,000
CUSTOMER  - 3,000,000
SUPPLIER  - 1,000,000
DWDATE  - 2,556

Step #2 - Analyze and find the query where clause filter columns, Link

The Join Columns are
LINEORDER PrimaryKey - lo_orderkey, FK - lo_custkey, lo_partkey, lo_suppkey, lo_orderdate, lo_commdate
PART pk - p_partkey, FK- lo_partkey
CUSTOMER pk - c_custkey, FK - lo_custkey
SUPPLIER pk - s_suppkey, FK- lo_suppkey
DWDATE pk - d_datekey, FK - lo_orderdate, lo_commdate

Step #3 - Define the Sort Keys

All the primary and foreign keys have the sort order keys defined. Based on Query Where Clauses define them

Table name Sort Key
LINEORDER lo_orderdate
PART p_partkey
CUSTOMER c_custkey
SUPPLIER s_suppkey
DWDATE d_datekey

Step #4 - Analyze the record counts, data to find out distribution strategy
  • The Largest Dimension Table is PART, Each table can have only one distribution key
  • LINEORDER is the fact table, and PART is the largest dimension. PART joins LINEORDER on its primary key, p_partkey
  • Designate lo_partkey as the distribution key for LINEORDER and p_partkey as the distribution key for PART so that the matching values for the joining keys will be collocated 
Step #5 - Design Distribution Strategy

Based on data define the strategy for data to be collocated and records returned faster.

Table name Distribution Style
LINEORDER lo_partkey
PART p_partkey
CUSTOMER ALL
SUPPLIER ALL
DWDATE ALL

The final update schema is listed in link. Observe the addition of sortkey, distkey mentioned in the DDL


Happy Learning new things!!!

No comments: