- Redshift started from Postgres, 4 Years 150+ features added
- Rebuilt and improved version, columnar storage, scale horizontally, OLAP functionality added
- Wrapped in AWS system
The 5 things to look at creating Redshift
- Step 1 - Find Fact, Dimension Tables and the Record Count
- Step 2 - Analyze and find the query where clause filter columns
- Step 3 - Define the Sort Keys
- Step 4 - Analyze the record counts, data to find out distribution strategy
- Step 5 - Design Distribution Strategy
Summary of Lessons from Link
Good Tutorial Table Design Tuning - Link
The Where clause columns are the sort keys - (date, partnumber, year etc). Same as TSQL index design Link -
Summary of Lessons and Key Steps
Step #1 - Demo Example Tables and Record Count
LINEORDER - 600,037,902
PART - 1,400,000
CUSTOMER - 3,000,000
SUPPLIER - 1,000,000
DWDATE - 2,556
Step #2 - Analyze and find the query where clause filter columns, Link
The Join Columns are
LINEORDER PrimaryKey - lo_orderkey, FK - lo_custkey, lo_partkey, lo_suppkey, lo_orderdate, lo_commdate
PART pk - p_partkey, FK- lo_partkey
CUSTOMER pk - c_custkey, FK - lo_custkey
SUPPLIER pk - s_suppkey, FK- lo_suppkey
DWDATE pk - d_datekey, FK - lo_orderdate, lo_commdate
Step #3 - Define the Sort Keys
All the primary and foreign keys have the sort order keys defined. Based on Query Where Clauses define them
Table name Sort Key
LINEORDER lo_orderdate
PART p_partkey
CUSTOMER c_custkey
SUPPLIER s_suppkey
DWDATE d_datekey
Step #4 - Analyze the record counts, data to find out distribution strategy
- The Largest Dimension Table is PART, Each table can have only one distribution key
- LINEORDER is the fact table, and PART is the largest dimension. PART joins LINEORDER on its primary key, p_partkey
- Designate lo_partkey as the distribution key for LINEORDER and p_partkey as the distribution key for PART so that the matching values for the joining keys will be collocated
Step #5 - Design Distribution Strategy
Based on data define the strategy for data to be collocated and records returned faster.
Table name Distribution Style
LINEORDER lo_partkey
PART p_partkey
CUSTOMER ALL
SUPPLIER ALL
DWDATE ALL
The final update schema is listed in link. Observe the addition of sortkey, distkey mentioned in the DDL
Happy Learning new things!!!
No comments:
Post a Comment