- Redshift started from Postgres, 4 Years 150+ features added
- Rebuilt and improved version, columnar storage, scale horizontally, OLAP functionality added
- Wrapped in AWS system
The 5 things to look at creating Redshift
- Step 1 - Find Fact, Dimension Tables and the Record Count
- Step 2 - Analyze and find the query where clause filter columns
- Step 3 - Define the Sort Keys
- Step 4 - Analyze the record counts, data to find out distribution strategy
- Step 5 - Design Distribution Strategy
Summary of Lessons from Link
Good Tutorial Table Design Tuning - Link
The Where clause columns are the sort keys - (date, partnumber, year etc). Same as TSQL index design Link -
Summary of Lessons and Key Steps
Step #1 - Demo Example Tables and Record Count
LINEORDER - 600,037,902
PART - 1,400,000
CUSTOMER - 3,000,000
SUPPLIER - 1,000,000
DWDATE - 2,556
Step #2 - Analyze and find the query where clause filter columns, Link
The Join Columns are
LINEORDER PrimaryKey - lo_orderkey, FK - lo_custkey, lo_partkey, lo_suppkey, lo_orderdate, lo_commdate
PART pk - p_partkey, FK- lo_partkey
CUSTOMER pk - c_custkey, FK - lo_custkey
SUPPLIER pk - s_suppkey, FK- lo_suppkey
DWDATE pk - d_datekey, FK - lo_orderdate, lo_commdate
Step #3 - Define the Sort Keys
All the primary and foreign keys have the sort order keys defined. Based on Query Where Clauses define them
Table name Sort Key
LINEORDER lo_orderdate
PART p_partkey
CUSTOMER c_custkey
SUPPLIER s_suppkey
DWDATE d_datekey
Step #4 - Analyze the record counts, data to find out distribution strategy
- The Largest Dimension Table is PART, Each table can have only one distribution key
- LINEORDER is the fact table, and PART is the largest dimension. PART joins LINEORDER on its primary key, p_partkey
- Designate lo_partkey as the distribution key for LINEORDER and p_partkey as the distribution key for PART so that the matching values for the joining keys will be collocated
Step #5 - Design Distribution Strategy
Based on data define the strategy for data to be collocated and records returned faster.
Table name Distribution Style
LINEORDER lo_partkey
PART p_partkey
CUSTOMER ALL
SUPPLIER ALL
DWDATE ALL
The final update schema is listed in link. Observe the addition of sortkey, distkey mentioned in the DDL
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--Step 1 Create Tables | |
drop table demo.aggstats | |
create table demo.aggstats | |
(code varchar(11) , | |
weekid int, | |
trancount int) | |
--Step 2 Create the procedure | |
drop procedure demo.biweeklysales(); | |
create or replace procedure demo.biweeklysales() | |
language plpgsql | |
as $$ | |
declare | |
CurrId INTEGER := 0; | |
MaxId INTEGER := 100; | |
startdate date := '01/01/2021'; | |
enddate date; | |
duration int:=14; | |
begin | |
while CurrId <= MaxId | |
LOOP | |
CurrId = CurrId + 1; | |
enddate = dateadd(day,duration,startdate); | |
insert into demo.aggstats(code,weekid,trancount) | |
select product_code, CurrId, sum(units_sold) as salescount from demo.alltransactions | |
where order_time >= startdate | |
and order_time < enddate | |
group by product_code; | |
startdate = dateadd(day,duration,startdate); | |
raise info 'CurrId = %', CurrId; | |
end LOOP; | |
raise info 'Loop Statement Executed -_-||^'; | |
end; | |
$$; | |
--Step 3 Execute | |
call demo.biweeklysales() | |
select distinct(weekid) from demo.aggstats | |
select * from demo.aggstats limit 100; |
No comments:
Post a Comment