"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

June 08, 2019

Data Warehousing with Amazon Redshift

  • Redshift started from Postgres, 4 Years 150+ features added
  • Rebuilt and improved version, columnar storage, scale horizontally, OLAP functionality added
  • Wrapped in AWS system
The 5 things to look at creating Redshift 
  • Step 1 - Find Fact, Dimension Tables and the Record Count
  • Step 2 - Analyze and find the query where clause filter columns
  • Step 3 - Define the Sort Keys
  • Step 4 - Analyze the record counts, data to find out distribution strategy
  • Step 5 - Design Distribution Strategy
Summary of Lessons from Link 
Good Tutorial Table Design Tuning - Link
The Where clause columns are the sort keys - (date, partnumber, year etc). Same as TSQL index design Link -

Summary of Lessons and Key Steps

Step #1 - Demo Example Tables and Record Count

LINEORDER - 600,037,902
PART  - 1,400,000
CUSTOMER  - 3,000,000
SUPPLIER  - 1,000,000
DWDATE  - 2,556

Step #2 - Analyze and find the query where clause filter columns, Link

The Join Columns are
LINEORDER PrimaryKey - lo_orderkey, FK - lo_custkey, lo_partkey, lo_suppkey, lo_orderdate, lo_commdate
PART pk - p_partkey, FK- lo_partkey
CUSTOMER pk - c_custkey, FK - lo_custkey
SUPPLIER pk - s_suppkey, FK- lo_suppkey
DWDATE pk - d_datekey, FK - lo_orderdate, lo_commdate

Step #3 - Define the Sort Keys

All the primary and foreign keys have the sort order keys defined. Based on Query Where Clauses define them

Table name Sort Key
LINEORDER lo_orderdate
PART p_partkey
CUSTOMER c_custkey
SUPPLIER s_suppkey
DWDATE d_datekey

Step #4 - Analyze the record counts, data to find out distribution strategy
  • The Largest Dimension Table is PART, Each table can have only one distribution key
  • LINEORDER is the fact table, and PART is the largest dimension. PART joins LINEORDER on its primary key, p_partkey
  • Designate lo_partkey as the distribution key for LINEORDER and p_partkey as the distribution key for PART so that the matching values for the joining keys will be collocated 
Step #5 - Design Distribution Strategy

Based on data define the strategy for data to be collocated and records returned faster.

Table name Distribution Style
LINEORDER lo_partkey
PART p_partkey
CUSTOMER ALL
SUPPLIER ALL
DWDATE ALL

The final update schema is listed in link. Observe the addition of sortkey, distkey mentioned in the DDL


--Step 1 Create Tables
drop table demo.aggstats
create table demo.aggstats
(code varchar(11) ,
weekid int,
trancount int)
--Step 2 Create the procedure
drop procedure demo.biweeklysales();
create or replace procedure demo.biweeklysales()
language plpgsql
as $$
declare
CurrId INTEGER := 0;
MaxId INTEGER := 100;
startdate date := '01/01/2021';
enddate date;
duration int:=14;
begin
while CurrId <= MaxId
LOOP
CurrId = CurrId + 1;
enddate = dateadd(day,duration,startdate);
insert into demo.aggstats(code,weekid,trancount)
select product_code, CurrId, sum(units_sold) as salescount from demo.alltransactions
where order_time >= startdate
and order_time < enddate
group by product_code;
startdate = dateadd(day,duration,startdate);
raise info 'CurrId = %', CurrId;
end LOOP;
raise info 'Loop Statement Executed -_-||^';
end;
$$;
--Step 3 Execute
call demo.biweeklysales()
select distinct(weekid) from demo.aggstats
select * from demo.aggstats limit 100;
Happy Learning new things!!!

No comments: