Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Data Warehousing with Amazon Redshift

June 08, 2019

Data Warehousing with Amazon Redshift

Redshift started from Postgres, 4 Years 150+ features added
Rebuilt and improved version, columnar storage, scale horizontally, OLAP functionality added
Wrapped in AWS system

The 5 things to look at creating Redshift

Step 1 - Find Fact, Dimension Tables and the Record Count
Step 2 - Analyze and find the query where clause filter columns
Step 3 - Define the Sort Keys
Step 4 - Analyze the record counts, data to find out distribution strategy
Step 5 - Design Distribution Strategy

Summary of Lessons from Link

Good Tutorial Table Design Tuning - Link

The Where clause columns are the sort keys - (date, partnumber, year etc). Same as TSQL index design Link -

Summary of Lessons and Key Steps

Step #1 - Demo Example Tables and Record Count

LINEORDER - 600,037,902

PART - 1,400,000

CUSTOMER - 3,000,000

SUPPLIER - 1,000,000

DWDATE - 2,556

Step #2 - Analyze and find the query where clause filter columns, Link

The Join Columns are

LINEORDER PrimaryKey - lo_orderkey, FK - lo_custkey, lo_partkey, lo_suppkey, lo_orderdate, lo_commdate

PART pk - p_partkey, FK- lo_partkey

CUSTOMER pk - c_custkey, FK - lo_custkey

SUPPLIER pk - s_suppkey, FK- lo_suppkey

DWDATE pk - d_datekey, FK - lo_orderdate, lo_commdate

Step #3 - Define the Sort Keys

All the primary and foreign keys have the sort order keys defined. Based on Query Where Clauses define them

Table name Sort Key

LINEORDER lo_orderdate

PART p_partkey

CUSTOMER c_custkey

SUPPLIER s_suppkey

DWDATE d_datekey

Step #4 - Analyze the record counts, data to find out distribution strategy

The Largest Dimension Table is PART, Each table can have only one distribution key
LINEORDER is the fact table, and PART is the largest dimension. PART joins LINEORDER on its primary key, p_partkey
Designate lo_partkey as the distribution key for LINEORDER and p_partkey as the distribution key for PART so that the matching values for the joining keys will be collocated

Step #5 - Design Distribution Strategy

Based on data define the strategy for data to be collocated and records returned faster.

Table name Distribution Style

LINEORDER lo_partkey

PART p_partkey

CUSTOMER ALL

SUPPLIER ALL

DWDATE ALL

The final update schema is listed in link. Observe the addition of sortkey, distkey mentioned in the DDL

Stored_Procs

	--Step 1 Create Tables
	drop table demo.aggstats

	create table demo.aggstats
	(code varchar(11) ,
	weekid int,
	trancount int)

	--Step 2 Create the procedure
	drop procedure demo.biweeklysales();

	create or replace procedure demo.biweeklysales()
	language plpgsql
	as $$
	declare
	CurrId INTEGER := 0;
	MaxId INTEGER := 100;
	startdate date := '01/01/2021';
	enddate date;
	duration int:=14;
	begin
	while CurrId <= MaxId
	LOOP
	CurrId = CurrId + 1;
	enddate = dateadd(day,duration,startdate);
	insert into demo.aggstats(code,weekid,trancount)
	select product_code, CurrId, sum(units_sold) as salescount from demo.alltransactions
	where order_time >= startdate
	and order_time < enddate
	group by product_code;
	startdate = dateadd(day,duration,startdate);
	raise info 'CurrId = %', CurrId;
	end LOOP;
	raise info 'Loop Statement Executed -_-\|\|^';
	end;
	$$;

	--Step 3 Execute
	call demo.biweeklysales()
	select distinct(weekid) from demo.aggstats
	select * from demo.aggstats limit 100;

view raw redshiftproc.sql hosted with ❤ by GitHub

Happy Learning new things!!!

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

June 08, 2019

Data Warehousing with Amazon Redshift

No comments:

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts