Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Notes from Azure Synapse Training

June 26, 2021

Notes from Azure Synapse Training

Lesson #1 - Tables – Indexes Best Practices

Clustered Columnstore index (Default Primary) - Highest level of data compression. Best overall query performance
Clustered index (Primary) - Performant for looking up a single to few rows
Heap (Primary) - Faster loading and landing temporary data. Best for small lookup tables
Nonclustered indexes (Secondary) - Enable ordering of multiple columns in a table. Allows multiple nonclustered on a single table. Can be created on any of the above primary indexes. More performant lookup queries

Queries with the following patterns typically run faster with ordered CCI:

The queries have equality, inequality, or range predicates
The predicate columns and the ordered CCI columns are the same.
The predicate columns are used in the same order as the column ordinal of ordered CCI columns.
Caching of results, Enable caching at DB level then query level - Resultcachehit flag returns the value whether it was reused

Fact table primarily CCI as we would run large aggregations based on dimensions so CCI becomes a choice for fact tables.

Lesson #2 - Distributed table design recommendations

Hash Distribution: Large fact tables exceeding several GBs with frequent inserts should use a hash distribution.
Round Robin Distribution: Potentially useful tables created from raw input. Temporary staging tables used in data preparation.
Replicated Tables: Lookup tables that range in size from 100’s MBs to 1.5 GBs should be replicated. Works best when table size is less than 2 GB compressed.

Lesson #3 - Result-set caching

Cache the results of a query from SQL pool storage. This enables interactive response times for repetitive queries against tables with infrequent data changes. The result-set cache persists even if SQL pool is paused and resumed later.

Cache Checks

You can tell if a query was executed with a result cache hit or miss by querying sys.pdw_request_steps for commands where value is like ‘%DWResultCacheDb%’

Lesson #4 - SQL Data Classification is a new feature in the Public Preview, that:

Automatically discovers columns containing potentially sensitive data
It provides a simple way to review and apply the classification recommendations through the Azure portal.
The sensitive data labels are persisted in the database (metadata attributes) and it audits and detects access to the sensitive data.
We offer built-in set of labels and information types, however customers can chose to define custom labels across Azure tenant using Azure Security Center

Lesson #5 - Dynamic Data Masking

Prevent abuse of sensitive data by hiding it from users
Easy configuration in new Azure Portal
Policy-driven at table and column level, for a defined set of users
Data masking applied in real-time to query results based on policy
Multiple masking functions available, such as full or partial, for various sensitive data categories (credit card numbers, SSN, etc.)

Lesson #6 - Spark vs SQL Server (Memory Handling)

Keep in mind spark uses memory much in the same way as sql server uses the buffer pool by storing frequently used objects in memory it reduces overall I/O and improves performance in large joins, sort and aggregates contrast this with a traditional hadoop based architecture which relies heavily on writing data out to disk between steps.

Every concept technical maps as an advancement or some sort of limitation which existed in place. Compared to SQL 2008 where you don't have so much of these feature synapse has beautifully evolved as a good environment for real-time / ML / big data handling capability for reporting / Ml recommendations/lakehouse / real-time BI system. Gone are the days of month-end jobs or Data sync jobs.

All good lessons :) Fantastic Features!!!

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

June 26, 2021

Notes from Azure Synapse Training

No comments:

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts