"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

June 26, 2021

Notes from Azure Synapse Training

Lesson #1 - Tables – Indexes Best Practices

  • Clustered Columnstore index (Default Primary) - Highest level of data compression. Best overall query performance
  • Clustered index (Primary) - Performant for looking up a single to few rows
  • Heap (Primary) - Faster loading and landing temporary data. Best for small lookup tables
  • Nonclustered indexes (Secondary) - Enable ordering of multiple columns in a table. Allows multiple nonclustered on a single table. Can be created on any of the above primary indexes. More performant lookup queries
Queries with the following patterns typically run faster with ordered CCI:
  • The queries have equality, inequality, or range predicates
  • The predicate columns and the ordered CCI columns are the same.
  • The predicate columns are used in the same order as the column ordinal of ordered CCI columns.
  • Caching of results, Enable caching at DB level then query level - Resultcachehit flag returns the value whether it was reused

Fact table primarily CCI as we would run large aggregations based on dimensions so CCI becomes a choice for fact tables. 

Lesson #2 - Distributed table design recommendations

  • Hash Distribution: Large fact tables exceeding several GBs with frequent inserts should use a hash distribution.
  • Round Robin Distribution: Potentially useful tables created from raw input. Temporary staging tables used in data preparation.
  • Replicated Tables: Lookup tables that range in size from 100’s MBs to 1.5 GBs should be replicated. Works best when table size is less than 2 GB compressed.

Lesson #3 - Result-set caching

Cache the results of a query from SQL pool storage. This enables interactive response times for repetitive queries against tables with infrequent data changes. The result-set cache persists even if SQL pool is paused and resumed later. 

Cache Checks

You can tell if a query was executed with a result cache hit or miss by querying sys.pdw_request_steps for commands where value is like ‘%DWResultCacheDb%’

Lesson #4 - SQL Data Classification is a new feature in the Public Preview, that:   

  • Automatically discovers columns containing potentially sensitive data
  • It provides a simple way to review and apply the classification recommendations through the Azure portal.
  • The sensitive data labels are persisted in the database (metadata attributes) and it audits and detects access to the sensitive data.
  • We offer built-in set of labels and information types, however customers can chose to define custom labels across Azure tenant using Azure Security Center

Lesson #5 - Dynamic Data Masking

  • Prevent abuse of sensitive data by hiding it from users
  • Easy configuration in new Azure Portal
  • Policy-driven at table and column level, for a defined set of users
  • Data masking applied in real-time to query results based on policy
  • Multiple masking functions available, such as full or partial, for various sensitive data categories (credit card numbers, SSN, etc.)
Lesson #6 - Spark vs SQL Server (Memory Handling)

Keep in mind spark uses memory much in the same way as sql server uses the buffer pool by storing frequently used objects in memory it reduces overall I/O and improves performance in large joins, sort and aggregates contrast this with a traditional hadoop based architecture which relies heavily on writing data out to disk between steps.

Every concept technical maps as an advancement or some sort of limitation which existed in place. Compared to SQL 2008 where you don't have so much of these feature synapse has beautifully evolved as a good environment for real-time / ML / big data handling capability for reporting / Ml recommendations/lakehouse / real-time BI system. Gone are the days of month-end jobs or Data sync jobs. 

All good lessons :) Fantastic Features!!!

No comments: