"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

July 24, 2022

NoSQL Summary - Options

 A bit of a relook on NoSQL for a class helped me consolidate my learning.

NoSql - Not only SQL. During Engineering when it comes to Database design it is all about

  • Codd's Rule
  • Normalization Techniques 

What I thought in 2005

  • How we handle columns, data types, relationships everything is key. Handling Null, Default values, constraints, etc...

Systems data was structured in 2000

  • 20 years back there was no social media, no WhatsApp. Most of the data is structured data, transactions, automating orders, etc..
What all performance improvements/challenges came as data volumes increased?
  • Partitioning by products/duration
  • Replication to manage read / writes
  • Use of Snapshot isolation/options
  • Denomalizing few tables
  • Migrating to the latest version / Rewriting some of the slow-performing reports
  • Pagination of reports instead of fetch all approach
  • Archiving completed orders
  • Vertical Scaling- Add more RAM, CPU
Since the social media age

  • Now we have more unstructured, semi-structured data from mobile phones, social media, reviews, ratings, rankings, messages, images, and videos.

I still remember the 2010 period when Hadoop was much spoken about. Moving computation where data is available. I looked up my post in 2011 on MongoDB. 

The evolution of databases is from

  • Stage 1 - Papers, Ledgers
  • Stage 2 - Excel, Access
  • Stage 3 - Databases
  • Stage 4 - Hadoop for large-scale data
  • Stage 5 - NoSQL
  • Stage 6 - lakehouse = (Hadoop + RDBMS + NOSQL + AI for data extraction from unstructured sources)

Building a RDBMS perspective is Tables, Keys, Relationships


Ref - Link

Everything revolves around Reading Correct Data vs Dirty Data (Transactions in progress may or may not commit). 

Everything in DBMS is

  • Create
  • Read
  • Update
  • Delete

How does read/write balance, Essentially a record or row needs to be locked before update. This ensures we work in a consistent state.

CAP theorem is the Crux of Everything


Ref - Link

Now you need to choose DB based on preference (C - A - P)


Questions to ask to decide on the choice of Database?
  • Is Query pattern aggregates or select for individual records?
  • What is projected database growth?
  • Is it structured / semi-structured data?
  • What are my top 2 choices, can I do a quick prototype and performance test to validate 
  • Schema design what practices are relevant to each database type? What maps closely to the current context?
  • Is Consistency a key thing, What about Availability / Partition tolerance, Is this system queried across geography to have availability in different regions
  • If it exists how different copies will sync up, Will there be a master-slave approach / Replication / Log copy?
  • What is cost allocated considering volume, and high availability needs?
Different NoSQL Systems

  • In one of the SaaS products we worked on, the Redis Key value pair was used for session management
  • IoT platform for device management in one of my friends Team Cassandra was used to push device data / Generate reports
  • One of the big retailers I was familiar used heavily columnar database Vertica to manage all their aggregate data for BI / ML work
We need to consider the use case, data volume, velocity, type of reporting, cost, growth, security everything to decide on choosing a database.

Ref - Link

I say table in RDBMS, Collection in MongoDB, What is the conceptual mapping?


SQL vs NoSQL Design Thinking 

  • How do I design collection in document DB, nested 1 to 1, 1 to many relationships
  • What information I store in key-value pair, What key value will be unique and will not result in duplicates
  • What column family I will create, How many aggregate queries will look like
Schema / Relationships / Keys will vary based on the Database type.

Which Database for What Application Purpose?
  • High reads consistent data - RDBMS
  • High writes low reads - HBase, Cassandra
  • Document-based storage (multiple key-value pairs, or key-array pairs, or even nested documents) - Mongodb, Couchdb
  • Key-Value stores are similar to maps or dictionaries where data is addressed by a unique key - Redis

Above all cost also plays a key role. Knowing what to choose based on size, data growth, and access patterns is key to deciding the type of Database for implementation.

RDBMS, KeyValue, Columnar, Graph, Document Collection all these forms of databases will co-exist :)

Data Stack

Ref - Link

Modern Analytics Stack



Ref - Link


Source - Link

No comments: