Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): NoSQL Summary

July 24, 2022

NoSQL Summary - Options

A bit of a relook on NoSQL for a class helped me consolidate my learning.

NoSql - Not only SQL. During Engineering when it comes to Database design it is all about

Codd's Rule
Normalization Techniques

What I thought in 2005

How we handle columns, data types, relationships everything is key. Handling Null, Default values, constraints, etc...

Systems data was structured in 2000

20 years back there was no social media, no WhatsApp. Most of the data is structured data, transactions, automating orders, etc..

What all performance improvements/challenges came as data volumes increased?

Partitioning by products/duration
Replication to manage read / writes
Use of Snapshot isolation/options
Denomalizing few tables
Migrating to the latest version / Rewriting some of the slow-performing reports
Pagination of reports instead of fetch all approach
Archiving completed orders
Vertical Scaling- Add more RAM, CPU

Since the social media age

Now we have more unstructured, semi-structured data from mobile phones, social media, reviews, ratings, rankings, messages, images, and videos.

I still remember the 2010 period when Hadoop was much spoken about. Moving computation where data is available. I looked up my post in 2011 on MongoDB.

The evolution of databases is from

Stage 1 - Papers, Ledgers
Stage 2 - Excel, Access
Stage 3 - Databases
Stage 4 - Hadoop for large-scale data
Stage 5 - NoSQL
Stage 6 - lakehouse = (Hadoop + RDBMS + NOSQL + AI for data extraction from unstructured sources)

Building a RDBMS perspective is Tables, Keys, Relationships

Ref - Link

Everything revolves around Reading Correct Data vs Dirty Data (Transactions in progress may or may not commit).

Everything in DBMS is

Create
Read
Update
Delete

How does read/write balance, Essentially a record or row needs to be locked before update. This ensures we work in a consistent state.

CAP theorem is the Crux of Everything

Ref - Link

Now you need to choose DB based on preference (C - A - P)

Questions to ask to decide on the choice of Database?

Is Query pattern aggregates or select for individual records?
What is projected database growth?
Is it structured / semi-structured data?
What are my top 2 choices, can I do a quick prototype and performance test to validate
Schema design what practices are relevant to each database type? What maps closely to the current context?
Is Consistency a key thing, What about Availability / Partition tolerance, Is this system queried across geography to have availability in different regions
If it exists how different copies will sync up, Will there be a master-slave approach / Replication / Log copy?
What is cost allocated considering volume, and high availability needs?

Different NoSQL Systems

In one of the SaaS products we worked on, the Redis Key value pair was used for session management
IoT platform for device management in one of my friends Team Cassandra was used to push device data / Generate reports
One of the big retailers I was familiar used heavily columnar database Vertica to manage all their aggregate data for BI / ML work

We need to consider the use case, data volume, velocity, type of reporting, cost, growth, security everything to decide on choosing a database.

Ref - Link

I say table in RDBMS, Collection in MongoDB, What is the conceptual mapping?

SQL vs NoSQL Design Thinking

How do I design collection in document DB, nested 1 to 1, 1 to many relationships
What information I store in key-value pair, What key value will be unique and will not result in duplicates
What column family I will create, How many aggregate queries will look like

Schema / Relationships / Keys will vary based on the Database type.

Which Database for What Application Purpose?

High reads consistent data - RDBMS
High writes low reads - HBase, Cassandra
Document-based storage (multiple key-value pairs, or key-array pairs, or even nested documents) - Mongodb, Couchdb
Key-Value stores are similar to maps or dictionaries where data is addressed by a unique key - Redis

Above all cost also plays a key role. Knowing what to choose based on size, data growth, and access patterns is key to deciding the type of Database for implementation.

RDBMS, KeyValue, Columnar, Graph, Document Collection all these forms of databases will co-exist :)

Data Stack