A bit of a relook on NoSQL for a class helped me consolidate my learning.
NoSql - Not only SQL. During Engineering when it comes to Database design it is all about
- Codd's Rule
- Normalization Techniques
What I thought in 2005
- How we handle columns, data types, relationships everything is key. Handling Null, Default values, constraints, etc...
Systems data was structured in 2000
- 20 years back there was no social media, no WhatsApp. Most of the data is structured data, transactions, automating orders, etc..
What all performance improvements/challenges came as data volumes increased?
- Partitioning by products/duration
- Replication to manage read / writes
- Use of Snapshot isolation/options
- Denomalizing few tables
- Migrating to the latest version / Rewriting some of the slow-performing reports
- Pagination of reports instead of fetch all approach
- Archiving completed orders
- Vertical Scaling- Add more RAM, CPU
Since the social media age
- Now we have more unstructured, semi-structured data from mobile phones, social media, reviews, ratings, rankings, messages, images, and videos.
I still remember the 2010 period when Hadoop was much spoken about. Moving computation where data is available. I looked up my post in 2011 on MongoDB.
The evolution of databases is from
- Stage 1 - Papers, Ledgers
- Stage 2 - Excel, Access
- Stage 3 - Databases
- Stage 4 - Hadoop for large-scale data
- Stage 5 - NoSQL
- Stage 6 - lakehouse = (Hadoop + RDBMS + NOSQL + AI for data extraction from unstructured sources)
Building a RDBMS perspective is Tables, Keys, Relationships
Everything revolves around Reading Correct Data vs Dirty Data (Transactions in progress may or may not commit).
Everything in DBMS is
How does read/write balance, Essentially a record or row needs to be locked before update. This ensures we work in a consistent state.
CAP theorem is the Crux of Everything
Ref -
LinkNow you need to choose DB based on preference (C - A - P)Questions to ask to decide on the choice of Database?
- Is Query pattern aggregates or select for individual records?
- What is projected database growth?
- Is it structured / semi-structured data?
- What are my top 2 choices, can I do a quick prototype and performance test to validate
- Schema design what practices are relevant to each database type? What maps closely to the current context?
- Is Consistency a key thing, What about Availability / Partition tolerance, Is this system queried across geography to have availability in different regions
- If it exists how different copies will sync up, Will there be a master-slave approach / Replication / Log copy?
- What is cost allocated considering volume, and high availability needs?
Different NoSQL Systems- In one of the SaaS products we worked on, the Redis Key value pair was used for session management
- IoT platform for device management in one of my friends Team Cassandra was used to push device data / Generate reports
- One of the big retailers I was familiar used heavily columnar database Vertica to manage all their aggregate data for BI / ML work
We need to consider the use case, data volume, velocity, type of reporting, cost, growth, security everything to decide on choosing a database.
Ref -
LinkI say table in RDBMS, Collection in MongoDB, What is the conceptual mapping?
SQL vs NoSQL Design Thinking
- How do I design collection in document DB, nested 1 to 1, 1 to many relationships
- What information I store in key-value pair, What key value will be unique and will not result in duplicates
- What column family I will create, How many aggregate queries will look like
Schema / Relationships / Keys will vary based on the Database type.
Which Database for What Application Purpose?- High reads consistent data - RDBMS
- High writes low reads - HBase, Cassandra
- Document-based storage (multiple key-value pairs, or key-array pairs, or even nested documents) - Mongodb, Couchdb
- Key-Value stores are similar to maps or dictionaries where data is addressed by a unique key - Redis
Above all cost also plays a key role. Knowing what to choose based on size, data growth, and access patterns is key to deciding the type of Database for implementation.
RDBMS, KeyValue, Columnar, Graph, Document Collection all these forms of databases will co-exist :)
Data Stack
Ref -
Link
Modern Analytics Stack
Ref -
Link
Source -
Link