A bit of a relook on NoSQL for a class helped me consolidate my learning.
NoSql - Not only SQL. During Engineering when it comes to Database design it is all about
- Codd's Rule
- Normalization Techniques
What I thought in 2005
- How we handle columns, data types, relationships everything is key. Handling Null, Default values, constraints, etc...
Systems data was structured in 2000
- 20 years back there was no social media, no WhatsApp. Most of the data is structured data, transactions, automating orders, etc..
- Partitioning by products/duration
- Replication to manage read / writes
- Use of Snapshot isolation/options
- Denomalizing few tables
- Migrating to the latest version / Rewriting some of the slow-performing reports
- Pagination of reports instead of fetch all approach
- Archiving completed orders
- Vertical Scaling- Add more RAM, CPU
- Now we have more unstructured, semi-structured data from mobile phones, social media, reviews, ratings, rankings, messages, images, and videos.
I still remember the 2010 period when Hadoop was much spoken about. Moving computation where data is available. I looked up my post in 2011 on MongoDB.
The evolution of databases is from
- Stage 1 - Papers, Ledgers
- Stage 2 - Excel, Access
- Stage 3 - Databases
- Stage 4 - Hadoop for large-scale data
- Stage 5 - NoSQL
- Stage 6 - lakehouse = (Hadoop + RDBMS + NOSQL + AI for data extraction from unstructured sources)
Everything revolves around Reading Correct Data vs Dirty Data (Transactions in progress may or may not commit).
Everything in DBMS is
- Create
- Read
- Update
- Delete
How does read/write balance, Essentially a record or row needs to be locked before update. This ensures we work in a consistent state.
CAP theorem is the Crux of Everything
Ref - LinkNow you need to choose DB based on preference (C - A - P)
- Is Query pattern aggregates or select for individual records?
- What is projected database growth?
- Is it structured / semi-structured data?
- What are my top 2 choices, can I do a quick prototype and performance test to validate
- Schema design what practices are relevant to each database type? What maps closely to the current context?
- Is Consistency a key thing, What about Availability / Partition tolerance, Is this system queried across geography to have availability in different regions
- If it exists how different copies will sync up, Will there be a master-slave approach / Replication / Log copy?
- What is cost allocated considering volume, and high availability needs?
- In one of the SaaS products we worked on, the Redis Key value pair was used for session management
- IoT platform for device management in one of my friends Team Cassandra was used to push device data / Generate reports
- One of the big retailers I was familiar used heavily columnar database Vertica to manage all their aggregate data for BI / ML work
I say table in RDBMS, Collection in MongoDB, What is the conceptual mapping?
SQL vs NoSQL Design Thinking
- How do I design collection in document DB, nested 1 to 1, 1 to many relationships
- What information I store in key-value pair, What key value will be unique and will not result in duplicates
- What column family I will create, How many aggregate queries will look like
Which Database for What Application Purpose?
- High reads consistent data - RDBMS
- High writes low reads - HBase, Cassandra
- Document-based storage (multiple key-value pairs, or key-array pairs, or even nested documents) - Mongodb, Couchdb
- Key-Value stores are similar to maps or dictionaries where data is addressed by a unique key - Redis
Above all cost also plays a key role. Knowing what to choose based on size, data growth, and access patterns is key to deciding the type of Database for implementation.
RDBMS, KeyValue, Columnar, Graph, Document Collection all these forms of databases will co-exist :)
Data Stack
Ref - LinkModern Analytics Stack
No comments:
Post a Comment