"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;
Showing posts with label columnar databases. Show all posts
Showing posts with label columnar databases. Show all posts

March 18, 2020

Data Perspectives

Different perspectives to decide on choosing the right database?
  • Strict data types - Schema on write
  • Schemaless data - Schema on read
  • Read-only immutable data
  • Eventually consistent data
  • Dirty read vs Committed data
  • Multi-version concurrency control
  • Replicate data based on logs
  • Replay committed logs
  • Data sharding
  • High reads consistent data - RDBMS
  • High writes low reads - HBase, Cassandra
  • Document-based storage - Mongodb, Couchdb
  • CAP, ACID Properties
Things I Wished More Developers Knew About Databases

Almost similar and deep-dive techniques from the tweet conversation
  • Read heavy vs write heavy. Insert vs updates. Vaccuuming
  • Replication or not, transaction logging, why indexes matter, performance tuning, i/o scheduler, unicode, gender isn't binary
  • Locks, cache effects, isolation levels
  • IO bound vs network bound especially in the situation of replication, scaling strayegy, concurrency vs distributed.
  • Materialized views, and the dangers of invalidating them unexpectedly.
  • Connection pool, scaling techniques to handle distributed application / system, improve performance, optimization of query etc.
  • I'd be interested in how this applies to a distributed system. Concurrency (specifically MVCC), connections, DB threading, backpressure handling
  • Disk storage implementation and optimization

Keep Thinking!!! 

January 01, 2020

NOSQL Internals and Design Practices

Objective – The objective of this paper is to analyze NoSQL internals from RDBMS developer perspective and provide design guidelines for NoSQL Applications
Analysis
RDBMS – RDBMS came into the picture to ensure the ACID properties are maintained and there is a single version of the truth. RDBMS plays a critical role in OLTP applications (Banking, Finance, and Payment) domains.
Database design– Database design is implemented to ensure it's normalized and avoid data redundancy. Primary Keys, Indexes are created to ensure query plans use the indexes to filter required rows and fetch required results within the shortest intervals.

Query Execution – Data is typically stored in a B-Tree format. The data is organized physically in the form of clustered indexes. This is the reason search based on the primary key is quick compared to any other non-indexed columns. Database Engine implements several other operations to optimize the execution plan by leveraging indexes, statistics, and partitioning, Non-clustered indexes. Depending on the query plan join operators, sort operators are applied to produce the execution plan. The execution plan is reused if it already exists in memory.
This paper was very useful to understand OLTP Internals. Reposting notes from my blog post
  • WAL – Changes are written in log and committed to disk when the checkpoint is reached
  • Buffer Manager – cache for data fetched / recently used
  • Two-Phase locking – Optimistic/pessimistic locking depending on isolation levels
  • Concurrency control – Based on isolation levels
NoSQL Databases 
Similar to above OLTP aspects, There are few papers that describe designing NOSQL apps for Read heavy / Write Heavy Apps. This paper was very useful to understand NoSQL perspective of designing apps in columnar databases

For Heavy Writes
  • Tall Skinny Tables
  • Consolidate data into single columns
For Heavy Reads
  • Fewer column families
  • Use bloom filters
There are multiple NoSQL databases (Key-Value, Document-based, Columnar Databases, etc...). 

Happy Learning!!!

June 17, 2012

Columnar Database Query Execution Basics

This post is to learn answer for How Query execution works in a Columnar database.


Row based storage - This is B-Tree based storage. If we have defined Clustered index on the table, Data would be ordered in based on the Clustered index key. There are a bunch of posts in msdn / blogs to explore more on this. Early Materialization happens in case of row based storage.


Coming to Columnar storage, How it works. From link Query execution in columnar database.

For column based Storage
  • Late Materialization (Load all required columns and then perform required join / filters)
  • Invisible join is another interesting concept
This link  was very useful about columnar databases. Below slide from tutorial is very good on fetching required columns and apply required filters for column store.


Happy Learning!!!