Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): columnar databases

Showing posts with label columnar databases. Show all posts

March 18, 2020

Data Perspectives

Different perspectives to decide on choosing the right database?

Strict data types - Schema on write
Schemaless data - Schema on read
Read-only immutable data
Eventually consistent data
Dirty read vs Committed data
Multi-version concurrency control
Replicate data based on logs
Replay committed logs
Data sharding
High reads consistent data - RDBMS
High writes low reads - HBase, Cassandra
Document-based storage - Mongodb, Couchdb
CAP, ACID Properties

Things I Wished More Developers Knew About Databases

Want to Debug Latency?

I wrote an initial draft on the things I wished more developers knew about DBs. It touches a variety of topics: write skews, external consistency, clock skews, database-generated IDs, nested transaction issues, caches & more. Is there anything you wished more devs knew about DBs?
— Jaana Dogan (@rakyll) April 13, 2020

Almost similar and deep-dive techniques from the tweet conversation

Read heavy vs write heavy. Insert vs updates. Vaccuuming
Replication or not, transaction logging, why indexes matter, performance tuning, i/o scheduler, unicode, gender isn't binary
Locks, cache effects, isolation levels
IO bound vs network bound especially in the situation of replication, scaling strayegy, concurrency vs distributed.
Materialized views, and the dangers of invalidating them unexpectedly.
Connection pool, scaling techniques to handle distributed application / system, improve performance, optimization of query etc.
I'd be interested in how this applies to a distributed system. Concurrency (specifically MVCC), connections, DB threading, backpressure handling
Disk storage implementation and optimization

Keep Thinking!!!

January 01, 2020

NOSQL Internals and Design Practices

Objective – The objective of this paper is to analyze NoSQL internals from RDBMS developer perspective and provide design guidelines for NoSQL Applications
Analysis
RDBMS – RDBMS came into the picture to ensure the ACID properties are maintained and there is a single version of the truth. RDBMS plays a critical role in OLTP applications (Banking, Finance, and Payment) domains.
Database design– Database design is implemented to ensure it's normalized and avoid data redundancy. Primary Keys, Indexes are created to ensure query plans use the indexes to filter required rows and fetch required results within the shortest intervals.

Query Execution – Data is typically stored in a B-Tree format. The data is organized physically in the form of clustered indexes. This is the reason search based on the primary key is quick compared to any other non-indexed columns. Database Engine implements several other operations to optimize the execution plan by leveraging indexes, statistics, and partitioning, Non-clustered indexes. Depending on the query plan join operators, sort operators are applied to produce the execution plan. The execution plan is reused if it already exists in memory.

This paper was very useful to understand OLTP Internals. Reposting notes from my blog post

WAL – Changes are written in log and committed to disk when the checkpoint is reached
Buffer Manager – cache for data fetched / recently used
Two-Phase locking – Optimistic/pessimistic locking depending on isolation levels
Concurrency control – Based on isolation levels

NoSQL Databases

Similar to above OLTP aspects, There are few papers that describe designing NOSQL apps for Read heavy / Write Heavy Apps. This paper was very useful to understand NoSQL perspective of designing apps in columnar databases

For Heavy Writes

Tall Skinny Tables
Consolidate data into single columns

For Heavy Reads

Fewer column families
Use bloom filters

There are multiple NoSQL databases (Key-Value, Document-based, Columnar Databases, etc...).

Happy Learning!!!

June 17, 2012

Columnar Database Query Execution Basics

This post is to learn answer for How Query execution works in a Columnar database.

Row based storage - This is B-Tree based storage. If we have defined Clustered index on the table, Data would be ordered in based on the Clustered index key. There are a bunch of posts in msdn / blogs to explore more on this. Early Materialization happens in case of row based storage.

Coming to Columnar storage, How it works. From link Query execution in columnar database.

For column based Storage

Late Materialization (Load all required columns and then perform required join / filters)
Invisible join is another interesting concept

This link was very useful about columnar databases. Below slide from tutorial is very good on fetching required columns and apply required filters for column store.

Happy Learning!!!

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

March 18, 2020

Data Perspectives

January 01, 2020

NOSQL Internals and Design Practices

June 17, 2012

Columnar Database Query Execution Basics

About Me

What is your Expertise

Search This Blog

Git Code Repository

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts