"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

January 20, 2023

Why RDBMS is more CA not CP

CAP theorem states that it is possible to achieve two of these three properties as guaranteed features in a distributed network.

My perspective with some context

Example - For online banking why CA is more important than P

  • We cannot allow deduction on the incorrect amount
  • Every deduction has to be on recently committed data
  • It has to be real-time
  • Data may be partitioned by region/base account/branch for reducing the size of DB
  • Reads may be pointed to some other system and write will be on the target system to avoid overhead
  • Ideally, there is one main system, a backup copy system
  • Here main things met are consistency (always show your recent info)
  • Here mainly the data is Consistent and Available in an OLTP kind of setup
  • This DB may not be available multi-region unless you are an international customer
  • In this setup, you can achieve consistency and availability always. But if the same person when goes to another place/country the network latency might come into effect. We cannot replicate the same copy in real time if there are data constraints
  • Consider what majorly is met, Still, you achieve partition with BCP, replication, or some other options but there is an overhead for everything that you add to the system
  • Latency is dependent on storage type, query conditions, indexes too
  • There could be two-phase commit / Write ahead logging to retry if the timeout

CA database can be built in the form of a relational database (e.g. PostgreSQL) deployed to multiple nodes using replication. CA - Single node systems usually. Databases that adhere to ACID properties focus on consistency and represent the traditional approach

These are all points that CA it needs to consider before getting into 'P'. When you copy/divide and store you are accountable to manage all of it, how recent / what to do when it is not available to access

In practice, a distributed system always needs to be partition tolerant, thus leaving us to choose one property from Consistency or Availability. Hence, there is a trade-off between consistency and availability

There is much more to decide between CAP - Different perspectives to decide on choosing the right database?

  • Strict data types - Schema on write
  • Schemaless data - Schema on read
  • Read-only immutable data
  • Eventually consistent data
  • Dirty read vs Committed data
  • Multi-version concurrency control
  • Replicate data based on logs
  • Replay committed logs
  • Data sharding
  • Consistency options (2-phase-commit, Pessimistic locking)

Partition tolerance: understood as the ability of the system to continue operation in the presence of network partitions. These occur if two or more "islands" of network nodes arise, temporarily or permanently, which cannot connect to each other. Some also understand partition tolerance as the ability of a system to cope with the dynamic addition and removal of nodes

Every DB is built with some tradeoffs.

Keep Thinking!!!

No comments: