"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;
Showing posts with label Big Data. Show all posts
Showing posts with label Big Data. Show all posts

March 19, 2020

Distributed Systems - Session #3 - Aurora

Sometimes I felt not connected to the session. Needs a lot of focus and patience to stay connected and focused :)


Key Summary points
  • Amazon early offering EC2
  • Rented out VMs to customers
  • VMM (Virtual Machine Monitors) that run/manage EC2 instances
  • EC2 good for stateless web servers
  • S3 - Scheme for storing large chunks of data (Periodic Snapshots)
  • Disks for EC2 instances - Fault Tolerance (EBS)
  • EBS (Elastic Block Store) - Looks for EC2 instances as it is a harddrive
  • Databases on EBS sends a large volume of data over the network
  • Amount of writes on Network Storage System
  • CPU / Disk space consumption
  • EC2 / EBS are in same availability zone
  • Transaction & Crash Recovery
  • Transaction (Sequence of operations / commands / atomic / ex- bank transfer money between accounts)
  • Reads page from disk
  • Make Changes in local cache
  • Then write changes to disk
  • Log entries describe the transaction
  • Three log records - Modify Operation, Old Value, New Value
  • Aurora is based on MySQL
  • RDS (Database replicated in multiple availability zones)
  • All the transactions mirrored to other databases (EBS Servers)
  • Multiple copies managed and updated to keep everything in sync
  • Read / Write Quorum will overlap 
  • Voting does not work to read from which server
  • These systems have version numbers
  • Readers takes the ones with highest version number
  • Split database into replicas
  • Data Sharding
  • Data across protection groups
Happy Learning!!!

March 18, 2020

Distributed Systems - Session #2




I paused it a lot as I didn't really get involved much but finally managed to complete it.

Key Lessons
  • Go lang examples for threading, locking, RPC, Typesafe and memory safe, Garbage Collected
  • Threads - Tools to manage concurrency in programs
  • Stacks are within address space of the program
  • I/O Concurrency - Overlapping of progress of different activities wait ing / executing
  • Parallelism - Parallelize CPU / IO cycles / routines
  • Process is a single program / single address space. Inside process there are multiple threads
  • Process -> memory area -> routines sit inside the process
  • Process implemented by the operating system
  • Thread challenges - Sharing data
  • Mutex / Locks for shared data
  • Data Access - Managing Locks / Deadlocks / Starvation / Blocking
  • Channels (Go Lang) - Send data between threads
  • WaitGroup, Sync.Cond
  • Webcrawlers design for parallel processing using threads
  • Handling concurrency / multiple parallel threads / optimum network capacity utilization
  • Remember doing SSIS ETL parallel tasks for Data pull
A multi-threaded Web crawler implemented in Python
Crawler
Multi-Threaded Crawler in Python

Happy Learning!!!

Data Perspectives

Different perspectives to decide on choosing the right database?
  • Strict data types - Schema on write
  • Schemaless data - Schema on read
  • Read-only immutable data
  • Eventually consistent data
  • Dirty read vs Committed data
  • Multi-version concurrency control
  • Replicate data based on logs
  • Replay committed logs
  • Data sharding
  • High reads consistent data - RDBMS
  • High writes low reads - HBase, Cassandra
  • Document-based storage - Mongodb, Couchdb
  • CAP, ACID Properties
Things I Wished More Developers Knew About Databases

Almost similar and deep-dive techniques from the tweet conversation
  • Read heavy vs write heavy. Insert vs updates. Vaccuuming
  • Replication or not, transaction logging, why indexes matter, performance tuning, i/o scheduler, unicode, gender isn't binary
  • Locks, cache effects, isolation levels
  • IO bound vs network bound especially in the situation of replication, scaling strayegy, concurrency vs distributed.
  • Materialized views, and the dangers of invalidating them unexpectedly.
  • Connection pool, scaling techniques to handle distributed application / system, improve performance, optimization of query etc.
  • I'd be interested in how this applies to a distributed system. Concurrency (specifically MVCC), connections, DB threading, backpressure handling
  • Disk storage implementation and optimization

Keep Thinking!!! 

March 12, 2020

Distributed Systems - Session #1

Key Notes
  • Storage, Big Data, File Sharing
  • The infrastructure that requires more than one computer
  • High Performance, Parallelism
  • Fault Tolerance - Two computer does the same things. One Fails another picks up. Availability / Recoverability, Replication
  • Systems are inherently physically distributed
  • To achieve security goals
  • Handle unexpected failure patterns (Partial Failures)
  • Challenges are Concurrency, Partial Failures
  • Academic Curiosity -> Real-world Examples
  • Lectures, Research papers for ideas, implementation details, labs, exams
  • Map Reduce - Map Function on each of the input files, Obvious Parallelism available. The output is a list of Key-Value Pairs. Maps -> Intermediate Output -> Reducers. Collects all instances, all maps. 

Happy Learning!!!

February 26, 2020

IoT Architecture

  • Edge - Lightweight protocols for Machine to Cloud communication - MQTT (Lightweight pub/sub) - Immutable data - Read-Only - Data cannot be modified
  • Cloud Entry - Large scale data ingestion to consume data - Kafka - (Distributed Log processing)- Immutable data - Data cannot be modified
  • Cloud Streaming - Real-time data analysis to report and alert - Spark - (RDDs / ML ) - Immutable data - Data cannot be modified
  • Store and Analyze - Reports on data, Completed transactions - Postgres, RDBMS - Do the Remaining CRUD
  • ML on Edge, ML on Streaming data, ML on Stored data (completed transaction)
Happy Learning!!!

December 23, 2019

Difference between SQL and NOSQL Systems

Reposting from my two-year-old Quora answer

The Key differences between them lies in the understanding CAP theorem
  • Consistency
  • Availability
  • Partition Tolerance
In layman terms. SQL systems ex-RDBMS will adhere ACID properties (Atomicity, Consistency, Isolation, Durability).
  • The datatypes, schema are predefined, You cannot store non-matching datatypes
  • To avoid dirty data, systems enforce isolation levels that govern only committed data is read (Consistency)
  • Only latest records are available, records at that point in time are not available
  • Banking Systems, ordering systems where data needs to consistent will be mostly SQL based systems where consistency is important
No-SQL systems (Not Only SQL)
  • The schema is not tightly governed, its flexible you can store different datatypes in same columns
  • These may be geographically distributed where data may be synced and eventually be consistent end of day not realtime
  • They also support point in time data, data values at a point in time can also be looked up
  • Where there is no requirement for consistency we can achieve other 2 Availability and partition tolerance
  • Since some of the ACID properties are compromised you will have high availability of this systems
It is more to do with business need to decided SQL or SQL based storage.

Happy Learning!!!

September 26, 2019

The Curse of Cheap Data Plans

Many time I wonder cheap data plans are a curse, not a boom. I see more often these days
  • More time I personally spend on Youtube
  • Forwards of TikTok/ Halo Status videoes
  • Rechecking same repetitive news everywhere
I have lost a lot of sleeping hours. Google Youtube recommendation is the most unfair recommendation. Providing extremely similar recommendations. There is no mix of different sources. Sometimes tailored information is not what we need, we need the raw data.

Too much of personalization is a curse. You will lose yourself biased on your perspectives. Sometimes raw information makes more sense than tailored information.

Escape the Web!!!

September 14, 2019

New Age BI = Kimball + Big Data for unstructured data + AI capabilities

Kimball and Inmon are driven off of structuring data

Inmon
  • Bill Inmon is centralized DW proponent
  • Inmon defines data warehouse as a centralized repository for the entire enterprise
  • Data warehouse is at the center of the Corporate Information Factory (CIF)
Kimball
  • Kimball defines data warehouse as “A copy of transaction data specifically structured for query and analysis”
  • Kimball kept getting more correct due to global corporations and their need for distributed DW
  • Kimball is the distributed Data Marts proponent.
  • Kimball defines business processes quite broadly.
Current Status
  • Both those concepts are currently changing based with BigData, Inmem Processing, Columnar
  • databases and machine intelligence
  • We need both tactical plus bigdata plus analytics
Today BI = DW + AI capabilities + Big Data
  • BI & Analytics is spread across multiple components; we cannot invest in centralized DW as systems need to be agile enough to accommodate changing data needs
  • Knowledge discovering is an ongoing process
  • Data is structured, unstructured. Strategically data is changing.
  • Making it smaller datamarts is more manageable
Kimball + Big Data for unstructured data + AI capabilities = New Age BI

The goal of analytics is to move away from historical to real-time recommendations

New Age #BI = #Kimball + #BigData for unstructured data + #AI capabilities
  • #Kimbal - Build Datamarts to slide and dice through data
  • #BigData - Find your digital footprint, Look at realtime trends/sentiments
  • #AI - Use AI to find insights from your customer, users, demands, patterns
Happy Learning!!!

July 12, 2019

Data Modelling for Workloads

  • Denormalized way to suit query patterns
  • Data Properties - Concurrent writes 
  • Metadata - No continuous updates (Just store data, no update involved)
  • Indexes - Single Field, Multikey indexes, Text Indexes, Column Store Indexes
  • Sharding - Shardkey could be location_id (Distributed Storage)
  • Data Partitioning
  • Timeseries Info aggregation in storage level (NoSQL)
  • JSON Modelling - Collections in MongoDB, Establishing Hierarchy and Relationships
Happy Learning!!!

Day #262 - Data Modeling for Analytics Translators


Summary Notes
  • Flexible, Extensible, Governed Effectively
  • DW - Staging - RAW - Processing - Consumption
  • Data processed multiple times
  • Aggregated at the end
  • Operational Data Store
Schema on Write
  • Write data
  • Read data
  • Same Schema. Fixed Structure
Schema on Read
  • Apply schema when you read
  • Write once read many times
  • WORM
  • Bringing data separated by different business, data, databases
Example
  • Data arrived in JSON format
  • Add time_stamp to relate data source
  • Add Source_system
Canonical Model 
  • Repeating data for right reasons
  • Enrich with meta_data, canonical elements
  • Link Canonical elements, suppliers together, Provide unique_id
Data Governance
  • Data problems comes in mass scenarios
  • Reports Data Discrepancies
  • IT framework to manage Data Governance
  • Master Data Management (MDM is a technology which provides a 360 degree view of a user data coming from different sources)
  • Data Quality
  • Data Archival
  • Data Security
MDM
  • Source -> ETL (Clean, Standardize, Transform, MDM) -> Reports, DW, EDW
  • Rules Based
  • Metadata verification
  • Data Collection -> ETL -> Data Quality -> MDM -> DW
Data Quality API / Module
  • Add / Remove Business Rules
  • Field Level Validations against messages
  • Return Error codes or log for failures
  • Auditing and Reporting failed messages automatically
Happy Learning!!!

June 08, 2019

Day #259 - Setting up Kafka on my Ubuntu - Big Data Setup - Part IV

Finally Setting up all Big Data Tools in my Linux Setup. Rough Steps and my reference notes




Happy Data Thinking!!!!

April 02, 2019

Day #232 - Kafka + Spark Integration - Big Data Setup - Part I

Experimenting with Kafka and Spark using Pyspark

Example 1 - Kafka Publish - Consume
Example 2 - Kafka Publish - Spark Consume

Happy Learning!!!

March 31, 2019

Day #230 - What is your Data Story? - Big Data Setup - Part III

Tools evolve, patterns and architecture vary as we progress. To build your data story you need to think in certain perspectives

The story of Data in Motion
  • Streaming data (Incoming data)
  • Passing data (Data between certain time interval)
  • Transactional data (Current data in operation)
  • Historical data (Transaction completed)
The most famous example is the e-commerce segment. The data story evolves as
  • Streaming data (Placing orders) - Kafka
  • Passing data (Checking orders received is past 30 minutes window) – Spark
  • Transitional data (Orders placed) - HBase / NoSQL / any RDBMS based on business need
  • Completed data - Completed orders move it to hdfs, build hive table for further analysis.
Data Science Role in the Story
  • Real Time Machine Learning on Spark (30 Minute Internal Data, Clustering Sales order to group them into similar clusters, Clustering orders based on sellers and products etc..) to understand segmentation of data at that window interval
  • Perform Machine Learning on the Historical Completed Data (Recommendations, Forecast, Predictions etc.)
The key tools summary
Spark
  • Spark Streaming – Real-time querying, Load RDD data in RAM, keep it until you are done, Data is cached in RAM from disk for iterative processing. 
  • RDD (Resilient distributed datasets). RDD - Read Only collection of objects across machines
  • Spark SQL – Schema / SQL
  • Immutable data is always safe to share across multiple processes as well as multiple threads
  • Machine Learning – ML Lib
  • Graph Processing – Graphx
Kafka
  • At the heart of Apache Kafka sits a distributed log
  • The log-structured approach is itself a simple idea: a collection of messages, appended sequentially to a file.
  • When a service wants to read messages from Kafka it ‘seeks’ to the position of the last message it read, then scans sequentially, reading messages in order, while periodically recording its new position in the log.
  • Data is immutable. When you read messages from Kafka, Data is copied directly from the disk buffer to the network buffer
  • Data organized in topics. Producers write data to brokers, Consumers read data from brokers
HBASE
  • Low Latency, Consistent, best suited for random read/write big data access
  • NOSQL database hosted on top of HDFS. Columnar based Database
  • HBase uses HDFS as its data storage layer, this takes care of fault tolerance, scalability aspects
Hive
  • Targeted for Analytics
  • Natural choice for SQL Developers is Hive 
  • ETL + DW (data summarization, query and analysis)
Pig
  • Scripting language for analytics queries
Considerations for RDBMS Vs NOSQL
  • Performance - Latency tolerance, how slow my queries can run for huge data sets
  • Durability - Data loss tolerance when database crashes losing in-memory or Lost transactions tolerance
  • Consistency - Weird results tolerance (Dirty data tolerance)
  • Availability - Downtime tolerance
The lambda architecture is the reference for the fast and batch layer, With Machine learning and more tools evolving it would be helpful to think in terms of Data Story in an end to end perspective and fit in tools for your need


The tools remain the same but mapping is different across different cloud providers 



Data is the same but we have progressed further to query data in motion. Tools evolve but your data story remains the same. Come out of tools let's build a data story and connect the dots.

Old Process – Model – Collect - Analyze Data
New Process – Collect – Analyze Data in Motion – Build Model (Paradigm Shift)

My Whiteboard


More Reads
Pattern: Database per service
The Hardest Part About Microservices: Your Data

Update - Oct 18th 2020

The new wave of current products, cloud providers is impressive. Reusing from link, Post

Architecture



BI Architecture



Data Processing



AI Architecture


Time to write your own Data Story!!!

October 05, 2018

Deep Dive PySpark Examples - Big Data Setup - Part II

After experimenting a bit of pyspark I feel Its much better to handle with R / Python. Most of things we can achieve are repetitive between R /Python / Spark / SQL.

  • Data Pipeline tasks at DB Level
  • One Hot Encoding also can done with basic TSQL Code
  • While working in NLP it makes sense to use TF-IDF Vectorizers

Happy Learning!!!


September 01, 2017

Exploring Analytics in Microsoft Azure

I am working on Microsoft Azure platform on a BI cloud solution. Some of the key components I worked recently are
  • Azure Data Factory
  • Azure Data Lake
  • Azure SQL Data warehouse
  • Power BI on top of Data warehouse for reporting
I had earlier compared different stacks Microsoft / Google / Amazon.

The high level workflow for cloud based BI Solution and key components are

Step #1 - Moving Data from In premises to Cloud
Here data management gateway is installed on the in-premises machines, Pipelines are created in Azure Data factory to move data from In-premises to Azure Data lake

Step #2 - Azure Data Factory
ADF provides platform for data ingestion, Consuming high volumes of data. This experience setting up pipelines has some similarities and differences compared to SSIS. The key differences are
  • Everything is JSON based
  • Setting up Connections
  • Defining input and output data formats in datsets
  • Input and Output datasets also define the storage locations
  • Defining Pipeline logic which includes, logic, input, output datasets, scheduling for pipeline
  • This is bit straight forward but there is some learning with the tool, configuration properties
Step #3 - Azure Data lake
Azure Datalake is for storing data (RDBMS / No-RDBMS) data, If we have to integrate data from MSSQL, MYSQL for a realtime processing from two sources, We can leverage data lake to store and consolidate it later. The data stored in Datalake are referenced as external tables in AZURE Sql Datawarehouse

Step #4 - Web Application
All the references of data movement from Datalake and connectivity to Datawarehouse is managed by Access control leveraged with a Azure web app. The security aspect is well managed in Azure infrastructure

Step #5 - Data Consolidation into SQL Datawarehouse
The external tables referenced in Datalake can be referenced, queried in TSQL format and data loaded in Azure Datawarehouse tables. This is the location of fact and dimension tables that would power our datawarehouse. This could be done by stored procedures.

Step #6 - Power BI reporting
We have completed Data loading, data consolidation. The next is Power BI. PowerBI has the most power offering for web / mobile platforms. This is convenient and easy to use. The extended Analytics / R Support / Machine Library support also makes it suitable to run both Business Intelligence / Machine Learning solutions.

Security aspects of this architecture is well handled with Firewall, IAM access as needed. This seems very stable even some of the components are constantly updated. This is high level architecture explanation, We will look into To-do exercises in coming weeks.

Happy Learning!!!

January 07, 2015

Databases - IOT - CES

CES notes on IOT had a interesting tag line posted in MEMSQL blogpost

Tag line copied from the post


Also, vast landscape of DB products in multiple categories (RDBMS, NOSQL, In-memory, Hadoop, Stream processing) check-out 451 research paper. Depending on the application needs you can identify top products to evaluate / get started


NoSQL LinkedIn Skills Index – December 2014

Current State

Happy Learning!!!

October 10, 2014

HBase Overview Notes

Limitations of Hadoop 1.0
  • No Random Access --> Hadoop for more batch access (OLAP)
  • Not suitable for Real-time Access
  • No Update - Access Pattern is WORM (Write Once Read Multiple Times Hadoop best suited)
Why HBase
  • Flexible Schema Design --> Add a new column when a row is added
  • Multiple versions of a single cell (Data)
  • Columnar storage
  • Cache columns at client side
  • Compression of columns
Read  v/s Write

  • For Availability (Compromise on Write) vs Consistency (Compromise on Read)
Hbase
  • NoSQL Class on Non-Relational Storage Systems
  • In RDBMS it is Rowkey based allocations, HBase it is columnar storage
  • Hbase needs HDFS for replication
  • ZooKeeper - Taking all requests from client. Client will communicate from zookeeper Client -> ZooKeeper -> HMaster
  • Region Server - It Serves the region. Region Server processor runs on slaves (Data Nodes)
Happy Learning!!!

October 09, 2014

Pig Overview Notes

Pig
  • Primarily for semi structured data
  • So called 'Pig' as it processes all kinds of data
  • Pig is data flow language not a procedural language
  • Map Reduce - Java Programmers, Hive - for TSQL folks, Pig (Rapid Prototyping & increased productivity)
  • Pig is on client side, need not be on cluster
  • Execution Sequence - Query Parser -> Semantic Checking -> Logical Optimizer (Variable level) -> Logical to physical translator -> Physical to M/R translator -> MapReduce Launcher
  • Ping Concepts - Map - array, Tuple - ordered list of data ,Bag - Unordered collection of tuple
  • Pig - for client side access, Hive will work only within cluster, semi structured data
  • Hive - Best suited for SQL style analytics, structured data
  • MR - Audio Video Analytics Map Reduce Approach is the only option

Happy Learning!!!

October 08, 2014

Hive Overview Notes

  • Data Warehousing package built on top of hadoop
  • Managing and querying structured data
  • Apache Derby embedded DB used by Hive
  • metastore_db folder for persistence of data
  • Suitable for WORM - Write Once Read Many Times Access Pattern
  • Core Components are Shell, Metastore, Execution Engine, Compiler (Parse, Plan, Optimize), Driver
  • Tables can be created as Internal Tables, External Table (Pointing to external file)
  • When Internal Tables are dropped schema + data is dropped. For external referencing tables only Schema is dropped not data. Both Internal and External tables reside in HDFS
  • Data files for created tables would be available in location /user/hive/warehouse
  • Partitioning in Hive - Hash Value % Number of buckets - that particular row will go into that bucket
  • Partition table should always be an Internal Hive Table
Happy Learning!!!

October 07, 2014

Map Reduce Internals

Client Submits Job. Job Tracker does the splitting, scheduling Job

Mapper
  • Mapper runs the business logic (ex- word counting)
  • Mapper (Maps your need from the record)
  • Record reader provides input to mapper in key value format
  • Mapper Side Join (Distributed Caching)
  • Output of mapper (list of keys and values). Output of mapper function stored in Sequence file
  • Framework does splitting based on input format, Default is new line (text format)
  • Every row / Record will go through map function
  • When there is a data split (row) is split between two 64MB Blocks. That particular row would be merged for complete record and processed
  • Default block size in Hadoop 2.0 is 128MB
Reducer
  • Reducer will poll it, job tracker will inform what all nodes to poll
  • Default number of reducer is 1. This is configurable
  • Multiple Reducers - Not possible - Multiple level MR jobs possible
  • Reduce Side join (Join @ Reducer Level)
Combiner
  • Combiner - Mini Reducer, Combiner before writing to disk, finds max value from data
  • Combiner is used when map job itself can do some preprocessing to minimize reducer workload
Partitioner
  • Hash Partitioner is default partitioner
  • Mapper -> Combiner -> Partitioner -> Reducer (For multi-dimension, 2012-max sales by product, 2013, max sales by location)
Happy Learning!!!