Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Big Data

Showing posts with label Big Data. Show all posts

March 19, 2020

Distributed Systems - Session #3 - Aurora

Sometimes I felt not connected to the session. Needs a lot of focus and patience to stay connected and focused :)

Key Summary points

Amazon early offering EC2
Rented out VMs to customers
VMM (Virtual Machine Monitors) that run/manage EC2 instances
EC2 good for stateless web servers
S3 - Scheme for storing large chunks of data (Periodic Snapshots)
Disks for EC2 instances - Fault Tolerance (EBS)
EBS (Elastic Block Store) - Looks for EC2 instances as it is a harddrive
Databases on EBS sends a large volume of data over the network
Amount of writes on Network Storage System
CPU / Disk space consumption
EC2 / EBS are in same availability zone
Transaction & Crash Recovery
Transaction (Sequence of operations / commands / atomic / ex- bank transfer money between accounts)
Reads page from disk
Make Changes in local cache
Then write changes to disk
Log entries describe the transaction
Three log records - Modify Operation, Old Value, New Value
Aurora is based on MySQL
RDS (Database replicated in multiple availability zones)
All the transactions mirrored to other databases (EBS Servers)
Multiple copies managed and updated to keep everything in sync
Read / Write Quorum will overlap
Voting does not work to read from which server
These systems have version numbers
Readers takes the ones with highest version number
Split database into replicas
Data Sharding
Data across protection groups

Happy Learning!!!

March 18, 2020

Distributed Systems - Session #2

I paused it a lot as I didn't really get involved much but finally managed to complete it.

Key Lessons

Go lang examples for threading, locking, RPC, Typesafe and memory safe, Garbage Collected
Threads - Tools to manage concurrency in programs
Stacks are within address space of the program
I/O Concurrency - Overlapping of progress of different activities wait ing / executing
Parallelism - Parallelize CPU / IO cycles / routines
Process is a single program / single address space. Inside process there are multiple threads
Process -> memory area -> routines sit inside the process
Process implemented by the operating system
Thread challenges - Sharing data
Mutex / Locks for shared data
Data Access - Managing Locks / Deadlocks / Starvation / Blocking
Channels (Go Lang) - Send data between threads
WaitGroup, Sync.Cond
Webcrawlers design for parallel processing using threads
Handling concurrency / multiple parallel threads / optimum network capacity utilization
Remember doing SSIS ETL parallel tasks for Data pull

A multi-threaded Web crawler implemented in Python
Crawler
Multi-Threaded Crawler in Python

Happy Learning!!!

Data Perspectives

Different perspectives to decide on choosing the right database?

Strict data types - Schema on write
Schemaless data - Schema on read
Read-only immutable data
Eventually consistent data
Dirty read vs Committed data
Multi-version concurrency control
Replicate data based on logs
Replay committed logs
Data sharding
High reads consistent data - RDBMS
High writes low reads - HBase, Cassandra
Document-based storage - Mongodb, Couchdb
CAP, ACID Properties

Things I Wished More Developers Knew About Databases

Want to Debug Latency?

I wrote an initial draft on the things I wished more developers knew about DBs. It touches a variety of topics: write skews, external consistency, clock skews, database-generated IDs, nested transaction issues, caches & more. Is there anything you wished more devs knew about DBs?
— Jaana Dogan (@rakyll) April 13, 2020

Almost similar and deep-dive techniques from the tweet conversation

Read heavy vs write heavy. Insert vs updates. Vaccuuming
Replication or not, transaction logging, why indexes matter, performance tuning, i/o scheduler, unicode, gender isn't binary
Locks, cache effects, isolation levels
IO bound vs network bound especially in the situation of replication, scaling strayegy, concurrency vs distributed.
Materialized views, and the dangers of invalidating them unexpectedly.
Connection pool, scaling techniques to handle distributed application / system, improve performance, optimization of query etc.
I'd be interested in how this applies to a distributed system. Concurrency (specifically MVCC), connections, DB threading, backpressure handling
Disk storage implementation and optimization

Keep Thinking!!!

March 12, 2020

Distributed Systems - Session #1

Key Notes

Storage, Big Data, File Sharing
The infrastructure that requires more than one computer
High Performance, Parallelism
Fault Tolerance - Two computer does the same things. One Fails another picks up. Availability / Recoverability, Replication
Systems are inherently physically distributed
To achieve security goals
Handle unexpected failure patterns (Partial Failures)
Challenges are Concurrency, Partial Failures
Academic Curiosity -> Real-world Examples
Lectures, Research papers for ideas, implementation details, labs, exams
Map Reduce - Map Function on each of the input files, Obvious Parallelism available. The output is a list of Key-Value Pairs. Maps -> Intermediate Output -> Reducers. Collects all instances, all maps.

Happy Learning!!!

February 26, 2020

IoT Architecture

Edge - Lightweight protocols for Machine to Cloud communication - MQTT (Lightweight pub/sub) - Immutable data - Read-Only - Data cannot be modified
Cloud Entry - Large scale data ingestion to consume data - Kafka - (Distributed Log processing)- Immutable data - Data cannot be modified
Cloud Streaming - Real-time data analysis to report and alert - Spark - (RDDs / ML ) - Immutable data - Data cannot be modified
Store and Analyze - Reports on data, Completed transactions - Postgres, RDBMS - Do the Remaining CRUD
ML on Edge, ML on Streaming data, ML on Stored data (completed transaction)

The confluence of blockchain and IoT in Supply Chain can improve the efficiency of the operations, as per @EqualOcean's research. Link > https://t.co/5nWv9EIqNh @antgrasso via @LindaGrass0 #SupplyChain #blockchain #SmartContracts #IoT #Tech #DigitalTransformation pic.twitter.com/bDmPz7Njz0
— C-Suite Tech Point (@CsuiteTechPoint) May 25, 2020

IoT ecosystems that stitch together sensor-based data with legacy systems and newer technologies hold tremendous promise for those engaged in moving goods. Link > https://t.co/Uk0IFVrVqs @DeloitteInsight via @antgrasso @antgrasso_IT #CloudComputing #AI #IoT #IIoT pic.twitter.com/FpUDIR9kgN
— Antonio Grasso (@antgrasso) May 21, 2020

mt @Fisher85M
cc: @MikeQuindazzi @antgrasso #SupplyChain 4.0 & #4IR

[@JacBurns_Comext]#MachineLearning #IoT #IIoT #sensors #CyberSecurity #bigdata #5G #Industry40 #mobile #disruption pic.twitter.com/BCH8SnEEkF
— Digital Retweet (@DigitalRetweet) May 9, 2020

Happy Learning!!!

December 23, 2019

Difference between SQL and NOSQL Systems

Reposting from my two-year-old Quora answer

The Key differences between them lies in the understanding CAP theorem

Consistency
Availability
Partition Tolerance

In layman terms. SQL systems ex-RDBMS will adhere ACID properties (Atomicity, Consistency, Isolation, Durability).

The datatypes, schema are predefined, You cannot store non-matching datatypes
To avoid dirty data, systems enforce isolation levels that govern only committed data is read (Consistency)
Only latest records are available, records at that point in time are not available
Banking Systems, ordering systems where data needs to consistent will be mostly SQL based systems where consistency is important

No-SQL systems (Not Only SQL)

The schema is not tightly governed, its flexible you can store different datatypes in same columns
These may be geographically distributed where data may be synced and eventually be consistent end of day not realtime
They also support point in time data, data values at a point in time can also be looked up
Where there is no requirement for consistency we can achieve other 2 Availability and partition tolerance
Since some of the ACID properties are compromised you will have high availability of this systems

It is more to do with business need to decided SQL or SQL based storage.

Happy Learning!!!

September 26, 2019

The Curse of Cheap Data Plans

Many time I wonder cheap data plans are a curse, not a boom. I see more often these days

More time I personally spend on Youtube
Forwards of TikTok/ Halo Status videoes
Rechecking same repetitive news everywhere

I have lost a lot of sleeping hours. Google Youtube recommendation is the most unfair recommendation. Providing extremely similar recommendations. There is no mix of different sources. Sometimes tailored information is not what we need, we need the raw data.

Too much of personalization is a curse. You will lose yourself biased on your perspectives. Sometimes raw information makes more sense than tailored information.

Escape the Web!!!

September 14, 2019

New Age BI = Kimball + Big Data for unstructured data + AI capabilities

Kimball and Inmon are driven off of structuring data

Inmon

Bill Inmon is centralized DW proponent
Inmon defines data warehouse as a centralized repository for the entire enterprise
Data warehouse is at the center of the Corporate Information Factory (CIF)

Kimball

Kimball defines data warehouse as “A copy of transaction data specifically structured for query and analysis”
Kimball kept getting more correct due to global corporations and their need for distributed DW
Kimball is the distributed Data Marts proponent.
Kimball defines business processes quite broadly.

Current Status

Both those concepts are currently changing based with BigData, Inmem Processing, Columnar
databases and machine intelligence
We need both tactical plus bigdata plus analytics

Today BI = DW + AI capabilities + Big Data

BI & Analytics is spread across multiple components; we cannot invest in centralized DW as systems need to be agile enough to accommodate changing data needs
Knowledge discovering is an ongoing process
Data is structured, unstructured. Strategically data is changing.
Making it smaller datamarts is more manageable

Kimball + Big Data for unstructured data + AI capabilities = New Age BI

The goal of analytics is to move away from historical to real-time recommendations

New Age #BI = #Kimball + #BigData for unstructured data + #AI capabilities

#Kimbal - Build Datamarts to slide and dice through data
#BigData - Find your digital footprint, Look at realtime trends/sentiments
#AI - Use AI to find insights from your customer, users, demands, patterns

Happy Learning!!!

July 12, 2019

Data Modelling for Workloads

Denormalized way to suit query patterns
Data Properties - Concurrent writes
Metadata - No continuous updates (Just store data, no update involved)
Indexes - Single Field, Multikey indexes, Text Indexes, Column Store Indexes
Sharding - Shardkey could be location_id (Distributed Storage)
Data Partitioning
Timeseries Info aggregation in storage level (NoSQL)
JSON Modelling - Collections in MongoDB, Establishing Hierarchy and Relationships

Happy Learning!!!

Day #262 - Data Modeling for Analytics Translators

Summary Notes

Flexible, Extensible, Governed Effectively
DW - Staging - RAW - Processing - Consumption
Data processed multiple times
Aggregated at the end
Operational Data Store

Schema on Write

Write data
Read data
Same Schema. Fixed Structure

Schema on Read

Apply schema when you read
Write once read many times
WORM
Bringing data separated by different business, data, databases

Example

Data arrived in JSON format
Add time_stamp to relate data source
Add Source_system

Canonical Model

Repeating data for right reasons
Enrich with meta_data, canonical elements
Link Canonical elements, suppliers together, Provide unique_id

Data Governance

Data problems comes in mass scenarios
Reports Data Discrepancies
IT framework to manage Data Governance
Master Data Management (MDM is a technology which provides a 360 degree view of a user data coming from different sources)
Data Quality
Data Archival
Data Security

MDM

Source -> ETL (Clean, Standardize, Transform, MDM) -> Reports, DW, EDW
Rules Based
Metadata verification
Data Collection -> ETL -> Data Quality -> MDM -> DW

Data Quality API / Module

Add / Remove Business Rules
Field Level Validations against messages
Return Error codes or log for failures
Auditing and Reporting failed messages automatically

Happy Learning!!!

June 08, 2019

Day #259 - Setting up Kafka on my Ubuntu - Big Data Setup - Part IV

Finally Setting up all Big Data Tools in my Linux Setup. Rough Steps and my reference notes

Happy Data Thinking!!!!

April 02, 2019

Day #232 - Kafka + Spark Integration - Big Data Setup - Part I

Experimenting with Kafka and Spark using Pyspark

Example 1 - Kafka Publish - Consume
Example 2 - Kafka Publish - Spark Consume

Happy Learning!!!

March 31, 2019

Day #230 - What is your Data Story? - Big Data Setup - Part III

Tools evolve, patterns and architecture vary as we progress. To build your data story you need to think in certain perspectives

The story of Data in Motion

Streaming data (Incoming data)
Passing data (Data between certain time interval)
Transactional data (Current data in operation)
Historical data (Transaction completed)

The most famous example is the e-commerce segment. The data story evolves as

Streaming data (Placing orders) - Kafka
Passing data (Checking orders received is past 30 minutes window) – Spark
Transitional data (Orders placed) - HBase / NoSQL / any RDBMS based on business need
Completed data - Completed orders move it to hdfs, build hive table for further analysis.

Data Science Role in the Story

Real Time Machine Learning on Spark (30 Minute Internal Data, Clustering Sales order to group them into similar clusters, Clustering orders based on sellers and products etc..) to understand segmentation of data at that window interval
Perform Machine Learning on the Historical Completed Data (Recommendations, Forecast, Predictions etc.)

The key tools summary
Spark

Spark Streaming – Real-time querying, Load RDD data in RAM, keep it until you are done, Data is cached in RAM from disk for iterative processing.
RDD (Resilient distributed datasets). RDD - Read Only collection of objects across machines
Spark SQL – Schema / SQL
Immutable data is always safe to share across multiple processes as well as multiple threads
Machine Learning – ML Lib
Graph Processing – Graphx

Kafka

At the heart of Apache Kafka sits a distributed log
The log-structured approach is itself a simple idea: a collection of messages, appended sequentially to a file.
When a service wants to read messages from Kafka it ‘seeks’ to the position of the last message it read, then scans sequentially, reading messages in order, while periodically recording its new position in the log.
Data is immutable. When you read messages from Kafka, Data is copied directly from the disk buffer to the network buffer
Data organized in topics. Producers write data to brokers, Consumers read data from brokers

HBASE

Low Latency, Consistent, best suited for random read/write big data access
NOSQL database hosted on top of HDFS. Columnar based Database
HBase uses HDFS as its data storage layer, this takes care of fault tolerance, scalability aspects

Hive

Targeted for Analytics
Natural choice for SQL Developers is Hive
ETL + DW (data summarization, query and analysis)

Pig

Scripting language for analytics queries

Considerations for RDBMS Vs NOSQL

Performance - Latency tolerance, how slow my queries can run for huge data sets
Durability - Data loss tolerance when database crashes losing in-memory or Lost transactions tolerance
Consistency - Weird results tolerance (Dirty data tolerance)
Availability - Downtime tolerance

The lambda architecture is the reference for the fast and batch layer, With Machine learning and more tools evolving it would be helpful to think in terms of Data Story in an end to end perspective and fit in tools for your need

The tools remain the same but mapping is different across different cloud providers

Data is the same but we have progressed further to query data in motion. Tools evolve but your data story remains the same. Come out of tools let's build a data story and connect the dots.

Old Process – Model – Collect - Analyze Data
New Process – Collect – Analyze Data in Motion – Build Model (Paradigm Shift)

My Whiteboard

More Reads
Pattern: Database per service
The Hardest Part About Microservices: Your Data

Update - Oct 18th 2020

The new wave of current products, cloud providers is impressive. Reusing from link, Post

Architecture

BI Architecture

Data Processing

AI Architecture

Time to write your own Data Story!!!

October 05, 2018

Deep Dive PySpark Examples - Big Data Setup - Part II

After experimenting a bit of pyspark I feel Its much better to handle with R / Python. Most of things we can achieve are repetitive between R /Python / Spark / SQL.

Data Pipeline tasks at DB Level
One Hot Encoding also can done with basic TSQL Code
While working in NLP it makes sense to use TF-IDF Vectorizers

Happy Learning!!!

September 01, 2017

Exploring Analytics in Microsoft Azure

I am working on Microsoft Azure platform on a BI cloud solution. Some of the key components I worked recently are

Azure Data Factory
Azure Data Lake
Azure SQL Data warehouse
Power BI on top of Data warehouse for reporting

I had earlier compared different stacks Microsoft / Google / Amazon.

The high level workflow for cloud based BI Solution and key components are

Step #1 - Moving Data from In premises to Cloud

Here data management gateway is installed on the in-premises machines, Pipelines are created in Azure Data factory to move data from In-premises to Azure Data lake

Step #2 - Azure Data Factory

ADF provides platform for data ingestion, Consuming high volumes of data. This experience setting up pipelines has some similarities and differences compared to SSIS. The key differences are

Everything is JSON based
Setting up Connections
Defining input and output data formats in datsets
Input and Output datasets also define the storage locations
Defining Pipeline logic which includes, logic, input, output datasets, scheduling for pipeline
This is bit straight forward but there is some learning with the tool, configuration properties

Step #3 - Azure Data lake

Azure Datalake is for storing data (RDBMS / No-RDBMS) data, If we have to integrate data from MSSQL, MYSQL for a realtime processing from two sources, We can leverage data lake to store and consolidate it later. The data stored in Datalake are referenced as external tables in AZURE Sql Datawarehouse

Step #4 - Web Application

All the references of data movement from Datalake and connectivity to Datawarehouse is managed by Access control leveraged with a Azure web app. The security aspect is well managed in Azure infrastructure

Step #5 - Data Consolidation into SQL Datawarehouse

The external tables referenced in Datalake can be referenced, queried in TSQL format and data loaded in Azure Datawarehouse tables. This is the location of fact and dimension tables that would power our datawarehouse. This could be done by stored procedures.

Step #6 - Power BI reporting

We have completed Data loading, data consolidation. The next is Power BI. PowerBI has the most power offering for web / mobile platforms. This is convenient and easy to use. The extended Analytics / R Support / Machine Library support also makes it suitable to run both Business Intelligence / Machine Learning solutions.

Security aspects of this architecture is well handled with Firewall, IAM access as needed. This seems very stable even some of the components are constantly updated. This is high level architecture explanation, We will look into To-do exercises in coming weeks.

Happy Learning!!!

January 07, 2015

Databases - IOT - CES

CES notes on IOT had a interesting tag line posted in MEMSQL blogpost

Tag line copied from the post

Also, vast landscape of DB products in multiple categories (RDBMS, NOSQL, In-memory, Hadoop, Stream processing) check-out 451 research paper. Depending on the application needs you can identify top products to evaluate / get started

Data Platforms Landscape

Updated Data Platforms Landscape Map – October 2014 http://t.co/qhJawLQhNX Free download - now fully indexed pic.twitter.com/cH3QL5GFcm
— Matt Aslett (@maslett) November 18, 2014

NoSQL LinkedIn Skills Index – December 2014

NoSQL LinkedIn Skills Index – December 2014 http://t.co/bCvR4ipSGG
— Matt Aslett (@maslett) December 18, 2014

Current State

It is remarkable how many database vendors are struggling despite apparently pre-emptively designing their products to be perfect for IoT
— Matt Aslett (@maslett) December 11, 2014

Happy Learning!!!

October 10, 2014

HBase Overview Notes

Limitations of Hadoop 1.0

No Random Access --> Hadoop for more batch access (OLAP)
Not suitable for Real-time Access
No Update - Access Pattern is WORM (Write Once Read Multiple Times Hadoop best suited)

Why HBase

Flexible Schema Design --> Add a new column when a row is added
Multiple versions of a single cell (Data)
Columnar storage
Cache columns at client side
Compression of columns

Read v/s Write

For Availability (Compromise on Write) vs Consistency (Compromise on Read)

Hbase

NoSQL Class on Non-Relational Storage Systems
In RDBMS it is Rowkey based allocations, HBase it is columnar storage
Hbase needs HDFS for replication
ZooKeeper - Taking all requests from client. Client will communicate from zookeeper Client -> ZooKeeper -> HMaster
Region Server - It Serves the region. Region Server processor runs on slaves (Data Nodes)

Happy Learning!!!

October 09, 2014

Pig Overview Notes

Pig

Primarily for semi structured data
So called 'Pig' as it processes all kinds of data
Pig is data flow language not a procedural language
Map Reduce - Java Programmers, Hive - for TSQL folks, Pig (Rapid Prototyping & increased productivity)
Pig is on client side, need not be on cluster
Execution Sequence - Query Parser -> Semantic Checking -> Logical Optimizer (Variable level) -> Logical to physical translator -> Physical to M/R translator -> MapReduce Launcher
Ping Concepts - Map - array, Tuple - ordered list of data ,Bag - Unordered collection of tuple
Pig - for client side access, Hive will work only within cluster, semi structured data
Hive - Best suited for SQL style analytics, structured data
MR - Audio Video Analytics Map Reduce Approach is the only option

Happy Learning!!!

October 08, 2014

Hive Overview Notes

Data Warehousing package built on top of hadoop
Managing and querying structured data
Apache Derby embedded DB used by Hive
metastore_db folder for persistence of data
Suitable for WORM - Write Once Read Many Times Access Pattern
Core Components are Shell, Metastore, Execution Engine, Compiler (Parse, Plan, Optimize), Driver
Tables can be created as Internal Tables, External Table (Pointing to external file)
When Internal Tables are dropped schema + data is dropped. For external referencing tables only Schema is dropped not data. Both Internal and External tables reside in HDFS
Data files for created tables would be available in location /user/hive/warehouse
Partitioning in Hive - Hash Value % Number of buckets - that particular row will go into that bucket
Partition table should always be an Internal Hive Table

Happy Learning!!!

October 07, 2014

Map Reduce Internals

Client Submits Job. Job Tracker does the splitting, scheduling Job

Mapper

Mapper runs the business logic (ex- word counting)
Mapper (Maps your need from the record)
Record reader provides input to mapper in key value format
Mapper Side Join (Distributed Caching)
Output of mapper (list of keys and values). Output of mapper function stored in Sequence file
Framework does splitting based on input format, Default is new line (text format)
Every row / Record will go through map function
When there is a data split (row) is split between two 64MB Blocks. That particular row would be merged for complete record and processed
Default block size in Hadoop 2.0 is 128MB

Reducer

Reducer will poll it, job tracker will inform what all nodes to poll
Default number of reducer is 1. This is configurable
Multiple Reducers - Not possible - Multiple level MR jobs possible
Reduce Side join (Join @ Reducer Level)

Combiner

Combiner - Mini Reducer, Combiner before writing to disk, finds max value from data
Combiner is used when map job itself can do some preprocessing to minimize reducer workload

Partitioner

Hash Partitioner is default partitioner
Mapper -> Combiner -> Partitioner -> Reducer (For multi-dimension, 2012-max sales by product, 2013, max sales by location)

Happy Learning!!!

March 19, 2020

March 18, 2020

March 12, 2020

February 26, 2020

December 23, 2019

September 26, 2019

September 14, 2019

July 12, 2019

June 08, 2019

April 02, 2019

March 31, 2019

October 05, 2018

September 01, 2017

January 07, 2015

October 10, 2014

October 09, 2014

October 08, 2014

October 07, 2014

About Me

What is your Expertise

Search This Blog

Git Code Repository

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts