Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): June 2019

June 28, 2019

Day #260 - Fundamentals Revisited - Detection

Summary of Notes
One stage detector

Recall high but compromise localization ability, Densebox, Yolo, SSD, RetinaNet
One Stage Detector - Retinanet
FPN - Feature Pyramid Net - computing convolutional feature maps
Class Subnet - object classification
box subnet - bounding box regression

Two Stage Detector

FRCNN, RFCN, MaskRCNN
Two-stage detector - Strong localization ability
Feature Pyramid Net Structure
ROI alignment

Non-Max Suppression - post-processing to eliminate multiple responses
Metrics - Precision, Recall

Precision - How many selected items were relevant
Recall - How many relevant items were selected

Detection

Sliding window approach, Parallel Computation

Manually Handcrafted features for Image

Haar feature
Histogram of Gradient
Local Binary pattern
Aggregated Channel feature

References
Link1
Link2
Link3

Happy Mastering Data Science!!!

June 23, 2019

Outside Retail, RFID in Smart Cities

Past few years I have looked at RFID in retail perspective - Inventory Management. We work on RFID solutions that enable real-time inventory management and improving stock efficiency by tracking

Restock
Misplaced Items
Replenishment Items

RFID is a major player in Retail for Inventory Management. Millions of items are tagged and tracked using RFID based Inventory Solutions.
I had an opportunity to visit smart city solution provider. The solutions they covered were

Waste Management
Parking Management
Smart Lighting
Video Monitoring
Environment Management
Surveillance
Command Centre

Waste Management - The bins are equipped with volume sensors and RFID tags, They are designed with batteries lasting 10 years. They provide details when the Truck and the bin connects, the RFID event is alerted

Intrusion Detection - Sensors placed for Motion detection during the area to alert if Someone comes closer to the area

Environment Management - This is basically measuring pollutants in the air and reporting the trends/patterns

Traffic - This is more of heat maps to reflect the patterns / crowded areas

Command Centre - All the calls are routed/tracked here. Provides a lot of data on issues/queries/ complaints / accidents / reporting any incidents.

Parking - This is based on sensors placed above / below parking lots

State of Art - All of this implementation are mostly based on sensors / RFID. This visit provided a lot of insights into data collection, framework, real-time alerts of end-to-end infrastructure.

Where is Video Analytics here?
Before the visit I was under the assumption a lot of video analytics, people counting use cases would be there. Actually, most of the things are achieved with RFID, sensor-based events. Video Analytics is yet to make a mark in implementation. But down the line in the next few years, we would see a lot of video analytics on the data collected.

Crowd Detection
Use of Drones for Monitoring
Gender Detection from Surveillance Cameras
Vehicle Number Detection and Reporting
Face Detection and Indexing at Landmarks
Loitering
Audio-based analytics to detect gunshots/sounds/ accidents

A lot of use cases, RFID + Video Analytics will further strengthen Smart Cities Solutions portfolio.
Happy Learning!!!

June 19, 2019

The Data Around Myself

Different Data Sources that run our Knowledge and Emotions

Linkedin - Primarily for Professional Contacts / Future prospects. Connect for jobs, mentoring, professional discussions, sharing / following up on Industry
Quora is for Interest-based Contacts - Deep Dive Discussions perspectives and debates, Following with experts in our areas of interests
Facebook is for Friends, Relationships - Events, Holidays, Updates. I have zero inputs in this space as I don't believe in hundreds of contacts rather meaningful few contacts. Till date, I haven't invested my time in FB
Twitter - Pointers, and thoughts from people of our interests and ideas
Knowledge Sources - Reddit, hacker news, medium
Code Sources for Learning - Github, StackOverflow
Communication Channels - Skype, Whatsapp, Slack

We would observe more networks evolve focused on

Industry
Domain
Age Group
Relationships
Data Ownership

Happy Data Perspectives!!!

June 15, 2019

Lessons Learnt from Video Analytics Projects

These are lessons learned based on working on Video Analytics Projects.

Customers gets carried away with demos. Demos with fewer objects will look good. But how many objects you can train. How many you can detect ? You need to nail down. When you look at Tesla Autopilot videos you can observe the detected entities are Lanes (Yellow), Speed (White), Type of Vehicle (Red), Signals in that Lane, Alerting when vehicle before is stopped, Dimension (Length of Vehicle), Pedestrian Detection / Speed, Boundary of detected objects. You need to nail down on objects you want to detect. Not everything that you see, can be counted
Object detection is not a Xerox machine copying type of problem. If you spend X amount of time for detecting one type of object, you again need to spend same amount of time for training another object. I have observed customers assume it's one time to train and then it auto learns
Object detection is one part of the problem. Type of camera, the dataset from fish eye camera and mounted camera both will differ. You need to prepare and work on datasets for both of them
After Object Detection, Object Tracking becomes the next task. In Real-time systems People / Car where there is lot of movements involved, tracking becomes the next important aspect after detection. Object Re_Id - Another challenge is doing Object Reidentification when the Person / Object re-appears in the frame or passes across cameras.
View video analytics as a form of translating video data into insights. Objects counted, Repeat objects, duration of objects in frame. This data need to be correlated with other forms of data to establish any insights/ correlation aspects.
Working on smaller video sets and rapidly changing environment is a problem of dataset generation. The big tech companies Microsoft, Amazon, Google, Tesla they have lot of video datasets and data available. Working in smaller customer's with limited datasets and different lighting, environmental conditions is always a challenge.
Edge computing has a long way to go. With my limited experience I feel edge computing models Intel, Google and other providers has a long way to go
Lot of research activity and very tough to keep up the pace. Ton of Papers and demo code put up. They get added everyday. It is difficult to track and keep knowledge updated on daily basis.
Video Analytics will go together with other inputs location information, coordinates, RFID information. Co-ordinates can be used to map the object location and use it to infer in future frames

Will Keep updating with more lessons!!!

Happy mastering Data Science!!!

June 14, 2019

9 Reasons why Business Intelligence and Database Developers should learn Machine Learning

Business Intelligence Insights can fuel Data Science Use Cases

BI will give you trends on historical data, BI will pin point what are your highlights and low lights
Effective Data Science use cases for Business will be converting this low lights into more proactive signs
BI will shed light on trend by seasonality, products, location etc. These values are effective feature variables for building your ML models

Database Developers are naturally good at Analyzing Data

I have worked with Terabyte of Data, Microsoft Entertainment and Devices data in 2008. I remember all the Insights we computed in Supply Chain, Orders, Returns, Warranty. When I started on Data Science I felt how those data can be effectively analyzed with Data Science

Sales Analysis - The sales of products, regions, units sold can be clustered to find insights in it
Repair Analysis - The types of products, repairs, regions can be clustered to find the most occurring issues
Sales, Repair, Warranty renewal - From all the real time and Transactional data, we can build forecasting for repairs, sales, warranty renewal.

4. Database developers work with huge volumes of data, schema design, index design for performance

5. The Job involves aggregating, writing procedures to implement business transactions
6. Handling all load issues of concurrency, dead-lock, dirty data

These traits would be helpful setting up the data pipeline. All the work of generating insights can be achieved with TSQL itself. You do not really need to use pandas or learn from scratch

They can build pipelines, Reporting, Talk about the Numbers

7. Translate the insights / dimensions as feature variables
8. Communicate to business the insights and how they are data science use cases to solve
9. Naturally BI Reporting + Transactional Reporting will provide a lot of Visualization which also is a must for Data Science to present your story

Today Transactional Data Reporting, Business Intelligence Insights, Future predictions with Data Science everything is needed to succeed in business, OLTP + OLAP + Data Science = All about data in business

Note - I am not including Video, Text in this context. Data Science with Data (Numbers) is the scope of this post

#BusinessIntelligence, #MachineLearning, #Database. #TSQL, #ArtificialIntelligence, #DataPipeline

Happy Mastering Data Science!!!

June 12, 2019

Why Data Science Projects Fail ?

Many places I keep observing only failed #ML implementation. Data Science is here to stay but there will be a bubble burst with "Failed Data Science Projects"

Some of the lessons missed I observed are

There is no single model to solve everything - Detection, Classification, Scoring, Recommendations. It will be a mix of multiple models
Data Science project is creative work, needs data, finetuning, re-training, come out of deadlines, it's making someone learn from data
The perfect model comes after iterations, not in the first iteration
Don't go for AWS, Google, Microsoft Vision, NLP, AI tools in the first go, It is pay per use. Good to start. Build something on your own, You will control on internals and improving it further
Production deployment - There are tons of tools out there, Building is more important than deploying
Data Science projects and Sales teams visions often conflict. I have seen Sales teams not able to sell AI products because they don't understand fully how AI can fit in the portfolio
Data Science is multiple perspectives = Data + Domain Knowledge + BI + Computer Vision + .... It's not just .fit and .predict. Learn the complete perspective. Don't get carried away with one perspective.

Happy Mastering Data Science!!!

June 08, 2019

Day #259 - Setting up Kafka on my Ubuntu - Big Data Setup - Part IV

Finally Setting up all Big Data Tools in my Linux Setup. Rough Steps and my reference notes

	--Linux Pre-requisites
	sudo apt-get install nano
	sudo apt-get install openjdk-8-jdk
	sudo update-alternatives --config java
	sudo apt-get install curl
	http://apache.org/dist/kafka/2.1.1/
	Extracted kafka to /home/kafka/kafka folder under administrator

	--Reference Steps https://www.digitalocean.com/community/tutorials/how-to-install-apache-kafka-on-ubuntu-18-04
	--Additional Steps, Few Customized
	sudo nano /etc/systemd/system/zookeeper.service
	[Unit]
	Requires=network.target remote-fs target
	After=network.target remote-fs.target
	[Service]
	Type=simple
	User=administrator
	ExecStart=/home/administrator/kafka/kafka/bin/zookeeper-server-start.sh /home/a$
	ExecStop=/home/administrator/kafka/kafka/bin/zookeeper-server-stop.sh
	Restart=on-abnormal
	[Install]
	WantedBy=multi-user.target

	sudo nano /etc/systemd/system/kafka.service
	[Unit]
	Requires=zookeeper.service
	After=zookeeper.service
	[Service]
	Type=simple
	User=administrator
	ExecStart=/bin/sh -c '/home/administrator/kafka/kafka/bin/kafka-server-start.sh$
	ExecStop=/home/administrator/kafka/kafka/bin/kafka-server-stop.sh
	Restart=on-abnormal
	[Install]
	WantedBy=multi-user.target

	Kafka test Examples
	=====================
	Step 1 - ~/kafka/kafka/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic TutorialTopic
	Step 2 - echo "Hello,World" \| ~/kafka/kafka/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic TutorialTopic > /dev/null
	Step 3 - ~/kafka/kafka/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic TutorialTopic –from-beginning

view raw kafka_installation.txt hosted with ❤ by GitHub

Happy Data Thinking!!!!

Data Warehousing with Amazon Redshift

Redshift started from Postgres, 4 Years 150+ features added
Rebuilt and improved version, columnar storage, scale horizontally, OLAP functionality added
Wrapped in AWS system

The 5 things to look at creating Redshift

Step 1 - Find Fact, Dimension Tables and the Record Count
Step 2 - Analyze and find the query where clause filter columns
Step 3 - Define the Sort Keys
Step 4 - Analyze the record counts, data to find out distribution strategy
Step 5 - Design Distribution Strategy

Summary of Lessons from Link

Good Tutorial Table Design Tuning - Link

The Where clause columns are the sort keys - (date, partnumber, year etc). Same as TSQL index design Link -

Summary of Lessons and Key Steps

Step #1 - Demo Example Tables and Record Count

LINEORDER - 600,037,902

PART - 1,400,000

CUSTOMER - 3,000,000

SUPPLIER - 1,000,000

DWDATE - 2,556

Step #2 - Analyze and find the query where clause filter columns, Link

The Join Columns are

LINEORDER PrimaryKey - lo_orderkey, FK - lo_custkey, lo_partkey, lo_suppkey, lo_orderdate, lo_commdate

PART pk - p_partkey, FK- lo_partkey

CUSTOMER pk - c_custkey, FK - lo_custkey

SUPPLIER pk - s_suppkey, FK- lo_suppkey

DWDATE pk - d_datekey, FK - lo_orderdate, lo_commdate

Step #3 - Define the Sort Keys

All the primary and foreign keys have the sort order keys defined. Based on Query Where Clauses define them

Table name Sort Key

LINEORDER lo_orderdate

PART p_partkey

CUSTOMER c_custkey

SUPPLIER s_suppkey

DWDATE d_datekey

Step #4 - Analyze the record counts, data to find out distribution strategy

The Largest Dimension Table is PART, Each table can have only one distribution key
LINEORDER is the fact table, and PART is the largest dimension. PART joins LINEORDER on its primary key, p_partkey
Designate lo_partkey as the distribution key for LINEORDER and p_partkey as the distribution key for PART so that the matching values for the joining keys will be collocated

Step #5 - Design Distribution Strategy

Based on data define the strategy for data to be collocated and records returned faster.

Table name Distribution Style

LINEORDER lo_partkey

PART p_partkey

CUSTOMER ALL

SUPPLIER ALL

DWDATE ALL

The final update schema is listed in link. Observe the addition of sortkey, distkey mentioned in the DDL

Stored_Procs

	--Step 1 Create Tables
	drop table demo.aggstats

	create table demo.aggstats
	(code varchar(11) ,
	weekid int,
	trancount int)

	--Step 2 Create the procedure
	drop procedure demo.biweeklysales();

	create or replace procedure demo.biweeklysales()
	language plpgsql
	as $$
	declare
	CurrId INTEGER := 0;
	MaxId INTEGER := 100;
	startdate date := '01/01/2021';
	enddate date;
	duration int:=14;
	begin
	while CurrId <= MaxId
	LOOP
	CurrId = CurrId + 1;
	enddate = dateadd(day,duration,startdate);
	insert into demo.aggstats(code,weekid,trancount)
	select product_code, CurrId, sum(units_sold) as salescount from demo.alltransactions
	where order_time >= startdate
	and order_time < enddate
	group by product_code;
	startdate = dateadd(day,duration,startdate);
	raise info 'CurrId = %', CurrId;
	end LOOP;
	raise info 'Loop Statement Executed -_-\|\|^';
	end;
	$$;

	--Step 3 Execute
	call demo.biweeklysales()
	select distinct(weekid) from demo.aggstats
	select * from demo.aggstats limit 100;

view raw redshiftproc.sql hosted with ❤ by GitHub

Happy Learning new things!!!

Day #258 - Deep Learning - Non Convex Optimization

Convex optimization there can be only one optimal solution. Non-convex optimization may have multiple locally optimal points. Hence, finding the global minimum is very difficult.

What makes non-convex optimization hard?

Potentially many local minima
Saddle points
Very flat regions
Widely varying curvature

Examples of non-convex problems

Matrix completion, principal component analysis
Low-rank models and tensor decomposition
Maximum likelihood estimation with hidden variables
Usually non-convex
The big one: deep neural networks

How to solve non-convex problems?

Stochastic gradient descent
Mini-batching
SVRG
Momentum

One good slide from link

Ref Link -

Ref - Link

Ref - Link

Backprop - Link

Happy Mastering DL!!!

June 06, 2019

Machine Learning Chaos - Data Chaos

Often I keep hearing below responses for building ML use cases. There are multiple missing dimensions for a judgement. Some of my observations from the discussions

Data Unavailable - You cannot map a standard use cases and expect those standard features be part of data. ML use cases today go with "I need all this data", rather we should look at "What model I can build with available data"
Data Insufficient - Data gets archived and deleted in most transactional systems. "Building a model with currently available data and improving it periodically is more important than waiting for 5 year data"
Model Accuracy - Don't compare ML with other software metrics, ML is learning from data. Garbage In Garbage out. Software is code for functionality
Picking Right Use-case - Before finding the right use cases, we need to understand the collected data and features. Gap between use cases and understanding of available data will not get you right use cases to solve
Data Pipeline - Building first ML use case involves setting up data pipeline for collecting newer features, changes to transactional system. This needs good analysis of gap between features available and good to have features with the pipeline setup for adding newer feature variables

#MachineLearning #Data #Business #AI. Navigate through the chaos of Data and Features to build your ML Models.

Happy Learning!!!

June 05, 2019

Learning vs Compensation vs Titles

This question usually comes to me at different cycles. The experiences we gather from different roles and domains get accumulated and gives us perspectives unique to the domain.

After a decade I had options to choose

Go forward in same Database space
Setting up teams from scratch
Domain-based focus and build expertise
Professional services, Automation, QA
Learn something new

Every challenge once you solve becomes boring. Everything is difficult until you find a way to do it. Passion without skills won't get you where you want to go.

Five years back I decided to reboot myself in Data science space. Every other previous experience gave new perspectives to explore. Building expertise in one area, solving problems in a domain perspective, Working on creating unique IP, Learning other related tasks to delivery.

Many times in career I always get discussions and proposals to go back to previous roles. From data science now I get to hear Big Data + Data science skills. Even within data science to build expertise in video, numbers, text aspects, it needs deep dive and consistent focus for a certain period.

Close to two decades in a few years now question remains how long to remain as Individual contributor, Finding a role that uses all the previous experience, keeping up with titles?

The primary aspect of my satisfaction is learning. After a few years, there could be something that would replace today's data science.

As I grow older, I have to pick and choose my areas of interest, areas to focus on. I never met happiness focusing alone on titles or compensation. They also have a priority after satisfaction though. A job that gives you learning, exciting but not overwhelming challenges, manageable work-life, and compensation would be the best thing to look for and have satisfaction till the day of death. Sometimes you need titles to execute your strategy.

Compensation helps you meet your financial goals. The search is to find a role that gives you opportunities that meets your priorities and leverages your strengths. Career is a long term thing. Keep learning until you find your dream role. Be prepared for the role. Don't wait for titles to learn the role.

Interesting Read - An incomplete list of skills senior engineers need, beyond coding

My Perspectives

How to write a design doc, take feedback, and drive it to resolution, in a reasonable period of time

Competitive product
Papers Referenced
Potential Architecture
Prototype
Demo
Next Steps

How to mentor an early-career teammate, a mid-career engineer, a new manager who needs technical advice

What would I do if I am in your situation? This is how I look up and advise. We cannot learn everything. We have to pick and focus on few things which are important for us.

How to influence another team to use your solution instead of writing their own

View it as a joint success. Sometimes you have to give credit to make things work.

How to get other engineers to listen to your ideas without making them feel threatened

You are not selling your idea, You are conveying how it is done across other industries/companies. You need to have a working demo, deep dive on the benefits of it. It may not work in the first go but over a period of time trust will develop

How to craft a project proposal, socialize it, and get buy-in to execute it

Everything we pick someone who has already experimented / could be something ongoing. Give the context, potential value and take it to next level.

Keep going!!!

June 04, 2019

Day#257 - 9 Steps to Build your first Machine Learning Use Case

1. Data Analysis from Domain Perspective - Explore your data and understand the entities, transactions, business flow captured
2. Data Science Use Cases Review from Customer / Sales / Business Opportunities perspective - Read / Connect with Stakeholders find the use cases that business needs
3. Mapping of Available Data to Data Science use cases - Conduct the feasibility study to map the use cases and data flow captured to explore about them
4. Pick and chose use case based on Data, Impact to business - With #1 and #2 find the sweet spot to hit your first use case
5. Develop Model, Build Features, Test and Validate Accuracy
6. Demonstrate the model, explain the data, features - Demo it to business users, Sell your use case
7. 70-80% Accuracy is Good Enough to get started - Quality improves over time and more features captured. Layout plan to add more features
8. Build End to End Data Pipeline, Model Training Pipeline, Deploy as API / Expose the output as Report / API response
9. Include the aspects of Model Re-train to keep up with Changing Data Dynamics, Refresh the model over periodic intervals

The Journey Continues!!!

June 03, 2019

Day #256 - Experimenting with Linux on Windows

While trying to install Linux (Dual boot) corrupted by Laptop Setup

Created a Dell Recovery USB using Service Tag
Using this link
Used diskpart tool
Clean all disk
and using GPT command create a basic MBR record

Happy Re-Learning!!!

June 28, 2019

June 23, 2019

June 19, 2019

June 15, 2019

June 14, 2019

June 12, 2019

June 08, 2019

June 06, 2019

June 05, 2019

June 04, 2019

June 03, 2019

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts