"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

June 28, 2019

Day #260 - Fundamentals Revisited - Detection

Summary of Notes
One stage detector
  • Recall high but compromise localization ability, Densebox, Yolo, SSD, RetinaNet
  • One Stage Detector - Retinanet
  • FPN - Feature Pyramid Net - computing convolutional feature maps
  • Class Subnet - object classification 
  • box subnet - bounding box regression
Two Stage Detector
  • FRCNN, RFCN, MaskRCNN
  • Two-stage detector - Strong localization ability
  • Feature Pyramid Net Structure
  • ROI alignment

Non-Max Suppression - post-processing to eliminate multiple responses
Metrics - Precision, Recall
  • Precision - How many selected items were relevant
  • Recall - How many relevant items were selected
Detection
  • Sliding window approach, Parallel Computation
Manually Handcrafted features for Image
  • Haar feature
  • Histogram of Gradient
  • Local Binary pattern
  • Aggregated Channel feature
References 
Link1
Link2
Link3

Happy Mastering Data Science!!!

June 23, 2019

Outside Retail, RFID in Smart Cities

Past few years I have looked at RFID in retail perspective - Inventory Management. We work on RFID solutions that enable real-time inventory management and improving stock efficiency by tracking
  • Restock
  • Misplaced Items
  • Replenishment Items 
RFID is a major player in Retail for Inventory Management. Millions of items are tagged and tracked using RFID based Inventory Solutions.
I had an opportunity to visit smart city solution provider. The solutions they covered were
  • Waste Management
  • Parking Management
  • Smart Lighting
  • Video Monitoring
  • Environment Management
  • Surveillance
  • Command Centre
Waste Management - The bins are equipped with volume sensors and RFID tags, They are designed with batteries lasting 10 years. They provide details when the Truck and the bin connects, the RFID event is alerted

Intrusion Detection - Sensors placed for Motion detection during the area to alert if Someone comes closer to the area

Environment Management - This is basically measuring pollutants in the air and reporting the trends/patterns

Traffic - This is more of heat maps to reflect the patterns / crowded areas

Command Centre - All the calls are routed/tracked here. Provides a lot of data on issues/queries/ complaints / accidents / reporting any incidents.

Parking - This is based on sensors placed above / below parking lots

State of Art - All of this implementation are mostly based on sensors / RFID. This visit provided a lot of insights into data collection, framework, real-time alerts of end-to-end infrastructure.

Where is Video Analytics here?
Before the visit I was under the assumption a lot of video analytics, people counting use cases would be there. Actually, most of the things are achieved with RFID, sensor-based events. Video Analytics is yet to make a mark in implementation. But down the line in the next few years, we would see a lot of video analytics on the data collected.
  • Crowd Detection
  • Use of Drones for Monitoring
  • Gender Detection from Surveillance Cameras
  • Vehicle Number Detection and Reporting
  • Face Detection and Indexing at Landmarks
  • Loitering
  • Audio-based analytics to detect gunshots/sounds/ accidents
A lot of use cases, RFID + Video Analytics will further strengthen Smart Cities Solutions portfolio.
Happy Learning!!!

June 19, 2019

The Data Around Myself

Different Data Sources that run our Knowledge and Emotions
  • Linkedin - Primarily for Professional Contacts / Future prospects. Connect for jobs, mentoring, professional discussions, sharing / following up on Industry
  • Quora is for Interest-based Contacts - Deep Dive Discussions perspectives and debates, Following with experts in our areas of interests
  • Facebook is for Friends, Relationships - Events, Holidays, Updates. I have zero inputs in this space as I don't believe in hundreds of contacts rather meaningful few contacts. Till date, I haven't invested my time in FB
  • Twitter - Pointers, and thoughts from people of our interests and ideas
  • Knowledge Sources - Reddit, hacker news, medium
  • Code Sources for Learning - Github, StackOverflow
  • Communication Channels - Skype, Whatsapp, Slack
We would observe more networks evolve focused on
  • Industry
  • Domain
  • Age Group
  • Relationships
  • Data Ownership
Happy Data Perspectives!!!

June 15, 2019

Lessons Learnt from Video Analytics Projects

These are lessons learned based on working on Video Analytics Projects.
  1. Customers gets carried away with demos. Demos with fewer objects will look good. But how many objects you can train. How many you can detect ? You need to nail down. When you look at Tesla Autopilot videos you can observe the detected entities are Lanes (Yellow), Speed (White), Type of Vehicle (Red), Signals in that Lane, Alerting when vehicle before is stopped, Dimension (Length of Vehicle), Pedestrian Detection / Speed, Boundary of detected objects. You need to nail down on objects you want to detect. Not everything that you see, can be counted
  2. Object detection is not a Xerox machine copying type of problem. If you spend X amount of time for detecting one type of object, you again need to spend same amount of time for training another object. I have observed customers assume it's one time to train and then it auto learns
  3. Object detection is one part of the problem. Type of camera, the dataset from fish eye camera and mounted camera both will differ. You need to prepare and work on datasets for both of them
  4. After Object Detection, Object Tracking becomes the next task. In Real-time systems People / Car where there is lot of movements involved, tracking becomes the next important aspect after detection. Object Re_Id - Another challenge is doing Object Reidentification when the Person / Object re-appears in the frame or passes across cameras.
  5. View video analytics as a form of translating video data into insights. Objects counted, Repeat objects, duration of objects in frame. This data need to be correlated with other forms of data to establish any insights/ correlation aspects.
  6. Working on smaller video sets and rapidly changing environment is a problem of dataset generation. The big tech companies Microsoft, Amazon, Google, Tesla they have lot of video datasets and data available. Working in smaller customer's with limited datasets and different lighting, environmental conditions is always a challenge.
  7. Edge computing has a long way to go. With my limited experience I feel edge computing models Intel, Google and other providers has a long way to go
  8. Lot of research activity and very tough to keep up the pace. Ton of Papers and demo code put up. They get added everyday. It is difficult to track and keep knowledge updated on daily basis.
  9. Video Analytics will go together with other inputs location information, coordinates, RFID information. Co-ordinates can be used to map the object location and use it to infer in future frames
Will Keep updating with more lessons!!!

Happy mastering Data Science!!!

June 14, 2019

9 Reasons why Business Intelligence and Database Developers should learn Machine Learning

Business Intelligence Insights can fuel Data Science Use Cases
  1. BI will give you trends on historical data, BI will pin point what are your highlights and low lights
  2. Effective Data Science use cases for Business will be converting this low lights into more proactive signs
  3. BI will shed light on trend by seasonality, products, location etc. These values are effective feature variables for building your ML models
Database Developers are naturally good at Analyzing Data

I have worked with Terabyte of Data, Microsoft Entertainment and Devices data in 2008. I remember all the Insights we computed in Supply Chain, Orders, Returns, Warranty. When I started on Data Science I felt how those data can be effectively analyzed with Data Science
  • Sales Analysis - The sales of products, regions, units sold can be clustered to find insights in it
  • Repair Analysis - The types of products, repairs, regions can be clustered to find the most occurring issues
  • Sales, Repair, Warranty renewal - From all the real time and Transactional data, we can build forecasting for repairs, sales, warranty renewal.
4. Database developers work with huge volumes of data, schema design, index design for performance
5. The Job involves aggregating, writing procedures to implement business transactions
6. Handling all load issues of concurrency, dead-lock, dirty data

These traits would be helpful setting up the data pipeline. All the work of generating insights can be achieved with TSQL itself. You do not really need to use pandas or learn from scratch

They can build pipelines, Reporting, Talk about the Numbers

7. Translate the insights / dimensions as feature variables
8. Communicate to business the insights and how they are data science use cases to solve
9. Naturally BI Reporting + Transactional Reporting will provide a lot of Visualization which also is a must for Data Science to present your story

Today Transactional Data Reporting, Business Intelligence Insights, Future predictions with Data Science everything is needed to succeed in business, OLTP + OLAP + Data Science = All about data in business

Note - I am not including Video, Text in this context. Data Science with Data (Numbers) is the scope of this post

#BusinessIntelligence, #MachineLearning, #Database. #TSQL, #ArtificialIntelligence, #DataPipeline

Happy Mastering Data Science!!!

June 12, 2019

Why Data Science Projects Fail ?

Many places I keep observing only failed #ML implementation. Data Science is here to stay but there will be a bubble burst with "Failed Data Science Projects"

Some of the lessons missed I observed are
  • There is no single model to solve everything - Detection, Classification, Scoring, Recommendations. It will be a mix of multiple models
  • Data Science project is creative work, needs data, finetuning, re-training, come out of deadlines, it's making someone learn from data
  • The perfect model comes after iterations, not in the first iteration
  • Don't go for AWS, Google, Microsoft Vision, NLP, AI tools in the first go, It is pay per use. Good to start. Build something on your own, You will control on internals and improving it further
  • Production deployment - There are tons of tools out there, Building is more important than deploying
  • Data Science projects and Sales teams visions often conflict. I have seen Sales teams not able to sell AI products because they don't understand fully how AI can fit in the portfolio
  • Data Science is multiple perspectives = Data + Domain Knowledge + BI + Computer Vision + .... It's not just .fit and .predict. Learn the complete perspective. Don't get carried away with one perspective.
Happy Mastering Data Science!!!




June 08, 2019

Day #259 - Setting up Kafka on my Ubuntu - Big Data Setup - Part IV

Finally Setting up all Big Data Tools in my Linux Setup. Rough Steps and my reference notes




Happy Data Thinking!!!!

Data Warehousing with Amazon Redshift

  • Redshift started from Postgres, 4 Years 150+ features added
  • Rebuilt and improved version, columnar storage, scale horizontally, OLAP functionality added
  • Wrapped in AWS system
The 5 things to look at creating Redshift 
  • Step 1 - Find Fact, Dimension Tables and the Record Count
  • Step 2 - Analyze and find the query where clause filter columns
  • Step 3 - Define the Sort Keys
  • Step 4 - Analyze the record counts, data to find out distribution strategy
  • Step 5 - Design Distribution Strategy
Summary of Lessons from Link 
Good Tutorial Table Design Tuning - Link
The Where clause columns are the sort keys - (date, partnumber, year etc). Same as TSQL index design Link -

Summary of Lessons and Key Steps

Step #1 - Demo Example Tables and Record Count

LINEORDER - 600,037,902
PART  - 1,400,000
CUSTOMER  - 3,000,000
SUPPLIER  - 1,000,000
DWDATE  - 2,556

Step #2 - Analyze and find the query where clause filter columns, Link

The Join Columns are
LINEORDER PrimaryKey - lo_orderkey, FK - lo_custkey, lo_partkey, lo_suppkey, lo_orderdate, lo_commdate
PART pk - p_partkey, FK- lo_partkey
CUSTOMER pk - c_custkey, FK - lo_custkey
SUPPLIER pk - s_suppkey, FK- lo_suppkey
DWDATE pk - d_datekey, FK - lo_orderdate, lo_commdate

Step #3 - Define the Sort Keys

All the primary and foreign keys have the sort order keys defined. Based on Query Where Clauses define them

Table name Sort Key
LINEORDER lo_orderdate
PART p_partkey
CUSTOMER c_custkey
SUPPLIER s_suppkey
DWDATE d_datekey

Step #4 - Analyze the record counts, data to find out distribution strategy
  • The Largest Dimension Table is PART, Each table can have only one distribution key
  • LINEORDER is the fact table, and PART is the largest dimension. PART joins LINEORDER on its primary key, p_partkey
  • Designate lo_partkey as the distribution key for LINEORDER and p_partkey as the distribution key for PART so that the matching values for the joining keys will be collocated 
Step #5 - Design Distribution Strategy

Based on data define the strategy for data to be collocated and records returned faster.

Table name Distribution Style
LINEORDER lo_partkey
PART p_partkey
CUSTOMER ALL
SUPPLIER ALL
DWDATE ALL

The final update schema is listed in link. Observe the addition of sortkey, distkey mentioned in the DDL


Happy Learning new things!!!

Day #258 - Deep Learning - Non Convex Optimization

Convex optimization there can be only one optimal solution. Non-convex optimization may have multiple locally optimal points. Hence, finding the global minimum is very difficult.

What makes non-convex optimization hard?
  • Potentially many local minima
  • Saddle points
  • Very flat regions
  • Widely varying curvature
Examples of non-convex problems
  • Matrix completion, principal component analysis
  • Low-rank models and tensor decomposition
  • Maximum likelihood estimation with hidden variables
  • Usually non-convex
  • The big one: deep neural networks
How to solve non-convex problems?
  • Stochastic gradient descent
  • Mini-batching
  • SVRG
  • Momentum
One good slide from link

Ref Link



Ref - Link


Ref -  Link





Backprop - Link

Happy Mastering DL!!!

June 06, 2019

Machine Learning Chaos - Data Chaos

Often I keep hearing below responses for building ML use cases. There are multiple missing dimensions for a judgement. Some of my observations from the discussions
  • Data Unavailable - You cannot map a standard use cases and expect those standard features be part of data. ML use cases today go with "I need all this data", rather we should look at "What model I can build with available data"
  • Data Insufficient - Data gets archived and deleted in most transactional systems. "Building a model with currently available data and improving it periodically is more important than waiting for 5 year data"
  • Model Accuracy - Don't compare ML with other software metrics, ML is learning from data. Garbage In Garbage out. Software is code for functionality
  • Picking Right Use-case - Before finding the right use cases, we need to understand the collected data and features. Gap between use cases and understanding of available data will not get you right use cases to solve
  • Data Pipeline - Building first ML use case involves setting up data pipeline for collecting newer features, changes to transactional system. This needs good analysis of gap between features available and good to have features with the pipeline setup for adding newer feature variables
#MachineLearning #Data #Business #AI. Navigate through the chaos of Data and Features to build your ML Models.

Happy Learning!!!

June 05, 2019

Learning vs Compensation vs Titles

This question usually comes to me at different cycles. The experiences we gather from different roles and domains get accumulated and gives us perspectives unique to the domain.

After a decade I had options to choose
  • Go forward in same Database space
  • Setting up teams from scratch 
  • Domain-based focus and build expertise
  • Professional services, Automation, QA
  • Learn something new
Every challenge once you solve becomes boring. Everything is difficult until you find a way to do it. Passion without skills won't get you where you want to go.

Five years back I decided to reboot myself in Data science space. Every other previous experience gave new perspectives to explore. Building expertise in one area, solving problems in a domain perspective, Working on creating unique IP, Learning other related tasks to delivery.

Many times in career I always get discussions and proposals to go back to previous roles. From data science now I get to hear Big Data + Data science skills. Even within data science to build expertise in video, numbers, text aspects, it needs deep dive and consistent focus for a certain period.

Close to two decades in a few years now question remains how long to remain as Individual contributor, Finding a role that uses all the previous experience, keeping up with titles?

The primary aspect of my satisfaction is learning. After a few years, there could be something that would replace today's data science.

As I grow older, I have to pick and choose my areas of interest, areas to focus on. I never met happiness focusing alone on titles or compensation. They also have a priority after satisfaction though. A job that gives you learning, exciting but not overwhelming challenges, manageable work-life, and compensation would be the best thing to look for and have satisfaction till the day of death. Sometimes you need titles to execute your strategy.

Compensation helps you meet your financial goals. The search is to find a role that gives you opportunities that meets your priorities and leverages your strengths. Career is a long term thing. Keep learning until you find your dream role. Be prepared for the role. Don't wait for titles to learn the role.


How to write a design doc, take feedback, and drive it to resolution, in a reasonable period of time
  • Competitive product
  • Papers Referenced
  • Potential Architecture
  • Prototype 
  • Demo
  • Next Steps
How to mentor an early-career teammate, a mid-career engineer, a new manager who needs technical advice
  • What would I do if I am in your situation? This is how I look up and advise. We cannot learn everything. We have to pick and focus on few things which are important for us.
How to influence another team to use your solution instead of writing their own
  • View it as a joint success. Sometimes you have to give credit to make things work. 
How to get other engineers to listen to your ideas without making them feel threatened
  • You are not selling your idea, You are conveying how it is done across other industries/companies. You need to have a working demo, deep dive on the benefits of it. It may not work in the first go but over a period of time trust will develop
How to craft a project proposal, socialize it, and get buy-in to execute it
  • Everything we pick someone who has already experimented / could be something ongoing. Give the context, potential value and take it to next level.
Keep going!!!

June 04, 2019

Day#257 - 9 Steps to Build your first Machine Learning Use Case

1. Data Analysis from Domain Perspective - Explore your data and understand the entities, transactions, business flow captured
2. Data Science Use Cases Review from Customer / Sales / Business  Opportunities perspective - Read / Connect with Stakeholders find the use cases that business needs
3. Mapping of Available Data to Data Science use cases - Conduct the feasibility study to map the use cases and data flow captured to explore about them
4. Pick and chose use case based on Data, Impact to business - With #1 and #2 find the sweet spot to hit your first use case
5. Develop Model, Build Features, Test and Validate Accuracy 
6. Demonstrate the model, explain the data, features - Demo it to business users, Sell your use case
7. 70-80% Accuracy is Good Enough to get started - Quality improves over time and more features captured. Layout plan to add more features
8. Build End to End Data Pipeline, Model Training Pipeline, Deploy as API / Expose the output as Report / API response
9. Include the aspects of Model Re-train to keep up with Changing Data Dynamics, Refresh the model over periodic intervals

The Journey Continues!!!

June 03, 2019

Day #256 - Experimenting with Linux on Windows

While trying to install Linux (Dual boot) corrupted by Laptop Setup
  • Created a Dell Recovery USB using Service Tag
  • Using this link 
  • Used diskpart tool
  • Clean all disk
  • and using GPT command create a basic MBR record
Happy Re-Learning!!!