Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): March 2020

March 29, 2020

Corona Stats - As of March28th

Data Source - Link (As of March28th data)
Case Stats and Growth Trend
Start Date - 2019-12-31

Day 69 - 102133
Day 81 - 213258
Day 84 - 305270
Day 87 - 417061
Day 89 - 528019

Summary

1st 100K - 69 Days
2nd 100K - 12 days
3rd 100K - 3 days
4th 100K - 3 days
5th 100K - 2 days

Death Stats and Trend

Day 1 - 2019-12-31
Day 76 - 5407
Day 83 - 11251
Day 86 - 16365
Day 88 - 20991

Summary

First 5K - 76 Days
Second 5K - 7 Days
Third 5K - 3 Days
Fourth 5K - 2 Days

Case Distribution by Country

Fatality

I hope we get through this challenge and recover soon. With global lockdown measures hope we observe downwards trend in the coming weeks.

Good Read - Response to COVID-19 in Taiwan Big Data Analytics, New Technology, and Proactive Testing

Key Points

Specific approaches for case identification, containment, and resource allocation

Databases Leveraged

Immigration and customs database for travel to Risk Areas
Health insurance database for proactively seeking out patients with severe respiratory symptoms

Risk Categorization

Low risk (no travel to level 3 alert areas)
Higher risk (recent travel to level 3 alert areas)

Inference

Real-time alerts during a clinical visit based on travel history and clinical symptoms to aid case identification
Tracked through their mobile phone from Self Quarantine

Good Read - Lessons from Italy’s Response to Coronavirus

Key Summary Points (Implementation)

Avoid partial solutions
Learning is critical

Key Summary Points (Lessons Learnt)

Extensive testing
Proactive tracing
Home diagnosis
Monitor and protect health care and other essential workers

Corona Perspectives (July 5th 2020)

Covid Cycle

Unlock Cycle

IT Impact

Carefully we need to plan, bride the gap to address the gaps in the economy, unorganized sectors, poor performing domains. Hope the new normal provide more innovation and newer job opportunities

From NPTEL Lecture Link

Webinar 2 - Link

Keep thinking!!!

March 21, 2020

Corona impact in Retail

Essentials, Medical and Food supplies, and eCommerce will have a spiked up demand. Clothing / Fashion / Toys / Luxury brands /Smartphone and non-essentials will have an impact leading to reduced sales / temporary closure of stores.

Business Impact

Reduced Store Traffic
Revisit on Sales Forecasts
Temporary Closure of poorly performing stores
Supply chain / Manufacturing Delays / Reduced Demands

Alternatives

Omni Channel Support
Contactless delivery
Equip Store associates with Sufficient safety procedures
More Sanitizing efforts for store associates/customers
A shift for e-commerce mode
Use offline data for online personalization
Stock up / Align towards products in demand (Healthcare / Medical / Essentials etc)

It will take time to recover and reset/fix the entire supply chain, manufacturing, overcomes direct/indirect job loss, economic impact. Hoping it will be handled well and things will come back to normalcy soon.

7 COVID-19 crisis management tips for small businesses

1 Provide reassurance
2 Develop remote service delivery
3 Negotiate short-term relief
4 Ensure supply chain continuity
5 Provide discounts
6 Proactive hygiene practices
7 Protect employee wageshttps://t.co/xNmGIA4a7r pic.twitter.com/hYwigt50Cq
— Vala Afshar (@ValaAfshar) March 23, 2020

These Struggling Retailers May Suffer Their Final Blow From The Coronavirus Lockdown

Challenges

Rent and other fixed costs of running
Employee Salaries
Cash on their balance sheet

Be Positive!!!
Keep Thinking!!!
Practice Social Distancing!!

Retail AI Landscape

Survey of Retail Landscape, Use cases, Startups

Keep Thinking!!!

March 20, 2020

Day #334 - Lessons Learnt in evaluating SQL 2016 Performance Features

Sharing my lessons on proposing SQL In-Memory table implementation for the product I worked with. I worked with Sunil Agarwal from the SQL product team to evaluate the features, benefits, migration approach, etc.

Happy Learning!!!

SQL 2019 - Interesting Features

SQL 2019 - Interesting Features (Link)

I have a lot of bias for SQL Server. Some SQL 2019 features are awesome. The things I liked are
Query heterogeneous databases with Polybase (Polybase feature was there in 2016 too but the databases supported is not as many as I see now)

Polybase provides in SQL Server 2019 through a concept called an EXTERNAL TABLE.
External tables are just like SQL Server tables except SQL Server only stores the metadata of the table definition
Polybase uses ODBC drivers to connect to sources such as Oracle, Teradata, MongoDB, and SQL Server.

SQL Datalake capability

Support for SQL, NoSQL
Support for ML Engine
Support for HDFS

These are promising features. Obviously, there will be some product limitations in the early stages.
Highlights

Support for unstructured data
Heterogenous database support
Schema on read is achieved with external tables
SQL wrapper to query both different databases / unstructured data
Integration with HDFS
ML APIs / Visualization features

Very good move to accommodate / position SQL as an Integration Database engine for heterogenous/unstructured/structured data

Happy Learning!!!

Interesting Product - Intelligent shopping cart

Another interesting product - an intelligent shopping cart

Key features are

instore navigation
store promotions
product suggestions
scans and weighs products
displays a running tally of purchases
pay on the spot with the cart

Technical Implementation - Link

Product Scan (Barcode / RFID)
pay/card swipe attached
UI to display items scanned / list products

AI Solutions

Object Detection
Weight + Object detection for counting

Cons / Concerns

Cost of the cart/maintenance
Accuracy of items detected
Accuracy of the count of items for smaller products

Keep Thinking!!!

March 19, 2020

Day #333 - Deep Learning Guidelines

CI / CD, DL frameworks, Buy vs Develop are different sets of challenges. The more you learn, the more you feel you have a lot to learn :). Learning / doing/debugging/testing everything is part of learning. Keep going!!!

Different levels of learning are required for a different set of challenges.

Mastering Keras vs Pytorch vs Tensorflow
Knowing Advanced features of Data Pipelines / Porting in Edge Devices
Building end to end the flow of Edge Analytics -> Data Consolidation -> Reporting
Deployment of this overall end to end solution
Accuracy / Understanding real-world challenges and next incremental steps

This link provides a good guideline

The ML tools landscape is very useful

Key Notes
Step #1 - Data

Data Storage
Data ETL Process (Workflow / Async Process)
Data Labelling (Raw Data -> Modelled)
Data Versioning

Step #2 - Development / Traning

DL Frameworks
Source code management
Store & Retrieve Results
Distributed Training

Step #3 - Deployment

Build Tools
Web Deployments
Monitoring predictions
Edge Devices / Custom Hardware Deployment

DL Frameworks

Key Notes

Caffe - C++ based (Fintech used Caffe)
Tensorflow - Google (Mobile, JS, Scalable Deployment) - Abstraction - Computational Graph
Keras - Wrapper on Tensorflow
PyTorch - FB product

ML Code Management for Training / Deployment / Serving

Key Lessons

Training System (Model Development)
Production System (Ready to use Model, Setup)
Serving System (Web App or anything that serves model)

On all these three levels there is a certain set of tests run to validate every layer - Train / Model / Production Serving Tests

Infrastructure (Buy vs Build)

Deep Learning Optimization

Data Versioning

Key Lessons

Unversioned Data (file system) (L0)
Version with a snapshot - Daily data (L1), Data backup with Date
A mix of assets and code (L2), JSON or any other labeled storage
L3 - Specialized solution - DVC, Pachyderm, Quill

#fullstackdeeplearning Yangqing Jia .. The full stack pic.twitter.com/MwfTzPoWJx
— Rahel Jhirad (@RahelJhirad) August 5, 2018

Training Neural Nets: a Hacker’s Perspective
Common Coding Mistakes

The incorrect shape of tensors
Preprocessing inputs incorrectly
Incorrect loss function
Numerical computation errors (NaN)

Troubleshooting Deep Neural Networks
Troubleshooting Deep Neural Networks

Happy Learning!!!

Distributed Systems - Session #3 - Aurora

Sometimes I felt not connected to the session. Needs a lot of focus and patience to stay connected and focused :)

Key Summary points

Amazon early offering EC2
Rented out VMs to customers
VMM (Virtual Machine Monitors) that run/manage EC2 instances
EC2 good for stateless web servers
S3 - Scheme for storing large chunks of data (Periodic Snapshots)
Disks for EC2 instances - Fault Tolerance (EBS)
EBS (Elastic Block Store) - Looks for EC2 instances as it is a harddrive
Databases on EBS sends a large volume of data over the network
Amount of writes on Network Storage System
CPU / Disk space consumption
EC2 / EBS are in same availability zone
Transaction & Crash Recovery
Transaction (Sequence of operations / commands / atomic / ex- bank transfer money between accounts)
Reads page from disk
Make Changes in local cache
Then write changes to disk
Log entries describe the transaction
Three log records - Modify Operation, Old Value, New Value
Aurora is based on MySQL
RDS (Database replicated in multiple availability zones)
All the transactions mirrored to other databases (EBS Servers)
Multiple copies managed and updated to keep everything in sync
Read / Write Quorum will overlap
Voting does not work to read from which server
These systems have version numbers
Readers takes the ones with highest version number
Split database into replicas
Data Sharding
Data across protection groups

Happy Learning!!!

March 18, 2020

Distributed Systems - Session #2

I paused it a lot as I didn't really get involved much but finally managed to complete it.

Key Lessons

Go lang examples for threading, locking, RPC, Typesafe and memory safe, Garbage Collected
Threads - Tools to manage concurrency in programs
Stacks are within address space of the program
I/O Concurrency - Overlapping of progress of different activities wait ing / executing
Parallelism - Parallelize CPU / IO cycles / routines
Process is a single program / single address space. Inside process there are multiple threads
Process -> memory area -> routines sit inside the process
Process implemented by the operating system
Thread challenges - Sharing data
Mutex / Locks for shared data
Data Access - Managing Locks / Deadlocks / Starvation / Blocking
Channels (Go Lang) - Send data between threads
WaitGroup, Sync.Cond
Webcrawlers design for parallel processing using threads
Handling concurrency / multiple parallel threads / optimum network capacity utilization
Remember doing SSIS ETL parallel tasks for Data pull

A multi-threaded Web crawler implemented in Python
Crawler
Multi-Threaded Crawler in Python

Happy Learning!!!

Staying updated in Data science - My 5 Lessons

Reddit, tweets, LinkedIn follows news, analytics blogs, links, Lex Fridman interviews, Stanford / MIT / Cornell updated courses
Look at Kaggle kernels, understand feature variables, newer features build. Learn domain-specific findings
Read research papers and try to look for techniques in video/text/ audio projects which you can reapply
Look at Github examples and code them in your free time. This help to know coding practices/ best practices
A lot of industry-specific products we can find by digging deep on AI technology and product landscape. Top 100 AI companies, AI product blogs, etc..

Teach, blog in different mediums. This helps to learn, gather different perspectives. If you have observed technology and know the underlying pattern/architecture you can better connect the product, purpose, and applications of the tool.

During ML interviews I did find most interviewing folks 6 to 7 years younger than me. I came from DB BI to the AI world. It's a good feel to continue code, coach, teach a younger set of folks.

Good Read (Link)

Reading Research papers

Happy Learning!!!

Data Perspectives

Different perspectives to decide on choosing the right database?

Strict data types - Schema on write
Schemaless data - Schema on read
Read-only immutable data
Eventually consistent data
Dirty read vs Committed data
Multi-version concurrency control
Replicate data based on logs
Replay committed logs
Data sharding
High reads consistent data - RDBMS
High writes low reads - HBase, Cassandra
Document-based storage - Mongodb, Couchdb
CAP, ACID Properties

Things I Wished More Developers Knew About Databases

Want to Debug Latency?

I wrote an initial draft on the things I wished more developers knew about DBs. It touches a variety of topics: write skews, external consistency, clock skews, database-generated IDs, nested transaction issues, caches & more. Is there anything you wished more devs knew about DBs?
— Jaana Dogan (@rakyll) April 13, 2020

Almost similar and deep-dive techniques from the tweet conversation

Read heavy vs write heavy. Insert vs updates. Vaccuuming
Replication or not, transaction logging, why indexes matter, performance tuning, i/o scheduler, unicode, gender isn't binary
Locks, cache effects, isolation levels
IO bound vs network bound especially in the situation of replication, scaling strayegy, concurrency vs distributed.
Materialized views, and the dangers of invalidating them unexpectedly.
Connection pool, scaling techniques to handle distributed application / system, improve performance, optimization of query etc.
I'd be interested in how this applies to a distributed system. Concurrency (specifically MVCC), connections, DB threading, backpressure handling
Disk storage implementation and optimization

Keep Thinking!!!

Analytics Leaders

There are three types of Leaders in my perspective

Technical Leaders - Coming up with new strategies/solutions, publishing papers, case studies. They propel/push the limits of tech to the next level. It takes time, effort to analyze, perform experiments and publish the observations.
Business Leaders - Able to find business use cases that can be solved with AI. Mapping relevant AI use cases for business/domains
Practitioner Leaders - Early Adopters to apply the techniques in solving the business problems, experimenting and leveraging different techniques, papers and newer approaches to solve business problems.

Keep Thinking!!!

SQL Performance Tuning & Coding Guidelines

This vacation was useful to find some of my prior work / presentation. Sharing some of my Earlier SQL Performance Tuning Slides I did my Balmukund from SQL Product Support Team.

Happy Learning!!!

March 16, 2020

stitchfix Blog Post - This post provides Data Strategy for Data Science

This post provides insights into Data Science Strategy in stitchfix

Problem Solving Approach (Use Cases - Data - Models)

Step #1 - Business Use Cases -> Finding Relevant Data -> Providing Data with ETL
Step #2 - Data - Multiple Models
Step #3 - API to consume results and use data for decision making

Key Lessons

Availability of Raw Data
Building ETL for data updates
Data Pipelines for Feature Engineering
Different Data Science Algos for Algorithms
Data Science uses cases driven from the business context

Data Demands

Raw Data Access (Pull Everything to a Data lake)
Data updates / Deletes (Data lake updates with events)
Feature variables (Custom ETL to select, transform data from raw data)

Experimentation

A / B Testing
Validating with real-time results
Ongoing correction of models

Connecting Data and Science

Overlapping functions with Domain, Data and Data Science Knowledge
A lot of Experimentation

Algorithms (Data Science use cases)

Style Recommenders (Recombining Attributes from existing styles adding feedback), Developing Design with a certain set of attributes
Warehouse Assignment (Shipping cost, shipping time, inventory match)
Inventory Forecast (Demand, Unit Price, Total Cost, Ordering Cost, Carrying cost, Season, Recently emailed etc)
Fashion Design Algorithms
Buying Algorithms
Engagement Algorithms
Messaging Algorithms
Capacity Optimization
Assignment Optimization
Network Optimization
Visitor Qual Algorithms
Latent Size Algorithms
Latent Fit Algorithms
Batch Picking Algorithm
Global Optimizations
Pick Path Algorithm
Virtual Warehouses
Sizebreak Algorithms
Planning Algorithms
Assortment Algorithms
Replenishment Algorithms

Use Case Categorization

Customer Context - Style Recommenders, Fashion Design Algorithms, Latent Size Algorithms, Latent Fit Algorithms
Retailer Context - Business Use Cases (Inventory Forecast, Replenishment Algorithms)
Warehouses Use Cases - Assignment Optimization, Allocation
Clients Use Cases - Style recommendations, Demand Predictions
Optimize Supply Chain - Warehouse Assignment, Pick Path Algorithm

Data Science - Algorithm Demands

Assortment Algorithms - Apriori / Market Basket Analysis
Targeting Algorithms - Recommendations
Replenishment Algorithms - Forecasting
Allocation Algorithms - Resource Allocation
Virtualized Warehouses - Demand Forecasting

Key Lessons

Data Science Use cases in Retail Space
Data Science Use cases in Supply Chain
Data Science Use cases in Fashion, Ecommerce Segments
Data Lake Strategy for Data Science
Bird's Eye view for picking right use cases

Keep Thinking!!!

March 15, 2020

Interesting Product - https://www.glisten.ai/

Techcrunch posted on this article. A great example for Computer Vision in Fashion. A very niche product idea. In fact, I am doing my prototypes for a similar idea :)

My Analysis of Models / Approach Involved

Multi-Label Detection - Clothes Combination
Landmark Detection - Bounding boxes for Upper Body, Lower Body
Pattern Detection - Use the extracted boxes and detect patterns
Color Detection - Find the maximum color present in the detected portion
Gender Detection Models - Extract Face, Identify Gender
OCR - Scan the text content, search for product attributes for Scanned product

A very interesting niche ML Idea :)

Happy Learning!!!

March 13, 2020

How to Track Possible Secondary and Tertiary Contacts of Infected Corona Patient

Tracking Potential Patients

Identify places visited based on google history, GPS Tracking, Rides opted of Infected Patients
Identifying their movements mapped to mobile signals, Nearest Mobile signals, This will also highlight potential mobile numbers in the vicinity
Continuously monitoring the key factors, screening in regular intervals of Secondary and Tertiary Contacts
Large Scale Screening / Complete Lockdown / Ban are travel are the only possible options to control the pace of virus

Analyze COVID death rates

Death rates reported from 2019 Jan to June 2019
Death rates reported from 2020 Jan to June 2020
%% increase in the death rate
Number of reported COVID deaths
Number of non-COVID deaths
Match by age factor/ gender factor

To know the exact impact we need to compare, analyze by different dimensions and identify the insights

Keep Thinking!!!

Computer Vision checklist for Security Camera

Safety - Data should be safe even in case of theft. Opt for Network storage, not local storage
Sight - The angle of placement for maximum coverage is more important. It should align with already developed computer vision models to reuse them. Camera viewing angle, are very different from the data that the existing models are trained on, we need more sophisticated and powerful algorithms to compensate for these shortcomings and overcome the challenges.
Availability - Alert mechanism should be there in case if it has any outage due to network/power. Leverage alternate options (battery / wifi / sim card)
Models - Deploy both image processing, edge analytics, simple to complex models. Real-world situations need more than one model / algo to validate
Scalability - Throughput needs to be there, avoiding duplicate images, detecting only when there is a change of state, sending inferences of edge analytics. Send optimal data not all the data

Leverage models built for the real world - Link

Keep Thinking!!!

March 12, 2020

Distributed Systems - Session #1

Key Notes

Storage, Big Data, File Sharing
The infrastructure that requires more than one computer
High Performance, Parallelism
Fault Tolerance - Two computer does the same things. One Fails another picks up. Availability / Recoverability, Replication
Systems are inherently physically distributed
To achieve security goals
Handle unexpected failure patterns (Partial Failures)
Challenges are Concurrency, Partial Failures
Academic Curiosity -> Real-world Examples
Lectures, Research papers for ideas, implementation details, labs, exams
Map Reduce - Map Function on each of the input files, Obvious Parallelism available. The output is a list of Key-Value Pairs. Maps -> Intermediate Output -> Reducers. Collects all instances, all maps.

Happy Learning!!!

March 11, 2020

Day #332 - Dress Color Detection

After evaluating a few projects, This Gitproject was helpful - link

Approach
1. Extract RGB composition for input images
2. Use pre-trained samples are available for White, Black, Red, Green, Blue, Orange, Yellow and Violet
3. KNN to compute nearest Color Match between 1 and 2

Input Image -

Detected color is: red

Happy Learning!!!

March 10, 2020

Amazon building Retail Stores AI Tech

Amazon Journey

Amazon used its lessons learned in building scalable infra to launch AWS
Amazon build complete cloud stack / big data tools for different domains
Amazon leveraged AI with Alexa, Retail Stores cashier-less stores

Retail Stores with AI - Cashierless stores this is a product by itself. Now Amazon is pitching this as a Retail Disruptive product. This will compete against traditional RFID, People Counting software provides like Sensormatic, Checkpoint. Amazon will get more insights. This will be a good testbed to disrupt the retail sector.

Keep watching this space https://justwalkout.com/

Data Collected
We only collect the data needed to provide shoppers with an accurate receipt. Shoppers can think of this as similar to typical security camera footage. Shoppers enter the store with a credit card, grab what they want and just walk out - it's that easy.

Possibilities?

Unique Shopper Identification
Shopper identification from Credit Card / SSN
Face Identification
Product Identification / Product Tracking

Will people still be working in stores with Just Walk Out technology?
Yes. Retailers will still employ store associates to greet and answer shoppers' questions, stock the shelves, check IDs for the purchasing of certain goods, and more - their roles have simply shifted to focus on more valuable activities.

Inference

Elimination of Cashiers
10~15% Cashiers maybe there per store, It will result in cost savings
Customer Assistants for Product queries / Search (Chatbots may compliment them)

What Alternative Options to Challenge this move? Retailer AI To-Do List?

Build your own Retail AI portfolio
Collaborate with Other Partners(Google / Microsoft / Nvidia)
Build Data Expertise, Long term plan for Edge Analytics
Reduce Investments on RFID, Proportionately increase AI investments
Invest in AI for Inventory, Traffic, Loss Prevention Solutions

Keep Thinking!!!

AI - Social cause Use Cases - Papers / Approach / Tech Analysis

Aggression prediction at Rehabilitation Centres (Video + Audio Analytics)
War crimes Analysis from Satellite Images (Video Analytics)
Suicide hotline automated call analysis and forwarding (Audio Analytics)

AI for social cause - AGDC: Automatic Garbage Detection and Collection

Very good paper - AGDC: Automatic Garbage Detection and Collection

Key Summary