"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

November 28, 2020

Weekend Reading - Resilient Distributed Datasets

 Paper Link 

Key notes
  • Resilient Distributed Datasets (RDDs) - a distributed memory abstraction that lets programmers perform in-memory computations on large cluster
  • Data reuse is common in many iterative machine learning and graph algorithms, including PageRank, K-means clustering, and logistic regression. 
  • RDDs are a good fit for many parallel applications because these applications naturally apply the same operation to multiple data items. 
  • RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement
  • If a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to recompute
  • Formally, an RDD is a read-only, partitioned collection of records. RDDs can only be created through deterministic operations on either (1) data in stable storage or (2) other RDDs
  • RDD has enough information about how it was derived from other datasets (its lineage) to compute its partitions from data in stable storage
  • The main difference between RDDs and DSM is that RDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads and writes to each memory location




Keep Thinking!!!

November 24, 2020

Interesting projects and their tech choices

Project #1 - Tech for SaaS built by One Person

  • Language - Python, Typescript
  • Frameworks - Django, React, NextJS, Celery, Bootstrap 4
  • Databases - Clickhouse, PostgresSQL, Redis
  • Deployment - Terraform, Docker, Kubernetes, CircleCI

Related Good Read - Link

Project #2 - Raspberry Pi Webcam

Project #3 - Preventing Against Dos Attacks, Good Read - Someone attacked our company

Keep Thinking!!!

November 21, 2020

Kubeflow Pipelines - Learning Notes #1

To appreciate something we need to why, how, what about the Tool.


Key Notes

Why / Necessity
  1. Monitoring of Model
  2. Training /Serving - Differences in transformation, handling missing data
  3. Frequency to refresh the model



Production System Components


Kubeflow Platform

Develop, Deploy, Manage
Pipelines, Data Management, Serving (Rest End Point) 



Pipeline Component
Commands
Setup cluster, permissions in yaml file


Demo with screenshots




Pipelines
  • Domain-Specific Language
  • Instantiate Components
  • Define Dependency between components
  • Compile and Deploy Pipeline





Custom Components




Somehow the gap between ML code vs kubeflow code there is a lot of learning. How much time it takes to port to this infra? I need to experiment to comment. A lot of features are there but we shouldn't end up rewriting ML code to pipeline code. 

Notes #2



Codify ML Workflows
Adopt pipeline mindset
Experiment, Reproduce, Share pipeline

Define Pipeline
  • The description on ML Workflow
  • Runs on Container
  • Execution vs Runtime decoupled
  • Components - one step of workflow
  • Component - Packaged as Docker image
  • Pod for Each Step
  • Pipeline SDK


More Reads
pipeline sdk key notes - Link1, Link2, Link3
SDK Summary pointers



KALE (Kubeflow Automated pipeLines Engine) is a project that aims at simplifying the Data Science experience of deploying Kubeflow Pipelines workflows.

An Argo workflow executor is a process that conforms to a specific interface that allows Argo to perform certain actions like monitoring pod logs, collecting artifacts, managing container lifecycles, etc.

Katib is a Kubernetes-native project for automated machine learning (AutoML). Katib supports hyperparameter tuning, early stopping and neural architecture search (NAS). Learn more about AutoML at fast.ai, Google Cloud, Microsoft Azure or Amazon SageMaker

Katib is the project which is agnostic to machine learning (ML) frameworks.

Ray Train, an easy-to-use library for distributed deep learning.

Dask is a flexible library for parallel computing in Python.

ML metadata (MLMD) library by Google. MLMD is an integral part of TensorFlow Extended (TFX) and a stand-alone application

The most important entities created and stored by MLMD are:
  • Artifacts that are generated by the pipeline steps (e.g., the trained model).
  • Metadata about the executions (e.g., the step itself).
  • Metadata about the context (e.g., the whole pipeline).

Keep Thinking!!!

November 15, 2020

Smartphone / Social Media Issues

There are a lot of issues due to cheap internet smartphone / social media, The primary motive seems to have failed

  • Targeted marketing to influence people
  • Change from Morality to Majority Opinion is right
  • Crowd following mentality
  • Drive more consumeristic / purchase behavior
  • Focus more on lifestyle/luxury than principles/priorities
  • Low or no importance to education
  • Setting up wrong examples
  • Addiction to games/media / fake gurus
  • Commercialize everything, Take away all the time
  • In a way they have created isolated hotspots, depressed, addicted to smartphones. With cheap internet, they have only screwed their lives further. Targeted apps for each age group. Drive based on their emotional, behavioral needs. 

Keep Questioning!!!

November 14, 2020

Interesting Research paper Read - Gender and Race Preferences in Hiring in the Age of Diversity Goals: Evidence from Silicon Valley Tech Firms

Paper - Link

Key Insights

  • Women are 9-10% more likely to receive a callback compared to men, 
  • Whereas Black Hispanic and Asian applicants are 8-13% less likely to receive a callback compared to White applicants

Key Notes

Studying hiring discrimination at the intersection of race and gender, giving primacy to both

How hiring discrimination, in particular, leads to occupational segregation.

Experiment #1 - Send Fictitious resumes with randomized white-sounding and black-sounding names to potential employers for different types of occupations and consistent discrimination against African Americans across occupations (Bertrand and Mullainathan 2004).

Insights

Statistical discrimination - Employer who imperfectly observes an applicant's quality and productivity resorts to group-level averages to make inferences about the individual, which may lead to discrimination

Taste-based - Employers may have a prejudiced taste and animus towards a particular group, leading to discrimination (Becker 1971).

Discriminatory phenomenon -  female discrimination in male-dominated occupations and male discrimination in female-dominated occupations

ML Approach


Keep Thinking!!!

Personal Datawarehouses

We need to have the ability to claim our personal data / use it to trade without PII. Everything now is paid Gmail / google photos/youtube. The end-user data is used without any benefits for End-user.

  • Reclaim your Google Data
  • Reclaim your Social Media Data
  • Reclaim your Amazon Data
  • Reclaim your Location Data

Hope there is some value for user data.

Keep Thinking!!!

November 10, 2020

Session - ODSC East Talk - Challenges of machine learning development

Code - Link

Key Notes
  • Automated and Scalable Infra for ML 


ML Automation Steps
  • Data Pipeline
  • ML Team
  • Production / Deploy / Feedback Mechanism

Key Challenge is integration of all stages of development
  • Reproducibility via docker
  • Scaling via Kubernetes

Reproducibility
  • Share insights
  • Deploy Code

Automation process
  • Data pipeline
  • Feature generation

Production
  • Monitoring
  • Logging
  • Packaging

End to End workflow of Development to Production




Infra 
  • Deploy
  • Monitor
  • Train
  • Scale it on cloud

Data Pipeline process

Keep Thinking!!!

November 08, 2020

Weekend Reads - Advanced Models for Computer Vision

Key Notes

What Classifier will Miss - Human-level scene understanding

  • Parsing the scene
  • The angle of Bicycle (Pose, Relative pose)
  • Person on Bicycle
  • Closer Inspection

Tasks

  • Object Detection
  • Pose Estimation
  • Accuracy vs Efficiency of Models

CNN as Deep Learning Puzzle

Input-Output Node, Loss Computation and Backprop


Classification - sparse description of the image

Object Detection

  • Multi-task problem
  • Classification & Localisation
  • Object, Location, Bounding box
  • Dataset, Samples, List of Objects, Labels, Bbox for each object


Predict BBOX Coordinates

  • Continuous Output
  • Minimize mse of samples
  • Regression for bbox prediction
  • The first part is the classification
  • The Second Step is regression








Faster RCNN

  • Two-Stage Detector
  • Good Candidate BBOX
  • Refine through Regression
  • Discretize bbox space
  • Anchor points distributed
  • Candidate boxes of different scale and ratio
  • n candidates per anchor
  • Is there an object or not in the box
  • Refine through regression
  • We cannot backdrop on parameters of bbox (Spatial Transformer Networks)





One Stage Detector - Train end to end

  • Employ Hard negative mining

Retinanet uses Focal Loss (The loss function is just a mathematical way of saying how far off a guess is from the real value of a data point.). It puts more weight on the objects that were hard to classify and decreases the impact on easy correct predictions




Semantic Segmentation

  • Pooling - reduce the resolution of feature maps
  • Upsample based on the nearest neighbor approach



U-Net

  • Segmenting medical images
  • Input Image -> Convolution -> RELU pooling
  • Encoder - Similar to Image Classifier
  • Upsampling through Decoder for same resolution output
  • Upsampling - blobby feature map
  • For every location distribution over classes
  • Cross Entropy (Avg Over all Locations)




Keep Thinking!!!