"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

August 31, 2018

Day #125 - Google Cloud Platform Notes

Containers
  • Built in capability to isolate environment from one another
  • Software containers built on isolation
  • Containers starts faster than VM
  • Light weight container run time determines image format
  • Kubernetes uses Docker container
Kubernetes
  • Cluster networking and naming system built on google scale (Kubernetes)
  • Runs from Laptop to Cloud providers
  • Support Rolling Updates (Update without downtime)
  • Built in scaling / load-balancing capability
Kubernetes Engine
  • Managed Service
  • Built in Monitoring / Logging Capabilities
  • Kubernetes cluster nodes are compute engine VM machines
Sample commands for web server deployment
kubectl run nginx --image=nginx:1.10.0
kubectl get services
kubectl get pods
kubectl scale deployment nginx --replicas 3

Data Platform
  • ApacheHadoop based on map-reduce programming model
  • Map runs in parallel produces intermediate results
  • Reduce gets those intermediate results and consolidates
Cloud Pub / Sub
  • For stream Analytics
  • Publishers / Subscribers
  • Receiving messages need not be synchronous
  • Atleast one delivery in case of low latency
  • For Streaming data it works well
Google ML Model
Structured data 
  • Classification and Regression
  • Recommendation
  • Anamoly Detection - Sensor Diagnostics, Log Metrics
  • Forecasting, Cross Sales, UpSales
For unstructured data
  • Image and video analytics (Damaged Shipments)
  • Text Analytics  - Call centre log analysis, Topic Identification
Cloud Vision API
  • Categorize images
  • Detect individual objects within images
  • Moderate content, Image Sentiment Analysis
Cloud Speech API (Dictate, Command through voice, transcribe audio files)
Cloud Natural Language API (Entity Recognition, Overall Sentiment, Multiple language Support)
Cloud Translation API
Cloud Video Intelligence API

Happy Learning!!!

August 30, 2018

Day #124 - Quora Answers Ranking (NLP + ML Analysis)

This post is to understand Quora answer ranking system

Reposting some of key lines from article
Snippet #1
A supervised approach means having a training dataset that is used to extract features from and create a model. Item-wise regression means that the model will give us a numeric score for each answer that we can use to rank them.

Snippet #2
Features Rating
At Quora we define good answers to have the following five properties:
  • Answers the question that was asked.
  • Provides knowledge that is reusable by anyone interested in the question.
  • Answers that are supported with rationale.
  • Demonstrates credibility and is factually correct.
  • Is clear and easy to read.
Analysis #1
My understanding on their approach
  • Keywords would be extracted from the question to identify features that it talks about. These keywords may be used to weight the answers or relate to answers
They may also leverage the following data from answers to rank and assign score among other answers
  • Number of Views
  • Number of Upvotes
  • Context (Topic)
  • Number of Comments
  • Score based on words(features) used
  • Assign a Overall score
Analysis #2
  • For answers related to same topic matching keywords, Match it to existing features and compare with it to provide comparative ranking. 
Analysis #3
  • I suppose they may do OCR as well to extract text from images.
Analysis #4
  • Translating Quora answers into any other language might need substantial re-work and building corpus in the target language
They could have shared a sample example code snippet along with textual definitions

Happy Learning!!!

August 29, 2018

Day #123 - Feature Engineering Tools and Cloud ML Engine

Feature Engineering
  • Longest and most difficult Phase (Preprocessing)
  • Translate raw data into features based on domain knowledge
  • Create good features, synthetic features
  • Reasonable hypothesis for features that matter
  • Different problems in same domain may need different features
Feature Creation
#changing from range 0 to 1
#Re-scaling improves performance of gradient descent

features['price'] = (features['price']-min(price)/(max(price)-min(price))

#category columns
#one hot encoding technique
tf.feature_column.categorical_column_with_vocabulary_list('city',keys=['San Diego','Los Angeles','San Francisco','Sacremento'])

#preprocessing technique
#Provide range of values
features['capped_rooms']=tf.clip_by_value(features['rooms'],clip_value_min=0,clip_value_max=4)

#bucketize columns
lat = tf.feature_column.numeric_column('latitude')
dlat = tf.feature_column.bucketized_column(lat,boundaries=np.arrange(32,42,1).tolist())

Big Query
  • Fully managed DW
  • Compute aggregates
  • Compute Stats
Data beam
  • For streaming data pipeline
  • Time windowed stats
  • Operate on google cloud storage data
Cloud Dataflow
  • Change data from one format to another format
  • BigQuery -> Cloud Storage processed data
  • Python and Java based pipelines
Cloud ML Engine
  • For scaling data we use Cloud ML
  • Train Model
  • Monitor
  • Deploy it as Microservice
  • Batching and Distributed Training
  • Host as Rest API
Tools
  • GCP Console based
  • Use all existing google tools in GCP
  • Specify region, bucket to host code
  • Walkthrough of commands, project stuctue to execute the project in Google ML
Observations (New Learning)
  • Copying data to google cloud
  • Format to specify training / testing data
  • Using Tensorboard
  • Hosting as Rest API
  • Federated Learning
Happy Learning!!!

August 24, 2018

Day #122 - Tensorflow Estimator API

  • Manage data distribution for out of box
  • Data Parallelism - Replicate your model on multiple workers
estimator = tf.estimator.LinearRegressor(...)
tf.estimator.train_and_evaluate(estimator,....)

Needed for running on multiple machines

#1. Estimator
#2. Run Config
#3. Training Spec
#4. Test Spec

estimator = tf.estimator.LinearRegressor(feature_columns=featcols,config=run_config)
..
tf.estimator.train_and_evaluate(estimator,train_spec,eval_spec)

#5. Checkpoints, Summary

run_config = tf.estimator.RunConfig(model_dir=output_dir,save_summary_steps=100,save_checkpoint_steps=2000)
estimator = tf.estimator.LinearRegressor(config=run_config,....)

#6. Using Data Sets

train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn,max_steps=5000)

#7. Eval Spec

tf.estimator.train_and_evaluate(estimator,train_spec,eval_spec)

#8. Evaluation Checkpoint

eval_spec = tf.estimator.EvalSpec(input_fn=eval_input_fn,steps=100,throttle_secs=600,exporters=...)

#9. Measure for Test data

tf.estimator.train_and_evaluate(estimator,train_spec,eval_spec)


Happy Learning!!!

Day #121 - Tensorflow Notes

Estimators
  • Provide Boiler Plate code
  • tf.estimator - High level API for production ready models
  • Python API treats tensorflow as numeric processing library
Estimator API
  • Quick Model
  • Checkpointing
  • Distributed Training
  • Train / Eval / Monitor
  • Out of memory datasets
  • Hyper parameter tuning
Base class
  • tf.estimator.Estimator
  • Linear (LinearRegressor)
  • Dense Neural Networks (DNNRegressor, DNNLinearCombinedRegressor)
  • LinearClassifier, DNNClassifier, DNNLinearCombinedClassifier


Happy Learning!!!

August 23, 2018

Day #120 - Tensorflow

Tensor (N-Dimensional array of data)
  • High Performance Library for Numerical Computation
  • Represented as Directed Graphs
  • DAG (Directed Acyclic Graphs)
  • Edges - Arrays of Data
DAG
  • Language independent version of representation
  • Similar to JVM
  • Tensorflow engine written in C++
  • Tensorflow Lite - On device interference of ML Models
API Hierarchy
  • Number of abstraction layers
  • High Level API -> tf.estimator
  • tf.layers, tf.losses, tf.metrics -> Custom NN Models
  • Core Tensorflow Python
  • Core Tensorflow C++
  • CPU / GPU / TPU / Android
Execution
  • Code
  • Tensors Definition
  • Creates DAG
  • Run DAG in Session
  • Lazy Evaluation model (minimize context switches)
Graph and Session
  • Explicit edges to represent dependencies
  • Helps to partition and run parallel pieces

 

Happy Learning!!!