"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

October 31, 2016

Day #39 - Useful Tool MyMediaLite for Recommendations

This post is based on learning's for assignment link1, link2

Input is User-Items file as listed below


Sample Execution Command


We will be supplying parameter 20 in user20.txt to identify recommendations for user 20. The recommender type is mentioned in the --recommender parameter

Happy Learning!!!

October 30, 2016

Day #38 - Python Matrix Operations Learnings

#Python Exercises
#Exercise #1
#Save it to File test1.txt
#Input
#1 0 1 0 1
#0 0 0 1 1
#1 1 1 1 0
#1 0 0 0 1
#0 0 0 0 1
#Output
#1 1
#1 3
#1 5
#2 4
#2 5
#3 1
#3 2
#3 3
#3 4
#4 1
#4 4
#5 5
import pandas as pd
import numpy as np
file = open('test1.txt','r')
filewrite = open('test1_output.txt','w')
a = 0
for line in file:
a = a+1
values = line.split('\t')
b = 1
for value in values:
if(value=='1'):
filewrite.write(str(a) + '\t'+ str(b) + '\n')
b = b+1
filewrite.close()
#Exercise 2
#Parse test1.txt and do matrix manipulation
import pandas as pd
import numpy as np
data = pd.read_csv('test1.txt',header=None, sep = "\t")
data_matrix = np.matrix(data)
print(data_matrix.shape)
print(data_matrix.shape[0])
print(data_matrix.shape[1])
#Exercise 3
#Compute row and column sum
import pandas as pd
import numpy as np
data = pd.read_csv('test1.txt',header=None, sep = "\t")
data_matrix = np.matrix(data)
print(data_matrix.shape)
print(data_matrix.shape[0])
print(data_matrix.shape[1])
#Parse Matrix sum of each column
for i in range(0,data_matrix.shape[0]):
sum = 0
for j in range(0,data_matrix.shape[1]):
sum = sum+data_matrix[i,j]
print('row -',i+1,'-sum ',sum)
#Parse Matrix sum of each column
for i in range(0,data_matrix.shape[1]):
sum = 0
for j in range(0,data_matrix.shape[0]):
sum = sum+data_matrix[j,i]
print('column-',i+1,'-sum ',sum)
view raw matrixops.py hosted with ❤ by GitHub
Happy Learning!!!

October 12, 2016

Day #37 - Numpy Learnings - Matrices

import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1.1,2.2,3.1],'B':[5.2,6.1,7.1],'C':[5.4,6.6,7.4]})
print(df)
#Tip #1
#Create matrix of float values
data_intermediate = df.astype(float)
data_matrix = np.matrix(data_intermediate)
print('matrix')
print(data_matrix)
#Tip #2 Compute Transpose
print('Transpose matrix')
print(np.transpose(data_matrix))
#Tip #3 - Matrix Inverse
print('Inverse matrix')
print(data_matrix.I)
#Tip #4 - Identity Matrix
print('Identity Matrix')
print(np.identity(3))
print('Identity Matrix X Data ')
print(np.identity(3)*data_matrix)
#Tip #5 - Eigen Values
print('Eigen Values')
print(np.linalg.eigvals(data_matrix))
#Tip #6 - Eigen Vectors
w, v= np.linalg.eig(data_matrix)
print('Eigen Vector')
print(v)
#Tip #7 - svd
print('SVD')
print(np.linalg.svd(data_matrix))
view raw numpybasics.py hosted with ❤ by GitHub


Happy Learning!!!

October 10, 2016

Day #36 - Pandas Dataframe Learning's

import numpy as np
import pandas as pd
#Tip #1
#Data frame to matrix
df = pd.DataFrame({'A':[1,2,3,4],'B':[5,6,7,8],'C':[5,6,7,8]})
print(df)
#Tip #2
#Standardize value in columns
df["A"] = (df["A"]-df["A"].mean())/np.std(df["A"])
#Tip #3
#Dynamically stardardize except last column
for col in df.columns[:-1]:
df[col] = (df[col] - df[col].mean())/ (np.std(df[col]))
print(df)
features = list(df.columns[:-1])
#Tip #4 - Replance na values
df = df.fillna(-9999)
#Tip #5
#Dynamically add columns
for i in range(0,2):
colname1 = str(5+i)
col1 = i
col2 = i+1
print('colname', colname1)
print('col1', col1)
print('col2', col2)
if(col2 < 2):
df[colname1] = df[features[col1]]*df[features[col2]]
print('newly added columns')
print(df)
view raw pandasbasics.py hosted with ❤ by GitHub
Happy Learning!!!

Day #35 - Bias Vs Variance


These are frequently occurring terms with respect to performance of model against training and testing data sets.

Classification error = Bias + Variance

Bias (Under-fitting)
  • Bias is high if the concept class cannot model the true data  distribution well, and does not depend on training set size.
  • High Bias will lead to under-fitting
How to identify High Bias
  • Training Error will be high
  • Cross Validation error also will be high (Both will be nearly the same)
Variance(Over-fitting)
  • High Variance will lead to over-fitting
How to identify High Variance
  • Training Error will be high
  • Cross Validation error also will be Very Very High compared to training error
Hot to Fix ?
Variance decreases with more training data, and increases with more complicated classifiers

Happy Learning!!!

October 08, 2016

Day #34 - What is diffference between Logistics Regression and Naive Bayes

Both are probabilistic
Logistics
  • Discriminative (Entire approach is purely discriminative)
  • P(Y/X)
  • Final Value lies between Zero and 1
  • Formula given by exp(w0+w1x)/(exp(w0+ w1x)+1)
  • Further can be expressed as 1/(1+(exp-(w0+ w1x))
Binary Logistic Regression - 2 class
Multinomial Logistic Regression - More than 2 class

Example - Link

import math
def sigmoid(x):
a = []
for item in x:
a.append(1/(1+math.exp(-item)))
return a
import matplotlib.pyplot as plt
import numpy as np
x = np.arange(-70, 70,1)
sig = sigmoid(x)
plt.plot(x,sig)
plt.title('Sigmoid Weight 1')
plt.show()
x = np.arange(-70, 70,5)
sig = sigmoid(x)
plt.title('Sigmoid Weight 5')
plt.plot(x,sig)
plt.show()
x = np.arange(-70,70,100)
sig = sigmoid(x)
plt.title('Sigmoid Weight 100')
plt.plot(x,sig)
plt.show()
#3 class classifier
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
iris = datasets.load_iris()
#only two features taken
X = iris.data[:,:2]
Y = iris.target
#step size in mesh
h = 0.02
logreg = linear_model.LogisticRegression(C=1e5)
#create instance of classifier and fit data
logreg.fit(X,Y)
#plot decision boundary and assign color for it
x_min, x_max = X[:,0].min()-0.5,X[:,0].max()+0.5
y_min, y_max = X[:,1].min()-0.5,X[:,1].max()+0.5
xx,yy = np.meshgrid(np.arange(x_min,x_max,h),np.arange(y_min,y_max,h))
Z = logreg.predict(np.c_[xx.ravel(),yy.ravel()])
#put the result to color plot
Z = Z.reshape(xx.shape)
plt.figure(1,figsize=(4,3))
plt.pcolormesh(xx,yy,Z,cmap=plt.cm.Paired)
#plot also training points
plt.scatter(X[:,0],X[:,1],c=Y,edgecolors='k',cmap=plt.cm.Paired)
plt.xlabel('Sepal Length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(),xx.max())
plt.ylim(yy.min(),yy.max())
plt.xticks(())
plt.yticks(())
plt.show()
view raw LR.py hosted with ❤ by GitHub



Link - Ref
Logistic Regression
  • Classification Model
  • Probability of success as a sigmoid function of a linear combination of features
  • y belongs to (0,1) - 2 Class problem
  • p(yi) = 1 / 1+e-(w1x1+w2x2)
  • Linear combination of features - w1x1+w2x2
  • w can be found with max likelihood estimate- 
Naive Bayes
  • Generative Model
  • P(X/ Given Y) is Naive Bayes Assumption
  • Distribution for each class
Happy Learning

October 04, 2016

Day #33 - Pandas Deep Dive

import pandas as pd
#Create Data Frame with few columns and list values
df = pd.DataFrame({'A':[1,2,3,4],'B':[5,6,7,8]})
print(df)
#Drop a Column
df = df.drop('A',1)
print(df)
#Create column with int column names
df = pd.DataFrame({1:[1,2,3,4],2:[1,2,3,4],10:[1,2,3,4]})
print(df)
df = df.drop(1,1)
print(df)
#Sample two rows from data frame
print(df.sample(n=2))
#Create list with 56 values
a = []
for i in range(1,56):
a.append(i)
import random
#Sample 15 random entries
feature2 = random.sample(a,15)
print('feature')
print(df)
#Find maximum occuring values row wise
print(df.mode(axis=1))
Happy Learning!!!

October 02, 2016

Good Data Science Course Links


AI Lectures

Introduction to Machine Learning

Happy Learning!!!

Short Analytics Concept Videos



  • Descriptive Analysis (Analysis of existing data, Trends and Patterns), 
  • Diagnostic Analysis (Reasons / Patterns behind events)
  • Predictive Analytics (Future how will it look like) 
  • Prescriptive Analysis (How to be prepared / handle the future)

Great Compilation, Keep Learning!!!

October 01, 2016

Day #32 - Regularization in Machine Learning


A large coefficient will result in overfitting. To avoid we perform regularization. Regularization - To avoid overfitting
  • L1 - Sum of values (Lasso - Least absolute shrinkage and selection operator). L1 will be meeting in co-ordinates and result in one of the dimensions zero. This would result in variable elimination. The features that minimally contribute will be ignored.
  • L2 - Sum of squares of values (Ridge). L2 is kind of circle shaped. This will shrink all coefficient in same proportion but eliminate none
  • Discriminative - In SVM we use hyperplane to classify the classes. This is example for discriminative approach
  • Probabilistic - Generated by Gauss Distribution. This is again based on Central Limit Theorem. Infinite points will fit into a Normal distribution. Here we apply gauss distribution model
  • Max Likelihood - Probability that the point p belongs to one distribution. 
Good Read for L2 - Indeed, using the L2 loss comes from the assumption that the data is drawn from a Gaussian distribution

Another Read -

  • L1 Loss function minimizes the absolute differences between the estimated values and the existing target values. L1 loss function is more robust and is generally not affected by outliers
  • L2 loss function minimizes the squared differences between the estimated and existing target values. L2 error will be much larger in the case of outliers 

Happy Learning!!!