Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): October 2016

October 31, 2016

Day #39 - Useful Tool MyMediaLite for Recommendations

This post is based on learning's for assignment link1, link2

Input is User-Items file as listed below

Sample Execution Command

We will be supplying parameter 20 in user20.txt to identify recommendations for user 20. The recommender type is mentioned in the --recommender parameter

Happy Learning!!!

October 30, 2016

Day #38 - Python Matrix Operations Learnings

	#Python Exercises
	#Exercise #1
	#Save it to File test1.txt
	#Input
	#1 0 1 0 1
	#0 0 0 1 1
	#1 1 1 1 0
	#1 0 0 0 1
	#0 0 0 0 1

	#Output
	#1 1
	#1 3
	#1 5
	#2 4
	#2 5
	#3 1
	#3 2
	#3 3
	#3 4
	#4 1
	#4 4
	#5 5

	import pandas as pd
	import numpy as np
	file = open('test1.txt','r')
	filewrite = open('test1_output.txt','w')
	a = 0
	for line in file:
	a = a+1
	values = line.split('\t')
	b = 1
	for value in values:
	if(value=='1'):
	filewrite.write(str(a) + '\t'+ str(b) + '\n')
	b = b+1
	filewrite.close()

	#Exercise 2
	#Parse test1.txt and do matrix manipulation

	import pandas as pd
	import numpy as np
	data = pd.read_csv('test1.txt',header=None, sep = "\t")
	data_matrix = np.matrix(data)
	print(data_matrix.shape)
	print(data_matrix.shape[0])
	print(data_matrix.shape[1])

	#Exercise 3
	#Compute row and column sum

	import pandas as pd
	import numpy as np
	data = pd.read_csv('test1.txt',header=None, sep = "\t")
	data_matrix = np.matrix(data)
	print(data_matrix.shape)
	print(data_matrix.shape[0])
	print(data_matrix.shape[1])

	#Parse Matrix sum of each column
	for i in range(0,data_matrix.shape[0]):
	sum = 0
	for j in range(0,data_matrix.shape[1]):
	sum = sum+data_matrix[i,j]
	print('row -',i+1,'-sum ',sum)

	#Parse Matrix sum of each column
	for i in range(0,data_matrix.shape[1]):
	sum = 0
	for j in range(0,data_matrix.shape[0]):
	sum = sum+data_matrix[j,i]
	print('column-',i+1,'-sum ',sum)

view raw matrixops.py hosted with ❤ by GitHub

Happy Learning!!!

October 12, 2016

Day #37 - Numpy Learnings - Matrices

	import pandas as pd
	import numpy as np

	df = pd.DataFrame({'A':[1.1,2.2,3.1],'B':[5.2,6.1,7.1],'C':[5.4,6.6,7.4]})
	print(df)

	#Tip #1
	#Create matrix of float values
	data_intermediate = df.astype(float)
	data_matrix = np.matrix(data_intermediate)

	print('matrix')
	print(data_matrix)

	#Tip #2 Compute Transpose
	print('Transpose matrix')
	print(np.transpose(data_matrix))

	#Tip #3 - Matrix Inverse
	print('Inverse matrix')
	print(data_matrix.I)

	#Tip #4 - Identity Matrix
	print('Identity Matrix')
	print(np.identity(3))
	print('Identity Matrix X Data ')
	print(np.identity(3)*data_matrix)

	#Tip #5 - Eigen Values
	print('Eigen Values')
	print(np.linalg.eigvals(data_matrix))

	#Tip #6 - Eigen Vectors
	w, v= np.linalg.eig(data_matrix)
	print('Eigen Vector')
	print(v)

	#Tip #7 - svd
	print('SVD')
	print(np.linalg.svd(data_matrix))

view raw numpybasics.py hosted with ❤ by GitHub

Happy Learning!!!

October 10, 2016

Day #36 - Pandas Dataframe Learning's

	import numpy as np
	import pandas as pd

	#Tip #1
	#Data frame to matrix
	df = pd.DataFrame({'A':[1,2,3,4],'B':[5,6,7,8],'C':[5,6,7,8]})
	print(df)

	#Tip #2
	#Standardize value in columns
	df["A"] = (df["A"]-df["A"].mean())/np.std(df["A"])

	#Tip #3
	#Dynamically stardardize except last column
	for col in df.columns[:-1]:
	df[col] = (df[col] - df[col].mean())/ (np.std(df[col]))

	print(df)
	features = list(df.columns[:-1])

	#Tip #4 - Replance na values
	df = df.fillna(-9999)

	#Tip #5
	#Dynamically add columns
	for i in range(0,2):
	colname1 = str(5+i)
	col1 = i
	col2 = i+1
	print('colname', colname1)
	print('col1', col1)
	print('col2', col2)
	if(col2 < 2):
	df[colname1] = df[features[col1]]*df[features[col2]]

	print('newly added columns')
	print(df)

view raw pandasbasics.py hosted with ❤ by GitHub

Happy Learning!!!

Day #35 - Bias Vs Variance

These are frequently occurring terms with respect to performance of model against training and testing data sets.

Classification error = Bias + Variance

Bias (Under-fitting)

Bias is high if the concept class cannot model the true data distribution well, and does not depend on training set size.
High Bias will lead to under-fitting

How to identify High Bias

Training Error will be high
Cross Validation error also will be high (Both will be nearly the same)

Variance(Over-fitting)

High Variance will lead to over-fitting

How to identify High Variance

Training Error will be high
Cross Validation error also will be Very Very High compared to training error

Hot to Fix ?
Variance decreases with more training data, and increases with more complicated classifiers

Happy Learning!!!

October 08, 2016

Day #34 - What is diffference between Logistics Regression and Naive Bayes

Both are probabilistic
Logistics

Discriminative (Entire approach is purely discriminative)
P(Y/X)
Final Value lies between Zero and 1
Formula given by exp(w0+w1x)/(exp(w0+ w1x)+1)
Further can be expressed as 1/(1+(exp-(w0+ w1x))

Binary Logistic Regression - 2 class

Multinomial Logistic Regression - More than 2 class

Example - Link

	import math

	def sigmoid(x):
	a = []
	for item in x:
	a.append(1/(1+math.exp(-item)))
	return a

	import matplotlib.pyplot as plt
	import numpy as np

	x = np.arange(-70, 70,1)
	sig = sigmoid(x)
	plt.plot(x,sig)
	plt.title('Sigmoid Weight 1')
	plt.show()

	x = np.arange(-70, 70,5)
	sig = sigmoid(x)
	plt.title('Sigmoid Weight 5')
	plt.plot(x,sig)
	plt.show()

	x = np.arange(-70,70,100)
	sig = sigmoid(x)
	plt.title('Sigmoid Weight 100')
	plt.plot(x,sig)
	plt.show()

view raw Logistics_Different_Sigmoid.py hosted with ❤ by GitHub

	#3 class classifier
	import numpy as np
	import matplotlib.pyplot as plt
	from sklearn import linear_model, datasets

	iris = datasets.load_iris()
	#only two features taken
	X = iris.data[:,:2]
	Y = iris.target

	#step size in mesh
	h = 0.02

	logreg = linear_model.LogisticRegression(C=1e5)

	#create instance of classifier and fit data
	logreg.fit(X,Y)

	#plot decision boundary and assign color for it
	x_min, x_max = X[:,0].min()-0.5,X[:,0].max()+0.5
	y_min, y_max = X[:,1].min()-0.5,X[:,1].max()+0.5

	xx,yy = np.meshgrid(np.arange(x_min,x_max,h),np.arange(y_min,y_max,h))
	Z = logreg.predict(np.c_[xx.ravel(),yy.ravel()])

	#put the result to color plot
	Z = Z.reshape(xx.shape)
	plt.figure(1,figsize=(4,3))
	plt.pcolormesh(xx,yy,Z,cmap=plt.cm.Paired)

	#plot also training points
	plt.scatter(X[:,0],X[:,1],c=Y,edgecolors='k',cmap=plt.cm.Paired)
	plt.xlabel('Sepal Length')
	plt.ylabel('Sepal width')

	plt.xlim(xx.min(),xx.max())
	plt.ylim(yy.min(),yy.max())

	plt.xticks(())
	plt.yticks(())

	plt.show()

view raw LR.py hosted with ❤ by GitHub

Link - Ref
Logistic Regression

Classification Model
Probability of success as a sigmoid function of a linear combination of features
y belongs to (0,1) - 2 Class problem
p(yi) = 1 / 1+e-(w1x1+w2x2)
Linear combination of features - w1x1+w2x2
w can be found with max likelihood estimate-

Naive Bayes

Generative Model
P(X/ Given Y) is Naive Bayes Assumption
Distribution for each class

Happy Learning

October 04, 2016

Day #33 - Pandas Deep Dive


	import pandas as pd

	#Create Data Frame with few columns and list values
	df = pd.DataFrame({'A':[1,2,3,4],'B':[5,6,7,8]})
	print(df)

	#Drop a Column
	df = df.drop('A',1)
	print(df)

	#Create column with int column names
	df = pd.DataFrame({1:[1,2,3,4],2:[1,2,3,4],10:[1,2,3,4]})
	print(df)
	df = df.drop(1,1)
	print(df)

	#Sample two rows from data frame
	print(df.sample(n=2))

	#Create list with 56 values
	a = []
	for i in range(1,56):
	a.append(i)

	import random
	#Sample 15 random entries
	feature2 = random.sample(a,15)
	print('feature')
	print(df)

	#Find maximum occuring values row wise
	print(df.mode(axis=1))

view raw pandasin10mins.py hosted with ❤ by GitHub

Happy Learning!!!

October 02, 2016

Good Data Science Course Links

AI Lectures

Introduction to Machine Learning

Happy Learning!!!

Short Analytics Concept Videos

Descriptive Analysis (Analysis of existing data, Trends and Patterns),
Diagnostic Analysis (Reasons / Patterns behind events)
Predictive Analytics (Future how will it look like)
Prescriptive Analysis (How to be prepared / handle the future)

Great Compilation, Keep Learning!!!

October 01, 2016

Day #32 - Regularization in Machine Learning

A large coefficient will result in overfitting. To avoid we perform regularization. Regularization - To avoid overfitting

L1 - Sum of values (Lasso - Least absolute shrinkage and selection operator). L1 will be meeting in co-ordinates and result in one of the dimensions zero. This would result in variable elimination. The features that minimally contribute will be ignored.
L2 - Sum of squares of values (Ridge). L2 is kind of circle shaped. This will shrink all coefficient in same proportion but eliminate none
Discriminative - In SVM we use hyperplane to classify the classes. This is example for discriminative approach
Probabilistic - Generated by Gauss Distribution. This is again based on Central Limit Theorem. Infinite points will fit into a Normal distribution. Here we apply gauss distribution model
Max Likelihood - Probability that the point p belongs to one distribution.

Good Read for L2 - Indeed, using the L2 loss comes from the assumption that the data is drawn from a Gaussian distribution

Another Read -

L1 Loss function minimizes the absolute differences between the estimated values and the existing target values. L1 loss function is more robust and is generally not affected by outliers
L2 loss function minimizes the squared differences between the estimated and existing target values. L2 error will be much larger in the case of outliers

Happy Learning!!!

October 31, 2016

October 30, 2016

October 12, 2016

October 10, 2016

October 08, 2016

October 04, 2016

October 02, 2016

October 01, 2016

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts