"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

December 31, 2017

Day #94 - Integrate R and Python

#To invoke R Code from python there are a two options used
#Option #1 - Invoke as external R file using subprocess command
#myscript.R:
#x <- 4
#print(x)
import subprocess
subprocess.check_call(['Rscript', 'E:\\RNotes\\PetProject\\RModel\\Script.R'], shell=False)
#Option #2 - Using rpy2 package
import rpy2.robjects as robjects
r_source = robjects.r['source']
r_source("E:\\RNotes\\PetProject\\RModel\\Script.R")
view raw R_Python.py hosted with ❤ by GitHub
Happy Learning!!!

December 08, 2017

Day #93 - Regularizations

Four methods of Regularization
  • Cross Validation inside training data
    • 4 to 5 folds of K-Fold Validations
    • Split into K non-intersecting subsets
    • Leave one out scheme
    • Target variable leakage is still present in K Fold Scheme
  • Smoothing based on size of category
    • Category big lot of data points
    • Formula = (mean(target)*nrows+globalmean*alpha)/(nrows+alpha)
    • alpha = category size we can trust
  • Add Random Noise
    • Unstable, Hard to make it work
    • Too much noise
    • LOO, Leave one out Regularization
  • Sorting and calculating mean on some type of data
    • Fix sorting order of data
    • Use Rows 0 to N-1 to calculate mean for N-1
    • Least Leakage
    #Cross Validation inside training data
    y_tr = df_tr['target'].values #target variable
    skf = StratifiedKFold(y_tr,5,shuffle=True,random_state=123)
    #iterate into chunks
    for tr_ind, val_ind in skf:
    x_tr, x_val = df_tr.iloc[tr_ind],df_tr.iloc[val_ind]
    #for all columns iterate and map estimated encodings to dataframes
    for col in cols:
    #iterate through columns we want to encode
    means = x_val[col].map(x_tr.groupby(col).target.mean())
    x_val[col+'_mean_target'] = means
    train_new.iloc[val_ind] = x_val
    #global mean
    prior = df_tr['target'].mean()
    #fill NANs with global mean
    train_new.fillna(prior,inplace=True)
    #Expanding Mean
    cumsum = df_tr.groupby(col)['target'].cumsum()-df_tr['target']
    cumcnt = df_tr.groupby(col).cumcount()
    train_new[col+'_mean_target']=cumsum/cumcnt
 Happy Learning!!!