"No one is harder on a talented person than the person themselves" - Linda Wilkinson

August 16, 2014

Hbase Primer - Loading Data in HBASE Using SQLOOP / Java Code

This post is on examples using SQOOP / custom java code to import data into HBASE, HIVE, HDFS from MSSQL DB

Tried the steps on Cloudera VM - cloudera-quickstart-vm-4.4.0-1-vmware

From Linux terminal > hbase shell
create 'Employee_Details', {NAME => 'Details'}

Example #1 (Import to Hbase from MSSQL DB Table)
sqoop import --connect  "jdbc:sqlserver://;database=TestDB" --username "sa" --password "sa" --driver "com.microsoft.sqlserver.jdbc.SQLServerDriver" -m 1 --hbase-table "Employee_Details"  --column-family "Details" --table "dbo.employee"  --split-by "id"

Example #2 (List Tables in MSSQL DB Table)
sqoop list-tables --connect  "jdbc:sqlserver://;database=TestDB" --username "sa" --password "sa" --driver "com.microsoft.sqlserver.jdbc.SQLServerDriver"

Example #3 (Import to HDFS from MSSQL DB Table)
sqoop import --connect "jdbc:sqlserver://;database=TestDB" --username "sa" --password "sa" --driver "com.microsoft.sqlserver.jdbc.SQLServerDriver" --table "dbo.employee"  --split-by "id"

hadoop fs -ls (List files in hadoop file system)

Example #4 (Import into Hive Table from MSSQL DB Table)
sqoop import --hive-import --create-hive-table --hive-table Employee_Hive --connect  "jdbc:sqlserver://;database=TestDB" --username "sa" --password "sa" --driver "com.microsoft.sqlserver.jdbc.SQLServerDriver" -m 1 --table "dbo.employee" 

Example #5 - Custom Java Code
  • Add all required Jars need to be added to compile the project. This was one of challenges to get this code working 
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.util.Bytes;
import java.io.IOException;

public class HBaseTest
   public static void main(String[] args) throws IOException
          HBaseTest HT = new HBaseTest();

   public void DropTable()
                Configuration config = HBaseConfiguration.create();
                config.set("hbase.zookeeper.quorum", "localhost");
                config.set("hbase.zookeeper.property.clientPort", "2181");
                HBaseAdmin admin = new HBaseAdmin(config);
          catch(Exception Ex)

   public void CreateTable()
                Configuration config = HBaseConfiguration.create();
                config.set("hbase.zookeeper.quorum", "localhost");
                config.set("hbase.zookeeper.property.clientPort", "2181");
                HTableDescriptor ht = new HTableDescriptor("Employee_Details");
                ht.addFamily( new HColumnDescriptor("Id"));
                ht.addFamily( new HColumnDescriptor("Details"));
                HBaseAdmin hba = new HBaseAdmin(config);
                hba.createTable( ht );
          catch(Exception Ex)

   public void UpdateRecords()
          Configuration config = HBaseConfiguration.create();
                config.set("hbase.zookeeper.quorum", "localhost");
                config.set("hbase.zookeeper.property.clientPort", "2181");
                HTable table = new HTable(config, "Employee_Details");
             Put put = new Put(Bytes.toBytes("row1"));
          catch(Exception Ex)

This post was useful to try it out SQOOP Examples

Happy Learning!!!

August 09, 2014

HBase Primer - Querying for Seeks / Scans - Part I

This post is HBase Primer on Seeks / Scans. I recommend spending time on HBase Schema design to have a basic understanding of HBASE table structure. The motivation slide for this post is from Cloudera Session slides post.

I wanted to try out the suggested pattern of tables. This can be compared to TSQL Equivalent of DDL, DML Scripts. Querying, Select with filters is the key learning from this exercise. Key Learning's are select with filter examples. There are no aggregate functions available in Hbase. Hive and Phoenix which sits on top of Hbase Serves for this purpose of aggregations.

HBASE (One Liner)- Low Latency, Consistent, best suited for random read/write big data access

Few bookmarks provide great short intro - link1

Hbase Queries (Tried it on cloudera-quickstart-vm-4.4.0-1-vmware)
hbase shell

Disable and Drop Table
Disable 'Employee'
Drop 'Employee'

Create Table
Details is a column family. Inside this column family there are members Name, Location, DOB and Salary

create 'Employee', {NAME => 'Details'}

Insert Records
put 'Employee', 'row1', 'Details:Name', 'Raj'
put 'Employee', 'row1', 'Details:Location', 'Chennai'
put 'Employee', 'row1', 'Details:DOB', '01011990'
put 'Employee', 'row1', 'Details:Salary', '1990'

put 'Employee', 'row2', 'Details:Name', 'Raja'
put 'Employee', 'row2', 'Details:Location', 'Delhi'
put 'Employee', 'row2', 'Details:DOB', '01011991'
put 'Employee', 'row2', 'Details:Salary', '5000'

put 'Employee', 'row3', 'Details:Name', 'Kumar'
put 'Employee', 'row3', 'Details:Location', 'Mumbai'
put 'Employee', 'row3', 'Details:DOB', '01011992'
put 'Employee', 'row3', 'Details:Salary', '50000'

Select based on Column Qualifiers
scan 'Employee', {COLUMNS => ['Details:Name', 'Details:Location']}
scan 'Employee', {COLUMNS => ['Details:Location']}

Single Filter - Search by Location Column Filter
scan 'Employee' ,{ FILTER => "SingleColumnValueFilter('Details','Location',=, 'binary:Chennai')" }

ValueFilter - Search by Location Value
scan 'Employee' , { COLUMNS => 'Details:Location', FILTER => "ValueFilter (=, 'binary:Chennai')"}

Multiple Filter - Search by Name and Location
scan 'Employee' ,{ FILTER => "SingleColumnValueFilter('Details','Location',=, 'binary:Chennai') AND SingleColumnValueFilter('Details','Name',=, 'binary:Raj')"}

Timestamp and Filter - Search with Timestamp and Filter
scan 'Employee' ,{TIMERANGE=>[1407324224184,1407324224391], COLUMNS => 'Details:Location', FILTER => "ValueFilter (=, 'binary:Chennai')"}

Timestamp and Filter - Search with multiple Filter on same column (contains)
scan 'Employee' ,{ FILTER => "SingleColumnValueFilter('Details','Location',=, 'binary:Chennai') OR SingleColumnValueFilter('Details','Location',=, 'binary:Delhi')"}

Filter using regExMatch
scan 'Employee' ,{ FILTER => "SingleColumnValueFilter('Details','Location',=, 'regexstring:Che*',true,true)" }

Search using Prefix
scan 'Employee' ,{ FILTER => "SingleColumnValueFilter('Details','Location',=, 'binary:Chennai') AND PrefixFilter ('row1')"}

Return only one record
scan 'Employee', {COLUMNS => ['Details:Location'], LIMIT => 1, STARTROW => 'Chennai'}

Return two records
scan 'Employee', {COLUMNS => ['Details:Location'], LIMIT => 2, STARTROW => 'Chennai'}

Return records between range of rowkeys
scan 'Employee', {COLUMNS => ['Details:Location'], STARTROW => 'row1',STOPROW=>'row2'}

Get Specific Row with Filter
get 'Employee', 'row1', {FILTER => "ValueFilter (=, 'binary:Chennai')"}

Count records
count 'Employee'

Timestamp Range - Return within the timestamp range
scan 'Employee' ,{TIMERANGE=>[1407324224391,1407324234707]}

Hbase Table 3 Column Family
create 'Employee1', {NAME => 'Name'}, {NAME => 'Location'}, {NAME => 'DOB'}

Query with column family
scan 'Employee1', {COLUMNS => ['Name']}

Delete a Column Family
alter ‘Employee', ‘delete’ => ‘Location'

July 21, 2014

Machine Learning Notes

This post is on my learning's from Machine Learning Session conducted by my colleague Gopi. It was really a good introduction and a lot of motivation towards learning the topic.

Concepts Discussed
  • Homogeneity - Is my data homogeneous
  • Pick the odd one out (Anomaly detection)
  • Entropy Computation
Wide variety of examples to find odd sets, variations. Example from below set identify the anomaly one

The last row involving zero is a odd one. Identifying them using entropy computation was very useful

Entropy Formula

Formula detailed notes from link

For row (1,1,1,2)
 = -[((3/4)*log2(3/4)) + ((1/4)*log2(1/4))]
 = -[-0.311 -0.5]
 = .811
 For row (1,2,2,1)
 = -[((2/4)*log2(2/4)) + ((2/4)*log2(2/4))]
 = -[-.5-.5]
 = 1
 For row (1,2,1,1)
 = -[((3/4)*log2(3/4)) + ((1/4)*log2(1/4))]
 = -[-0.311 -0.5]
 = .811 
 For row (1,0,1,2)
 = -[((2/4)*log2(2/4)) + ((1/4)*log2(1/4)) + ((1/4)*log2(1/4))]
 = -[-0.5 -0.311 -0.311]
 = 1.12

By excluding the row with higher values we will have homogeneous data set, The one last row with high entropy is the anomaly  
If Data set is homogeneous after removing a particular record set then that particular record set is the anomaly one

More Concepts Introduced
  • Conditional Probability
  • ID3 Algorithm
  • Measure Entropy
  • Decision Tree
  • Random Forest
  • Bagging Technique
Happy Learning!!! 

July 17, 2014

Big data Functional Testing Tools

This post is based on my learning notes on functional test tools for Big Data ecosystem. Earlier posts, we have read the basics of big data ecosystem components. Sharing the first version of Test Tools Analysis for Functional Testing

Product / Area
Testing Tools
Test Approach
Programming Language
MQTT Testing Utility, Tsung

storm test

Query HIVE (Similar to TSQL)

Map Reduce Jobs


Lift charts, Target shuffling, Bootstrap sampling to test the consistency of the model

Junit, Mockito, Apache MRUnit

Jmeter Plugins for Hadoop, HBASE, Cassandra

Happy Learning!!!

July 06, 2014

Weekend Reading - Webinar - Performance Testing Approach for Big Data Applications.

Very good session – Webinar - Performance Testing Approach for Big Data Applications. Few interesting notes / slides from session

  • Rate of Data Ingestion - How fast system consumes data?
  • Data Processing - Speed how data is processed. Testing Data processing in isolation with data sets populated. Run specific perf tests (MR Jobs, Pig, Hive Scripts)
  • Data Persistence – I/O bound process. (Data Writes / Updates on DB, Garbage Collection, Monitoring Metrics)
  • Complete end to end time for processing (Network Connectivity, Processing, Results)

Big Data Test Challenges
  • Diverse Technologies
  • Unavailability of Test Tools for Big Data Technologies / Scenarios
  • Limited Monitoring / Diagnostic Solutions
  • Test Scripting / Environment
Perf Test Tools
  • Use cloud to simulate large infrastructure
  • Cloud orchestration scripts Puppet, Chef

  • Depending on usage in production identify patterns for production workload
  • Fault Tolerance Scenarios
  • Hadoop monitoring tools to check Map reduce jobs
  • Selecting Test Clients - Custom code
  • Performance / Failover tests to ensure scalability (Node failures during processing)
Test Parameters and Summary

Very Nice, Practical and useful webinar. There are a lot of posts / webinars. This one is very useful and practical.

More Reads
Evaluating SolrMeter for Performance Testing
Benchmarking with HTTPerf.js and NodeUnit

Happy Learning!!!

Weekend Reads - API Testing (Web Service Testing, Inspect Http request, Rest API Testing) - Free tools

Very good presentation on compiled list of free tools for API Testing. More Details - Free API debugging and testing tools you should know about

Tools List from the presentation and SO reads
While reading through the list again went back to check on SOAP, Rest Basics. API testing / Web Service testing we will be looking into only aspects Rest, SOAP based web services. Summary based on StackOverflow readings, posts. References - StackOverflow reference answers link. More details Pls check reference link. (Consolidated Answer and Detailed short summary listed below)

REST is over HTTP. REST has no WSDL interface definition
SOAP can be over any transportprotocols such HTTP, FTP, STMP, JMS etc.
REST stands for Representational State Transfer. REST approach uses the standard GET, PUT, POST, and DELETE verbs
Simple Object Access Protocol (SOAP) 
SOAP builds an XML protocol on top of HTTP  / TCP/IP.
REST is good for getting a blob of data that you don't have to work with
SOAP describes functions, and types of data. If you want to get an object, SOAP is way quicker and easier to implement
Typically uses normal HTTP methods instead of a big XML format describing everything
Has several protocols and technologies relating to it: WSDL, XSDs, SOAP, WS-Addressing
REST plays well with AJAX'y web pages. If you keep your requests simple, you can make service calls directly from your JavaScript, and that comes in very handy.
SOAP is useful from a tooling perspective because the WSDL is so easily consumed by tools. So, you can get Web Service clients generated for you in your favourite language.

More Reads
From AWS Blog - 80% REST / 20% SOAP usage pattern

Happy Learning!!!

July 05, 2014

APIs Good Read

I'm learning Big Data basics, checking real life architectures to understand the technology, implementation. Interesting slide on API's. I am sharing the same.

Happy Reading!!!

June 26, 2014

Tablediff - Quickly compare column data and schema between two Database Tables on Different Machines

Today Learning's 
  • Tablediff gets installed with replication features
  • Prepare Tablediff for multiple tables
  • Run it from a batch file and log results in text file
  • One Click Results Displayed :). Tested on SQL 2012
Example Batch File 1

cd "C:\Program Files\Microsoft SQL Server\110\COM"
tablediff.exe -sourceserver SourceDBServerIP -sourcedatabase TestDB -sourcetable Table1 -destinationserver DestDBServerIP  -destinationdatabase TestDB -destinationtable Table1 -c -dt DiffsTable1
tablediff.exe -sourceserver SourceDBServerIP -sourcedatabase TestDB -sourcetable Table2 -destinationserver DestDBServerIP  -destinationdatabase TestDB -destinationtable Table2 -c -dt DiffsTable2

  • Option –c - Perform column level comparisons
  • Option –dt – Output different records to particular table (drop and create if exists)
  • By checking the log and the diff tables created we can find error code and description of differences
  • By adding -strict option it expects the same schema (column order). I have not used it in above example
  • Add -sourceuser,  -sourcepassword , -destinationuser, -destinationpassword parameters in case of sql credentials. If not specified windows credentials will be used
Batch File 2 - Capture Log Results

C:\\TableDiff.bat > C:\\TableDiffResults.txt 2>&1

One Click RunTD.bat and check logs for results. SO link

Happy Learning!!!

June 19, 2014

Two interesting bugs - Web service changes, DB column position changes

Two interesting bugs and learning's from them

Bug#1 - Due to install / upgrade paths it resulted in different ordinal position of same tables. Due to usage of select * this resulted in wrong update on columns

Automation Solution - With SQL command (link) we can figure out ordinal position differences between two same tables. Querying two environments for each tables, dumping the results in a XML. XML comparison  between them would highlight tables with different cardinal positions for two same tables.

Bug #2 - When web services are changed, the optional parameters introduced, signatures changed often fails to support backward compatibility 

Automation Solution - SOA model is open source tool for wsdl comparison. This can be used to validate wsdl differences between versions. Example code from SO

More Reads

Happy Bugging & Learning!!!

Good QA Reads

Good QA Reads
More Reads
Happy Reading!!!

June 18, 2014

TSQL Learning's @ Work - Date Conversion yyyymmdd to mm/dd/yyyy, xp_cmdshell Tips

Tip#1 - Converting YYYYMMDD into MM/DD/YYYY Format


Tip #2 - Coverting HHMMSS into HH:MM:SS

SELECT STUFF(STUFF('211532', 3, 0, ':'), 6, 0, ':')

Tip #3 - xp_cmdshell does not work with temp tables / table variables. Use Global Temp Tables ## (Stackoverflow :) link)

Tip #4 - Enable xp_cmdshell in SQL Server - SO Link

Tip #5 - Testing Dynamic SQL Code with EXEC command - SO Link

Happy TSQL Revising!!!

June 15, 2014

Interesting Learning Notes

Note #1 - What is difference between performance and scalability problem ?
  • Performance problem - Fixing performance issue for a website example - Pageload time is very high for homepage of website
  • Scalability problem - Scaling website to support 10X user base than current user base
This post was very useful for above answer Performance v Scalability – For Employers

Note #2 - Read / Write Advantages / Disadvantages for Normalized / Denomarmalized Databases ?
  • Normalized - Insert in multiple tables, Highly consistent
  • Denormalized - Easy Insert, Consistency issue with updates (Multiple  versions of Records may exist)
I loved reading this post again and again - Data storages and read vs write controversy. Post is Very Simple, intuitive and clear.

  • Tip #1 - Changing window size during execution set_window_size
  • Tip #2 - Screenshots comparison using needle (Python based)

June 10, 2014

Interesting TSQL Question

Check the below TSQL Example involving NULL. What is the result for below three queries

Tested on SQL 2012 Setup

Create TABLE Test1
(Id           int,
 Score int not null)

Create TABLE Test2
(Id           int,
 Score2       int not null)

 insert into Test1(Id, Score) VALUES (NULL,10),(10,100),(20,200)

 insert into Test2(Id, Score2) VALUES (NULL,11),(10,110),(20,220)

Case #1 
 Select * from  Test1 JOIN Test2
 ON Test1.Id = Test2.Id

Case #2 
 Select * from  Test1 LEFT OUTER JOIN Test2
 ON Test1.Id = Test2.Id

Case #3
 Select * from  Test1 LEFT OUTER JOIN Test2
 ON Test1.Id = Test2.Id

Results are

Happy Learning!!!

June 03, 2014

SQL Server Database Restore Notes

While installing SQL 2012 on Windows Server 2008 R2
  • Enable windows 3.5 from Windows Add / Remove Features
  • While trying to restore SQL 2012 backup on SQL 2008, Error 'media family on device is incorrectly formed' for version mismatch is misleading
SQL Version to check query

SERVERPROPERTY('ProductVersion') AS ProductVersion,
SERVERPROPERTY('ProductLevel') AS ProductLevel,
SERVERPROPERTY('Edition') AS Edition,
SERVERPROPERTY('EngineEdition') AS EngineEdition;

Useful Notes
Quick Script encompassing the steps from above notes

Step 1 - Stop SQL Server

Step 2 - Remove DB Folders (MDF, LDF Files)
rmdir /s /q "C:\Databases\VOL\Data\"
rmdir /s /q "C:\Databases\VOL\Log\"

Step 3 - Recreate Same file path for Restore
mkdir  "C:\Databases\VOL\Data\"
mkdir  "C:\Databases\VOL\Log\"

Step 4 - Start SQL Server

Step 5 - Run DB Restore script
Db Restore Command

Tweak it as you need :)

Happy Learning!!!