Mapper
- Mapper runs the business logic (ex- word counting)
- Mapper (Maps your need from the record)
- Record reader provides input to mapper in key value format
- Mapper Side Join (Distributed Caching)
- Output of mapper (list of keys and values). Output of mapper function stored in Sequence file
- Framework does splitting based on input format, Default is new line (text format)
- Every row / Record will go through map function
- When there is a data split (row) is split between two 64MB Blocks. That particular row would be merged for complete record and processed
- Default block size in Hadoop 2.0 is 128MB
- Reducer will poll it, job tracker will inform what all nodes to poll
- Default number of reducer is 1. This is configurable
- Multiple Reducers - Not possible - Multiple level MR jobs possible
- Reduce Side join (Join @ Reducer Level)
- Combiner - Mini Reducer, Combiner before writing to disk, finds max value from data
- Combiner is used when map job itself can do some preprocessing to minimize reducer workload
- Hash Partitioner is default partitioner
- Mapper -> Combiner -> Partitioner -> Reducer (For multi-dimension, 2012-max sales by product, 2013, max sales by location)
No comments:
Post a Comment