What is MapReduce
and How it Works
MapReduce is a parallel programming
technique made popular by Google. It is used for processing very very large
amounts of data. Such processing can be completed in a reasonable amount of
time only by distributing the work to multiple machines in parallel. Each
machine processes a small subset of the data.
MapReduce is a programming model that
lets developers focus on the writing code that processes their data without
having to worry about the details of parallel execution.
The MapReduce concept is fairly simple to
understand . MapReduce requires modeling the data to be processed as key,value pairs. The developer codes
a map function and a reduce function.
First Task - Map : The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).
Second Task - Reduce: The reduce job
takes the output from a map as input and combines those data tuples into a
smaller set of tuples.
As
the sequence of the name MapReduce implies, the reduce job is always performed
after the map job.
Refer the following diagram (Figure.1)
which explains Map & Reduce
Now,
Lets see in details how exactly Map () and Reduce () works -
MapReduce requires modeling the data to
be processed as key,value pairs. The developer codes a map function and a
reduce function.
The MapReduce runtime calls the map
function for each key,value pair. The map function takes as input a key,value
pair and produces an output which is another key,value pair.
The MapReduce runtime sorts and groups
the output from the map functions by key. It then calls the reduce function
passing it a key and a list of values associated with the key. The reduce
function is called for each key. The output from the reduce function is a
key,value pair. The value is generally an aggregate or something calculated by
processing the list of values that were passed in for the input key. The reduce
function is called for each intermediate key produced by the map function. The
output from the reduce function is the required result.
As an example , let us say you have a
large number of files that contain fruits eaten or sold to different customers
from various fruit mart across the city. You need to find out how many fruits
were sold in total and individual.
Assume each line in the file is a fruits
sold entry. We are processing files line by line.
The map and reduce pseudo code would look
like this:
map(key , value) {
// key = fruit name // value =number of times it has occurred in the file case (apple)
key = apple
a_value + = 1
case (mango)
key = mango
m_value + = 1
.
.
.
} |
reduce(key, list of values) {
// key =fruit name // list of values {1,1,1,1.....} for each value count = count + value output key , count } |
The map function is called for each line
in each file. Fruit name is parsed out of relevant lines and output with a
value 1. The MapReduce runtime sorts and groups the output by fruit name. The
reduce function is called for each fruit name. The reduce function aggregates
the values for each fruit, which is the required result. See figure 2. for
understanding the algorithm.
Multiple machines (cluster) and MapReduce
relation
MapReduce jobs are generally executed on
a cluster of machines. Each machine executes a task which is either a map task
or reduce task. Each task is processing a subset of the data.
In the above example, let us say we start
with a set of large input files. The MapReduce runtime breaks the input data
into partitions called splits or shards. Each split or shared is processed by a
map task on a machine. The output from each map task is sorted and partitioned
by key. The outputs from all the maps are merged to create partitions that are
input to the reduce tasks.
There can be multiple machines each
running a reduce task. Each reduce task gets a partition to process. The
partition can have multiple keys. But all the data for each key is in 1
partition. In other words each key can processed by 1 reduce task only.
Note that The
number of machines , the number of map tasks , number of reduce tasks and
several other things are configurable.
See figure 3 to understand the concept of
multiple machines / clusters.
Uses of MapReduce
MapReduce is useful for problems that
require some processing of large data sets. The algorithm can be broken into
map and reduce functions. MapReduce runtime takes care of distributing the
processing to multiple machines and aggregating the results.



No comments:
Post a Comment