MapReduce is a programming model used for processing and generating big data sets with a parallel, distributed algorithm on a cluster. It is a technique for distributed computing based on Java, and it is used to access big data stored in the Hadoop File System (HDFS) . The MapReduce algorithm contains two important tasks: Map and Reduce. The Map function takes input from the disk as key-value pairs, processes them, and produces another set of intermediate key-value pairs as output. The Reduce function takes inputs as key-value pairs and produces key-value pairs as output. The MapReduce program is composed of a map procedure, which performs filtering, and a set of reducers that can perform the reduction phase. The MapReduce system designates Reduce processors, assigns the key each processor should work on, and provides that processor with all the Map-generated data associated with that key. Between the map and reduce stages, the data are shuffled (parallel-sorted/exchanged between nodes) to move the data from the map node that produced them to the shard in which they will be reduced. The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes.

