How does Hadoop works?

Apache Hadoop is a framework that can store and process huge amounts of unstructured data ranging from terabytes to petabytes. This file system is highly available and fault-tolerant to its users. This platform is capable of storing a massive amount of data in a distributed manner in HDFS. The Hadoop map-reduce is a processing unit in Hadoop that processes the data in parallel. Hadoop YARN is another component of the Hadoop framework that is good at managing the resources amongst applications running in a cluster and scheduling a task.

Hadoop has overcome its dependency as it does not rely on hardware but instead achieves high availability and also detects the point of failures in the software itself. Hadoop has also given birth to countless innovations in the big data space. Apache Spark that has been talked about most about the technology was born out of Hadoop.

How do Hadoop works?

Hadoop does distribute processing of huge data across the cluster of commodity of servers that work on multiple servers simultaneously. To process any data, the client submits the data and program to Hadoop.  In the Hadoop Ecosystem, HDFS is good at Data Storage, Map Reduce is good at Data Processing, and YARN is good at task dividing.

Are you new to the concept of Hadoop?, then check out our post on What is Hadoop?

How does HDFS work in Hadoop?


HDFS is a distributed file system that runs on master-slave technology. This component has two daemons namely the namenode as well as the data node.

Name Node:

          The name node is a daemon that is running on the master machine. It is the centerpiece of the HDFS file system. The name node store the directory tree of all file in the file system. This Namenode comes into the picture where the client wants to add/copy/move/ delete the file. Whenever the clients request the Name Node return the list of  Data Node servers where the actual data resides.

Data Node:

              This daemon runs on the Slave node where it stores the data in Hadoop File System. In a functional file system, the data replicates across many Data Nodes. Initially, the Data Node was connected to the Name Node. It keeps on looking for the request to access the data. Once the Namenode provides the location of the data, the client applications can interact with Data Node directly. And during the data replication, the data node instances can talk to each other.

Replica Placement:

 The Replica placement decides the HDFS performance and reliability. Huge HDFS clusters instance runs on a cluster of computers spread across the racks. The communication here happens by switching between the nodes. The rack awareness algorithm determines the rack id of each Data Node. The replicas get placed on unique racks in a simple policy. It prevents data loss in the event of failure. During the Data retrieval, it utilizes the bandwidths from multiple racks.

Map Reduce:

The map-reduce algorithm is to processes the data parallelly on a distributed cluster. It is subsequently combined into a desired output/ result.  This Map Reduce consists of several stages:

  • In the first step, the program locates and reads the <> file containing the raw data.
  • Since the file format is arbitrary, there is a need to convert the data into something where something can process. Here the and does this job.
  • Here the Input format uses Input Split function to split the file into smaller pieces.
  • Then the Record Reader transforms the raw data for processing by map. Here the Record Reader outputs a list of key-value pairs.
  • Once the mapper process these key-value pairs the result goes to the . Here we have another function called <> which intimates the user when the mapper finishes the task
  • In the next step, the reduce function performs its task on each key-value pair from the mapper.
  • Finally, the output pair organizes the key-value pair from the reducer for writing on HDFS.

Do you want to know the practical working of Map Reduce? If Yes, visit Hadoop Online Training


YARN is responsible for diving the task on job monitoring/scheduling and resource management into separate daemons. Besides, there is one Resource Manager and per-application Application Master. Here the application can be Job (or) a DAG of jobs. The resource manager has two components. A scheduler and the Application Manager. Here the scheduler is a pure scheduler that does not track the status of the application. Moreover, there is no need of restarting the application in case of application (or) hardware failure. Here the scheduler allocates the resources based on the abstract notation of the computer. Here the container is nothing but a fraction of resources like CPU, memory, disk, network, and many more.

The Application Manager does the following tasks:

  1. Accept the Job submission by the client
  2. Negotiates the first container for a specific application master.
  3. Restarts the container after an application failure

On the other side Application Master does the following tasks:

  1. Negotiates containers from the scheduler
  2. Tracks the container status and monitors its progress.

This YARN Supports the concepts of Resource Reservation via a Reservation System. Here the users can fix the number of resources for the execution of a particular job over time and temporal constraints. This Reservation system ensures that the resources were available to the job until its completion. YARN is capable of scaling beyond the thousand nodes viaYARN federation. These YARN federations allow wiring multiple subclusters into a single massive cluster. Here, we can use many independent clusters together to form a single large job. Moreover, it is capable of achieving large-scale systems.

What is Hadoop used for?

Hadoop has become a distributed framework for processing large amounts of structured and semi-structured data. This platform is not good enough in dealing with small data sets. But when compared with a large amount of data, this platform suits best in the following cases:

  • This platform suits well in a variety of big data applications that gather data from different sources in different formats. This platform is very flexible in storing the various data types, irrespective of the data type contains in the data. Hadoop in the big data application has to join the data through any format.
  • Large scale enterprises require clusters of servers, where specialized data management and programming skills were limited where its implementation is a costly affair.

What can Hadoop do?

Hadoop can be fit into multiple roles depending on the movie. These platforms suit best product recommendations, fraud detection, and identifying diseases, sentiment analysis, infrastructure management, and many more. Hadoop distributes the same job across the cluster and gets done within a limited time that runs on commodity hardware. Timing and money save is the ultimate goal of any business.

This is how Hadoop works in big data. By reaching the end of this post, I hope you people have gained enough knowledge on working with the Hadoop Ecosystems. You can get practical knowledge on Hadoop from Realtime Industry professionals through Hadoop Online Course. In the upcoming post of this blog, I'll be sharing a detailed explanation of each component of the Hadoop file system. You can also check out our Hadoop Interview Questions prepared by experts in our website.