Big Data Hadoop

big data Hadoop Logo

Hadoop is Open source Apache Framework which is written in java programming language. Hadoop allow distributed processing of large datasets across many computers which using programming models. Environment provides distributed storage and computation across clusters of computers. It is designed to scale up from single server to thousands of machine using the environment that we discuss above.

In 2006, Cutting join Yahoo and took with him the Nutch project which is further divided and hadoop is one of its part which is a distributed and processing portion. In 2008, Yahoo released Hadoop as an open-source Project. Now, Apache Software Foundation (ASF) maintain and managed the framework and ecosystem of technologies of hadoop which is a global community of software developers.

Why is Hadoop Important?

  • Scalability: Little administration is required to easily grow your system to handle more data by adding more nodes.
  • Low Cost: Because this is an open-source framework and uses commodity hardware to store large data.
  • Fault Tolerance: Data and application protected against hardware failure like if a node goes down jobs are automatically redirected to other nodes to make sure that any node does not fail for that multiple copies are stored.
  • Computing Power: This means to processes big data. Computing nodes is directly proportional to processing power, the more computing nodes you have the more will be the power.
  • Ability to process huge amount of data of any kind, quickly: As we know data is constantly increasing on social media especially , that’s a key consideration.
Hadoop Architecture

Hadoop Architecture

Framework Includes four modules:
Hadoop common:

These contain java libraries which provides filesystem and OS level abstraction and contain necessary java files and scripts required to start hadoop.

Hadoop YARN:

Framework for job scheduling and cluster resources management.

Hadoop Distributed File System(HDFS):

This is a file system that provides high throughput access to application data.

Hadoop MapReduce:

MapReduce is YARN based system for parallel processing of large data sets.

Since 2012, the term “Hadoop” often refers not just to the base modules mention above but also to the collection of additional software packages that can be installed on the top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache spark etc.


It is a software framework for easily writing application to process big amount of data parallel on large clusters in a reliable hardware and in fault-tolerate manner. The term mapreduce refers to two different tasks that perform by hadoop program.
  • Map task: This task takes input data and covert in into a set of data, where individual element broken down into key pairs.
  • Reduce task: this task correlates to map task in which it takes output of the map task and combines those into smaller tuples.

Both input and output are stored in the file system and take care of scheduling tasks, monitoring them and re-execute the failed tasks. This framework consist of a single master JobTracker and one slave TaskTracker it is a single point failure for the hadoop mapreduce means if it is goes down then the running all the tasks are halted. Master is responsible for scheduling the task for the tracker and the slave executes those task that are assigned by the master and provide task status periodically.

Hadoop Distributed File System(HDFS)

This is most common file system used by hadoop it is based on google file system(GFS). Uses Master slave architecture in which master consist of singleNameNode that manage the file system metadata and another one is slaveDataNodes that store the actual data. HDFS namespaces is split into several blocks and those blocks are stored in a set of DataNodes, it also take care of read and write operation in the file system. The NameNodes determine the mapping of blocks to the DataNodes,it also take care of block creation ,deletion and replication based on the instruction given by NameNode. HDFS provide shell like file system in which the list of commands interact with the file system.

Working of Hadoop.

  • Stage1: In this stage user/application can submit job to the hadoop(job client) with his specifications like: location of the input and output file in DFS, Java classes in the form of jar files which containing the implementation of MapReduce Function.
  • Stage2: In this stage hadoop job client submits the job to the JobTracker which then compile the responsibilities like distributing software to the slaves, scheduling tasks and monitoring them, providing status and diagnoist information to the job-client.
  • Stage3: In the stage TaskTracker executes the task as per MapReduce implementation and output the reduce function in the output file of file system.

Advantages of Hadoop.

  • Framework allows the user to quickly write and test distributed systems which is efficient, and automatic distributes data and work across the nodes, utilizes the underlying parallelism of the CPU cores.
  • It does not rely on the hardware to provide fault-tolerance and high availability, rather itself designed and detect the future failure and handle it.
  • Servers can be added or removed from the clusters dynamically and hadoop continues to work without interruption.
  • Big advantage is that apart from open source, it is compatible with all the type of platforms since it is java based.
Hadoop Infographics

Embed this Infographics on your site(Copy this Code)

<div style="clear:both"><a href="">
<img src="" title="Big Data Hadoop" alt="Big Data Hadoop" border="0" /></a></div>
<div>Courtesy of: 
<a href="">Assignment Help</a></div>

To get help in Bigdata-Hadoop Project Contact Us