Hadoop

6 min readMay 20, 2020

Hadoop is a software utilities collection that using a network of many computers to solve problems having big amount of data and computation. Apache Hadoop gives a framework for distributed storage and processing of big data. It designed for computer clusters build from commodity hardware and also found use on clusters of higher end hardware. Apache Hadoop consists sub projects Hadoop MapReduce and HDFS (Hadoop Distributed File System). Hadoop MapReduce is computational model and framework for applications run on Hadoop. HDFS is taking care of storage part. MapReduce applications get data from HDFS. Apache Hadoop YARN is the resource management and job scheduling technology in the Hadoop. YARN is responsible for allocating system resources to the various applications running on the Hadoop cluster and scheduling tasks to be executed on different cluster nodes. Hadoop fellows Master Slave architecture for data storage and distributed data processing using MapReduce and HDFS methods.

Name node represents every files and directory which are used in the namespace. Data mode helps to manage the state of HDFS node and allows to interact with blocks. Master node allows to conduct parallel processing of data using Hadoop MapReduce. Slave nodes are the additional machines in the Hadoop cluster which allows you to store data to do complex calculations.

Hadoop uses to stores and processes big data. So it helps to analyze a big amount of data and bring out a meaning full conclusion. Hadoop is used in many places like Security and law enforcement, Customer’s requirements understandings, Cities and countries improvement, Financial trading and forecasting, Understanding and optimizing business process, Personal Quantification and performance, Improving healthcare and public health, Optimizing machine performance, Improving sports and Improving science and research.

In the distributed system, the components are placed on different networked computers and they are communicate and coordinate their actions by sending messages to another one. Those components interact with another one to achieve a common goal. Distributed systems can be divided into four architecture models as client server, three tier, n tier and peer to peer. Significant characteristics of a distributed system are concurrency of components, lack of a global clock and independent failures of components. These characteristics should preserve fault tolerant, highly available, recoverable, consistent, scalable, predictable performance and secure as a means of a distributed system.

Fault Tolerance in Hadoop

Fault tolerance is the ability to work even in case of any unfavorable conditions like components failure. In the ordinary systems like relational database, all the operations are done by the user on a single machine. If any unfavorable conditions like machine failure, RAM crash, power down, hard disk failure and etc. happens the users have to wait until the problem get corrected manually. So during the failure user can’t access their data. Hadoop is highly fault-tolerant. Before Hadoop 3, it handles faults by the process of replica creation. It creates a replica of users data on different machines in the HDFS cluster. So if any failure happens the data can be accessible from another machine. Hadoop 3 introduced Erasure Coding to give Fault Tolerance. Erasure Coding in HDFS improves storage efficiency also while giving the same level fault tolerance and data durability as in the replication based HDFS deployment. Erasure Coding works by striping the file into small parts and store them on various disks.

Highly Availability in Hadoop

Highly Availability is availability of a system or data during components failure in the system. Highly availability feature in the Hadoop ensures the availability of the Hadoop cluster without any downtime during any nodes failure or machine crash. HDFS stores data files by split them into fixed size blocks. These blocks are stored in data nodes. Name node has the meta data about file system. Data node continuously send message to name node. If name node does not received message, the name node considers as the data node was dead. So it checks for the data in the data node and replicate that data on other data node. Name node is the only node that knows the list of files and directories in the Hadoop cluster. So for the highly availability of the system Hadoop has two or more name nodes running in a cluster. So this features allows users highly availability of Hadoop.

Recoverability in Hadoop

Recoverability means filed components can restart themselves after the cause of failure has been repaired and rejoin the system. When tasks are run on multiple nodes concurrently, Hadoop can automatically detect task failure and restart failed tasks on another healthy nodes. When a block or node is lost, it automatically creates a copy of the missing data from the available replicas.

Consistency in Hadoop

Consistency is known as a system coordinate actions by multiple components often during the concurrency and failure. It seems the ability of distributed system to act as a single system. Apache Zookeeper is an open source project designed to coordinate multiple services in the Hadoop. Zookeeper acts fast enough with workloads like reads and write data. It maintains a record of all transactions.

Scalability in Hadoop

Scalability is the ability to dynamically adjust its computing performance by changing available computing resources and scheduling methods. In Hadoop, can easily scale the cluster by adding more nodes. There are two types of scalability in Hadoop. They are vertical scalability and horizontal scalability. Vertical scalability is referred as scale up. In vertical scaling, can increase the hardware capacity of the individual machine. It make the existing system more powerful. Horizontal scalability is referred as scale out. It is basically adding more machines or setting up cluster. In horizontal scaling instead of increasing hardware capacity, can add more nodes to existing cluster and most importantly, can add more machines without stopping the system.

Predictable Performance in Hadoop

Predictable performance is the ability of the system to provide response in a timely manner. Hadoop performance tuning will help to optimize Hadoop cluster performance and make better to provide best result while doing Hadoop Programming in big data. There are some best performance tuning techniques like Memory tuning, Improving IO performance, Minimizing the disk spill by compressing map output, Tuning the number of mapper or reducer tasks, Writing a combiner, Using skewed joins and Speculative execution.

Security in Hadoop

Hadoop is designed for manage large amount of data in a trusted environment. So security is the most important one for that. Hadoop requires many of same approaches in the traditional data management system. They are Authentication, Authorization, Auditing and Data Protection. Kerberos is the authentication protocol used in Hadoop. Transparent encryption is used for data protection in Hadoop. This encryption is end to end encryption, which means that only clients will encrypt or decrypt the data. For authorization Hadoop HDFS checks the files and directory permission after the user authentication. Hadoop has some tools for supporting Hadoop security. Knox and Ranger are the two major open source projects that support Hadoop security.

On concluding Apache Hadoop is a most popular and powerful tool for store huge amount of data in the distributed manner and process data in parallel on a cluster of nodes. Visit to https://hadoop.apache.org/ for getting started with Apache Hadoop.