This is the fourth and final post in our series dedicated to learning resources for big data. In this post, we will introduce you to Hadoop: a framework for distributed computing and storage for handling very large datasets.
Let me be clear at the outset. This post is not going to be a tutorial on how to use Hadoop. (If you are looking for such a tutorial, click here.) The focus is mainly on providing you with a high-level overview of how Hadoop works and of the distributed computing architecture. Let’s get started.
You have a very large dataset on the order of gigabytes or even terabytes. You can’t store it on your physical machine. And you certainly cannot process it! So how we approach the problem?
One idea is to store a file across multiple systems. Processing of the data is done parallelly and locally i.e. each system or node manipulates the data it has on hand. This enables the data to be processed very quickly and efficiently.
Before we get into the details of how Hadoop works, we need to understand the basics of a master-slave process. A master-slave system works roughly as follows:
A process or device acts as the master. It delegates tasks or responsibilities to other processes known as slaves. These processes can run simultaneously or one at a time in terms of priority.
Simple enough right? (It’s not that simple to implement. There has been extensive research on discovering efficient ways to implement the process. Describing these is beyond the scope of this series.)
Hadoop consists of two core components: HDFS and MapReduce. HDFS or Hadoop Distributed File System takes care of the data storage component while MapReduce takes care of the processing.
HDFS handles distributed file storage. It is highly reliable, portable and scalable. It ensures reliability by replicating the data across multiple nodes. The way it works is to split files into large components (blocks) and distributes them among nodes in a cluster. A single NameNode tracks where data is stored in a cluster of machines (known as DataNodes) while data is split into chunks or data blocks (typically 64 MB in size) and stored on the DataNodes.
You can think of HDFS as a large storage bin. You dump in your data, and it happily resides there until you want to do some processing. This processing is taken care of by tools such as MapReduce.
MapReduce does the actual data processing in a very slick way.Traditionally, the way data processing is done is to move data across a network to be processed by a server. You are aware that opening a large file takes time. A very long time. MapReduce speeds up the process tremendously. How? The idea is to reverse the process. Rather than move data across the network, MapReduce, in a sense, moves the processing software to the data.
It works by breaking down a large computation into multiple tasks. MapReduce uses two components to accomplish this: a JobTracker and a TaskTracker. The JobTracker divides up the computation into smaller subtasks. These subtasks are moved on to TaskTrackers on other machines in the cluster. Once the job is done, the results are consolidated. This architecture is exactly the master-slave process outlined above!
The future looks bright for the users of Hadoop. Hadoop innovation is happening very quickly. Hadoop is quite difficult to use, and so many in the open source community have been developing tools to make it easier. Hadoop is going to be the mainstay of Big Data processing for a long time to come.
With this we come to the end of our admittedly brief foray into the world of Big Data and analytics. We hope you enjoyed this series. Once again, special thanks to Dawar Dedmari without whom this series would not have gotten off the ground. We will have more such series in the future. Give us your feedback on what you would like to have featured next! Tell us what you think in the comment section below.