Hadoop is an open source framework for distributed storage and processing of big data sets on commodity hardware.
It consists of the Hadoop Distributed File System (HDFS) for storing data, and a parallel processing engine such as MapReduce for processing the data.
In this guide, we will install Hadoop on an Ubuntu virtual machine in the cloud.

Contents
What is Hadoop?
Hadoop is a free, open-source framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
It is designed to scale up from a single server to thousands of machines, each offering local computation and storage.
Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer.
Why use Hadoop?
Hadoop was designed to tackle problems involving large data sets that were too big to process using traditional relational database management systems.
But what exactly is “big data”? There is no definitive answer, but one common definition is data sets that are so large and complex that they are difficult to process using available technology.
Another way to think of it is data that doesn’t fit neatly into rows and columns like relational databases do.
This can include social media data, weather data, log files, Sensor data, and a whole host of other unstructured data types.
Hadoop is designed to be scalable so that it can easily handle data growth. As more nodes (computers) are added to a Hadoop cluster, the system becomes more powerful and can process larger data sets.
Hadoop Installation
This quickstart assumes you have a running Ubuntu virtual machine (VM) with Internet connectivity.
If you don’t have an Ubuntu VM set up, you can create one easily using VirtualBox and Vagrant. Just follow the steps in the Getting Started with VirtualBox and Vagrant guide.
If you prefer, you can also use a cloud-based Ubuntu VM. For more information, see the Cloud Servers section of the Ubuntu Documentation.
Once you have your Ubuntu VM ready, log in and follow the steps below to install Hadoop.
- Update your package manager indexes: sudo apt-get update
- Install Java 8: sudo apt-get install openjdk-8-jdk-headless -y // For headless servers
sudo apt-get install openjdk-8-jdk -y // For servers with a UI - Install Hadoop: sudo apt-get install hadoop -y
- Start the Hadoop service: sudo systemctl start hadoop
- Check the status of the Hadoop service: sudo systemctl status hadoop
Hadoop Configuration - Before you can run Hadoop, you need to configure it. You can do this by editing the files in the conf/ directory. The main configuration file is core-site.xml, which contains properties that are common to all Hadoop components. You will need to edit at least the fs.defaultFS property to point to your Namenode URL.
- The next important file is mapred-site.xml, which contains properties that are specific to MapReduce. In particular, you need to edit the mapreduce.framework.name property to point to your MapReduce implementation (either “local” or “yarn”).
- Finally, hdfs-site.xml contains properties that are specific to the HDFS component of Hadoop. One important property that you will need to edit is dfs.replication, which specifies the number of replicas of each HDFS file that should be created.
- Once you have edited these configuration files, you can start Hadoop by running the following command:
bin/start-all
Hadoop Management
Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity servers.
It is designed to scale up from a single server to thousands of machines, each offering local computation and storage.
Hadoop Monitoring
There are numerous tools available to monitor your Hadoop installation. The most popular is Apache Ganglia, which is a scalable distributed monitoring system for high-performance computing systems, such as clusters and grids.
Another popular monitoring solution is Nagios, which is an open-source system and network monitoring tool.
If you want to monitor Hadoop directly, the best tool to use is the Hadoop web interface.
This interface provides detailed information about the status of your Hadoop installation, including the number of nodes in your cluster and the capacity of each node.
To access the web interface, simply go to http://your_cluster_name:50030 in a web browser.
Hadoop Security
Hadoop is a secure system by default with a number of features to ensure that user data is protected:
- Authentication: Users must authenticate themselves before they are allowed to access Hadoop resources. By default, Hadoop uses Kerberos to authenticate users.
- Authorization: Once a user has been authenticated, they may only access resources that they have been authorized to use. Authorization is managed using the Sentry project in Hadoop.
- Encryption: Data stored on HDFS can be encrypted using either transparent encryption or file-level encryption.