The goal of this blog is to automate Hadoop Multi-Node Cluster installation and configuration on Amazon EC2 instances. If thats what you are looking for then read on.
WARNING!!! The cluster is only for practice purposes!! Its not highly secure. If you want a highly secure cluster then you have to apply more strict security settings. The security settings are purposely kept low so as to deploy the cluster smoothly without any errors.
We will be building a 5 node cluster on Amazon EC2 t2.micro instances. The Hadoop version we will be using is 1.2.1 and Operating System will be Ubuntu Server 14.04 LTS (HVM). The username is ubuntu for all 5 machines.
The hostnames of the 5 machines will be nn for namenode, 2nn for secondary namenode, d1 for datanode 1, d2 for datanode 2 and d3 for datanode 3.
This is how the Hadoop Daemons will run on the 5 instances:
Namenode and Jobtracker will run on nn.
Secondary Namenode will run on 2nn.
TaskTracker and DataNode daemons will run on d1, d2 and d3.
Before starting you will have to download these two tools (PuTTY and PuTTYGEN), hadoop bash script and ec2 config file:
1. PuTTY: http://the.earth.li/~sgtatham/putty/latest/x86/putty.exe
2. PuTTYGEN: http://the.earth.li/~sgtatham/putty/latest/x86/puttygen.exe
3. Hadoop Bash Script: https://drive.google.com/file/d/0B2T8Pye0P7e5Qm93cWwwN3ZOUUk/view?usp=sharing
4. Hadoop hosts Config: https://drive.google.com/file/d/0B2T8Pye0P7e5bUVPdlFoWXhUNjQ/view?usp=sharing
Lets start with the multi-node cluster.
STEP 1: Logging-in into your Amazon Web Services account
If you do not have an AWS account then make one. You will have to add a credit/debit card and send a fax copy of any government ID proof of yours to Amazon. It will take 1 or 2 days to activate. After logging into your amazon web services account you need to do two things:
A. Create a Keypair for logging into your instances.
B. Create a security group.
Important!! Name your Amazon Private key as “Key1″ so that everything goes smooth through the tutorial. If you want you can name it as you like but just carefully edit the Bash script according to the name of your key.
Below is the video on how to create a key pair and a security group:
STEP 2: Creating the EC2 Instances
We require 5 instances. We will choose the t2.micro instance and default 8GB EBS (Elastic Block Storage) drive for each instance. The Operating System we will use on all the instances is Ubuntu Server 14.04 LTS. The Bash Script will/might not work if you choose any other Operating System other than Ubuntu. The total EBS usage will be 40GB out of which 30GB is free under free tier scheme. The remaining 10GB will be charged at a negligible rate of $0.1 per GB per month. According to my experience Amazon EBS is charged on hourly basis and not monthly. These two discussions also seems to confirm it: http://stackoverflow.com/questions/5468535/amazon-ebs-pricing-monthly-daily-hourly http://serverfault.com/questions/197379/amazon-ebs-charges-calculation
If you dont know much about free tier please carefully take a look on all the things that are free under the free tier scheme (Link: http://aws.amazon.com/free/). As far as our tutorial is considered t2.micro instance upto 750hrs/month , 30 GB of EBS, Ubuntu Server 14.04 and 5GB of S3 Storage are free (As of January 2015).
Upload the Key1.pem file to the s3 bucket. And replace the private-key-link and private-key-name in the EC2-Launch-Config-Multinode file. It is shown in the video.
Here is the video on how to create the 5 instances:
Important!!! Do not forget to paste the bash script in User Data in Advance Details while configuring the t2.micro instance.
Wait till you get 2/2 checks for all 5 instances in the Status Checks column. It will take around 3-5 mins.
Step 3: Logging-in into machines and configuring
In this step we will login in to all the 5 instances. We will add and edit the hosts, hostname, masters and slaves config files. Here is the video for step 3 and step 4:
Tip!!! Right Clicking on the PuTTY terminal will paste the text which is there in the clipboard. While copy pasting the commands select one more line to send “enter” along with the commands. Carefully follow all the steps in the same order for smooth deployment.
You can check the cluster summary from your browser over here:
1. Namenode summary at [Namenode(nn) Public DNS]:50070
2. Jobtracker summary at [Namenode(nn) Public DNS]:50030
Step 4: Starting the Cluster and exploring hadoop
In this step we will fire up the hadoop daemons and start exploring.
Hadoop Shell commands: http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html#cat
Google for use cases of multi-node hadoop cluster. Let me know in the comments if you get stuck anywhere in the tutorial. I will help you resolve the issues.
I hope this blog was informative for you. And I would like to thank you for reading it.
-Mohammad Yusuf Ghazi
*This post is locked for comments