This document describes how to install and configure Hadoop Cluster and Apache Flume. The below described project work is done towards our course- Cloud Computing Security and Privacy completion. We implemented IDS like application based on Hadoop MapReduce Framework, To forward logs from applications servers to Hadoop we used Apache Flume. Our Hadoop Cluster consisted of one NameNode (Master) and two Data Nodes. We used Hadoop 2.7.2, Flume 1.6.0 and Ubuntu Server 16.04. Conceptual Diagram of our implementation is shown below.
Data Flow to Hadoop
To forward real time live logs from application servers to Hadoop, we used Apache Flume. Below, we listed the data flow between different application servers and Hadoop.
- Configured Rsyslog service on application servers (Web, FTP and Proxy Servers) to forward logs to Apache Flume for further processing.
- Apache Flume was configured to listen on port 7000 for syslog events from Application servers.
- Further, Apache Flume forwards the logs received from the application servers to a Flume directory in HDFS using memory as a channel, eventually logs are written to DataNodes.
The flow is described below in Figure
Data Flow from Hadoop
Data flow from Hadoop to our IDS application is shown in figure below. we wrote script and scheduled to execute every five minutes to automates the complete process. I will cover this in part 2 of this tutorial
Hadoop Installation
Update hosts file on all nodes with the following content
hadoop@HadoopMaster:~$ sudo vi /etc/hosts 127.0.0.1 localhost #127.0.1.1 HadoopMaster 192.168.246.130 HadoopMaster 192.168.246.131 HadoopData1 192.168.246.132 HadoopData2 192.168.246.133 Flume
Install and verify Java on all nodes
hadoop@HadoopMaster:~$ sudo add-apt-repository ppa:webupd8team/java hadoop@HadoopMaster:~$ sudo apt-get update hadoop@HadoopMaster:~$ sudo apt-get install oracle-java8-installer hadoop@HadoopMaster:~$ sudo apt-get install oracle-java8-set-default hadoop@HadoopMaster:~$ java -version
Download Hadoop and Create HDFS NameNode Directory on Master, HDFS DataNode Directory on HadoopData1 and HadoopData2 and tmp directory on all Nodes
hadoop@HadoopMaster:~$ cd /usr/local/ hadoop@HadoopMaster:/usr/local$ sudo wget http://apache.mirror.iweb.ca/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz hadoop@HadoopMaster:/usr/local$ sudo mv hadoop-2.7.2/ hadoop/ hadoop@HadoopMaster:/usr/local$ sudo mkdir -p /usr/local/hadoop_tmp hadoop@HadoopMaster:/usr/local$ sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode hadoop@HadoopMaster:/usr/local$ sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode hadoop@HadoopMaster:/usr/local$ sudo chown hadoop:hadoop hadoop* -R hadoop@HadoopData1:/usr/local$ sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode hadoop@HadoopData1:/usr/local$ sudo chown hadoop:hadoop hadoop* -R hadoop@HadoopData2:/usr/local$ sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode hadoop@HadoopData2:/usr/local$ sudo chown hadoop:hadoop hadoop* -R
Create ssh keys and transfer it to data nodes so that nodes can access each other without passphrase
hadoop@HadoopMaster:~$ ssh-keygen -t rsa -P "" hadoop@HadoopMaster:~$ ssh-copy-id -i .ssh/id_rsa localhost hadoop@HadoopMaster:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@HadoopData1 hadoop@HadoopMaster:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@HadoopData2
Set Hadoop and Java Environment Variables, append following at the end of ~/.bachrc file on each node
export JAVA_HOME=/usr/lib/jvm/java-8-oracle export PATH=$JAVA_HOME/bin:$PATH export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib" export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
Reload bashrc file with following command
hadoop@HadoopMaster:~$ source ~/.bashrc
Hadoop Configuration
Goto Hadoop directory
hadoop@HadoopMaster:~$ cd /usr/local/hadoop/etc/hadoop/
update Java Path in file hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
Update the content of core, hdfs, mapred and yarn files
core-site.xml
<configuration> <property> <name>hadoop.tmp.dir</name> <value>/usr/local/hadoop_tmp/tmp</value> </property> <property> <name>fs.default.name</name> <value>hdfs://HadoopMaster:54310</value> </property> </configuration>
hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop_store/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop_store/hdfs/datanode</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> </configuration>
mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
yarn-site.xml
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>HadoopMaster:8025</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>HadoopMaster:8030</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>HadoopMaster:8050</value> </property> </configuration>
Update masters file with the hostname of Master node
hadoop@HadoopMaster:/usr/local/hadoop/etc/hadoop$ vi masters HadoopMaster
Update slaves file with the hostnames of DataNodes/Slaves
hadoop@HadoopMaster:/usr/local/hadoop/etc/hadoop$ vi slaves HadoopData1 HadoopData2
sync configuration on both data nodes
hadoop@HadoopMaster:~$ rsync -avP /usr/local/hadoop/etc/hadoop/ hadoop@HadoopData1:/usr/local/hadoop/etc/hadoop/ hadoop@HadoopMaster:~$ rsync -avP /usr/local/hadoop/etc/hadoop/ hadoop@HadoopData2:/usr/local/hadoop/etc/hadoop/
format namenode
hadoop@HadoopMaster:~$ hadoop namemode -format
start hadoop cluster
start-all.sh
following figure shows processes running on HadoopMaster
following figure shows processes running on Hadoop DataNodes
create Flume input directory on Hadoop
hadoop@HadoopMaster:~$ hadoop fs -mkdir /user hadoop@HadoopMaster:~$ hadoop fs -mkdir /user/hadoop hadoop@HadoopMaster:~$ hadoop fs -mkdir /user/hadoop/flume hadoop@HadoopMaster:~$ hadoop fs -mkdir /user/hadoop/flume/input
Apache Flume Installation and Configuration
I installed Flume using Cloudera’s packages, data flow from Application Servers to Hadoop is shown in following figure
to install flume issue following commands
flume@flume:~$ sudo wget https://archive.cloudera.com/cdh5/ubuntu/trusty/amd64/cdh/archive.key -O archive.key flume@flume:~$ sudo apt-key add archive.key flume@flume:~$ sudo wget 'https://archive.cloudera.com/cdh5/ubuntu/trusty/amd64/cdh/cloudera.list' -O /etc/apt/sources.list.d/cloudera.list flume@flume:~$ sudo apt-get update flume@flume:~$ sudo apt-get install flume-ng flume@flume:~$ cd /etc/flume-ng/conf flume@flume:/etc/flume-ng/conf$ sudo cp flume-env.sh.template flume-env.sh
edit flume-env.sh and update JAVA path, to verfiy java path run command “update-alternatives –config java”
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/bin
create file flume-syslog.conf and paste the following content in it
flume@flume:~$ sudo vi /etc/flume-ng/conf/flume-syslog.conf # Flume-syslog agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Source - SyslogUDP a1.sources.r1.type = syslogudp a1.sources.r1.command = 0.0.0.0 a1.sources.r1.port = 7777 a1.sources.r1.keepFields = timestamp,hostname # Sink - HDFS a1.sinks.k1.type = hdfs a1.sinks.k1.hdfs.path = hdfs://HadoopMaster:54310/user/hadoop/flume/input a1.sinks.k1.hdfs.fileType = DataStream a1.sinks.k1.hdfs.writeFormat = Text a1.sinks.k1.hdfs.batchSize = 10000 a1.sinks.k1.hdfs.rollSize = 0 a1.sinks.k1.hdfs.rollInternal = 600 a1.sinks.k1.hdfs.rollCount = 10000 a1.sinks.k1.hdfs.filePrefix = syslog a1.sinks.k1.hdfs.minBlockReplicas = 1 # Memory Channel a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Binding Source and Sink to Channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
To start Flume on boot, create flume-ng.service file and paste following content in it
flume@flume:~$ sudo vi /etc/systemd/system/flume-ng.service [Unit] Description=Apache Flume [Service] ExecStart=/usr/bin/nohup /usr/bin/flume-ng agent -c conf -f /etc/flume-ng/conf/flume-syslog.conf --name a1 & [Install] WantedBy=multi-user.target
set permissions and enable flume-ng service to start on boot by issuing following commands
flume@flume:~$ chmod 664 /etc/systemd/system/flume-ng.service flume@flume:~$ systemctl daemon-reload flume@flume:~$ systemctl enable flume-ng.service
reboot and verify that flume is working by issuing command “ps -aux | grep flume-ng”, if everything goes well you will see similar output as shown in following picture
Forwarding Syslog to Flume:
on client machines add following line in the end of rsyslog file
flume@flume:~$ sudo vi /etc/rsyslog.conf *.* @Flume:7777
save and restart rsyslog service
hadoop@HadoopData2:~$ sudo service rsyslog restart
‘hadoop fs -ls /user/hadoop/flume/input’ will show all files created by Flume