Listen:
Since Cloudera was starting to build a complete stack it is more easy to get a cluster up and running. They build rpm's for Redhat based systems, I use CentOS, the newest CDH build is CDH2, which comes with a lot nice addons. Thanks for the work, guys!
What hardware we should use? Simple, moderate servers are pretty nice for. 2 CPU, 8 GB RAM, 4x750GB HDD will be enough. The master should have 4 CPU, 8 or more GB RAM and 100 GB RAID 5, we need that hardware twice.Why? The master is the most important server, hosting the namenode, jobtracker and, if you setup, ganglia. The hardware you use depends on your usecases, what I describe there works for eventanalysis and weblogs. But it will be a good start to see what power hadoop can deliver. The systems are located in the same rack so we can rack awareness moving in background. Use lvm and create a large directory, I use /opt/hadoop:
# df -h /opt/hadoop/
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/myvg-hadoopvol
1.4T 213G 1.1T 17% /opt/hadoop
Important: Grab from java.sun.com the jdk-rpm (1.6.x) and install it.
Hint: A good idea will be to deploy a SSH key to all nodes, so we can authenticate without a password. Done with cp rsa_id.pub from masternode to slaves /root/.ssh/authorized_keys
All servers are running, first get the repo working in your box. Simply add a file in /etc/yum.repos.d/:
# cat cloudera-cdh2.repo
[cloudera-cdh2]
name=Cloudera's Distribution for Hadoop, Version 2
mirrorlist=http://archive.cloudera.com/redhat/cdh/2/mirrors
gpgkey = http://archive.cloudera.com/redhat/cdh/RPM-GPG-KEY-cloudera
gpgcheck=1
enabled=1
deploy that file on all nodes.
Now we give a bit more limits:
# cat /etc/security/limits.conf
hdfs soft nofile 5000
hdfs hard nofile 5000
mapred hard nofile 5000
mapred soft nofile 5000
hadoop hard nofile 5000
hadoop soft nofile 5000
# cat /etc/sysctl.conf
fs.file-max=200000
Setup the namenode and jobtracker on the master-box:
"yum install hadoop-0.20-pipes hadoop-0.20-native hadoop-0.20-jobtracker hadoop-0.20 hadoop-zookeeper hadoop-0.20-namenode -y && mkdir -p /opt/hadoop/name && mkdir -p /opt/hadoop/hdfs/mapred/local && chown -R hdfs:hadoop /opt/hadoop/hdfs/ && chown -R mapred:hadoop /opt/hadoop/hdfs/mapred"
To prevent the configs from a unwanted update we use alternatives here:
cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.used
alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.used 50
I use port 9000 for hdfs, but that depends on your environment.
Disable IPv6:
edit hadoop-env.sh:
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
Edit the main config files:
core-site.xml:
<property>
<name>fs.default.name</name>
<value>hdfs://<your-server-name>:9000</value>
</property>
hdfs-site.xml:
<property>
<name>dfs.data.dir</name>
<value>/opt/hadoop/hdfs/data</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/opt/hadoop/hdfs/name</value>
</property>
<property>
<name>dfs.secondary.http.address</name>
<value>namenode2:50070</value>
</property>
mapred-site.xml:
<property>
<name>mapred.local.dir</name>
<value>/opt/hadoop/hdfs/mapred/local</value>
</property>
<name>mapred.job.tracker</name>
<value><your-server-name>:54311</value>
</property>
Describe your cluster. Here we have to edit 2 files, master and slave:
# cat /etc/hadoop-0.20/conf.used/masters
<your-namenode-server>
# cat /etc/hadoop-0.20/conf.used/slaves
datanode1
datanode2
datanode3
datanode4
datanode5
datanode6
Setup the secondary namenode:
# yum install hadoop-0.20-secondarynamenode -y
Format the namenode:
# sudo -u hdfs hadoop namenode -format
You have to wait, simple watch the logs (tail -f /var/log/hadoop-0.20/*.log)
and start:
# for x in /etc/init.d/hadoop-0.20-* ; do $x restart ; done
create the mapred-dirs in hdfs:
# sudo -u hdfs hadoop fs -mkdir /mapred/system
# sudo -u hdfs hadoop fs -chown -R mapred /mapred
Login at the secondary namenode and restart also:
# for x in /etc/init.d/hadoop-0.20-* ; do $x restart ; done
Watch the logs for errors.
Now let us install the datanodes and tasktrackers. Before you let run the script be sure all nodes are via DNS available, SSH keys deployed and hosts are known by the SSH subsystem.
# for i in $(cat /etc/hadoop-0.20/conf.used/slaves); do ssh $i 'yum install hadoop-0.20-datanode.noarch hadoop-0.20-native.x86_64 hadoop-0.20-tasktracker.noarch -y && \
mkdir -p /opt/hadoop/hdfs/name && mkdir -p /opt/hadoop/hdfs/mapred/local && chown -R hdfs:hadoop /opt/hadoop/hdfs/ && chown -R mapred:hadoop /opt/hadoop/hdfs/mapred'; done
copy the config:
# for i in $(cat /etc/hadoop-0.20/conf.used/slaves); do scp -r /etc/hadoop-0.20/conf.used $i:/etc/hadoop-0.20/ ; done
activate via alternatives:
# for i in $(cat /etc/hadoop-0.20/conf/hadoop_slaves); \
do ssh $i alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.as24 50; done
Now it is time to get the cluster running the first time:
# for i in $(cat /etc/hadoop-0.20/conf.used/slaves); do ssh $i 'for x in /etc/init.d/hadoop-0.20-* ; do $x restart; done'; done
Again, watch the logs for errors, and also you can check the whole nodes with:
# for i in $(cat /etc/hadoop-0.20/conf.used/slaves); do echo $i && ssh $i 'for x in /etc/init.d/hadoop-0.20-* ; do $x status; done'; done
If all goes well you can see your running cluster via web:
http://namenode:50070/dfshealth.jsp
Another good idea will be to edit your hosts according your cluster-nodes:
...
# hadoop nodes
IP datanode1.FQDN datanode1
IP datanode2.FQDN datanode2
IP datanode3.FQDN datanode3
IP datanode4.FQDN datanode4
IP datanode5.FQDN datanode5
IP datanode6.FQDN datanode6
#hadoop master
IP namenode1.FQDN namenode1
#hadoop secondary
IP namenode2.FQDN namenode2
Enjoy!
created: 25.November 2010
Comments
Post a Comment