Skip to main content

Get an Hadoop cluster running in 20 minutes

Listen:

Since Cloudera was starting to build a complete stack it is more easy to get a cluster up and running. They build rpm's for Redhat based systems, I use CentOS, the newest CDH build is CDH2, which comes with a lot nice addons. Thanks for the work, guys!

What hardware we should use? Simple, moderate servers are pretty nice for. 2 CPU, 8 GB RAM, 4x750GB HDD will be enough. The master should have 4 CPU, 8 or more GB RAM and 100 GB RAID 5, we need that hardware twice.Why? The master is the most important server, hosting the namenode, jobtracker and, if you setup, ganglia. The hardware you use depends on your usecases, what I describe there works for eventanalysis and weblogs. But it will be a good start to see what power hadoop can deliver. The systems are located in the same rack so we can rack awareness moving in background. Use lvm and create a large directory, I use /opt/hadoop:


# df -h /opt/hadoop/
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/myvg-hadoopvol
                      1.4T  213G  1.1T  17% /opt/hadoop


Important: Grab from java.sun.com the jdk-rpm (1.6.x) and install it.

Hint: A good idea will be to deploy a SSH key to all nodes, so we can authenticate without a password. Done with cp rsa_id.pub from masternode to slaves /root/.ssh/authorized_keys

All servers are running, first get the repo working in your box. Simply add a file in /etc/yum.repos.d/:

# cat cloudera-cdh2.repo
[cloudera-cdh2]
name=Cloudera's Distribution for Hadoop, Version 2
mirrorlist=http://archive.cloudera.com/redhat/cdh/2/mirrors
gpgkey = http://archive.cloudera.com/redhat/cdh/RPM-GPG-KEY-cloudera
gpgcheck=1
enabled=1

deploy that file on all nodes.

Now we give a bit more limits:
# cat /etc/security/limits.conf
 hdfs            soft     nofile         5000
 hdfs            hard     nofile         5000
 mapred          hard     nofile         5000
 mapred          soft     nofile         5000
 hadoop          hard     nofile         5000
 hadoop          soft     nofile         5000

# cat /etc/sysctl.conf
 fs.file-max=200000

Setup the namenode and jobtracker on the master-box:
"yum install hadoop-0.20-pipes hadoop-0.20-native hadoop-0.20-jobtracker hadoop-0.20 hadoop-zookeeper hadoop-0.20-namenode -y && mkdir -p /opt/hadoop/name && mkdir -p /opt/hadoop/hdfs/mapred/local && chown -R hdfs:hadoop /opt/hadoop/hdfs/ && chown -R mapred:hadoop /opt/hadoop/hdfs/mapred"

To prevent the configs from a unwanted update we use alternatives here:
cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.used
alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.used 50

I use port 9000 for hdfs, but that depends on your environment.

Disable IPv6:
edit hadoop-env.sh:
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

Edit the main config files:

core-site.xml:
<property>
  <name>fs.default.name</name>
  <value>hdfs://<your-server-name>:9000</value>
</property>

hdfs-site.xml:
<property>
    <name>dfs.data.dir</name>
    <value>/opt/hadoop/hdfs/data</value>
</property>
<property>
  <name>dfs.name.dir</name>
  <value>/opt/hadoop/hdfs/name</value>
</property>
<property>
  <name>dfs.secondary.http.address</name>
  <value>namenode2:50070</value>
</property>

mapred-site.xml:
<property>
<name>mapred.local.dir</name>
  <value>/opt/hadoop/hdfs/mapred/local</value>
</property>
<name>mapred.job.tracker</name>
  <value><your-server-name>:54311</value>
</property>

Describe your cluster. Here we have to edit 2 files, master and slave:
# cat /etc/hadoop-0.20/conf.used/masters
<your-namenode-server>

# cat /etc/hadoop-0.20/conf.used/slaves
datanode1
datanode2
datanode3
datanode4
datanode5
datanode6

Setup the secondary namenode:
# yum install hadoop-0.20-secondarynamenode -y

Format the namenode:
# sudo -u hdfs hadoop namenode -format
You have to wait, simple watch the logs (tail -f /var/log/hadoop-0.20/*.log)

and start:
# for x in /etc/init.d/hadoop-0.20-* ; do $x restart ; done

create the mapred-dirs in hdfs:
# sudo -u hdfs hadoop fs -mkdir /mapred/system
# sudo -u hdfs hadoop fs -chown -R mapred /mapred

Login at the secondary namenode and restart also:
# for x in /etc/init.d/hadoop-0.20-* ; do $x restart ; done

Watch the logs for errors.

Now let us install the datanodes and tasktrackers. Before you let run the script be sure all nodes are via DNS available, SSH keys deployed and hosts are known by the SSH subsystem.

# for i in $(cat /etc/hadoop-0.20/conf.used/slaves); do ssh $i 'yum install hadoop-0.20-datanode.noarch hadoop-0.20-native.x86_64 hadoop-0.20-tasktracker.noarch -y && \ 
mkdir -p /opt/hadoop/hdfs/name && mkdir -p /opt/hadoop/hdfs/mapred/local && chown -R hdfs:hadoop /opt/hadoop/hdfs/ && chown -R mapred:hadoop /opt/hadoop/hdfs/mapred'; done

copy the config:
# for i in $(cat /etc/hadoop-0.20/conf.used/slaves); do scp -r /etc/hadoop-0.20/conf.used $i:/etc/hadoop-0.20/ ; done

activate via alternatives:
# for i in $(cat /etc/hadoop-0.20/conf/hadoop_slaves); \ 
do ssh $i alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.as24 50; done

Now it is time to get the cluster running the first time:
# for i in $(cat /etc/hadoop-0.20/conf.used/slaves); do ssh $i 'for x in /etc/init.d/hadoop-0.20-* ; do $x restart; done'; done

Again, watch the logs for errors, and also you can check the whole nodes with:
# for i in $(cat /etc/hadoop-0.20/conf.used/slaves); do echo $i && ssh $i 'for x in /etc/init.d/hadoop-0.20-* ; do $x status; done'; done

If all goes well you can see your running cluster via web:
http://namenode:50070/dfshealth.jsp

Another good idea will be to edit your hosts according your cluster-nodes:

# cat /etc/hosts
...

# hadoop nodes
IP datanode1.FQDN datanode1
IP datanode2.FQDN datanode2
IP datanode3.FQDN datanode3
IP datanode4.FQDN datanode4
IP datanode5.FQDN datanode5
IP datanode6.FQDN datanode6
#hadoop master
IP namenode1.FQDN namenode1
#hadoop secondary
IP namenode2.FQDN namenode2

Enjoy!

created: 25.November 2010

Comments

Popular posts from this blog

Deal with corrupted messages in Apache Kafka

Under some strange circumstances, it can happen that a message in a Kafka topic is corrupted. This often happens when using 3rd party frameworks with Kafka. In addition, Kafka < 0.9 does not have a lock on Log.read() at the consumer read level, but does have a lock on Log.write(). This can lead to a rare race condition as described in KAKFA-2477 [1]. A likely log entry looks like this: ERROR Error processing message, stopping consumer: (kafka.tools.ConsoleConsumer$) kafka.message.InvalidMessageException: Message is corrupt (stored crc = xxxxxxxxxx, computed crc = yyyyyyyyyy Kafka-Tools Kafka stores the offset of each consumer in Zookeeper. To read the offsets, Kafka provides handy tools [2]. But you can also use zkCli.sh, at least to display the consumer and the stored offsets. First we need to find the consumer for a topic (> Kafka 0.9): bin/kafka-consumer-groups.sh --zookeeper management01:2181 --describe --group test Prior to Kafka 0.9, the only way to get this in...

Beyond Ctrl+F - Use LLM's For PDF Analysis

PDFs are everywhere, seemingly indestructible, and present in our daily lives at all thinkable and unthinkable positions. We've all got mountains of them, and even companies shouting about "digital transformation" haven't managed to escape their clutches. Now, I'm a product guy, not a document management guru. But I started thinking: if PDFs are omnipresent in our existence, why not throw some cutting-edge AI at the problem? Maybe Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) could be the answer. Don't get me wrong, PDF search indexes like Solr exist, but they're basically glorified Ctrl+F. They point you to the right file, but don't actually help you understand what's in it. And sure, Microsoft Fabric's got some fancy PDF Q&A stuff, but it's a complex beast with a hefty price tag. That's why I decided to experiment with LLMs and RAG. My idea? An intelligent knowledge base built on top of our existing P...

Run Llama3 (or any LLM / SLM) on Your MacBook in 2024

I'm gonna be real with you: the Cloud and SaaS / PaaS is great... until it isn't. When you're elbow-deep in doing something with the likes of ChatGPT or Gemini or whatever, the last thing you need is your AI assistant starts choking (It seems that upper network connection was reset) because 5G or the local WiFi crapped out or some server halfway across the world is having a meltdown(s). That's why I'm all about running large language models (LLMs) like Llama3 locally. Yep, right on your trusty MacBook. Sure, the cloud's got its perks, but here's why local is the way to go, especially for me: Privacy:  When you're brainstorming the next big thing, you don't want your ideas floating around on some random server. Keeping your data local means it's  yours , and that's a level of control I can get behind. Offline = Uninterrupted Flow:  Whether you're on a plane, at a coffee shop with spotty wifi, or jus...