Apache Tez on CDH 5.4.x

Listen:

Since Cloudera doesn't support Tez in their Distribution right now (but it'll come, I'm pretty confident), we experimented with Apache Tez and CDH 5.4 a bit.

To use Tez with CDH isn't so hard - and it works quite well. And our ETL and Hive jobs finished around 30 - 50% faster.

Anyway, here the blueprint. We use CentOS 6.7 with Epel Repo.

1. Install maven 3.2.5
wget http://archive.apache.org/dist/maven/maven-3/3.2.5/binaries/apache-maven-3.2.5-bin.tar.gz
tar xvfz apache-maven-3.2.5-bin.tar.gz -C /usr/local/
cd /usr/local/
ln -s apache-maven-3.2.5 maven

=> Compiling Tez with protobuf worked only with 3.2.5 in my case

1.1 Install 8_u40 JDK
mkdir development && cd development (thats my dev-root)

wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/8u40-b26/jdk-8u40-linux-x64.tar.gz"
tar xvfz jdk-8u40-linux-x64.tar.gz
export JAVA_HOME=/home/alo.alt/development/jdk1.8.0_40
export JRE_HOME=/home/alo.alt/development/jdk1.8.0_40/jre
export PATH=$PATH:/home/alo.alt/development/jdk1.8.0_40:/home/alo.alt/development/jdk1.8.0_40/jre

2. Create a maven profile.d file
vi /etc/profile.d/maven.sh
export M2_HOME=/usr/local/maven
export PATH=${M2_HOME}/bin:${PATH}

3. Get Tez
git clone https://github.com/apache/tez.git
git checkout tags/release-0.7.0
git checkout -b tristan

modify pom.xml to use hadoop-2.6.0-cdh.5.4.2

<profile>
<id>cdh5.4.0</id>
<activation>
<activeByDefault>false</activeByDefault>
</activation>
<properties>
<hadoop.version>2.6.0-cdh5.4.0</hadoop.version>
</properties>
<pluginRepositories>
<pluginRepository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</pluginRepository>
</pluginRepositories>
<repositories>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
</repositories>
</profile>

And apply the patch from https://gist.github.com/killerwhile/23225004a78949d4c849#file-gistfile1-diff

4. Install protobuf
sudo yum -y install gcc-c++ openssl-devel glibc
wget https://protobuf.googlecode.com/files/protobuf-2.5.0.tar.bz2
tar xfvj protobuf-2.5.0.tar.bz2
cd protobuf-2.5.0/
./configure && make && make check
make install && ldconfig && protoc --version

or use the precompiled RPMS:
ftp://ftp.pbone.net/mirror/ftp5.gwdg.de/pub/opensuse/repositories/home:/kalyaka/CentOS_CentOS-6/x86_64/protobuf-2.5.0-16.1.x86_64.rpm
ftp://ftp.pbone.net/mirror/ftp5.gwdg.de/pub/opensuse/repositories/home:/kalyaka/CentOS_CentOS-6/x86_64/protobuf-compiler-2.5.0-16.1.x86_64.rpm

5. Build Tez against CDH 5.4.2
mvn -Pcdh5.4.2 clean package -Dtar -DskipTests=true -Dmaven.javadoc.skip=true

6. Install Tez
hadoop dfs -mkdir /apps/tez && hadoop dfs -copyFromLocal tez/tez-dist/target/tez-0.7.0.tar.gz /apps/tez/tez-0.7.0.tar.gz

sudo mkdir -P /apps/tez && tar xvfz tez/tez-dist/target/tez-0.7.0.tar.gz -C /apps/tez/

6.1 create a tez-site.xml in /apps/tez/conf/
<configuration>
<property>
<name>tez.lib.uris</name>
<value>${fs.default.name}/apps/tez/tez-0.7.0.tar.gz</value>
</property>
</configuration>

7. Run Tez with Yarn
export TEZ_HOME=/apps/tez
export TEZ_CONF_DIR=${TEZ_HOME}/conf
export HADOOP_CLASSPATH="${HADOOP_CLASSPATH}:${TEZ_CONF_DIR}:$(find ${TEZ_HOME} -name "*.jar" | paste -sd ":")"

hive> set hive.execution.engine=tez;
hive> SELECT s07.description, s07.salary, s08.salary, s08.salary - s07.salary FROM sample_07 s07 JOIN sample_08 s08 ON ( s07.code = s08.code) WHERE s07.salary < s08.salary ORDER BY s08.salary-s07.salary DESC LIMIT 1000;

beeline --hiveconf tez.task.launch.env="LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$YOUR_HADOOP_COMMON_HOME/lib/native" \
--hiveconf tez.am.launch.env="LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$YOUR_HADOOP_COMMON_HOME/lib/native" '
Check if you have the lib*.so available in the native folder (or point to the folder which contains the .so files).

Sources:
https://gist.github.com/killerwhile/23225004a78949d4c849#file-gistfile1-diff
http://tez.apache.org/install.html

novatechflow

Search This Blog

Apache Tez on CDH 5.4.x

Labels

Comments

Post a Comment

Popular posts from this blog

Beyond Ctrl+F - Use LLM's For PDF Analysis

Deal with corrupted messages in Apache Kafka

What Makes You The Number 1 Product Manager?