novatechflow

Posts

Showing posts from January, 2012

Use snappy codec with Hive

[1] Snappy is a compression and decompression library, initially developed from Google and now integrated into Hadoop. Snappy acts about 10% faster than LZO, the biggest differences are the packaging and that snappy only provides a codec and does not have a container spec, whereas LZO has a file-format container and a compression codec. Snappy is shipped with CDH3u2 (for Clouderas Distribution) included in the hadoop-0.20 package or in [2] Apache hadoop Version 0.21.0 up. The example I explain was initially created from Esteban, an Cloudera Customer Operations Engineer. Create a sequenced file $ seq 1 1000 | awk '{OFS="\001";print $1, $1 % 10}' > test_input.hive $ cat test_input.hive |head -5 11 22 33 44 Upload into hdfs $ hadoop dfs -mkdir /tmp/hivetest $ hadoop dfs -put /home/hdfs/test_input.hive /tmp/hivetest $ hadoop dfs -ls /tmp/hivetest Found 1 items -rw-r--r-- 3 hdfs supergroup 5893 2012-01-19 09:58 /tmp/hivetest/test_input.hive ...