Skip to main content

Query HBase tables with Impala


As described in other blog posts, Impala uses Hive Metastore Service to query the underlaying data. In this post I use the Hive-HBase handler to connect Hive and HBase and query the data later with Impala. In the past I've written a tutorial ( how to connect HBase and Hive, please follow the instructions there.

This approach offers Data Scientists a wide field of work with data stored in HDFS and / or HBase. You will get the possibility to run queries against your stored data independently which technology and database do you use, simply by querying the different data sources in a fast and easy way.

I use the official available census data gathered in 2000 by the US government. The goal is to push this data as CSV into HBase and query this table per Impala. I've made a demonstration script which is available in my git repository.

Demonstration scenario

The dataset looks pretty simple:

cat DEC_00_SF3_P077_with_ann_noheader.csv


Create the HBase table:

create 'zipcode_hive', 'id', 'zip', 'desc', 'income'

and create an external table in Hive which looks as follows:

CREATE EXTERNAL TABLE ZIPCODE_HBASE (key STRING,zip STRING,desc1 STRING,desc2 STRING,income STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,zip:zip,desc:desc1,desc:desc2,income:income") TBLPROPERTIES("" = "zipcode_hive");

Here we map the Hive tables per HBaseStorageHandler to the HBase scheme we've used in the step above.

After these steps are successfully finished, we need to copy the CSV data into HBase. I chose Pig for this task but you can use a translate table in Hive, too.

Here's my Pig script:

cat PopulateData.pig

copyFromLocal DEC_00_SF3_P077_with_ann_noheader.csv ziptest.csv
A = LOAD 'ziptest.csv' USING PigStorage(',') as (id:chararray, zip:chararray, desc1:chararray, desc2:chararray, income:chararray); STORE A INTO 'hbase://zipcode_hive' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('zip:zip,desc:desc1,desc:desc2,income:income');

The job takes a few seconds and the data is available per HBase:

scan 'zipcode_hive', LIMIT => 2

ROW                                    COLUMN+CELL                                                                                                    
 8600000US00601                        column=desc:desc1, timestamp=1368880594523, value=006015-DigitZCTA                                             
 8600000US00601                        column=desc:desc2, timestamp=1368880594523, value=0063-DigitZCTA                                               
 8600000US00601                        column=income:income, timestamp=1368880594523, value=11102                                                     
 8600000US00601                        column=zip:zip, timestamp=1368880594523, value=00601                                                           
 8600000US00602                        column=desc:desc1, timestamp=1368880594523, value=006025-DigitZCTA                                             
 8600000US00602                        column=desc:desc2, timestamp=1368880594523, value=0063-DigitZCTA                                               
 8600000US00602                        column=income:income, timestamp=1368880594523, value=12869                                                     
 8600000US00602                        column=zip:zip, timestamp=1368880594523, value=00602 

Now we do the same with Impala:

select * from zipcode_hbase limit 4

Using service name 'impala' for kerberos
Connected to hadoop1:21000
Server version: impalad version 1.0 RELEASE (build d1bf0d1dac339af3692ffa17a5e3fdae0aed751f)
Query: select *
from ZIPCODE_HBASE limit 4
Query finished, fetching results ...
| key            | desc1            | desc2          | income | zip   |
| 8600000US00601 | 006015-DigitZCTA | 0063-DigitZCTA | 11102  | 00601 |
| 8600000US00602 | 006025-DigitZCTA | 0063-DigitZCTA | 12869  | 00602 |
| 8600000US00603 | 006035-DigitZCTA | 0063-DigitZCTA | 12423  | 00603 |
| 8600000US00604 | 006045-DigitZCTA | 0063-DigitZCTA | 33548  | 00604 |
Returned 4 row(s) in 0.42s

Another query to get the incomes between 1,000 and 5,000 US$, sorted by income:

select * from zipcode_hbase where income between '1000' and '5000' order by income DESC limit 20;

| key            | desc1            | desc2          | income | zip   |
| 8600000US64138 | 641385-DigitZCTA | 6413-DigitZCTA | 49995  | 64138 |
| 8600000US12477 | 124775-DigitZCTA | 1243-DigitZCTA | 49993  | 12477 |
| 8600000US33025 | 330255-DigitZCTA | 3303-DigitZCTA | 49991  | 33025 |
| 8600000US44119 | 441195-DigitZCTA | 4413-DigitZCTA | 49988  | 44119 |
| 8600000US34997 | 349975-DigitZCTA | 3493-DigitZCTA | 49982  | 34997 |
| 8600000US70665 | 706655-DigitZCTA | 7063-DigitZCTA | 49981  | 70665 |
| 8600000US28625 | 286255-DigitZCTA | 2863-DigitZCTA | 49981  | 28625 |
| 8600000US76134 | 761345-DigitZCTA | 7613-DigitZCTA | 49979  | 76134 |
| 8600000US44618 | 446185-DigitZCTA | 4463-DigitZCTA | 49978  | 44618 |
| 8600000US65714 | 657145-DigitZCTA | 6573-DigitZCTA | 49978  | 65714 |
| 8600000US77338 | 773385-DigitZCTA | 7733-DigitZCTA | 49976  | 77338 |
| 8600000US14622 | 146225-DigitZCTA | 1463-DigitZCTA | 49972  | 14622 |
| 8600000US84339 | 843395-DigitZCTA | 8433-DigitZCTA | 49972  | 84339 |
| 8600000US85020 | 850205-DigitZCTA | 8503-DigitZCTA | 49967  | 85020 |
| 8600000US64061 | 640615-DigitZCTA | 6403-DigitZCTA | 49964  | 64061 |
| 8600000US97361 | 973615-DigitZCTA | 9733-DigitZCTA | 49961  | 97361 |
| 8600000US30008 | 300085-DigitZCTA | 3003-DigitZCTA | 49960  | 30008 |
| 8600000US48634 | 486345-DigitZCTA | 4863-DigitZCTA | 49958  | 48634 |
| 8600000US47923 | 479235-DigitZCTA | 4793-DigitZCTA | 49946  | 47923 |
| 8600000US46958 | 469585-DigitZCTA | 4693-DigitZCTA | 49946  | 46958 |
Returned 20 row(s) in 1.08s


Popular posts from this blog

Deal with corrupted messages in Apache Kafka

Under some strange circumstances, it can happen that a message in a Kafka topic is corrupted. This often happens when using 3rd party frameworks with Kafka. In addition, Kafka < 0.9 does not have a lock on at the consumer read level, but does have a lock on Log.write(). This can lead to a rare race condition as described in KAKFA-2477 [1]. A likely log entry looks like this: ERROR Error processing message, stopping consumer: ($) kafka.message.InvalidMessageException: Message is corrupt (stored crc = xxxxxxxxxx, computed crc = yyyyyyyyyy Kafka-Tools Kafka stores the offset of each consumer in Zookeeper. To read the offsets, Kafka provides handy tools [2]. But you can also use, at least to display the consumer and the stored offsets. First we need to find the consumer for a topic (> Kafka 0.9): bin/ --zookeeper management01:2181 --describe --group test Prior to Kafka 0.9, the only way to get this in...

Beyond Ctrl+F - Use LLM's For PDF Analysis

PDFs are everywhere, seemingly indestructible, and present in our daily lives at all thinkable and unthinkable positions. We've all got mountains of them, and even companies shouting about "digital transformation" haven't managed to escape their clutches. Now, I'm a product guy, not a document management guru. But I started thinking: if PDFs are omnipresent in our existence, why not throw some cutting-edge AI at the problem? Maybe Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) could be the answer. Don't get me wrong, PDF search indexes like Solr exist, but they're basically glorified Ctrl+F. They point you to the right file, but don't actually help you understand what's in it. And sure, Microsoft Fabric's got some fancy PDF Q&A stuff, but it's a complex beast with a hefty price tag. That's why I decided to experiment with LLMs and RAG. My idea? An intelligent knowledge base built on top of our existing P...

Run Llama3 (or any LLM / SLM) on Your MacBook in 2024

I'm gonna be real with you: the Cloud and SaaS / PaaS is great... until it isn't. When you're elbow-deep in doing something with the likes of ChatGPT or Gemini or whatever, the last thing you need is your AI assistant starts choking (It seems that upper network connection was reset) because 5G or the local WiFi crapped out or some server halfway across the world is having a meltdown(s). That's why I'm all about running large language models (LLMs) like Llama3 locally. Yep, right on your trusty MacBook. Sure, the cloud's got its perks, but here's why local is the way to go, especially for me: Privacy:  When you're brainstorming the next big thing, you don't want your ideas floating around on some random server. Keeping your data local means it's  yours , and that's a level of control I can get behind. Offline = Uninterrupted Flow:  Whether you're on a plane, at a coffee shop with spotty wifi, or jus...