Query a HBASE table through Hive using PySpark on EMR

In this blog post, I’ll demonstrate how we can access a HBASE table through Hive from a PySpark script/job on an AWS EMR cluster. First I created an EMR cluster (EMR 5.27.0, Hive 2.3.5, Hbase 1.4.0). Then I connected to the master node, executed “hbase shell”, created a HBASE table, and inserted a sample row:

I logged in to hive and created a Hive table which points to the HBASE table:

When I tried to access table using spark.table(‘myhivetable’), I got an error pointing that the org.apache.hadoop.hive.hbase.HBaseStorageHandler class was not found. I tried to use “–packages” parameter to get the required JAR library from maven repository. It downloaded a lot of missing jars but it did not work. So I downloaded the required JAR file using wget, and copied it to Spark’s JAR directory:

I noticed that it requires some HBASE Jar files, so I copied them into the Spark’s JAR directory:

After that, I tried to query the table and it worked:

I tested the same method on an earlier EMR version (5.12.x) and I saw that it failed because the spark executors tried to connect to local Zookeeper instances (which do not exists on core/task nodes). If you got such an error, you can set “hbase.zookeeper.quorum” to your master node’s IP address (where the Zookeeper runs):

The IP (172.31.41.174) is the private IP of my master node. You can learn the private IP of your master node from the EMR console. It’s shown in master instance info in Hardware page. You can also connect to master node and run the following command:

I hope it helps. Please do not hesitate to let me know if you have any questions.

Please share
  •  
  •  
  •  
  •  
  •  
  •  

AWS Big Data Specialist. Oracle Certified Professional (OCP) for EBS R12, Oracle 10g and 11g. Co-author of "Expert Oracle Enterprise Manager 12c" book published by Apress. Awarded as Oracle ACE (in 2011) and Oracle ACE Director (in 2016) for the continuous contributions to the Oracle users community. Founding member, and vice president of Turkish Oracle User Group (TROUG). Presented at various international conferences including Oracle Open World.

Leave Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.