Query a HBASE table through Hive using PySpark on EMR
In this blog post, I’ll demonstrate how we can access a HBASE table through Hive from a PySpark script/job on an AWS EMR cluster. First I created an EMR cluster (EMR 5.27.0, Hive 2.3.5, Hbase 1.4.0). Then I connected to the master node, executed “hbase shell”, created a HBASE table, and inserted a sample row:
1 2 |
create 'mytable','f1' put 'mytable', 'row1', 'f1:name', 'Gokhan' |
I logged in to hive and created a Hive table which points to the HBASE table:
1 2 3 |
CREATE EXTERNAL TABLE myhivetable (rowkey STRING, name STRING) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,f1:name') TBLPROPERTIES ('hbase.table.name' = 'mytable'); |
When I tried to access table using spark.table(‘myhivetable’), I got an error pointing that the org.apache.hadoop.hive.hbase.HBaseStorageHandler class was not found. I tried to use “–packages” parameter to get the required JAR library from maven repository. It downloaded a lot of missing jars but it did not work. So I downloaded the required JAR file using wget, and copied it to Spark’s JAR directory:
1 2 |
wget https://repo1.maven.org/maven2/org/apache/hive/hive-hbase-handler/2.3.5/hive-hbase-handler-2.3.5.jar sudo cp hive-hbase-handler-2.3.5.jar /usr/lib/spark/jars/ |