Oracle Big Data Cloud Service CE: Working with Hive, Spark and Zeppelin 0.7

In my previous post, I mentioned that Oracle Big Data Cloud Service – Compute Edition started to come with Zeppelin 0.7 and the version 0.7 does not have HIVE interpreter. It means we won’t be able to use “%hive” blocks to run queries for Apache Hive. Instead of “%hive” blocks, we can use JDBC interpreter (“%jdbc” blocks) or Spark SQL (“%sql” blocks).

The JDBC interpreter lets you create a JDBC connection to any data source. It has been tested with both popular RDBMS and NoSQL databases such as Postgres, MySQL, Amazon Redshift, Apache Hive. To be able to connect a data source, we first need to define it on Zeppelin interpreter settings. In normal conditions, we access Zeppelin trough Big Data Cloud – Compute Edition Console, and it prevents us to see the menu to reach the interpreter settings but we can easily bypass the console with a little trick. After we opened a notebook at the console, get the URL we connected, remove “?#notebook/XXXXX” part from the URL, and add “/zeppelinui/”, so our URL should be like this “https://bigdataconsoleip:1080/zeppelinui/”. This is the address we can access Zeppelin’s native user interface.

In this page, we can use the drop-down menu on the upper-right to access the interpreters page. We can search the interpreters, edit the settings and then restart the interpreter. For now, we don’t need to change anything. Hive is already defined in our Cloud Service so we can use JDBC interpreter to connect Hive.

Introduction to Oracle Big Data Cloud Service – Compute Edition (Part VI) – Hive

I though I would stop writing about “Oracle Big Data Cloud Service – Compute Edition” after my fifth blog post, but then I noticed that I didn’t mention about the Apache Hive, another important component of the Big Data. Hive is a data warehouse infrastructure built on top of Hadoop, designed to work with large datasets. Why is it so important? Because it includes support for SQL (SQL:2003 and SQL:2011), and helps users to utilize existing SQL skillsets to quickly derive value from big data.

Although new improvements of Hive project enables sub-second query retrieval (Hive LLAP) but it’s not designed for online transaction processing (OLTP) workloads. Hive is best used for traditional data warehousing tasks.

In this blog post, I’ll demonstrate how we can import data from CSV files into hive tables, and run SQL queries to analyze the date stored in these tables.

Introduction to Oracle Big Data Cloud Service – Compute Edition (Part III) – Ambari

This is my third blog post about Oracle Big Data Cloud Service – Compute Edition. I continue to guide you about the “Big Data Cloud Service – Compute Edition” and its components. In this blog post, I will introduce Ambari – the management service of our hadoop cluster.

The Apache Ambari simplifies provisioning, managing, and monitoring Apache Hadoop clusters. It’s the default management tool of Hortonworks Data Platform but it can be used independently from Hortonworks. After you create your big data service, SSH and 8080 (port used by Ambari) is blocked. You need to enable the rules to allow access through these ports. In my first blog post about Oracle Big Data Cloud Service – Compute Edition, I showed how to enable these ports.