Oracle Big Data Cloud Service CE: Working with Hive, Spark and Zeppelin 0.7

In my previous post, I mentioned that Oracle Big Data Cloud Service – Compute Edition started to come with Zeppelin 0.7 and the version 0.7 does not have HIVE interpreter. It means we won’t be able to use “%hive” blocks to run queries for Apache Hive. Instead of “%hive” blocks, we can use JDBC interpreter (“%jdbc” blocks) or Spark SQL (“%sql” blocks).

The JDBC interpreter lets you create a JDBC connection to any data source. It has been tested with both popular RDBMS and NoSQL databases such as Postgres, MySQL, Amazon Redshift, Apache Hive. To be able to connect a data source, we first need to define it on Zeppelin interpreter settings. In normal conditions, we access Zeppelin trough Big Data Cloud – Compute Edition Console, and it prevents us to see the menu to reach the interpreter settings but we can easily bypass the console with a little trick. After we opened a notebook at the console, get the URL we connected, remove “?#notebook/XXXXX” part from the URL, and add “/zeppelinui/”, so our URL should be like this “https://bigdataconsoleip:1080/zeppelinui/”. This is the address we can access Zeppelin’s native user interface.

In this page, we can use the drop-down menu on the upper-right to access the interpreters page. We can search the interpreters, edit the settings and then restart the interpreter. For now, we don’t need to change anything. Hive is already defined in our Cloud Service so we can use JDBC interpreter to connect Hive.

Oracle BDCSCE Upgraded: Zeppelin 0.7 and Spark 2.1

Last week, Oracle Big Data Cloud Service – Compute Edition was upgraded from 17.2.5 to 17.3.1-20. I do not know if the new version is still in testing phase and available to only trial users, but sooner or later the new version will be available to all Oracle Cloud users.

The new version is still based on HDP 2.4.2 but it contains upgrades on two key components: Zeppelin and Spark. Users can now select which Spark version they will use (version 2.1 or version 1.6) when creating the service, and Zeppelin 0.7 installed instead of Zeppelin 0.6. Both of them are important changes.

Introduction to Oracle Big Data Cloud Service – Compute Edition (Part VI) – Hive

I though I would stop writing about “Oracle Big Data Cloud Service – Compute Edition” after my fifth blog post, but then I noticed that I didn’t mention about the Apache Hive, another important component of the Big Data. Hive is a data warehouse infrastructure built on top of Hadoop, designed to work with large datasets. Why is it so important? Because it includes support for SQL (SQL:2003 and SQL:2011), and helps users to utilize existing SQL skillsets to quickly derive value from big data.

Although new improvements of Hive project enables sub-second query retrieval (Hive LLAP) but it’s not designed for online transaction processing (OLTP) workloads. Hive is best used for traditional data warehousing tasks.

In this blog post, I’ll demonstrate how we can import data from CSV files into hive tables, and run SQL queries to analyze the date stored in these tables.

Introduction to Oracle Big Data Cloud Service – Compute Edition (Part V) – Pig

This is my fifth blog post of my introduction series for Oracle Big Data Cloud Service – Compute Edition. In this blog post, I’ll mention “Apache Pig”. It’s a tool/platform created by “Yahoo!” to analyze large data sets without the complexities of writing a traditional MapReduce program. It’s designed to process any kind of data (structured or unstructured) so it’s a great tool for ETL jobs. Pig comes installed and ready to use with “Oracle Big Data Cloud Service – Compute Edition”. In this blog post, I’ll show how we can write use pig to read, parse and analyze data.

Pig has a high-level SQL-like programming language called Pig Latin. We need to learn basics of this language to be able to use Pig. Each statement in a Pig script, is processed by the Pig interpreter to build a logical plan which will be used to procedure MapReduce jobs. The steps in the logical plan are not “executed” until a DUMP or STORE statement is used.

Pig scripts have generally the following structure:

  1. Data is read by using LOAD statements.
  2. Data is transformed/processed.
  3. The result is dumped (to screen) or stored to a file (or a Hive table).

Introduction to Oracle Big Data Cloud Service – Compute Edition (Part IV) – Zeppelin

This is my forth blog post about Oracle Big Data Cloud Service – Compute Edition. In my previous blog posts, I showed how we can create a big data cloud service compute edition on Oracle Cloud, which services are installed by default, ambari management service and now it’s time to write about how we can work with data using Apache Zeppelin. Apache Zeppelin is a web-based notebook that enables interactive data analytics. Zeppelin is not the only way to work data but it’s surely very friendly for end-users and (as I said before) it’s already installed to our big data cloud service compute edition.

We can create a rule to allow access to TCP port 9995 for accessing Zeppelin directly, or we can use “big data console” provided by Oracle. I’ll prefer the second one because our Ngix proxy will let only authenticated users to access Zeppelin.

After you reach the console, go to the notebooks page. Click “new note”, enter a name and then click “OK” – this will create a new empty notebook, and you’ll start editing it. My new notebook’s name is “MyFirstNote”. As you can see there are some sample notebooks, you can examine them to learn how you can use java and spark with Zeppelin.