PySpark Examples #1: Grouping Data from CSV File (Using RDDs)

During my presentation about “Spark with Python”, I told that I would share example codes (with detailed explanations). So this is my first example code. In this code, I read data from a CSV file to create a Spark RDD (Resilient Distributed Dataset). RDDs are the core data structures of Spark. I explained the features of RDDs in my presentation, so in this blog post, I will only focus on the example code.

For this sample code, I use the “u.user” file file of MovieLens 100K Dataset. I renamed it to “users.csv” but you can use it with current name if you want.

Introduction to Apache Spark with Python

Today, I spoke about “Apache Spark with Python” at Big Talk #2 meet-up in Istanbul Teknokent ARI-3, another event organized by Komtas for big data community. We had almost full room. Mine was the last session of the day but the audience was still very focused and eager to listen the subjects, so for me, the event was great.

By the way, I also enjoyed the sessions of other speakers: Zekeriya Beşioğlu spoke about Data Lakes and Kylo (an open source data lake management software). I’ll surely test that software as soon as possible. After Zekeriya, İsmail Parsa spoke about data science on retails systems. I’m very impressed from his knowledge, and happy to find a chance to join his session.

Using Spark to Process Data From Cassandra for Analytics

After my presentation about Apache Cassandra, most people asked if they can run analytical queries on Cassandra, and how they can integrate Spark with Cassandra. So I decided to write a blog post to demonstrate how we can process data from Cassandra using Spark. In this blog post, I’ll show how I can build a testing environment on Oracle Cloud (Spark + Cassandra), load sample data to Cassandra, and query the data using Spark.

Let me first create an Oracle Big Data Cloud instance. Instead of installing Spark manually, I’ll use Big Data Cloud service so I’ll have both Spark and Zeppelin. Zeppelin is a web-based notebook for interactive data analytics. I’ll use Zeppelin to run Spark scripts and queries.

I login to Oracle Cloud, and start creating Big Data Cloud service. I select “Basic” for deployment profile, because I do not need HIVE, I want only one node (for testing), and I select 2.1 of Spark version. After the service is created, I go to “Access rules” and enable ora_p2bdcsce_ssh because I will need connect to my server through SSH.

Using Spark to join data from CSV and MySQL Table

Yesterday, I explained how we can access MySQL database from Zeppelin which comes with Oracle Big Data Cloud Service Compute Edition (BDCSCE). Although we can use Zeppelin to access MySQL, we still need something more powerful to combine data from two different sources (for example data from CSV file and RDBMS tables). Spark is a great choice to process data. In this blog post, I’ll write a simple PySpark (Python for Spark) code which will read from MySQL and CSV, join data and write the output to MySQL again. Please keep in mind that I use Oracle BDCSCE which supports Spark 2.1. So I tested my codes on only Spark 2.1 and used Zeppelin environment. I expect you run all these steps on same environment. Otherwise, you may need to modify paths and codes.

For my sample script, I’ll use the flight information belongs to year 2008. If you read my my blog post series about BDCSCE, you should be familiar with it. Anyway, do not worry about the data structure, I use only a few columns of the data, and you get more information about it on the statistical computing website.

First, I’ll create a table on MySQL to store most active carriers (in 2008). I already set Zeppelin to access my MySQL database, so I create a new paragraph, put the following SQL commands and run them.

Oracle Big Data Cloud Service CE: Working with Hive, Spark and Zeppelin 0.7

In my previous post, I mentioned that Oracle Big Data Cloud Service – Compute Edition started to come with Zeppelin 0.7 and the version 0.7 does not have HIVE interpreter. It means we won’t be able to use “%hive” blocks to run queries for Apache Hive. Instead of “%hive” blocks, we can use JDBC interpreter (“%jdbc” blocks) or Spark SQL (“%sql” blocks).

The JDBC interpreter lets you create a JDBC connection to any data source. It has been tested with both popular RDBMS and NoSQL databases such as Postgres, MySQL, Amazon Redshift, Apache Hive. To be able to connect a data source, we first need to define it on Zeppelin interpreter settings. In normal conditions, we access Zeppelin trough Big Data Cloud – Compute Edition Console, and it prevents us to see the menu to reach the interpreter settings but we can easily bypass the console with a little trick. After we opened a notebook at the console, get the URL we connected, remove “?#notebook/XXXXX” part from the URL, and add “/zeppelinui/”, so our URL should be like this “https://bigdataconsoleip:1080/zeppelinui/”. This is the address we can access Zeppelin’s native user interface.

In this page, we can use the drop-down menu on the upper-right to access the interpreters page. We can search the interpreters, edit the settings and then restart the interpreter. For now, we don’t need to change anything. Hive is already defined in our Cloud Service so we can use JDBC interpreter to connect Hive.