Using Spark to Process Data From Cassandra for Analytics

After my presentation about Apache Cassandra, most people asked if they can run analytical queries on Cassandra, and how they can integrate Spark with Cassandra. So I decided to write a blog post to demonstrate how we can process data from Cassandra using Spark. In this blog post, I’ll show how I can build a testing environment on Oracle Cloud (Spark + Cassandra), load sample data to Cassandra, and query the data using Spark.

Let me first create an Oracle Big Data Cloud instance. Instead of installing Spark manually, I’ll use Big Data Cloud service so I’ll have both Spark and Zeppelin. Zeppelin is a web-based notebook for interactive data analytics. I’ll use Zeppelin to run Spark scripts and queries.

I login to Oracle Cloud, and start creating Big Data Cloud service. I select “Basic” for deployment profile, because I do not need HIVE, I want only one node (for testing), and I select 2.1 of Spark version. After the service is created, I go to “Access rules” and enable ora_p2bdcsce_ssh because I will need connect to my server through SSH.…

Using Spark to join data from CSV and MySQL Table

Yesterday, I explained how we can access MySQL database from Zeppelin which comes with Oracle Big Data Cloud Service Compute Edition (BDCSCE). Although we can use Zeppelin to access MySQL, we still need something more powerful to combine data from two different sources (for example data from CSV file and RDBMS tables). Spark is a great choice to process data. In this blog post, I’ll write a simple PySpark (Python for Spark) code which will read from MySQL and CSV, join data and write the output to MySQL again. Please keep in mind that I use Oracle BDCSCE which supports Spark 2.1. So I tested my codes on only Spark 2.1 and used Zeppelin environment. I expect you run all these steps on same environment. Otherwise, you may need to modify paths and codes.

For my sample script, I’ll use the flight information belongs to year 2008. If you read my my blog post series about BDCSCE, you should be familiar with it. Anyway, do not worry about the data structure, I use only a few columns of the data, and you get more information about it on the statistical computing website.

First, I’ll create a table on MySQL to store most active carriers (in 2008). I already set Zeppelin to access my MySQL database, so I create a new paragraph, put the following SQL commands and run them.

Python for Data Science – Importing table data from a web page

This is another blog post about using Pandas package. This time, I’ll show you how to import table data from a web page. To be able to get table data, there should be a table defined with table tags (table,td,tr) in the web page we access. Unfortunately most web sites do not use “tables” anymore. They usually prefer to use “div” tags, so if this code doesn’t work, check HTML source code of the page.

For testing purposes, I’ll try to fetch exchange rates from CNN Money International web site. There are two tables in the page, one for the exchange rates and one for the world markets.…

Python for Data Science – Importing XML to Pandas DataFrame

In my previous post, I showed how easy to import data from CSV, JSON, Excel files using Pandas package. Another popular format to exchange data is XML. Unfortunately Pandas package does not have a function to import data from XML so we need to use standard XML package and do some extra work to convert the data to Pandas DataFrames.

Here’s a sample XML file (save it as test.xml):

We want to convert his to a dataframe which contains customer name, email, phone and street:

As you can see, we need to read attribute of an XML tag (customer name), text value of sub elements (address/street), so although we will use a very simple method, it will show you how to parse even complex XML files using Python.

Python for Data Science – Importing CSV, JSON, Excel Using Pandas

Although I think that R is the language for Data Scientists, I still prefer Python to work with data. In this blog post, I will show you how easy to import data from CSV, JSON and Excel files using Pandas libary. Pandas is a Python package designed for doing practical, real world data analysis.

Here is the content of the sample CSV file (test.csv):

Here is the content of the sample JSON file (test.json):

I also created an Excel file (test.xls):