Using Spark to Process Data From Cassandra for Analytics

After my presentation about Apache Cassandra, most people asked if they can run analytical queries on Cassandra, and how they can integrate Spark with Cassandra. So I decided to write a blog post to demonstrate how we can process data from Cassandra using Spark. In this blog post, I’ll show how I can build a testing environment on Oracle Cloud (Spark + Cassandra), load sample data to Cassandra, and query the data using Spark.

Let me first create an Oracle Big Data Cloud instance. Instead of installing Spark manually, I’ll use Big Data Cloud service so I’ll have both Spark and Zeppelin. Zeppelin is a web-based notebook for interactive data analytics. I’ll use Zeppelin to run Spark scripts and queries.

I login to Oracle Cloud, and start creating Big Data Cloud service. I select “Basic” for deployment profile, because I do not need HIVE, I want only one node (for testing), and I select 2.1 of Spark version. After the service is created, I go to “Access rules” and enable ora_p2bdcsce_ssh because I will need connect to my server through SSH.

Build a Cassandra Cluster on Docker

In this blog post, I’ll show how we can build a three-node cassandra cluster on Docker for testing. I’ll use official cassandra images instead of creating my own images, so all process will take only a few minutes (depending on your network connection). I assume that you have Docker installed on your PC, have internet connection (I was born in 1976 so it’s normal for me to ask this kind of questions) and your PC has at least 8 GB RAM. First of all, we need to assign about 5 GB RAM to Docker (in case it has less RAM assigned), because each node will require 1.5+ GB RAM to work properly.

Open the docker preferences, click the advanced tab, set the memory to 5 GB or more, and click “apply and restart” docker service. Launch a terminal window, run “docker pull cassandra” command to fetch the latest official cassandra image.

Introduction to Apache Cassandra

On Friday, I gave a presentation about Apache Cassandra at Big Talk event organized by Komtaş Information Management company. Cassandra is a top level Apache project which is born at Facebook. It is a distributed database for managing large amounts of structured data. It provides highly available service and no single point of failure, even running on commodity harware.

In my previous company, we used Cassandra to store our social platform data. It performs well on even medium-size instances running on Amazon Cloud, so our development team wanted to use it on more projects. I managed both production and test environments, and I can say that it is easy to operate as long as you understand the Cassandra internals. So in this event, I wanted give some introductory information about Apache Cassandra.

By the way, I have to say that audience was great. The room was full. People asked lots of questions during the session, took photos of slides, and gave great feedback. I would like to thank people who joined my session, and Komtaş for organizing the event.