Use Snowflake and Zepl to Analyse Covid-19 (coronavirus) Data

Coronavirus changed our life, most of us are stuck at home. We are trying to follow everything about the pandemic. So I wanted to write a blog post which will guide to configure an environment that you can examine covid-19 pandemic data. In this blog post, I will show you how you can set up your Snowflake trial account, enable access to covid-19 data provided by Starschema, and how you can use Zepl (Data Science Analytics Platform) to analyse this data.

START YOUR SNOWFLAKE TRIAL

Let’s start with setting up Snowflake trial account. Snowflake is a cloud data platform which is available on major public cloud provides (Amazon, Azure and Google). It provides a cloud-native Data Warehouse. Why am I saying “cloud-native”? Because it was not ported to cloud like many other data warehouse services. It’s designed and built to run on cloud. Therefore it uses underlying cloud services efficiently. I think this is enough for introducing Snowflake, you will be able to discover it by yourself after we set up the trial account.

To create a trial account, you need to visit https://trial.snowflake.com/ page and fill a simple form. It does not require you to enter a credit card or any other payment method.

Introduction to Apache Spark with Python

Today, I spoke about “Apache Spark with Python” at Big Talk #2 meet-up in Istanbul Teknokent ARI-3, another event organized by Komtas for big data community. We had almost full room. Mine was the last session of the day but the audience was still very focused and eager to listen the subjects, so for me, the event was great.

By the way, I also enjoyed the sessions of other speakers: Zekeriya Beşioğlu spoke about Data Lakes and Kylo (an open source data lake management software). I’ll surely test that software as soon as possible. After Zekeriya, İsmail Parsa spoke about data science on retails systems. I’m very impressed from his knowledge, and happy to find a chance to join his session.

Using Spark to Process Data From Cassandra for Analytics

After my presentation about Apache Cassandra, most people asked if they can run analytical queries on Cassandra, and how they can integrate Spark with Cassandra. So I decided to write a blog post to demonstrate how we can process data from Cassandra using Spark. In this blog post, I’ll show how I can build a testing environment on Oracle Cloud (Spark + Cassandra), load sample data to Cassandra, and query the data using Spark.

Let me first create an Oracle Big Data Cloud instance. Instead of installing Spark manually, I’ll use Big Data Cloud service so I’ll have both Spark and Zeppelin. Zeppelin is a web-based notebook for interactive data analytics. I’ll use Zeppelin to run Spark scripts and queries.

I login to Oracle Cloud, and start creating Big Data Cloud service. I select “Basic” for deployment profile, because I do not need HIVE, I want only one node (for testing), and I select 2.1 of Spark version. After the service is created, I go to “Access rules” and enable ora_p2bdcsce_ssh because I will need connect to my server through SSH.

Python for Data Science – Importing CSV, JSON, Excel Using Pandas

Although I think that R is the language for Data Scientists, I still prefer Python to work with data. In this blog post, I will show you how easy to import data from CSV, JSON and Excel files using Pandas libary. Pandas is a Python package designed for doing practical, real world data analysis.

Here is the content of the sample CSV file (test.csv):

Here is the content of the sample JSON file (test.json):

I also created an Excel file (test.xls):