Use Snowflake and Zepl to Analyse Covid-19 (coronavirus) Data

Coronavirus changed our life, most of us are stuck at home. We are trying to follow everything about the pandemic. So I wanted to write a blog post which will guide to configure an environment that you can examine covid-19 pandemic data. In this blog post, I will show you how you can set up your Snowflake trial account, enable access to covid-19 data provided by Starschema, and how you can use Zepl (Data Science Analytics Platform) to analyse this data.

START YOUR SNOWFLAKE TRIAL

Let’s start with setting up Snowflake trial account. Snowflake is a cloud data platform which is available on major public cloud provides (Amazon, Azure and Google). It provides a cloud-native Data Warehouse. Why am I saying “cloud-native”? Because it was not ported to cloud like many other data warehouse services. It’s designed and built to run on cloud. Therefore it uses underlying cloud services efficiently. I think this is enough for introducing Snowflake, you will be able to discover it by yourself after we set up the trial account.

To create a trial account, you need to visit https://trial.snowflake.com/ page and fill a simple form. It does not require you to enter a credit card or any other payment method.

How to Use IAM authentication for RDS PostgreSQL with Glue ETL Jobs

Amazon RDS enables you to use AWS Identity and Access Management (IAM) to manage database access for Amazon RDS for PostgreSQL DB instances. It’s possible use the IAM authentication with Glue connections but it is not documented well, so I will demostrate how you can do it. In your Glue job, you can import boto3 library to call “generate_db_auth_token” method to generate a token and use it when connecting.

Here’s a simple Glue ETL script I wrote for testing. It connects to PostgreSQL using IAM authentication, reads data from a table and writes the output to S3:

How to Use AWS S3 bucket for Spark History Server

Since EMR Version 5.25, it’s possible to debug and monitor your Apache Spark jobs by logging directly into the off-cluster, persistent, Apache Spark History Server using the EMR Console. You do not need to anything extra to enable it, and you can access the Spark history even after the cluster is terminated. The logs are available for active clusters and are retained for 30 days after the cluster is terminated.

Although this is a great feature, each EMR cluster has its own logs in a different bucket, the number of active Spark history server UIs cannot exceed 50 for each AWS account, and if you want to keep the logs more than 30 days (after the cluster is terminated), you need to copy them to another bucket and then create a Spark History server for them.

To overcome all these limitations, and having a more flexible way to access Spark history, you can configure Spark to send the logs to a S3 bucket.

Lambda Function to Resize EBS Volumes of EMR Nodes

I have to start by saying that you should not use EMR as a persistent Hadoop cluster. The power of EMR lies in its elasticity. You should launch an EMR cluster, process the data, write the data to S3 buckets, and terminate the cluster. However, we see lot of AWS customers use the EMR as a persistent cluster. So I was not surprised when a customer told that they need to resize EBS volume automatically on new core nodes of their EMR cluster. The core nodes are configured to have 200 GB disks, but now they want to have 400 GB disks. It’s not possible to change the instance type or EBS volume configuration of core nodes, so a custom solution was needed for it. I explained to the customer, how to do it with some sample Python code, but at the end they gave up to use this method (thanks God).

I wanted to see if it can done anyway. So for fun and curiosity, I wrote a Lambda function with Java. It should be scheduled to run on every 5 or 10 minutes. On every run, it checks if there’s an ongoing resizing operation. If the resizing is done, it connects to the node and run “growpart” and “xfs_growfs” commands to grow the partition and filesystem. If there’s no resizing operation in progress, it checks all volumes of a specific cluster, and start a resizing operation on a volume which is smaller than a specific size.

Here’s the main class which will be used by Lambda function:

Query a HBASE table through Hive using PySpark on EMR

In this blog post, I’ll demonstrate how we can access a HBASE table through Hive from a PySpark script/job on an AWS EMR cluster. First I created an EMR cluster (EMR 5.27.0, Hive 2.3.5, Hbase 1.4.0). Then I connected to the master node, executed “hbase shell”, created a HBASE table, and inserted a sample row:

I logged in to hive and created a Hive table which points to the HBASE table:

When I tried to access table using spark.table(‘myhivetable’), I got an error pointing that the org.apache.hadoop.hive.hbase.HBaseStorageHandler class was not found. I tried to use “–packages” parameter to get the required JAR library from maven repository. It downloaded a lot of missing jars but it did not work. So I downloaded the required JAR file using wget, and copied it to Spark’s JAR directory: