Lambda Function to Resize EBS Volumes of EMR Nodes

I have to start by saying that you should not use EMR as a persistent Hadoop cluster. The power of EMR lies in its elasticity. You should launch an EMR cluster, process the data, write the data to S3 buckets, and terminate the cluster. However, we see lot of AWS customers use the EMR as a persistent cluster. So I was not surprised when a customer told that they need to resize EBS volume automatically on new core nodes of their EMR cluster. The core nodes are configured to have 200 GB disks, but now they want to have 400 GB disks. It’s not possible to change the instance type or EBS volume configuration of core nodes, so a custom solution was needed for it. I explained to the customer, how to do it with some sample Python code, but at the end they gave up to use this method (thanks God).

I wanted to see if it can done anyway. So for fun and curiosity, I wrote a Lambda function with Java. It should be scheduled to run on every 5 or 10 minutes. On every run, it checks if there’s an ongoing resizing operation. If the resizing is done, it connects to the node and run “growpart” and “xfs_growfs” commands to grow the partition and filesystem. If there’s no resizing operation in progress, it checks all volumes of a specific cluster, and start a resizing operation on a volume which is smaller than a specific size.

Here’s the main class which will be used by Lambda function:

Amazon QLDB and the Missing Command Line Client

Amazon Quantum Ledger Database is is a fully managed ledger database which tracks all changes of user data and maintains a verifiable history of changes over time. It was announced at AWS re:Invent 2018 and now available in five AWS regions: US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), and Asia Pacific (Tokyo).

You may ask why you would like to use QLDB (a ledger database) instead of using your traditional database solution. We all know that it’s possible to create history tables for our fact tables, and keep them up to date using triggers, stored procedures, or even with our application code (by writing changes of the main table to its history table). You can also say that your database have write-ahead/redo logs, so it’s possible to track and verify the changes of all your data as long as you keep them in your archive. On the other hand, It’s clear that this will create an extra workload and complexity for the database administrator and the application developer while it does not guarantee that the data was intact and reliable. What if your DBA directly modifies the data and history table after disabling the triggers and even alter the archived logs? You may say it’s too hard, but you know that it’s technically possible. In a legal dispute, or a security compliance investigation, this might be enough to question to the integrity of the data.

QLDB solves this problem with cryptographically verifiable journal. When an application needs to modify data in a document, the changes are logged into the journal files first (WAL concept). The difference here is, each block is hashed (SHA-256) for “verification” and has a sequence number to specify its address within the journal. QLDB calculates this hash value using the content of the journal block and the hash value of previous block. So the journal blocks are chained by the hash values! The QLDB users do not have access to the logs and the logs are immutable. In anyway, if someone modifies data, they also need to update the journal blocks related with the data. This will cause a new hash to be generated for the journal block, and all the following blocks will have a different hash value than before.

Sample AWS Lambda Function to Monitor Oracle Database

I wrote a very simple AWS Lambda function to demonstrate how to connect an Oracle database, gather the tablespace usage information, and send these metrics to CloudWatch. First, I wrote this lambda function in Python and then I had to re-write it in Java. As you may know, you need to use cx_oracle module to connect Oracle Databases with Python. This extension module requires some libraries which are shipped by Oracle Database Client (oh God!). It’s a little bit tricky to pack it for the AWS Lambda. Good thing is, I found a great document which explains all necessary steps.

Here’s the main class which will be used by Lambda function:

Oracle Berkeley DB Java Edition

I was searching NoSQL databases and see that Oracle provides a NoSQL database called Berkeley DB. I examined it and wrote a blog to give quick tips to start Java Edition of Berkeley DB. This blog was published in Turkish, about 1 year ago. I’ve decided to translate (in fact, re-write) and publish it in English, and this is what you are reading now.

Berkeley DB is a high-performance embedded database originated at the University of California, Berkeley. It’s fast, reliable and used in several applications such as Evolution (email client), OpenLDAP, RPM (The RPM Package Manager) and Postfix (MTA). In contrast to most other database systems, Berkeley DB provides relatively simple data access services. Berkeley DB databases are B+Trees (like indexes in Oracle RDBMS) and can store only key/value pairs (there are no columns or tables). The keys and values are byte arrays. Databases are stored as files within a single directory which is called “environment”.

There are three versions of Berkeley DB:

  • Berkeley DB (the traditional database, written in C)
  • Berkeley DB Java Edition (native Java version)
  • Berkeley DB XML (for storing XML documents)

As a hobbyist Java programmer, I prefer Berkeley DB Java Edition (JE). Berkeley DB JE supports almost all features of traditional Berkeley DB such as replication, hot-backups, ACID and transactions. It is written in pure Java so it’s platform-independent.

Berkeley DB JE provides two interfaces:

  1. Traditional Berkeley DB API (with DB data abstraction of key/value pairs)
  2. Direct Persistence Layer (DPL) which contains “Plain Old Java Objects” (POJO)

Because I’m an old-school (ex)programmer, I’ll show how to use the traditional Berkeley DB API. Traditional Berkeley DB API will help you understand how Berkeley DB works.

How to Install Oracle Grid Control 11g (Step by Step Guide)

Oracle Enterprise Manager Grid Control is a tool to manage and monitor multiple instances of Oracle and non-Oracle platforms such as Microsoft .NET, Microsoft SQL Server, NetApp Filers, BEA weblogic and others.

I’ll show how to install Oracle Grid Control 11g on Oracle Linux (32bit). Here are the main steps:

1) Installation of Oracle Linux (How to Install Oracle Linux 5.6)
2) Installation of Oracle Database 11.2.0.2 (Repository Database)
3) Installation of Java and Weblogic Server (Middleware)
4) Installation of Grid Control
5) Installation of Grid Control Agent to a Target System

Installation of Oracle Database 11.2

After you install Oracle Linux, login as root and create the directories for Grid Control and other software we’ll install:

Because we installed oracle-validated package (How to Install Oracle Linux 5.6), user oracle and the related groups have been already created automatically. Password of the user Oracle is “oracle”. Login as oracle to the system, download the Oracle Database binaries. I’ll use 11.2.0.2 but the steps will be for 11.2.0.1 which you can download from Oracle.com:

http://www.oracle.com/technetwork/database/enterprise-edition/downloads/112010-linuxsoft-085393.html

Unzip the files:

For regular 11.2.0.1 (Linux x86):

Unzip will create a directory named “database”, go into this directory and run “./runInstaller”: