Amazon QLDB and the Missing Command Line Client

Amazon Quantum Ledger Database is is a fully managed ledger database which tracks all changes of user data and maintains a verifiable history of changes over time. It was announced at AWS re:Invent 2018 and now available in five AWS regions: US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), and Asia Pacific (Tokyo).

You may ask why you would like to use QLDB (a ledger database) instead of using your traditional database solution. We all know that it’s possible to create history tables for our fact tables, and keep them up to date using triggers, stored procedures, or even with our application code (by writing changes of the main table to its history table). You can also say that your database have write-ahead/redo logs, so it’s possible to track and verify the changes of all your data as long as you keep them in your archive. On the other hand, It’s clear that this will create an extra workload and complexity for the database administrator and the application developer while it does not guarantee that the data was intact and reliable. What if your DBA directly modifies the data and history table after disabling the triggers and even alter the archived logs? You may say it’s too hard, but you know that it’s technically possible. In a legal dispute, or a security compliance investigation, this might be enough to question to the integrity of the data.

QLDB solves this problem with cryptographically verifiable journal. When an application needs to modify data in a document, the changes are logged into the journal files first (WAL concept). The difference here is, each block is hashed (SHA-256) for “verification” and has a sequence number to specify its address within the journal. QLDB calculates this hash value using the content of the journal block and the hash value of previous block. So the journal blocks are chained by the hash values! The QLDB users do not have access to the logs and the logs are immutable. In anyway, if someone modifies data, they also need to update the journal blocks related with the data. This will cause a new hash to be generated for the journal block, and all the following blocks will have a different hash value than before.

Sample AWS Lambda Function to Monitor Oracle Database

I wrote a very simple AWS Lambda function to demonstrate how to connect an Oracle database, gather the tablespace usage information, and send these metrics to CloudWatch. First, I wrote this lambda function in Python and then I had to re-write it in Java. As you may know, you need to use cx_oracle module to connect Oracle Databases with Python. This extension module requires some libraries which are shipped by Oracle Database Client (oh God!). It’s a little bit tricky to pack it for the AWS Lambda. Good thing is, I found a great document which explains all necessary steps.

Here’s the main class which will be used by Lambda service:

An Interesting Problem with ODI: Unable to retrieve user GUID

One of my customers had a problem about logging in to Oracle Data Integrator (ODI) Studio. Their ODI implementation is configured to use external authentication (Microsoft Active Directory). The configuration was done years ago. No one modified it since it’s done, in fact most people even do not remember how it’s configured. Everything was fine until they started to get “ODI-10192: Unable to retrieve user GUID” error.

PySpark Examples #5: Discretized Streams (DStreams)

This is the fourth blog post which I share sample scripts of my presentation about “Apache Spark with Python“. Spark supports two different way for streaming: Discretized Streams (DStreams) and Structured Streaming. DStreams is the basic abstraction in Spark Streaming. It is a continuous sequence of RDDs representing stream of data. Structured Streaming is the newer way of streaming and it’s built on the Spark SQL engine. In next blog post, I’ll also share a sample script about Structured Streaming but today, I will demonstrate how we can use DStreams.

When you search example scripts about DStreams, you find sample codes that reads data from TCP sockets. So I decided to write a different one: My sample code will read from files located in a directory. The script will check the directory every second, and process the new CSV files it finds. Here’s the code:

PySpark Examples #3-4: Spark SQL Module

In this blog post, I’ll share example #3 and #4 from my presentation to demonstrate capabilities of Spark SQL Module. As I already explained in my previous blog posts, Spark SQL Module provides DataFrames (and DataSets – but Python doesn’t support DataSets because it’s a dynamically typed language) to work with structured data.

First, let’s start creating a temporary table from a CSV file and run query on it. Like I did my previous blog posts, I use the “u.user” file file of MovieLens 100K Data (I save it as users.csv).