An Interesting Problem with ODI: Unable to retrieve user GUID

One of my customers had a problem about logging in to Oracle Data Integrator (ODI) Studio. Their ODI implementation is configured to use external authentication (Microsoft Active Directory). The configuration was done years ago. No one modified it since it’s done, in fact most people even do not remember how it’s configured. Everything was fine until they started to get “ODI-10192: Unable to retrieve user GUID” error.

PySpark Examples #5: Discretized Streams (DStreams)

This is the fourth blog post which I share sample scripts of my presentation about “Apache Spark with Python“. Spark supports two different way for streaming: Discretized Streams (DStreams) and Structured Streaming. DStreams is the basic abstraction in Spark Streaming. It is a continuous sequence of RDDs representing stream of data. Structured Streaming is the newer way of streaming and it’s built on the Spark SQL engine. In next blog post, I’ll also share a sample script about Structured Streaming but today, I will demonstrate how we can use DStreams.

When you search example scripts about DStreams, you find sample codes that reads data from TCP sockets. So I decided to write a different one: My sample code will read from files located in a directory. The script will check the directory every second, and process the new CSV files it finds. Here’s the code:

PySpark Examples #3-4: Spark SQL Module

In this blog post, I’ll share example #3 and #4 from my presentation to demonstrate capabilities of Spark SQL Module. As I already explained in my previous blog posts, Spark SQL Module provides DataFrames (and DataSets – but Python doesn’t support DataSets because it’s a dynamically typed language) to work with structured data.

First, let’s start creating a temporary table from a CSV file and run query on it. Like I did my previous blog posts, I use the “u.user” file file of MovieLens 100K Data (I save it as users.csv).

PySpark Examples #2: Grouping Data from CSV File (Using DataFrames)

I continue to share example codes related with my “Spark with Python” presentation. In my last blog post, I showed how we use RDDs (the core data structures of Spark). This time, I will use DataFrames instead of RDDs. DataFrames are distributed collection of data organized into named columns (in a structured way). They are similar to tables in relational databases. They also provide a domain specific language API to manipulate your distributed data, so it’s easier to use.

DataFrames are provided by Spark SQL module, and they are used as primarily API for Spark’s Machine Learning lib and structured streaming modules. Spark developers recommend to use DataFrames instead of RDDs, because the Catalyst (Spark Optimizer) will optimize your execution plan and generate better code to process the data.

PySpark Examples #1: Grouping Data from CSV File (Using RDDs)

During my presentation about “Spark with Python”, I told that I would share example codes (with detailed explanations). So this is my first example code. In this code, I read data from a CSV file to create a Spark RDD (Resilient Distributed Dataset). RDDs are the core data structures of Spark. I explained the features of RDDs in my presentation, so in this blog post, I will only focus on the example code.

For this sample code, I use the “u.user” file file of MovieLens 100K Dataset. I renamed it to “users.csv” but you can use it with current name if you want.