This is the fourth blog post which I share sample scripts of my presentation about “Apache Spark with Python“. Spark supports two different way for streaming: Discretized Streams (DStreams) and Structured Streaming. DStreams is the basic abstraction in Spark Streaming. It is a continuous sequence of RDDs representing stream of data. Structured Streaming is the newer way of streaming and it’s built on the Spark SQL engine. In next blog post, I’ll also share a sample script about Structured Streaming but today, I will demonstrate how we can use DStreams.
When you search example scripts about DStreams, you find sample codes that reads data from TCP sockets. So I decided to write a different one: My sample code will read from files located in a directory. The script will check the directory every second, and process the new CSV files it finds. Here’s the code:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
sc = SparkContext.getOrCreate()
ssc = StreamingContext(sc, 1)
stream_data = ssc.textFileStream("file:///tmp/stream") \
.map( lambda x: x.split(","))