PySpark Examples #2: Grouping Data from CSV File (Using DataFrames)

I continue to share example codes related with my “Spark with Python” presentation. In my last blog post, I showed how we use RDDs (the core data structures of Spark). This time, I will use DataFrames instead of RDDs. DataFrames are distributed collection of data organized into named columns (in a structured way). They are similar to tables in relational databases. They also provide a domain specific language API to manipulate your distributed data, so it’s easier to use.

DataFrames are provided by Spark SQL module, and they are used as primarily API for Spark’s Machine Learning lib and structured streaming modules. Spark developers recommend to use DataFrames instead of RDDs, because the Catalyst (Spark Optimizer) will optimize your execution plan and generate better code to process the data.