How to Use AWS S3 bucket for Spark History Server

Since EMR Version 5.25, it’s possible to debug and monitor your Apache Spark jobs by logging directly into the off-cluster, persistent, Apache Spark History Server using the EMR Console. You do not need to anything extra to enable it, and you can access the Spark history even after the cluster is terminated. The logs are available for active clusters and are retained for 30 days after the cluster is terminated.

Although this is a great feature, each EMR cluster has its own logs in a different bucket, the number of active Spark history server UIs cannot exceed 50 for each AWS account, and if you want to keep the logs more than 30 days (after the cluster is terminated), you need to copy them to another bucket and then create a Spark History server for them.

To overcome all these limitations, and having a more flexible way to access Spark history, you can configure Spark to send the logs to a S3 bucket.