How to Use AWS S3 bucket for Spark History Server

Since EMR Version 5.25, it’s possible to debug and monitor your Apache Spark jobs by logging directly into the off-cluster, persistent, Apache Spark History Server using the EMR Console. You do not need to anything extra to enable it, and you can access the Spark history even after the cluster is terminated. The logs are available for active clusters and are retained for 30 days after the cluster is terminated.

Although this is a great feature, each EMR cluster has its own logs in a different bucket, the number of active Spark history server UIs cannot exceed 50 for each AWS account, and if you want to keep the logs more than 30 days (after the cluster is terminated), you need to copy them to another bucket and then create a Spark History server for them.

To overcome all these limitations, and having a more flexible way to access Spark history, you can configure Spark to send the logs to a S3 bucket.

Here is a simple configuration that you can adapt for your EMR cluster:

Unfortunately, Spark History server will require emrfs-hadoop-assembly JAR file to be able to access S3 buckets. If you try to launch your Spark History Server with above configuration, it will fail with the following error:

So you need to create a symbolic link as /usr/lib/spark/jars/emrfs.jar for /usr/share/aws/emr/emrfs/lib/emrfs-hadoop-assembly-X.X.X.jar file, and restart the Spark History Server.

I wrote a simple shell script for this task (

The first line finds the correct emrfs-hadoop-assembly JAR file and then create a symbolic link, the second and third lines are for restarting the Spark History Server. You need to upload this shell script to an S3 bucket, and define a CUSTOM_JAR step to launch this script:

If you add the above configuration and the step, to your transient cluster, the logs will remain in the S3 bucket and you will be able to view the old jobs when you create a new EMR cluster (with same configuration).

Hope it helps. If you have any questions, please wite into comments, and I’ll try to answer.

Please share

AWS Big Data Specialist. Oracle Certified Professional (OCP) for EBS R12, Oracle 10g and 11g. Co-author of "Expert Oracle Enterprise Manager 12c" book published by Apress. Awarded as Oracle ACE (in 2011) and Oracle ACE Director (in 2016) for the continuous contributions to the Oracle users community. Founding member, and vice president of Turkish Oracle User Group (TROUG). Presented at various international conferences including Oracle Open World.


  1. Ravi

    Hi, it works except that I still have to use “s3a” instead of “s3” for the log bucket path in spark-defaults.conf. Am I missing something? Thanks.

  2. Oleksandr

    Hi Gokhan. Thanks for the post, it really helped me. Here are additional dependencies I had to add to make it work:

    sudo ln -s `ls /usr/lib/hadoop/hadoop-aws-*` /usr/lib/spark/jars/hadoop-aws.jar
    sudo ln -s `ls /usr/share/aws/aws-java-sdk/aws-java-sdk-core-*` /usr/lib/spark/jars/aws-java-sdk-core.jar
    sudo ln -s `ls /usr/share/aws/aws-java-sdk/aws-java-sdk-s3-*` /usr/lib/spark/jars/aws-java-sdk-s3.jar

    Also, it seems that persistence spark history server (EMR console ‘Application history’ tab) does not work if we change event log dir to S3.

Leave Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.