Apache Spark

Fixing Apache Spark 1.6.x false error message for slave startup

I’ve been setting up an Apache Spark standalone cluster on a bunch of raspberry pi’s for a tertiary education project.

IMG_20160427_100834

When you’re running a slave on the same machine as the master, spinning up a slave instance works without hitch. However, when trying to start up a slave on a remote machine (even after having created a similar named user, ssh-keygen’d a key and exporting it to the slaves with ssh-copy-id), you’ll undoubtedly run into the following error message:

node02: starting org.apache.spark.deploy.worker.Worker, logging to /srv/spark-1.6.1-bin-hadoop2.6/logs/spark-spark-org.apache.spark.deploy.worker.Worker-1-node02.out
node02: failed to launch org.apache.spark.deploy.worker.Worker:
node02: full log in /srv/spark/spark-1.6.1-bin-hadoop2.6/logs/spark-spark-org.apache.spark.deploy.worker.Worker-1-node02.out

On line two, you’ll see “failed to launch org.apache.spark.deploy.worker.Worker:”, with no error message after the colon. Even stranger, the slave/worker actually started correctly! It will show as registered on the master node (after a couple of seconds).

So, what’s going on then? There’s an error, but there isn’t an error. The truth is that there isn’t an error in starting up the slave, but there is an error in the script that starts up the slave instance.

If you open up sbin/spark-daemon.sh in your Apache Spark installation directory, you’ll find a line (167 on my installation) that says:

if [[ ! $(ps -p "$newpid" -o comm=) =~ java ]]; then

This script checks to see if there is an instance of the slave that has been successfully started on the remote node by checking if the java run-time is currently executing the logic to host a slave.

This is where the error lies. Java is currently starting up the slave instance, but through a remote command issued by the master node via ssh. This means that bash is the command that’s actually executing the java command to get the slave instance up and running. The expression in the if statement above isn’t taking into account remote execution.

A very simple solution to this problem is to modify the if statement to include bash as part of its evaluation:

if [[ ! $(ps -p "$newpid" -o comm=) =~ java|bash ]]; then

Save the file, and from now on you should get clean startup messages every time.

I’m thinking of making a pull request to the Apache Spark source to include this. I will update this post if it’s accepted.

Please leave a comment if this has helped you.

G.

Leave a Reply

Your email address will not be published. Required fields are marked *