Spark on Kubernetes First Run

included in data

2018-12-27 634 words 3 minutes

Contents

Since the version 2.3, Spark can run on a Kubernetes cluster. Let’s see how to do it. In this example I will use the version 2.4. Prerequisites are:

Download an install (unzip) the corresponding Spark distribution,

For more information, there is a section on the Spark site dedicated to this use case.

Spark images

I will build and push Spark images to make them available to the K8S cluster.

The Spark distribution comes with a script (docker-image-tool.sh) that permits to build Spark images. You might wonder why build your own image and not use an image already available on the hub? One of the main reason is to give the ability to customize it to match your Spark distribution.

Several options are available to define a repo and / or to push images but I will do it in two steps.

# Building the images
$ cd $SPARK_HOME
$ ./bin/docker-image-tool.sh -t v2.4.0
# Listing the images
$ spark-2.4.0-bin-hadoop2.7 docker images | grep spark

spark-r   v2.4.0  7257138f9086  38 hours ago  764MB
spark-py  v2.4.0  fbc77732ab07  38 hours ago  438MB
spark     v2.4.0  d952a2b3506f  38 hours ago  348MB

The 3 images built

spark: The standard Spark image
spark-py: The Spark image with Python bindings (including Python 2 and 3 executables)
spark-r: The Spark image with R bindings (including R executable)

Run

In this first example, I will run a Spark job in cluster mode (the driver runs on the Cluster). This is a non interactive mode and so it cannot be used to run Spark shell or Jupyter notebooks. This is the first mode that has been made available for Spark on K8S.

Master

In the following example the magic line is --master k8s://https://localhost:6443. This means that spark-submit will interact with the K8S API server.

$ k cluster-info
Kubernetes master is running at https://localhost:6443

Sizing params

In the conf, I’ve changed the default settings to restrict the amount of memory (512 MB) that will be requested for the driver and the two workers (additionaly they will require 1 CPU each so 3 CPUs).

--conf spark.executor.instances=2 \
--conf spark.driver.memory=512m \
--conf spark.executor.memory=512m \

This sizing has to be checked since it can exceed the allocatable memory.

2018-12-26 07:47:39 WARN  TaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

The allocatable memory can be checked here 3 GB and 5 CPUs.

# There is only one node in the K8S shipped with Docker
k get node  --output=json | jq ".items[0].status.allocatable"

{
  "cpu": "5",
  "ephemeral-storage": "56453061334",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "2973324Ki",
  "pods": "110"
}

Running params

Spark image

For the Spark image I’m using the one that has been pushed to the repository.

--conf spark.kubernetes.container.image=spark:v2.4.0

Executable

We can see in the Dockerfile that the examples are available in /opt/spark/examples.

COPY examples /opt/spark/examples

So I can specify the jar path by using this location local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar where Spark will be able to find the class to run org.apache.spark.examples.SparkPi.

Putting it all together

And finally here is the command to run.

$ cd $SPARK_HOME
$ ./bin/spark-submit \
    --master k8s://https://localhost:6443 \
    --deploy-mode cluster \
    --name spark-pi \
    --class org.apache.spark.examples.SparkPi \
    --conf spark.executor.instances=2 \
    --conf spark.driver.memory=512m \
    --conf spark.executor.memory=512m \
    --conf spark.kubernetes.container.image=spark:v2.4.0\
    local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar

Once started you can check the pods running (the driver and the two executors).

$ k get pods

NAME                          READY STATUS             RESTARTS  AGE
spark-pi-1545849731069-driver 1/1   Running            0         10s
spark-pi-1545849731069-exec-1 0/1   ContainerCreating  0         0s
spark-pi-1545849731069-exec-2 0/1   ContainerCreating  0         0s

At the end, once again, Pi is roughly computed :-).

$ k logs spark-pi-1545849731069-driver

2018-12-26 15:51:49 INFO  DAGScheduler:54 - Job 0 finished: reduce at SparkPi.scala:38, took 1.751761 s
Pi is roughly 3.1380156900784506

Note that you can access to the Driver UI by forwarding the 4040 port.

$ k port-forward <driver-pod-name> 4040:4040
# So, in this example
$ k port-forward spark-pi-1545849731069-driver 4040:4040