Spark on Kubernetes Python and R bindings

included in data

2018-12-28 326 words 2 minutes

Contents

The version 2.4 of Spark for Kubernetes introduces Python and R bindings.

spark-py: The Spark image with Python bindings (including Python 2 and 3 executables)
spark-r: The Spark image with R bindings (including R executable)

Databricks has published an article dedicated to the Spark 2.4 features for Kubernetes.

It’s exactly the same principle as already explained in my previous article. But this time we are using:

A different image: spark-py
Another example: local:///opt/spark/examples/src/main/python/pi.py, once again a Pi computation :-|
A dedicated Spark namespace: spark.kubernetes.namespace=spark

The namespace must be created first:

$ k create namespace spark
namespace "spark" created

It permits to isolate the Spark pods from the rest of the cluster and could be used later to cap available resources.

$ cd $SPARK_HOME
$ ./bin/spark-submit \
    --master k8s://https://localhost:6443 \
    --deploy-mode cluster \
    --name spark-pi \
    --conf spark.executor.instances=2 \
    --conf spark.driver.memory=512m \
    --conf spark.executor.memory=512m \
    --conf spark.kubernetes.container.image=spark-py:v2.4.0 \
    --conf spark.kubernetes.pyspark.pythonVersion=3 \
    --conf spark.kubernetes.namespace=spark \
    local:///opt/spark/examples/src/main/python/pi.py

spark.kubernetes.pyspark.pythonVersion is an additional (an optional) property that can be used to select the major Python version to use (it’s 2 by default).

This sets the major Python version of the docker image used to run the driver and executor containers. Can either be 2 or 3.

Labels

An interesting that has nothing to do with Python is that Spark defines labels that are applied on pods. They permit to easily identify the role of each pod.

$ k get po -L spark-app-selector,spark-role -n spark

NAME                            READY     STATUS              RESTARTS   AGE       SPARK-APP-SELECTOR                       SPARK-ROLE
spark-pi-1545987715677-driver   1/1       Running             0          12s       spark-c4e28a2ef3d14cfda16c007383318c79   driver
spark-pi-1545987715677-exec-1   0/1       ContainerCreating   0          1s        spark-application-1545987726694          executor
spark-pi-1545987715677-exec-2   0/1       ContainerCreating   0          1s        spark-application-1545987726694          executor

You can for example use the label to delete all the terminated driver pods.

#  Can also decide to switch from the default to the spark namespace
# $ k config set-context $(kubectl config current- context) --namespace spark
$ k delete po -l spark-role=driver -n spark
# Or you can delete the whole namespace
$ k delete ns spark