Categories

Back 2 Code

Configure PySpark to connect to a Standalone Spark Cluster

In one of my previous article I talked about running a Standalone Spark Cluster inside Docker containers through the usage of docker-spark. I was using it with R Sparklyr framework. However if you want to use from a Python environment in an interactive mode (like in Jupyter notebooks where the driver runs on the local machine while the workers run in the cluster), you have several steps to follow. You need to run the same Python version on the driver and on the workers.

Spark History Server available in docker-spark

Spark comes with a history server, it provides a great UI with many information regarding Spark jobs execution (event timeline, detail of stages, etc.). Details can be found in the Spark monitoring page. I’ve modified the docker-spark to be able to run it with the docker-compose upcommand. With this implementation, its UI will be running at http://${YOUR_DOCKER_HOST}:18080. To use the Spark’s history server you have to tell your Spark driver:

Find Spark

Find Spark is an handy tool to use each time you want to switch between spark versions in Jupyter Notebooks without the need to change the SPARK_HOME environment variable.

WSJF

WSJF stands for Weighted Shortest Job First. It’s a technique used in scaled Agile framework (SAFe) to prioritise jobs—epics, features and capabilities—according to their value relative to the cost to perform it. Basically it’s a way of ranking a list of features in order to maximise the outcome—the value produced—with a constrained capacity to produce it. The job with the highest WSJF (value over the cost) is selected first for implementation.

Effective Monitoring and Alerting

A short note about this book I used in my work. First of all two good points. The first is that it deals with monitoring, alerting and reporting in general, that is to say independently of the tools used. This is both a strong point and a weak point since it could be useful to identify families of tools adapted to each use. This step back is not so common and allows to introduce higher level concepts, for example the organization of the monitoring in stacks which is absolutely crucial but also notions and general definitions applicable in all circumstances - or almost.