Dplyr & Sparklyr usage
In this example, I want to show the possibility to perform with the same syntax local computing as well as distributed computing thanks to the Sparklyr package.
To do that I will use the nycflights13 dataset (one of the dataset used in the Sparklyr demo) in order to check if the number of flights by day evolves according to the period of the year (the month).
Spoiler: It varies but not so much.
Using tidyverse tools
To perform computing on your laptop in R, the best way to go is to use the tidyverse packages.
|
Using sparklyr
The beauty of the sparklyr package is that you can reuse (almost) the same code to scale and run the computations on millions of lines (for example for all the flights in several years) in a Spark cluster.
I’m saying almost since you will notice that the only difference is a call to collect
. This call permits to retrieve the computed data to the driver (running in this case on your machine) to be able to plot it. It’s not a problem even with a dataset containing millions of rows since the result will always contain only 12 rows (one for each month).
This is a great benefit for data scientists who do not have to learn a new language / framework. This is not the case in Python. If you use python pandas on your laptop, you will have to rewrite completely your code in PySpark to be able to benefit from distributed computing in Spark.
In the first example I’m running spark locally (master = "local"
).
|
Spark standalone cluster
Now, I’m doing the same thing in a Spark Standalone cluster to be able to see the steps that are run to perform the computation. I’m using here the same Standalone Spark Cluster described in my previous article.
|
We can see in the history server UI (cf. Spark History Server available in docker-spark to know how to make it work) all the steps done to perform the computations. We can notice that the 2 workers are fired-up.