I thought it would be instructive to measure the performance boost brought by Spark 2.0 using one of the Zeppelin notebooks we use for our big data training sessions and which were created using Spark 1.6.2.

This particular notebook deals with machine learning, it includes experimenting how Spark helps us to tune hyper-parameters using the ParamGridBuilder and CrossValidator classes.

Having Spark trying and evaluating combinations of hyperparameters for us is very comfortable but also very cpu and memory demanding when learning from a 5 million-lines dataset.

This is of course one strong point to benchmark the new Spark version and I found it also interesting to use this “suggested correction notebook” for three other reasons :

  • it is written according to the good practices mentioned in the V1.6.2-ml-guide,
  • it uses real word data and problem,
  • it was written not taking into account the improvements brought by Spark 2.0.

In order to run correctly on the V2.0.0, only one line of code had to be changed among the 100 scala code lines of the Spark V1.6.2 notebook. Using the cluster configuration we gave to each trainee (4 slaves, 2-cpu & 8GB RAM per slave), here are the results measured:

Duration using Spark version 1.6 versus 2.0
Steps v 1.6.2 v 2.0.0 Delta
Reading data 5.9 s 4.0 s -33%
Machine learning 1308.0 s 1251.3 s -4.3%
Testing prediction 74.3 s 58.7 s -21%

The reading phase includes the manipulation of dataframes in order to prepare the data for the next two phases. This first phase comprises a lot of dataframe filtering and some aggregation; as expected the new version improved efficiency on this part.

As also expected, the new version does not significantly improve performance of the hyper-parameters learning tree tuning.
The testing phase consists of measuring prediction accuracy and displaying graphs based on data not used during the machine learning phase. Since this phase is coded in Spark SQL, it is not surprising to observe improvement.

As a conclusion on this particular example, the new Zeppelin & Spark gives us a little boost regarding speed – perhaps not as high as expected – which proves that all codes will not be granted a high boost when switching to Spark 2.0.