Thanks to Reynold Xin talk last Thursday, we have had confirmation of what the big improvements are.

Improvements are numerous in this major version which should be available for download in early June. Let’s take a look at the direct impact regarding SQL query engine. Two years ago, it was just a nice feature enabling to use a subset of SQL. Things move on, Spark 2.0 will be able to deal with most of SQL 2003, even further, it will be able to deal with the 99 TPC-DS queries (an industrial SQL benchmark). As before, you can mix SQL queries and standard coding (Java, Python, Scala), but as SQL engine extends its cover and its efficiency, Spark users will tend to write more SQL and less lines in other code languages.

Spark 1.x line has the reputation to be fast thanks to its “in memory” distributed dataframes. Spark 2.x will continue to reduce the time needed to obtain the result of an SQL query. It is less about improvements on the way to use the memory than about a major change on how to compile the SQL queries before sending the code to the cluster. Those improvements reduce the use of iterators, thanks to a deep change of the traditional (even outside the Hadoop/Spark world) way to compile SQL. This improvement is as efficient as the “tail call elimination” in functional programming, the reduction impacts :

  • the number of memory access,
  • the level of memory-caching,
  • the number of CPU cycles.

To make it more tangible, Reynold Xin presented the following comparison chart :

Screen Shot 2016-05-09 at 11.46.55
From brighTALK web-cast at time time index 0:49:00

So Spark SQL is becoming closer to the SQL 2013 standard and is making a quantum-leap in speed x5-x20 which is huge when you remember that Spark 1.0 was already x2-x100 faster than the previous HiveQL.

In further posts, I will write about other improvements occurring with Spark 2.0