Using Scala code in PySpark applications
While there are solid reasons to develop Spark applications using the Python API, it is undeniable that Scala is Spark’s native tongue. If you need a feature unsupported by PySpark, or just want to use a Scala library in your Python application, this post will show how to mix the two and get the best of both worlds.
Calling Scala code in PySpark applications
Pyspark sets up a gateway between the interpreter and the JVM - Py4J - which can be used to move java objects around. Let’s code up the simplest of Scala objects:
We then build this and package it as a JAR, by using a tool such as maven or sbt:
We are now able to launch the pyspark shell with this JAR on the –driver-class-path. Depending on the code we may also need to submit it in the –jars argument:
We can access our package by accessing the _jvm attribute of spark context (sc):
Voilà, we called our first Scala method from PySpark!
Real projects are never that simple
In an actual project, a couple things might differ from the simple example above, which introduces a bit of complexity:
Scala code with dependencies on external libraries.
In this case, I couldn’t always succeed by simply packaging my Scala code and submitting the PySpark job with the dependencies in –packages. The foolproof way to do it is to package a fat jar that also contains your Scala dependencies. We can use sbt assembly to accomplish this.
Passing Spark objects around.
Spark objects must be explicitly boxed/unboxed into java objects when passing them between environments. A few common examples are:
- SparkContext
If your Scala code needs access to the SparkContext (sc), your python code must pass sc._jsc
, and your Scala method should receive a JavaSparkContext parameter and unbox it to a Scala SparkContext.
- SQLContext
The Scala SQLContext can be passed from python by sending sqlContext._ssql_ctx
. This will be usable without any transformations on the Scala side.
- RDDs
You can pass them from Python to Scala via rdd._jrdd
. On the Scala side, a JavaRDD (jrdd) can be unboxed by accessing jrdd.rdd
. When converting it back to Python, one can do:
- DataFrames
To send a DataFrame (df) from python, one must pass the df._jdf
attribute. When returning a Scala DataFrame back to python, it can be converted on the python side by:
DataFrames can also be moved around by using registerTempTable
and accessing them through the sqlContext.