Interacting With HDFS from PySpark
One often needs to perform HDFS operations from a Spark application, be it to list files in HDFS or delete data. Because accomplishing this is not immediately obvious with the Python Spark API (PySpark), a few ways to execute such commands are presented below.
Using the Java Gateway
Even with Python applications, Spark relies on the JVM, using Py4J to execute Python code that can interface with JVM objects. Py4J uses a gateway between the JVM and the Python interpreter, which is accessible from your application’s SparkContext (sc below) object:
While this strategy doesn’t look too elegant, it is useful as it does not require any third party libraries.
Third party libraries
If using external libraries is not an issue, another way to interact with HDFS from PySpark is by simply using a raw Python library. Examples are the hdfs
lib, or snakebite
from Spotify:
or
Shell subprocesses
For completion’s sake, this section shows how to accomplish HDFS interaction directly through the subprocess
Python facilities, which allows Python to call arbitrary shell commands.
If this was helpful to you, you might also enjoy my Data Engineering Resources post!