Interacting With HDFS from PySpark
One often needs to perform HDFS operations from a Spark application, be it to list files in HDFS or delete data. Because accomplishing this is not immediately obvious with the Python Spark API (PySpark), a few ways to execute such commands are presented below.
Using the Java Gateway
Even with Python applications, Spark relies on the JVM, using Py4J to execute Python code that can interface with JVM objects. Py4J uses a gateway between the JVM and the Python interpreter, which is accessible from your application’s SparkContext (sc below) object:
######
# Get fs handler from java gateway
######
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
fs = FileSystem.get(URI("hdfs://somehost:8020"), sc._jsc.hadoopConfiguration())
# We can now use the Hadoop FileSystem API (https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html)
fs.listStatus(Path('/user/hive/warehouse'))
# or
fs.delete(Path('some_path'))
While this strategy doesn’t look too elegant, it is useful as it does not require any third party libraries.
Third party libraries
If using external libraries is not an issue, another way to interact with HDFS from PySpark is by simply using a raw Python library. Examples are the hdfs
lib, or snakebite
from Spotify:
from hdfs import Config
# The following assumes you have hdfscli.cfg file defining a 'dev' client.
client = Config().get_client('dev')
files = client.list('the_dir_path')
or
from snakebite.client import Client
client = Client(hdfs_hostname, hdfs_port)
client.delete('/some-path', recurse=True)
Shell subprocesses
For completion’s sake, this section shows how to accomplish HDFS interaction directly through the subprocess
Python facilities, which allows Python to call arbitrary shell commands.
import subprocess
cmd = 'hdfs dfs -ls /user/path'.split() # cmd must be an array of arguments
files = subprocess.check_output(cmd).strip().split('\n')
for path in files:
print (path)
If this was helpful to you, you might also enjoy my Data Engineering Resources post!