Interacting With HDFS from PySpark

One often needs to perform HDFS operations from a Spark application, be it to list files in HDFS or delete data. Because accomplishing this is not immediately obvious with the Python Spark API (PySpark), a few ways to execute such commands are presented below.

Using the Java Gateway

Even with Python applications, Spark relies on the JVM, using Py4J to execute Python code that can interface with JVM objects. Py4J uses a gateway between the JVM and the Python interpreter, which is accessible from your application’s SparkContext (sc below) object:

######
# Get fs handler from java gateway
######
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
fs = FileSystem.get(URI("hdfs://somehost:8020"), sc._jsc.hadoopConfiguration())

# We can now use the Hadoop FileSystem API (https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html)
fs.listStatus(Path('/user/hive/warehouse'))
# or
fs.delete(Path('some_path'))

While this strategy doesn’t look too elegant, it is useful as it does not require any third party libraries.

Third party libraries

If using external libraries is not an issue, another way to interact with HDFS from PySpark is by simply using a raw Python library. Examples are the hdfs lib, or snakebite from Spotify:

from hdfs import Config

# The following assumes you have hdfscli.cfg file defining a 'dev' client.
client = Config().get_client('dev')
files = client.list('the_dir_path')

from snakebite.client import Client

client = Client(hdfs_hostname, hdfs_port)
client.delete('/some-path', recurse=True)

Shell subprocesses

For completion’s sake, this section shows how to accomplish HDFS interaction directly through the subprocess Python facilities, which allows Python to call arbitrary shell commands.

import subprocess

cmd = 'hdfs dfs -ls /user/path'.split() # cmd must be an array of arguments
files = subprocess.check_output(cmd).strip().split('\n')
for path in files:
  print (path)

If this was helpful to you, you might also enjoy my Data Engineering Resources post!

Diogo Franco

I love data, distributed systems, machine learning, code and science!

Interacting With HDFS from PySpark

Using the Java Gateway

Third party libraries

Shell subprocesses

Diogo Franco

Recent posts

Newsletter