Avoiding Multiple Joins On Similar Columns

Let us assume that a Hive table contains various similar columns:

product_id	image_name1	image_name2	…	image_name15
1	a.jpg	b.jpg	…	c.jpg
2	d.jpg	e.jpg	…	f.jpg

In order to join on all image_names from this table, the naive approach would be to perform 15 different joins.

Various SQL dialects have non-standard ways of dealing with this - SQL server, for instance, has the unpivot operator. HiveQL does not support such operators, but it is possible to accomplish something similar by constructing a map with all image_name columns, and using the built-in UDTF explode:

SELECT x.product_id, x.image_name
FROM (
  SELECT product_id, map(
  	"image_name1", image_name1,
  	"image_name2", image_name2,
  	..., 
  	"image_name15", image_name15 ) as image_map
  FROM db.table) x
LATERAL VIEW explode(image_map) expl as image_nr, image_name

This returns a table that allows a single join on all image_names to be performed:

product_id	image_name
1	a.jpg
1	b.jpg
1	c.jpg
2	d.jpg
2	e.jpg
2	f.jpg

Diogo Franco

I love data, distributed systems, machine learning, code and science!

Avoiding Multiple Joins On Similar Columns

Diogo Franco

Recent posts

Newsletter