Avoiding Multiple Joins On Similar Columns

Let us assume that a Hive table contains various similar columns:

product_id image_name1 image_name2 image_name15
1 a.jpg b.jpg c.jpg
2 d.jpg e.jpg f.jpg

In order to join on all image_names from this table, the naive approach would be to perform 15 different joins.

Various SQL dialects have non-standard ways of dealing with this - SQL server, for instance, has the unpivot operator. HiveQL does not support such operators, but it is possible to accomplish something similar by constructing a map with all image_name columns, and using the built-in UDTF explode:

SELECT x.product_id, x.image_name
FROM (
  SELECT product_id, map(
  	"image_name1", image_name1,
  	"image_name2", image_name2,
  	..., 
  	"image_name15", image_name15 ) as image_map
  FROM db.table) x
LATERAL VIEW explode(image_map) expl as image_nr, image_name

This returns a table that allows a single join on all image_names to be performed:

product_id image_name
1 a.jpg
1 b.jpg
1 c.jpg
2 d.jpg
2 e.jpg
2 f.jpg
Diogo Franco

Diogo Franco

I work in Big Data at Farfetch, and I love data, distributed systems, machine learning, code and science!