Avoiding Multiple Joins On Similar Columns
Let us assume that a Hive table contains various similar columns:
product_id | image_name1 | image_name2 | … | image_name15 |
---|---|---|---|---|
1 | a.jpg | b.jpg | … | c.jpg |
2 | d.jpg | e.jpg | … | f.jpg |
In order to join on all image_names from this table, the naive approach would be to perform 15 different joins.
Various SQL dialects have non-standard ways of dealing with this - SQL server, for instance, has the unpivot
operator. HiveQL does not support such operators, but it is possible to accomplish something similar by constructing a map
with all image_name columns, and using the built-in UDTF explode
:
This returns a table that allows a single join on all image_names to be performed:
product_id | image_name |
---|---|
1 | a.jpg |
1 | b.jpg |
1 | c.jpg |
2 | d.jpg |
2 | e.jpg |
2 | f.jpg |