Join on element inside array

Question

I have two dataframes where I have to use a value of one dataframe to filter on the second dataframe using that value.

For example, below are the datasets

import pyspark
from pyspark.sql import Row

cust = spark.createDataFrame([Row(city='hyd',cust_id=100),
                              Row(city='blr',cust_id=101),
                              Row(city='chen',cust_id=102),
                              Row(city='mum',cust_id=103)])

item = spark.createDataFrame([Row(item='fish',geography=['london','a','b','hyd']),
                              Row(item='chicken',geography=['a','hyd','c']),
                              Row(item='rice',geography=['a','b','c','blr']),
                              Row(item='soup',geography=['a','kol','simla']),
                              Row(item='pav',geography=['a','del']),
                              Row(item='kachori',geography=['a','guj']),
                              Row(item='fries',geography=['a','chen']),
                              Row(item='noodles',geography=['a','mum'])])

cust dataset output:

+----+-------+
|city|cust_id|
+----+-------+
| hyd|    100|
| blr|    101|
|chen|    102|
| mum|    103|
+----+-------+

item dataset output:

+-------+------------------+
|   item|         geography|
+-------+------------------+
|   fish|[london, a, b,hyd]|
|chicken|       [a, hyd, c]|
|   rice|    [a, b, c, blr]|
|   soup|   [a, kol, simla]|
|    pav|          [a, del]|
|kachori|          [a, guj]|
|  fries|         [a, chen]|
|noodles|          [a, mum]|
+-------+------------------+

I need to use the city column values from cust dataframe to get the items from the item dataset. The final output should be:

+----+---------------+-------+
|city|          items|cust_id|
+----+---------------+-------+
| hyd|[fish, chicken]|    100|
| blr|         [rice]|    101|
|chen|        [fries]|    102|
| mum|      [noodles]|    103|
+----+---------------+-------+

ZygD · Accepted Answer · 2022-04-19 09:07:06Z

1

Before the join I would explode the array column. Then, collect_list aggregation can move all items to one list.

from pyspark.sql import functions as F

df = cust.join(item.withColumn('city', F.explode('geography')), 'city', 'left')
df = (df.groupBy('city', 'cust_id')
        .agg(F.collect_list('item').alias('items'))
        .select('city', 'items', 'cust_id')
)
df.show(truncate=False)
#+----+---------------+-------+
#|city|items          |cust_id|
#+----+---------------+-------+
#|blr |[rice]         |101    |
#|chen|[fries]        |102    |
#|hyd |[fish, chicken]|100    |
#|mum |[noodles]      |103    |
#+----+---------------+-------+

edited Apr 19, 2022 at 9:07

answered Apr 19, 2022 at 8:31

ZygD

24.8k41 gold badges107 silver badges144 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

wwnde · Accepted Answer · 2022-04-19 10:07:04Z

0

new = (
  #join the two columns on city
  item.withColumn('city',explode(col('geography')))
  .join(cust,how='left',on='city')
  #drop null rows and unwanted column
  .dropna().drop('geography')
  #groupby for the outcome
.groupby('city','cust_id').agg(collect_list('item').alias('items'))
)

new.show()

+----+---------------+-------+
|city|      items|cust_id|
+----+---------------+-------+
| blr|         [rice]|    101|
|chen|        [fries]|    102|
| hyd|[fish, chicken]|    100|
| mum|      [noodles]|    103|
+----+---------------+-------+

edited Apr 19, 2022 at 10:07

answered Apr 19, 2022 at 9:16

wwnde

26.7k6 gold badges22 silver badges38 bronze badges

Collectives™ on Stack Overflow

Join on element inside array

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related