Changing Multiple Values in PySpark Dataframe

Question

The sample of the dataset I am working on:

test = sqlContext.createDataFrame([(1,2),
                                   (1,3),
                                   (4,5)],
                                  ['cod_item_2','alter_cod'])

test_2 = sqlContext.createDataFrame([(1,"shamp_1"),(2,"shamp_2"),
                                     (4,"tire_1"),(5,"tire_2"),
                                     (3,"shamp_3"),(6,"cookie"),
                                     (7,"flower"),(8,"water")],
                                    ['cod_item','product_name'])

The first dataframe contains items and items that are equivalent to them.

The second dataframe contains all items and product names.

I want to use the first dataframe to pull out the items that are equivalent to the second dataframe and replace with the item that represents them (the item on the left side of the first table), where the result is as follows:

I tried doing a full join on both dataframes and using the when clause to change the values. But it ended up not working.

ZygD · Accepted Answer · 2022-07-01 19:26:21Z

2

You can do 2 joins. test_2 with test and then again with test_2 (self join). For this to work reliably I use alias on dataframes.

from pyspark.sql import functions as F

test_3 = (
    test_2.alias('a')
    .join(test, F.col("a.cod_item") == F.col("alter_cod"), "left")
    .join(test_2.alias('b'), F.col("cod_item_2") == F.col("b.cod_item"), "left")
    .select(
        F.coalesce("b.cod_item", "a.cod_item").alias("cod_item"),
        F.coalesce("b.product_name", "a.product_name").alias("product_name")
    )
)
test_3.show()
# +--------+------------+
# |cod_item|product_name|
# +--------+------------+
# |       4|      tire_1|
# |       1|     shamp_1|
# |       1|     shamp_1|
# |       4|      tire_1|
# |       7|      flower|
# |       6|      cookie|
# |       1|     shamp_1|
# |       8|       water|
# +--------+------------+

answered Jul 1, 2022 at 19:26

ZygD

24.8k41 gold badges107 silver badges144 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jplavorr Over a year ago

If the test_2 data frame is a shopping table (eg item shamp_1 appears repeatedly in it) the result of the joins ends up repeating the shamp_1 value even more (eg if has 3 items shamp_1, the join will results in 9 items instead of 5). How to fix this error?

ZygD Over a year ago

Before creating test_3, remove duplicates from test_2: test_2 = test_2.distinct()

Jplavorr Over a year ago

I understood. I ended up specifying the problem wrong, sorry for that. The correct thing would be to have another column: customer_ID. That way, it wouldn't be possible to drop duplicate items because I lose purchase information. But thanks so much for the help so far.

Collectives™ on Stack Overflow

Changing Multiple Values in PySpark Dataframe

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related