0

The sample of the dataset I am working on:

test = sqlContext.createDataFrame([(1,2),
                                   (1,3),
                                   (4,5)],
                                  ['cod_item_2','alter_cod'])

enter image description here

test_2 = sqlContext.createDataFrame([(1,"shamp_1"),(2,"shamp_2"),
                                     (4,"tire_1"),(5,"tire_2"),
                                     (3,"shamp_3"),(6,"cookie"),
                                     (7,"flower"),(8,"water")],
                                    ['cod_item','product_name'])

enter image description here

The first dataframe contains items and items that are equivalent to them.

The second dataframe contains all items and product names.

I want to use the first dataframe to pull out the items that are equivalent to the second dataframe and replace with the item that represents them (the item on the left side of the first table), where the result is as follows:

enter image description here

I tried doing a full join on both dataframes and using the when clause to change the values. But it ended up not working.

1 Answer 1

2

You can do 2 joins. test_2 with test and then again with test_2 (self join). For this to work reliably I use alias on dataframes.

from pyspark.sql import functions as F

test_3 = (
    test_2.alias('a')
    .join(test, F.col("a.cod_item") == F.col("alter_cod"), "left")
    .join(test_2.alias('b'), F.col("cod_item_2") == F.col("b.cod_item"), "left")
    .select(
        F.coalesce("b.cod_item", "a.cod_item").alias("cod_item"),
        F.coalesce("b.product_name", "a.product_name").alias("product_name")
    )
)
test_3.show()
# +--------+------------+
# |cod_item|product_name|
# +--------+------------+
# |       4|      tire_1|
# |       1|     shamp_1|
# |       1|     shamp_1|
# |       4|      tire_1|
# |       7|      flower|
# |       6|      cookie|
# |       1|     shamp_1|
# |       8|       water|
# +--------+------------+

Sign up to request clarification or add additional context in comments.

3 Comments

If the test_2 data frame is a shopping table (eg item shamp_1 appears repeatedly in it) the result of the joins ends up repeating the shamp_1 value even more (eg if has 3 items shamp_1, the join will results in 9 items instead of 5). How to fix this error?
Before creating test_3, remove duplicates from test_2: test_2 = test_2.distinct()
I understood. I ended up specifying the problem wrong, sorry for that. The correct thing would be to have another column: customer_ID. That way, it wouldn't be possible to drop duplicate items because I lose purchase information. But thanks so much for the help so far.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.