1

I have a dataframe like this

+---+-------------------------+-------+
id  |     sentence            | value |
+---+-------------------------+-------+
  1 |dogs love fruits         |     50|
  1 |Mary is in the kitchen   |     45|
  2 |John eats fruits         |     56|
+---+-------------------------+-------+

I have a list of words, and I would like to create multiple columns based on sting matching, the output would look like this

List = ["fruits", "kitchen"]
+---+-------------------------+-------+-----------+------------+------------+-------------+
id  |     sentence            | value |fruits_bool|fruits_value|kitchen_bool|kitchen_value
+---+-------------------------+-------+-----------+------------+------------+-------------+
  1 |dogs love fruits         |     50|1          |50          |0           |0 
  1 |Mary is in the kitchen   |     45|0          |0           |1           |45
  2 |John eats fruits         |     56|1          |56          |0           |0
+---+-------------------------+-------+-----------+------------+------------+-------------+

I can easily create these columns word by word, but I would like to do automatically based on the list. Here's the code I tried

groupby("id"). \
    pivot('sentence', List). \
    agg(
       F.count('sentence').alias('bool')
       F.max('value').alias('value')
     ). \
    fillna(0)

I get an empty dataframe. What is the best way to do it ?

2 Answers 2

1

You can use F.contains and cast it to integer

df\
.withColumn('fruits_bool',F.col('sentence').contains(F.lit('fruits')).cast('integer'))\
.withColumn('fruits_value',F.col('fruits_bool')*F.col('value'))\
.withColumn('kitchen_bool',F.col('sentence').contains(F.lit('kitchen')).cast('integer'))\
.withColumn('kitchen_value',F.col('kitchen_bool')*F.col('value')).show()

output :

+---+--------------------+-----+-----------+------------+------------+-------------+
| id|            sentence|value|fruits_bool|fruits_value|kitchen_bool|kitchen_value|
+---+--------------------+-----+-----------+------------+------------+-------------+
|  1|    dogs love fruits|   50|          1|          50|           0|            0|
|  2|Mary is in the ki...|   45|          0|           0|           1|           45|
|  3|    John eats fruits|   56|          1|          56|           0|            0|
+---+--------------------+-----+-----------+------------+------------+-------------+
Sign up to request clarification or add additional context in comments.

Comments

0

Something like this probably :

from pyspark.sql import functions as F

df.select(
    "*",
    *(
        F.when(F.lit(val).isin(F.col("sentence")), 1).oterhwise(0).alias(f"{val}_bool")
        for val in List
    ),
    *(
        F.when(F.lit(val).isin(F.col("sentence")), F.col("value"))
        .oterhwise(0)
        .alias(f"{val}_value")
        for val in List
    ),
)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.