Count occurrences of list of values in column using PySpark DataFrame

Question

I have a PySpark DataFrame with a string column text and a separate list word_list and I need to count how many of the word_list values appear in each text row (can be counted more than once).

df = spark.createDataFrame(
  [(1,'Hello my name is John'), 
   (2,'Yo go Bengals'), 
   (3,'this is a text')
  ]
  , ['id','text']
)

word_list = ['is', 'm', 'o', 'my']

The result would be:

| text                  | list_count   |
| Hello my name is John |       6      |
| Yo go Bengals         |       2      |
| this is a text        |       2      |

For text's first value, "is" occurs once, "m" occurs twice, "o" occurs twice, and "my" occurs once. In the second row, the only value from word_list that appears is "o" and it appears twice. In the third value for text, the only value from word_list that appears is "is" and it appears twice.

The result doesn't necessarily have to be PySpark-based either, it could be in Pandas if that's easier.

Are you sure your logic is consistent? For text's first value..."m" occurs twice ... In the third value for [text], the only value from [word_list] that appears is "is" and it appears once. In the same way m appears twice in the first row, does is not appear twice in the third row? — Nick Becker
– Nick Becker, Commented Feb 10, 2022 at 5:00
Yes you're correct... I'll edit that to make the logic right. — Bjorno
– Bjorno, Commented Feb 10, 2022 at 5:13

Vaebhav · Accepted Answer · 2022-02-10 04:40:13Z

1

You can do this with a UDF as below

UDF

df = sql.createDataFrame(
  [(1,'Hello my name is John'), 
   (2,'Yo go Bengals'), 
   (3,'this is a text')
  ]
  , ['id','text']
)

word_list = ['is', 'm', 'o', 'my']

def count_values(inp,map_list=None):
    
    count = 0
    
    for pattern in map_list:
        if re.findall(pattern,inp):
            count += 1
    
    return count

count_values_udf = F.udf(partial(count_values,map_list=word_list),IntegerType())

df.select(
            count_values_udf(F.col('text')).alias('count_values')
         ).show()

+------------+
|count_values|
+------------+
|           4|
|           1|
|           1|
+------------+

answered Feb 10, 2022 at 4:40

Vaebhav

5,0921 gold badge17 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

blackbishop · Accepted Answer · 2022-02-10 08:30:03Z

To count number of occurrences of a substring in string column, you can split that column by the substring. The count corresponds to the size of resulting array minus 1.

So in your case, you can use aggregate function on the word_list array column and for each element, split the text column and get the size - 1:

from pyspark.sql import functions as F

result = df.withColumn(
    "word_list",
    F.array(*[F.lit(x) for x in word_list])
).withColumn(
    "list_count",
    F.expr("aggregate(word_list, 0, (acc, x) -> acc + size(split(text, x)) -1)")
).drop("word_list")

result.show(truncate=False)
#+---+---------------------+----------+
#|id |text                 |list_count|
#+---+---------------------+----------+
#|1  |Hello my name is John|6         |
#|2  |Yo go Bengals        |2         |
#|3  |this is a text       |2         |
#+---+---------------------+----------+

Collectives™ on Stack Overflow

Count occurrences of list of values in column using PySpark DataFrame

2 Answers 2

UDF

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

UDF

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related