1

I have a PySpark DataFrame with a string column text and a separate list word_list and I need to count how many of the word_list values appear in each text row (can be counted more than once).

df = spark.createDataFrame(
  [(1,'Hello my name is John'), 
   (2,'Yo go Bengals'), 
   (3,'this is a text')
  ]
  , ['id','text']
)

word_list = ['is', 'm', 'o', 'my']

The result would be:

| text                  | list_count   |
| Hello my name is John |       6      |
| Yo go Bengals         |       2      |
| this is a text        |       2      | 

For text's first value, "is" occurs once, "m" occurs twice, "o" occurs twice, and "my" occurs once. In the second row, the only value from word_list that appears is "o" and it appears twice. In the third value for text, the only value from word_list that appears is "is" and it appears twice.

The result doesn't necessarily have to be PySpark-based either, it could be in Pandas if that's easier.

2
  • Are you sure your logic is consistent? For text's first value..."m" occurs twice ... In the third value for [text], the only value from [word_list] that appears is "is" and it appears once. In the same way m appears twice in the first row, does is not appear twice in the third row? Commented Feb 10, 2022 at 5:00
  • Yes you're correct... I'll edit that to make the logic right. Commented Feb 10, 2022 at 5:13

2 Answers 2

1

You can do this with a UDF as below

UDF

df = sql.createDataFrame(
  [(1,'Hello my name is John'), 
   (2,'Yo go Bengals'), 
   (3,'this is a text')
  ]
  , ['id','text']
)

word_list = ['is', 'm', 'o', 'my']

def count_values(inp,map_list=None):
    
    count = 0
    
    for pattern in map_list:
        if re.findall(pattern,inp):
            count += 1
    
    return count

count_values_udf = F.udf(partial(count_values,map_list=word_list),IntegerType())

df.select(
            count_values_udf(F.col('text')).alias('count_values')
         ).show()

+------------+
|count_values|
+------------+
|           4|
|           1|
|           1|
+------------+
Sign up to request clarification or add additional context in comments.

Comments

1

To count number of occurrences of a substring in string column, you can split that column by the substring. The count corresponds to the size of resulting array minus 1.

So in your case, you can use aggregate function on the word_list array column and for each element, split the text column and get the size - 1:

from pyspark.sql import functions as F

result = df.withColumn(
    "word_list",
    F.array(*[F.lit(x) for x in word_list])
).withColumn(
    "list_count",
    F.expr("aggregate(word_list, 0, (acc, x) -> acc + size(split(text, x)) -1)")
).drop("word_list")

result.show(truncate=False)
#+---+---------------------+----------+
#|id |text                 |list_count|
#+---+---------------------+----------+
#|1  |Hello my name is John|6         |
#|2  |Yo go Bengals        |2         |
#|3  |this is a text       |2         |
#+---+---------------------+----------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.