I have a PySpark DataFrame with a string column text and a separate list word_list and I need to count how many of the word_list values appear in each text row (can be counted more than once).
df = spark.createDataFrame(
[(1,'Hello my name is John'),
(2,'Yo go Bengals'),
(3,'this is a text')
]
, ['id','text']
)
word_list = ['is', 'm', 'o', 'my']
The result would be:
| text | list_count |
| Hello my name is John | 6 |
| Yo go Bengals | 2 |
| this is a text | 2 |
For text's first value, "is" occurs once, "m" occurs twice, "o" occurs twice, and "my" occurs once. In the second row, the only value from word_list that appears is "o" and it appears twice. In the third value for text, the only value from word_list that appears is "is" and it appears twice.
The result doesn't necessarily have to be PySpark-based either, it could be in Pandas if that's easier.
For text's first value..."m" occurs twice ... In the third value for [text], the only value from [word_list] that appears is "is" and it appears once.In the same waymappears twice in the first row, doesisnot appear twice in the third row?