PySpark: Search For substrings in text and subset dataframe

Question

I am brand new to pyspark and want to translate my existing pandas / python code to PySpark.

I want to subset my dataframe so that only rows that contain specific key words I'm looking for in 'original_problem' field is returned.

Below is the Python code I tried in PySpark:

def pilot_discrep(input_file):

    df = input_file 

    searchfor = ['cat', 'dog', 'frog', 'fleece']

    df = df[df['original_problem'].str.contains('|'.join(searchfor))]

    return df

When I try to run the above, I get the following error:

AnalysisException: u"Can't extract value from original_problem#207: need struct type but got string;"

pault · Accepted Answer · 2018-05-18 17:17:42Z

8

In pyspark, try this:

df = df[df['original_problem'].rlike('|'.join(searchfor))]

Or equivalently:

import pyspark.sql.functions as F
df.where(F.col('original_problem').rlike('|'.join(searchfor)))

Alternatively, you could go for udf:

import pyspark.sql.functions as F

searchfor = ['cat', 'dog', 'frog', 'fleece']
check_udf = F.udf(lambda x: x if x in searchfor else 'Not_present')

df = df.withColumn('check_presence', check_udf(F.col('original_problem')))
df = df.filter(df.check_presence != 'Not_present').drop('check_presence')

But the DataFrame methods are preferred because they will be faster.

edited May 18, 2018 at 17:17

pault

43.7k17 gold badges121 silver badges161 bronze badges

answered May 18, 2018 at 15:32

mayank agrawal

2,5552 gold badges16 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

jxc Over a year ago

change like to rlike

pault Over a year ago

@PineNuts0 look at the edited answer- pyspark.sql.Column.rlike() supports regular expression patterns.

Collectives™ on Stack Overflow

PySpark: Search For substrings in text and subset dataframe

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related