3

I am brand new to pyspark and want to translate my existing pandas / python code to PySpark.

I want to subset my dataframe so that only rows that contain specific key words I'm looking for in 'original_problem' field is returned.

Below is the Python code I tried in PySpark:

def pilot_discrep(input_file):

    df = input_file 

    searchfor = ['cat', 'dog', 'frog', 'fleece']

    df = df[df['original_problem'].str.contains('|'.join(searchfor))]

    return df 

When I try to run the above, I get the following error:

AnalysisException: u"Can't extract value from original_problem#207: need struct type but got string;"

1 Answer 1

8

In pyspark, try this:

df = df[df['original_problem'].rlike('|'.join(searchfor))]

Or equivalently:

import pyspark.sql.functions as F
df.where(F.col('original_problem').rlike('|'.join(searchfor)))

Alternatively, you could go for udf:

import pyspark.sql.functions as F

searchfor = ['cat', 'dog', 'frog', 'fleece']
check_udf = F.udf(lambda x: x if x in searchfor else 'Not_present')

df = df.withColumn('check_presence', check_udf(F.col('original_problem')))
df = df.filter(df.check_presence != 'Not_present').drop('check_presence')

But the DataFrame methods are preferred because they will be faster.

Sign up to request clarification or add additional context in comments.

2 Comments

change like to rlike
@PineNuts0 look at the edited answer- pyspark.sql.Column.rlike() supports regular expression patterns.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.