1

I was trying to get some insights on regexp_extract in pyspark and I tried to do a check with this option to get better understanding.

Below is my dataframe

data = [('2345', 'Checked|by John|for kamal'),
('2398', 'Checked|by John|for kamal '),
('2328', 'Verified|by Srinivas|for kamal than some random text'),        
('3983', 'Verified|for Stacy|by John')]

df = sc.parallelize(data).toDF(['ID', 'Notes'])

df.show()

+----+-----------------------------------------------------+
|  ID|               Notes                                 |
+----+-----------------------------------------------------+
|2345|Checked|by John|for kamal                            |
|2398|Checked|by John|for kamal                            |
|2328|Verified|by Srinivas|for kamal than some random text |
|3983|Verified|for Stacy|by John                           |
+----+-----------------------------------------------------+

So here I was trying to identify whether an ID is checked or verified by John

With the help of SO members I was able to crack the use of regexp_extract and came to below solution

result = df.withColumn('Employee', regexp_extract(col('Notes'), '(Checked|Verified)(\\|by John)', 1))

result.show()

+----+------------------------------------------------+------------+
|  ID|               Notes                                |Employee|
+----+------------------------------------------------+------------+
|2345|Checked|by John|for kamal                           | Checked|
|2398|Checked|by John|for kamal                           | Checked|
|2328|Verified|by Srinivas|for kamal than some random text|        |
|3983|Verified|for Stacy|by John                          |        |
+----+--------------------+----------------------------------------+

For few ID's this gives me perfect result ,But for last ID it didn't print Verified. Could someone please let me know whether any other action needs to be performed in the mentioned regular expression?

What I feel is (Checked|Verified)(\\|by John) is matching only adjacent values. I tried * and $, still it didn't print Verified for ID 3983.

3 Answers 3

1

I would have phrased the regex as:

(Checked|Verified)\b.*\bby John

Demo

This pattern finds Checked/Verified followed by by John with the two separated by any amount of text. Note that I just use word boundaries here instead of pipes.

Updated code:

result = df.withColumn('Employee', regexp_extract(col('Notes'), '\b(Checked|Verified)\b.*\bby John', 1))
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks Tim.But did that gave u the results. When I replaced my regular expression with urs,I did nt get any results.
1

You can try this regex:

import pyspark.sql.functions as F

result = df.withColumn('Employee', F.regexp_extract('Notes', '(Checked|Verified)\\|.*by John', 1))

result.show()
+----+--------------------+--------+
|  ID|               Notes|Employee|
+----+--------------------+--------+
|2345|Checked|by John|f...| Checked|
|2398|Checked|by John|f...| Checked|
|2328|Verified|by Srini...|        |
|3983|Verified|for Stac...|Verified|
+----+--------------------+--------+

Comments

1

Another way is to check if the column Notes contains a string by John:

   df.withColumn('Employee',F.when(col('Notes').like('%Checked|by John%'), 'Checked').when(col('Notes').like('%by John'), 'Verified').otherwise(" ")).show(truncate=False)

+----+----------------------------------------------------+--------+
|ID  |Notes                                               |Employee|
+----+----------------------------------------------------+--------+
|2345|Checked|by John|for kamal                           |Checked |
|2398|Checked|by John|for kamal                           |Checked |
|2328|Verified|by Srinivas|for kamal than some random text|        |
|3983|Verified|for Stacy|by John                          |Verified|
+----+----------------------------------------------------+--------+

2 Comments

Thanks! I am expecting the result as Verified for last ID
Me bad. It required multiple when outputs. See my edits

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.