How to regexp_extract if a matching pattern resides anywhere in the string - pyspark

Question

I was trying to get some insights on regexp_extract in pyspark and I tried to do a check with this option to get better understanding.

Below is my dataframe

data = [('2345', 'Checked|by John|for kamal'),
('2398', 'Checked|by John|for kamal '),
('2328', 'Verified|by Srinivas|for kamal than some random text'),        
('3983', 'Verified|for Stacy|by John')]

df = sc.parallelize(data).toDF(['ID', 'Notes'])

df.show()

+----+-----------------------------------------------------+
|  ID|               Notes                                 |
+----+-----------------------------------------------------+
|2345|Checked|by John|for kamal                            |
|2398|Checked|by John|for kamal                            |
|2328|Verified|by Srinivas|for kamal than some random text |
|3983|Verified|for Stacy|by John                           |
+----+-----------------------------------------------------+

So here I was trying to identify whether an ID is checked or verified by John

With the help of SO members I was able to crack the use of regexp_extract and came to below solution

result = df.withColumn('Employee', regexp_extract(col('Notes'), '(Checked|Verified)(\\|by John)', 1))

result.show()

+----+------------------------------------------------+------------+
|  ID|               Notes                                |Employee|
+----+------------------------------------------------+------------+
|2345|Checked|by John|for kamal                           | Checked|
|2398|Checked|by John|for kamal                           | Checked|
|2328|Verified|by Srinivas|for kamal than some random text|        |
|3983|Verified|for Stacy|by John                          |        |
+----+--------------------+----------------------------------------+

For few ID's this gives me perfect result ,But for last ID it didn't print Verified. Could someone please let me know whether any other action needs to be performed in the mentioned regular expression?

What I feel is (Checked|Verified)(\\|by John) is matching only adjacent values. I tried * and $, still it didn't print Verified for ID 3983.

Tim Biegeleisen · Accepted Answer · 2021-04-23 03:59:22Z

1

I would have phrased the regex as:

(Checked|Verified)\b.*\bby John

Demo

This pattern finds Checked/Verified followed by by John with the two separated by any amount of text. Note that I just use word boundaries here instead of pipes.

Updated code:

result = df.withColumn('Employee', regexp_extract(col('Notes'), '\b(Checked|Verified)\b.*\bby John', 1))

answered Apr 23, 2021 at 3:59

Tim Biegeleisen

526k32 gold badges324 silver badges399 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

USB Over a year ago

Thanks Tim.But did that gave u the results. When I replaced my regular expression with urs,I did nt get any results.

mck · Accepted Answer · 2021-04-23 07:48:51Z

1

You can try this regex:

import pyspark.sql.functions as F

result = df.withColumn('Employee', F.regexp_extract('Notes', '(Checked|Verified)\\|.*by John', 1))

result.show()
+----+--------------------+--------+
|  ID|               Notes|Employee|
+----+--------------------+--------+
|2345|Checked|by John|f...| Checked|
|2398|Checked|by John|f...| Checked|
|2328|Verified|by Srini...|        |
|3983|Verified|for Stac...|Verified|
+----+--------------------+--------+

answered Apr 23, 2021 at 7:48

mck

42.7k13 gold badges44 silver badges62 bronze badges

Comments

wwnde · Accepted Answer · 2021-04-23 10:39:31Z

1

Another way is to check if the column Notes contains a string by John:

   df.withColumn('Employee',F.when(col('Notes').like('%Checked|by John%'), 'Checked').when(col('Notes').like('%by John'), 'Verified').otherwise(" ")).show(truncate=False)

+----+----------------------------------------------------+--------+
|ID  |Notes                                               |Employee|
+----+----------------------------------------------------+--------+
|2345|Checked|by John|for kamal                           |Checked |
|2398|Checked|by John|for kamal                           |Checked |
|2328|Verified|by Srinivas|for kamal than some random text|        |
|3983|Verified|for Stacy|by John                          |Verified|
+----+----------------------------------------------------+--------+

edited Apr 23, 2021 at 10:39

answered Apr 23, 2021 at 4:07

wwnde

26.7k6 gold badges22 silver badges38 bronze badges

2 Comments

USB Over a year ago

Thanks! I am expecting the result as Verified for last ID

wwnde Over a year ago

Me bad. It required multiple when outputs. See my edits

Collectives™ on Stack Overflow

How to regexp_extract if a matching pattern resides anywhere in the string - pyspark

3 Answers 3

Demo

1 Comment

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related