How to find a string in each row in a dataframe in pyspark

Question

Here is the data frame available:

+--------------------+
|                Name|
+--------------------+
|Braund, Mr. Owen ...|
|Cumings, Mrs. Joh...|
|Heikkinen, Miss. ...|
|Futrelle, Mrs. Ja...|
|Allen, Mr. Willia...|
|Moran, Mr. James|
|McCarthy, Mr. Tim...|
|Palsson, Master. ...|
|Johnson, Mrs. Osc...|
+--------------------+

I want to find the first occurrence of Title and Surname in each row in DATA FRAME using Pyspark (Pandas lib is not available in my cluster).

pattern=re.compile(r'(Dr|Mrs?|Ms|Miss|Master|Rev|Capt|Mlle|Col|Major|Sir|Lady|Mme|Don)\\.'
pattern.match(df['Name'])

Maybe something like this that uses regex, or this or with a udf. — mkaran
– mkaran, Commented Jan 11, 2018 at 7:57
@mkaran Tried this code but not successful def findTitle(df): rdd=df.select("Name").flatMap(lambda x: x).map(lambda x:x).collect() for f in rdd: title=re.search('(Dr|Mrs?|Ms|Miss|Master|Rev|Capt|Mlle|Col|Major|Sir|Lady|Mme|Don)',f).group() — shalu
– shalu, Commented Jan 11, 2018 at 12:48
Can you try df = df.filter(df["Name"].rlike(r'(Ms|Miss)')) (and df.show())? I couldn't get it to work with your regex but it definitely works with this simpler expression. — mkaran
– mkaran, Commented Jan 11, 2018 at 13:10
Btw, if you want the surname your regex should be modified to something that uses lookbehind, e.g. (?<=Mr\.\s)\w+ will match Owen from the first line etc. — mkaran
– mkaran, Commented Jan 11, 2018 at 13:54
@mkaran df = df.filter(df["Name"].rlike(r'(Ms|Miss)')) .I have pattern for more than 15 surname.Its not working for me either. — shalu
– shalu, Commented Jan 18, 2018 at 12:18

mkaran · Accepted Answer · 2018-01-23 09:28:44Z

0

You can use regexp_extract as @Prem suggested but with a different regex pattern, depending on what you need:

# do not keep the first two groups, just what follows, the surname:
pattern = r'(?:(?:Dr|Mrs?|Ms|Miss|Master|Rev|Capt|Mlle|Col|Major|Sir|Lady|Mme|Don)\.?\s?)(\w+)'

# or keep both title and surname
pattern_with_title = r'((Dr|Mrs?|Ms|Miss|Master|Rev|Capt|Mlle|Col|Major|Sir|Lady|Mme|Don)\.?\s?)(\w+)'

#sample data
df = spark.createDataFrame([["Braund, Mr. Owen other stuff"], 
                       ["Cumings, Mrs. Joh some details"], 
                       ["Heikkinen, Miss. Hellen blah"], 
                       ["Futrelle, Mrs. Ja .... .... "]], ["Name"])

df.show()

+-----------------+
|             Name|
+-----------------+
| Braund, Mr. Owen|
|Cumings, Mrs. Joh|
| Heikkinen, Miss.|
|Futrelle, Mrs. Ja|
+-----------------+

# create a column with what matches the pattern
df = df.withColumn("Surname", regexp_extract("Name", pattern, 1))

df.show()
# keeps only the Surname
+-----------------+---------+
|             Name|  Surname|
+-----------------+---------+
| Braund, Mr. Owen| Owen    |
|Cumings, Mrs. Joh| Joh     |
| Heikkinen, Miss.| Hellen  |
|Futrelle, Mrs. Ja| Ja      |
+-----------------+---------+

 # in case you want both title and Surname:
 df = df.withColumn("Surname with title", regexp_extract("Name", pattern_with_title, 1))

+-----------------+---------+--------------------+
|             Name|  Surname|  Surname with title|
+-----------------+---------+--------------------+
|Braund, Mr. Owen | Owen    | Mr. Ownen          |
|Cumings, Mrs. Joh| Joh     | Mrs. Joh           |
|Heikkinen, Miss..| Hellen  | Miss. Hellen       |
|Futrelle, Mrs. Ja| Ja      | Mrs. Ja            |
+-----------------+---------+--------------------+

If you need the full Name, title Surname, then slightly change the pattern to include those too, e.g.:

main_pattern = r'Dr|Mrs?|Ms|Miss|Master|Rev|Capt|Mlle|Col|Major|Sir|Lady|Mme|Don'

pattern_full = r'(\w+,?\s('+ main_pattern+')\.?\s?\w+)'
pattern_name = r'(?:(?:'+ main_pattern+')\.?\s?)(\w+)'
pattern_title = r'(?:('+ main_pattern+')\.?\s?)'
pattern_surname = r'(\w+)(?:\,\s?(?:'+ main_pattern+')\.?\s?)'

df = df.withColumn("Full Name", regexp_extract("Name", pattern_full, 1))
df = df.withColumn("First Name", regexp_extract("Name", pattern_name, 1))
df = df.withColumn("Surname", regexp_extract("Name", pattern_surname, 1))
df = df.withColumn("Title", regexp_extract("Name", pattern_title, 1))

df.show(10, False)

+------------------------------+-----------------------+----------+------------+-----+
|Name                          |Full Name              |Surname   |First Name  |Title|
+------------------------------+-----------------------+----------+------------+-----+
|Braund, Mr. Owen other stuff  |Braund, Mr. Owen       |Braund    |Owen        |Mr   |
|Cumings, Mrs. Joh some details|Cumings, Mrs. Joh      |Cumings   |Joh         |Mrs  |
|Heikkinen, Miss. Hellen blah  |Heikkinen, Miss. Hellen|Heikkinen |Hellen      |Miss |
|Futrelle, Mrs. Ja .... ....   |Futrelle, Mrs. Ja      |Futrelle  |Ja          |Mrs  |
+------------------------------+-----------------------+----------+------------+-----+

It is all about which part to ignore and which part to select in the regex. Hope this helps, good luck!

Note: not the optimal regex and it has room to improve.

edited Jan 23, 2018 at 9:28

answered Jan 18, 2018 at 14:00

mkaran

2,71824 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

shalu Over a year ago

mkaran Over a year ago

@shalu, that was not clear in your question, could you update it please? I have updated the answer to include many different combinations, let me know if it covers your question. Thanks

shalu Over a year ago

I updated it.I want first occurrence of title and surname in each row.The pattern of surname and title is provide explicitly.

mkaran Over a year ago

@shalu thank you! I see that I had name-surname the other way. I've updated my answer. Does it cover your needs?

shalu Over a year ago

thanks, that what I wanted the Title column.Yes It covers my requirement.

|

Prem · Accepted Answer · 2018-01-12 19:07:41Z

0

If Name column has 'Surname' as the first word then you can try this else regex would need a little bit of tweaking.

from pyspark.sql.functions import regexp_extract, col

#sample data
df= sc.parallelize([["Braund, Mr. Owen"], 
                    ["Cumings, Mrs. Joh"], 
                    ["Heikkinen, Miss."], 
                    ["Futrelle, Mrs. Ja"]]).toDF(["Name"])

df = df.withColumn('Surname', regexp_extract(col('Name'), '(\S+),.*', 1))
df.show()

Sample data:

+-----------------+
|             Name|
+-----------------+
| Braund, Mr. Owen|
|Cumings, Mrs. Joh|
| Heikkinen, Miss.|
|Futrelle, Mrs. Ja|
+-----------------+

Output is:

+-----------------+---------+
|             Name|  Surname|
+-----------------+---------+
| Braund, Mr. Owen|   Braund|
|Cumings, Mrs. Joh|  Cumings|
| Heikkinen, Miss.|Heikkinen|
|Futrelle, Mrs. Ja| Futrelle|
+-----------------+---------+

answered Jan 12, 2018 at 19:07

Prem

12k1 gold badge21 silver badges34 bronze badges

1 Comment

shalu Over a year ago

I want the title name like Mr. Miss. and sure I will close as soon as I find answer.

Collectives™ on Stack Overflow

How to find a string in each row in a dataframe in pyspark

2 Answers 2

6 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related