0

Here is the data frame available:

+--------------------+
|                Name|
+--------------------+
|Braund, Mr. Owen ...|
|Cumings, Mrs. Joh...|
|Heikkinen, Miss. ...|
|Futrelle, Mrs. Ja...|
|Allen, Mr. Willia...|
|Moran, Mr. James|
|McCarthy, Mr. Tim...|
|Palsson, Master. ...|
|Johnson, Mrs. Osc...|
+--------------------+

I want to find the first occurrence of Title and Surname in each row in DATA FRAME using Pyspark (Pandas lib is not available in my cluster).

pattern=re.compile(r'(Dr|Mrs?|Ms|Miss|Master|Rev|Capt|Mlle|Col|Major|Sir|Lady|Mme|Don)\\.'
pattern.match(df['Name'])
5
  • Maybe something like this that uses regex, or this or with a udf. Commented Jan 11, 2018 at 7:57
  • @mkaran Tried this code but not successful def findTitle(df): rdd=df.select("Name").flatMap(lambda x: x).map(lambda x:x).collect() for f in rdd: title=re.search('(Dr|Mrs?|Ms|Miss|Master|Rev|Capt|Mlle|Col|Major|Sir|Lady|Mme|Don)',f).group() Commented Jan 11, 2018 at 12:48
  • Can you try df = df.filter(df["Name"].rlike(r'(Ms|Miss)')) (and df.show())? I couldn't get it to work with your regex but it definitely works with this simpler expression. Commented Jan 11, 2018 at 13:10
  • Btw, if you want the surname your regex should be modified to something that uses lookbehind, e.g. (?<=Mr\.\s)\w+ will match Owen from the first line etc. Commented Jan 11, 2018 at 13:54
  • @mkaran df = df.filter(df["Name"].rlike(r'(Ms|Miss)')) .I have pattern for more than 15 surname.Its not working for me either. Commented Jan 18, 2018 at 12:18

2 Answers 2

0

You can use regexp_extract as @Prem suggested but with a different regex pattern, depending on what you need:

# do not keep the first two groups, just what follows, the surname:
pattern = r'(?:(?:Dr|Mrs?|Ms|Miss|Master|Rev|Capt|Mlle|Col|Major|Sir|Lady|Mme|Don)\.?\s?)(\w+)'

# or keep both title and surname
pattern_with_title = r'((Dr|Mrs?|Ms|Miss|Master|Rev|Capt|Mlle|Col|Major|Sir|Lady|Mme|Don)\.?\s?)(\w+)'

#sample data
df = spark.createDataFrame([["Braund, Mr. Owen other stuff"], 
                       ["Cumings, Mrs. Joh some details"], 
                       ["Heikkinen, Miss. Hellen blah"], 
                       ["Futrelle, Mrs. Ja .... .... "]], ["Name"])

df.show()

+-----------------+
|             Name|
+-----------------+
| Braund, Mr. Owen|
|Cumings, Mrs. Joh|
| Heikkinen, Miss.|
|Futrelle, Mrs. Ja|
+-----------------+

# create a column with what matches the pattern
df = df.withColumn("Surname", regexp_extract("Name", pattern, 1))

df.show()
# keeps only the Surname
+-----------------+---------+
|             Name|  Surname|
+-----------------+---------+
| Braund, Mr. Owen| Owen    |
|Cumings, Mrs. Joh| Joh     |
| Heikkinen, Miss.| Hellen  |
|Futrelle, Mrs. Ja| Ja      |
+-----------------+---------+

 # in case you want both title and Surname:
 df = df.withColumn("Surname with title", regexp_extract("Name", pattern_with_title, 1))

+-----------------+---------+--------------------+
|             Name|  Surname|  Surname with title|
+-----------------+---------+--------------------+
|Braund, Mr. Owen | Owen    | Mr. Ownen          |
|Cumings, Mrs. Joh| Joh     | Mrs. Joh           |
|Heikkinen, Miss..| Hellen  | Miss. Hellen       |
|Futrelle, Mrs. Ja| Ja      | Mrs. Ja            |
+-----------------+---------+--------------------+

If you need the full Name, title Surname, then slightly change the pattern to include those too, e.g.:

main_pattern = r'Dr|Mrs?|Ms|Miss|Master|Rev|Capt|Mlle|Col|Major|Sir|Lady|Mme|Don'

pattern_full = r'(\w+,?\s('+ main_pattern+')\.?\s?\w+)'
pattern_name = r'(?:(?:'+ main_pattern+')\.?\s?)(\w+)'
pattern_title = r'(?:('+ main_pattern+')\.?\s?)'
pattern_surname = r'(\w+)(?:\,\s?(?:'+ main_pattern+')\.?\s?)'

df = df.withColumn("Full Name", regexp_extract("Name", pattern_full, 1))
df = df.withColumn("First Name", regexp_extract("Name", pattern_name, 1))
df = df.withColumn("Surname", regexp_extract("Name", pattern_surname, 1))
df = df.withColumn("Title", regexp_extract("Name", pattern_title, 1))

df.show(10, False)

+------------------------------+-----------------------+----------+------------+-----+
|Name                          |Full Name              |Surname   |First Name  |Title|
+------------------------------+-----------------------+----------+------------+-----+
|Braund, Mr. Owen other stuff  |Braund, Mr. Owen       |Braund    |Owen        |Mr   |
|Cumings, Mrs. Joh some details|Cumings, Mrs. Joh      |Cumings   |Joh         |Mrs  |
|Heikkinen, Miss. Hellen blah  |Heikkinen, Miss. Hellen|Heikkinen |Hellen      |Miss |
|Futrelle, Mrs. Ja .... ....   |Futrelle, Mrs. Ja      |Futrelle  |Ja          |Mrs  |
+------------------------------+-----------------------+----------+------------+-----+

It is all about which part to ignore and which part to select in the regex. Hope this helps, good luck!

Note: not the optimal regex and it has room to improve.

Sign up to request clarification or add additional context in comments.

6 Comments

but I required +-----------------+---------+--------------------+ | Name| Surname| title| +-----------------+---------+--------------------+ |Braund, Mr. Owen | Owen | Mr. | |Cumings, Mrs. Joh| Joh | Mrs. | |Heikkinen, Miss..| Hellen | Miss. | |Futrelle, Mrs. Ja| Ja | Mrs | +-----------------+---------
@shalu, that was not clear in your question, could you update it please? I have updated the answer to include many different combinations, let me know if it covers your question. Thanks
I updated it.I want first occurrence of title and surname in each row.The pattern of surname and title is provide explicitly.
@shalu thank you! I see that I had name-surname the other way. I've updated my answer. Does it cover your needs?
thanks, that what I wanted the Title column.Yes It covers my requirement.
|
0

If Name column has 'Surname' as the first word then you can try this else regex would need a little bit of tweaking.

from pyspark.sql.functions import regexp_extract, col

#sample data
df= sc.parallelize([["Braund, Mr. Owen"], 
                    ["Cumings, Mrs. Joh"], 
                    ["Heikkinen, Miss."], 
                    ["Futrelle, Mrs. Ja"]]).toDF(["Name"])

df = df.withColumn('Surname', regexp_extract(col('Name'), '(\S+),.*', 1))
df.show()

Sample data:

+-----------------+
|             Name|
+-----------------+
| Braund, Mr. Owen|
|Cumings, Mrs. Joh|
| Heikkinen, Miss.|
|Futrelle, Mrs. Ja|
+-----------------+

Output is:

+-----------------+---------+
|             Name|  Surname|
+-----------------+---------+
| Braund, Mr. Owen|   Braund|
|Cumings, Mrs. Joh|  Cumings|
| Heikkinen, Miss.|Heikkinen|
|Futrelle, Mrs. Ja| Futrelle|
+-----------------+---------+

1 Comment

I want the title name like Mr. Miss. and sure I will close as soon as I find answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.