Regular expression for matching diffrent name format in python

Question

I need a regular expression in python that will be able to match different name formats like I have 4 different names format for same person.like

R. K. Goyal Raj K. Goyal Raj Kumar Goyal R. Goyal

What will be the regular expression to get all these names from a single regular expression in a list of thousands.

PS: My list have thousands of such name so I need some generic solution for it so that I can combine these names together.In the above example R and Goyal can be used to write RE.

Thanks

R.*Goyal is a regular expression that will match all of those names. However, it seems unlikely that you really want to use regular expressions to solve the general problem of grouping names that are likely the same person. — Wooble
– Wooble, Commented May 15, 2013 at 15:40
R.*Goyal will also match other names such as Raj Anything Goyal — joon
– joon, Commented May 15, 2013 at 15:41
Or indeed "Rock Hudson and V. S. Goyal" (the R from Rock and then skip everything before Goyal). — tripleee
– tripleee, Commented May 15, 2013 at 15:52

joon · Accepted Answer · 2013-05-15 15:40:05Z

1

"R(\.|aj)? (K(\.|umar)? )?Goyal" will only match those four cases. You can modify this for other names as well.

answered May 15, 2013 at 15:40

joon

4,0364 gold badges45 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Tin Wizard · Accepted Answer · 2013-05-15 15:52:32Z

Fair warning: I haven't used Python in a while, so I won't be giving you specific function names.

If you're looking for a generic solution that will apply to any possible name, you're going to have to construct it dynamically.

ASSUMING that the first name is always the one that won't be dropped (I know people whose names follow the format "John David Smith" and go by David) you should be able to grab the first letter of the string and call that the first initial.

Next, you need to grab the last name- if you have no Jr's or Sr's or such, you can just take the last word (find the last occurrence of ' ', then take everything after that).

From there, "<firstInitial>* <lastName>" is a good start. If you bother to grab the whole first name as well, you can reduce your false positive matches further with "<firstInitial>(\.|<restOfFirstName>)* <lastName>" as in joon's answer.

If you want to get really fancy, detecting the presence of a middle name could reduce false positives even more.

qwwqwwq · Accepted Answer · 2013-05-15 16:35:38Z

I may be misunderstanding the problem, but I'm envisioning a solution where you iterate over the list of names and dynamically construct a new regexp for each name, and then store all of these regexps in a dictionary to use later:

import re

names = [ 'John Kelly Smith', 'Billy Bob Jones', 'Joe James', 'Kim Smith' ]
regexps={}
for name in names:
    elements=name.split()
    if len(elements) == 3:
        pattern = '(%s(\.|%s)?)?(\ )?(%s(\.|%s)? )?%s$' % (elements[0][0], \
                                                           elements[0][1:], \
                                                           elements[1][0], \
                                                           elements[1][1:], \
                                                           elements[2])
    elif len(elements) == 2:
        pattern = '%s(\.|%s)? %s$' % (elements[0][0], \
                                      elements[0][1:], \
                                      elements[1])
    else:
        continue

    regexps[name]=re.compile(pattern)

jksmith_regexp = regexps['John Kelly Smith']
print bool(jksmith_regexp.match('K. Smith'))
print bool(jksmith_regexp.match('John Smith'))
print bool(jksmith_regexp.match('John K. Smith'))
print bool(jksmith_regexp.match('J. Smith'))

This way you can easily keep track of which regexp will find which name in your text.

And you can also do handy things like this:

if( sum([bool(reg.match('K. Smith')) for reg in regexps.values()]) > 1 ):
    print "This string matches multiple names!"

Where you check to see if some of the names in your text are ambiguous.

Collectives™ on Stack Overflow

Regular expression for matching diffrent name format in python

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related