1

I need a regular expression in python that will be able to match different name formats like I have 4 different names format for same person.like

R. K. Goyal Raj K. Goyal Raj Kumar Goyal R. Goyal

What will be the regular expression to get all these names from a single regular expression in a list of thousands.

PS: My list have thousands of such name so I need some generic solution for it so that I can combine these names together.In the above example R and Goyal can be used to write RE.

Thanks

3
  • 1
    R.*Goyal is a regular expression that will match all of those names. However, it seems unlikely that you really want to use regular expressions to solve the general problem of grouping names that are likely the same person. Commented May 15, 2013 at 15:40
  • R.*Goyal will also match other names such as Raj Anything Goyal Commented May 15, 2013 at 15:41
  • Or indeed "Rock Hudson and V. S. Goyal" (the R from Rock and then skip everything before Goyal). Commented May 15, 2013 at 15:52

3 Answers 3

1

"R(\.|aj)? (K(\.|umar)? )?Goyal" will only match those four cases. You can modify this for other names as well.

Sign up to request clarification or add additional context in comments.

Comments

1

Fair warning: I haven't used Python in a while, so I won't be giving you specific function names.

If you're looking for a generic solution that will apply to any possible name, you're going to have to construct it dynamically.

ASSUMING that the first name is always the one that won't be dropped (I know people whose names follow the format "John David Smith" and go by David) you should be able to grab the first letter of the string and call that the first initial.

Next, you need to grab the last name- if you have no Jr's or Sr's or such, you can just take the last word (find the last occurrence of ' ', then take everything after that).

From there, "<firstInitial>* <lastName>" is a good start. If you bother to grab the whole first name as well, you can reduce your false positive matches further with "<firstInitial>(\.|<restOfFirstName>)* <lastName>" as in joon's answer.

If you want to get really fancy, detecting the presence of a middle name could reduce false positives even more.

Comments

1

I may be misunderstanding the problem, but I'm envisioning a solution where you iterate over the list of names and dynamically construct a new regexp for each name, and then store all of these regexps in a dictionary to use later:

import re

names = [ 'John Kelly Smith', 'Billy Bob Jones', 'Joe James', 'Kim Smith' ]
regexps={}
for name in names:
    elements=name.split()
    if len(elements) == 3:
        pattern = '(%s(\.|%s)?)?(\ )?(%s(\.|%s)? )?%s$' % (elements[0][0], \
                                                           elements[0][1:], \
                                                           elements[1][0], \
                                                           elements[1][1:], \
                                                           elements[2])
    elif len(elements) == 2:
        pattern = '%s(\.|%s)? %s$' % (elements[0][0], \
                                      elements[0][1:], \
                                      elements[1])
    else:
        continue

    regexps[name]=re.compile(pattern)

jksmith_regexp = regexps['John Kelly Smith']
print bool(jksmith_regexp.match('K. Smith'))
print bool(jksmith_regexp.match('John Smith'))
print bool(jksmith_regexp.match('John K. Smith'))
print bool(jksmith_regexp.match('J. Smith'))

This way you can easily keep track of which regexp will find which name in your text.

And you can also do handy things like this:

if( sum([bool(reg.match('K. Smith')) for reg in regexps.values()]) > 1 ):
    print "This string matches multiple names!"

Where you check to see if some of the names in your text are ambiguous.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.