Python RegEx for name standardization

Question

I am trying to write a regular expression to standardize names.

Use case:

J. J. Abrams -> JJ Abrams
J J Abrams -> JJ Abrams
J.J Abrams -> JJ Abrams
J.J. Abrams -> JJ Abrams
J J  Abrams -> JJ Abrams (multiple spaces)

The initials can appear at the end or in the middle of the name. In general an initial can have spaces or a '.' or a word boundary before or after it.

So I came up with this following:

p = re.compile(r'((\b|\s+|\.)[a-z](\.|\s+|\b))', re.I)

When I try to match and print the result, it looks wrong:

p.subn(lambda g: g.groups()[0].strip().strip('.'), "J J Abrams")
('JJAbrams', 2)

How do I retain the space before(or after) the non initial part?

Edit Also, I should have made it clear, there can be more than just 2 initials in the name. The above was just one random use case. Thanks

Might be hard for some names like D'J M O'Brien, Doris Di-O Y. — wp78de
– wp78de, Commented May 31, 2018 at 7:32
@SvenKrüger it's starting to look like a plain string manipulation job now. — sisanared
– sisanared, Commented May 31, 2018 at 7:59
@wp78de, they will stay as is. Only the initials with '.'s will get normalized — sisanared
– sisanared, Commented May 31, 2018 at 7:59
sisanared: can you check my answer and let me know if it is working for you? — Allan
– Allan, Commented May 31, 2018 at 8:59

SamWhan · Accepted Answer · 2018-05-31 10:47:44Z

3

For the cases given, replacing

(?<=\b[A-Z]\b)[. ]+(?=[A-Z]\b)|\.|(\s)\s+

with

$1

should do it.

It matches, using alternation, spaces and dots between initials, dots anywhere or more than one space. The latter captures the first space.

Replacing this with $1 removes matches from the first two alternations and in the third case (several spaces) replaces them with a single one (the first that's captured).

See it here at regex101.

answered May 31, 2018 at 10:47

SamWhan

8,3621 gold badge21 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Allan Over a year ago

Great answer +1! It's great that we can process it in one pass.

Allan · Accepted Answer · 2018-05-31 07:22:09Z

1

I think you could do it in 2 steps by using regex:

step 1:

regex:

 +|\. *

and replacement (a single space)

step 1 demo

step 2:

regex:

\b([a-z]) ([a-z])\b

replacement: \1\2

step 2 demo

By combining everything you have:

Input file:

$ cat names
J. J. Abrams
J J Abrams
J.J Abrams
J.J. Abrams
J J  Abrams
J  Abrams J.
Abrams J. J.
Abrams J J

python code:

$ cat names_norm.py 
import re
import sys

with open("names") as file:
        for line in file:
                line = re.sub(r" +|\. *", " ", line)
                line = re.sub(r"\b([a-zA-Z]) ([a-zA-Z])\b", "\g<1>\g<2>", line)
                sys.stdout.write(line)
sys.stdout.flush()

Output:

$ python names_norm.py                                                                                                           
JJ Abrams
JJ Abrams
JJ Abrams
JJ Abrams
JJ Abrams
J Abrams J 
Abrams JJ 
Abrams JJ

answered May 31, 2018 at 7:22

Allan

12.5k3 gold badges33 silver badges56 bronze badges

1 Comment

sisanared Over a year ago

I ended up writing it without using any regex. Your solution does work. Thanks

Austin · Accepted Answer · 2018-05-31 06:59:44Z

0

Use:

re.sub(r'(?<!\w)([A-Z])\.*\s*(?<!\w)([A-Z])\.*\s*([A-Za-z]*)', r'\1\2 \3', s)

Code:

>>> s = 'J. J. Abrams'
>>> re.sub(r'(?<!\w)([A-Z])\.*\s*(?<!\w)([A-Z])\.*\s*([A-Za-z]*)', r'\1\2 \3', s)
JJ Abrams

>>> s = 'J J Abrams'
>>> re.sub(r'(?<!\w)([A-Z])\.*\s*(?<!\w)([A-Z])\.*\s*([A-Za-z]*)', r'\1\2 \3', s)
JJ Abrams

>>> s = 'J.J Abrams'
>>> re.sub(r'(?<!\w)([A-Z])\.*\s*(?<!\w)([A-Z])\.*\s*([A-Za-z]*)', r'\1\2 \3', s)
JJ Abrams

>>> s = 'J.J.  Abrams'
>>> re.sub(r'(?<!\w)([A-Z])\.*\s*(?<!\w)([A-Z])\.*\s*([A-Za-z]*)', r'\1\2 \3', s)
JJ Abrams

>>> s = 'J J      Abrams'
>>> re.sub(r'(?<!\w)([A-Z])\.*\s*(?<!\w)([A-Z])\.*\s*([A-Za-z]*)', r'\1\2 \3', s)
JJ Abrams

edited May 31, 2018 at 6:59

answered May 31, 2018 at 6:54

Austin

26.1k4 gold badges28 silver badges52 bronze badges

3 Comments

sisanared Over a year ago

this looks good, just one nit pick. Try "'Abrams J J' there is an extra space at the end. I can always strip the string, but just thought I'd point it out

Austin Over a year ago

@sisanared that's working for me. You are expecting "Abrams JJ" right?

sisanared Over a year ago

Correct, but Im seeing "Abrams JJ ". Im using Py 2.7. Also, please see the edit, there can be cases with more than 2 initials in the name.

NeverHopeless · Accepted Answer · 2018-05-31 07:52:24Z

0

You may try to find all continuous alphabets and print with format:

import re
if __name__=='__main__': 
    names = ["J. J. Abrams", "J J Abrams", "J.J Abrams", "J.J. Abrams", "J J  Abrams", "J J J  Abrams"]
    for name in names:
        res = re.findall("([a-z]+)", name, re.I)       #Find all continuous alphabets
        res.insert(len(res)-1, " ").                   #Insert <space> at second last position 
        print("res : %s" % ("".join(map(str, res))))   #Join and display list which is formatted

Result:

res : JJ Abrams
res : JJ Abrams
res : JJ Abrams
res : JJ Abrams
res : JJ Abrams
res : JJJ Abrams

answered May 31, 2018 at 7:52

NeverHopeless

11.2k4 gold badges37 silver badges56 bronze badges

Comments

Sven-Eric Krüger · Accepted Answer · 2018-06-01 10:59:35Z

Because you only want to filter out the "." and the white spaces in certain positions I would suggest to only use standard string methods.

Replace all dot characters with one space.
Iterate over all substrings separated by white space - each stripped from leading and trailing white spaces
Add one leading and one trailing space to the elements longer that one character.
Put the parts back together again without spaces.
Strip the whole string from leading and trailing space characters.

For your pleasure I wrote it as a completly unreadable one liner.

names = ['J. R. R. Tolkien',  # "." and " "
         'Abrams  J J',       # " "
         'J.J Abrams',        # "." inbetween
         'J.R.R. Tolkien',    # "."
         'J R.R Tolkien']     # mixed

for name in names :
    name = "".join([(" {} ".format(elem)) if len(elem)>1 else elem for elem in name.replace('.', ' ').split()]).strip()

    print name

This leads to this output.

JRR Tolkien
Abrams JJ
JJ Abrams
JRR Tolkien
JRR Tolkien

EDIT

The solution of @ClasG might be unreadable, too. But my solution even takes twice the time for computing.

Elapsed time, mean value: 6.92432632016e-06
Elapsed time, mean value: 1.5598555044e-05

Collectives™ on Stack Overflow

Python RegEx for name standardization

5 Answers 5

1 Comment

1 Comment

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

1 Comment

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related