2

I am trying to write a regular expression to standardize names.

Use case:

J. J. Abrams -> JJ Abrams
J J Abrams -> JJ Abrams
J.J Abrams -> JJ Abrams
J.J. Abrams -> JJ Abrams
J J  Abrams -> JJ Abrams (multiple spaces)

The initials can appear at the end or in the middle of the name. In general an initial can have spaces or a '.' or a word boundary before or after it.

So I came up with this following:

p = re.compile(r'((\b|\s+|\.)[a-z](\.|\s+|\b))', re.I)

When I try to match and print the result, it looks wrong:

p.subn(lambda g: g.groups()[0].strip().strip('.'), "J J Abrams")
('JJAbrams', 2)

How do I retain the space before(or after) the non initial part?

Edit Also, I should have made it clear, there can be more than just 2 initials in the name. The above was just one random use case. Thanks

7
  • Does it necessarily have to be a regular expression? Commented May 31, 2018 at 7:24
  • Might be hard for some names like D'J M O'Brien, Doris Di-O Y. Commented May 31, 2018 at 7:32
  • @SvenKrüger it's starting to look like a plain string manipulation job now. Commented May 31, 2018 at 7:59
  • @wp78de, they will stay as is. Only the initials with '.'s will get normalized Commented May 31, 2018 at 7:59
  • sisanared: can you check my answer and let me know if it is working for you? Commented May 31, 2018 at 8:59

5 Answers 5

3

For the cases given, replacing

(?<=\b[A-Z]\b)[. ]+(?=[A-Z]\b)|\.|(\s)\s+

with

$1

should do it.

It matches, using alternation, spaces and dots between initials, dots anywhere or more than one space. The latter captures the first space.

Replacing this with $1 removes matches from the first two alternations and in the third case (several spaces) replaces them with a single one (the first that's captured).

See it here at regex101.

Sign up to request clarification or add additional context in comments.

1 Comment

Great answer +1! It's great that we can process it in one pass.
1

I think you could do it in 2 steps by using regex:

step 1:

regex:

 +|\. *

and replacement (a single space)

step 1 demo

step 2:

regex:

\b([a-z]) ([a-z])\b

replacement: \1\2

step 2 demo

By combining everything you have:

Input file:

$ cat names
J. J. Abrams
J J Abrams
J.J Abrams
J.J. Abrams
J J  Abrams
J  Abrams J.
Abrams J. J.
Abrams J J

python code:

$ cat names_norm.py 
import re
import sys

with open("names") as file:
        for line in file:
                line = re.sub(r" +|\. *", " ", line)
                line = re.sub(r"\b([a-zA-Z]) ([a-zA-Z])\b", "\g<1>\g<2>", line)
                sys.stdout.write(line)
sys.stdout.flush()

Output:

$ python names_norm.py                                                                                                           
JJ Abrams
JJ Abrams
JJ Abrams
JJ Abrams
JJ Abrams
J Abrams J 
Abrams JJ 
Abrams JJ

1 Comment

I ended up writing it without using any regex. Your solution does work. Thanks
0

Use:

re.sub(r'(?<!\w)([A-Z])\.*\s*(?<!\w)([A-Z])\.*\s*([A-Za-z]*)', r'\1\2 \3', s)

Code:

>>> s = 'J. J. Abrams'
>>> re.sub(r'(?<!\w)([A-Z])\.*\s*(?<!\w)([A-Z])\.*\s*([A-Za-z]*)', r'\1\2 \3', s)
JJ Abrams

>>> s = 'J J Abrams'
>>> re.sub(r'(?<!\w)([A-Z])\.*\s*(?<!\w)([A-Z])\.*\s*([A-Za-z]*)', r'\1\2 \3', s)
JJ Abrams

>>> s = 'J.J Abrams'
>>> re.sub(r'(?<!\w)([A-Z])\.*\s*(?<!\w)([A-Z])\.*\s*([A-Za-z]*)', r'\1\2 \3', s)
JJ Abrams

>>> s = 'J.J.  Abrams'
>>> re.sub(r'(?<!\w)([A-Z])\.*\s*(?<!\w)([A-Z])\.*\s*([A-Za-z]*)', r'\1\2 \3', s)
JJ Abrams

>>> s = 'J J      Abrams'
>>> re.sub(r'(?<!\w)([A-Z])\.*\s*(?<!\w)([A-Z])\.*\s*([A-Za-z]*)', r'\1\2 \3', s)
JJ Abrams

3 Comments

this looks good, just one nit pick. Try "'Abrams J J' there is an extra space at the end. I can always strip the string, but just thought I'd point it out
@sisanared that's working for me. You are expecting "Abrams JJ" right?
Correct, but Im seeing "Abrams JJ ". Im using Py 2.7. Also, please see the edit, there can be cases with more than 2 initials in the name.
0

You may try to find all continuous alphabets and print with format:

import re
if __name__=='__main__': 
    names = ["J. J. Abrams", "J J Abrams", "J.J Abrams", "J.J. Abrams", "J J  Abrams", "J J J  Abrams"]
    for name in names:
        res = re.findall("([a-z]+)", name, re.I)       #Find all continuous alphabets
        res.insert(len(res)-1, " ").                   #Insert <space> at second last position 
        print("res : %s" % ("".join(map(str, res))))   #Join and display list which is formatted

Result:

res : JJ Abrams
res : JJ Abrams
res : JJ Abrams
res : JJ Abrams
res : JJ Abrams
res : JJJ Abrams

Comments

0

Because you only want to filter out the "." and the white spaces in certain positions I would suggest to only use standard string methods.

  1. Replace all dot characters with one space.
  2. Iterate over all substrings separated by white space - each stripped from leading and trailing white spaces
  3. Add one leading and one trailing space to the elements longer that one character.
  4. Put the parts back together again without spaces.
  5. Strip the whole string from leading and trailing space characters.

For your pleasure I wrote it as a completly unreadable one liner.

names = ['J. R. R. Tolkien',  # "." and " "
         'Abrams  J J',       # " "
         'J.J Abrams',        # "." inbetween
         'J.R.R. Tolkien',    # "."
         'J R.R Tolkien']     # mixed

for name in names :
    name = "".join([(" {} ".format(elem)) if len(elem)>1 else elem for elem in name.replace('.', ' ').split()]).strip()

    print name

This leads to this output.

JRR Tolkien
Abrams JJ
JJ Abrams
JRR Tolkien
JRR Tolkien

EDIT

The solution of @ClasG might be unreadable, too. But my solution even takes twice the time for computing.

Elapsed time, mean value: 6.92432632016e-06
Elapsed time, mean value: 1.5598555044e-05

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.