1

I would like reconstruct full names from photo captions using Regex in Python, by appending last name back to the first name in patterns "FirstName1 and FirstName2 LastName". We can rely on names starting with capital letter.

For example,

'John and Albert McDonald' becomes 'John McDonald' and 'Albert McDonald'

'Stephen Stewart, John and Albert Diamond' becomes 'John Diamond' and 'Albert Diamond'

I would need to avoid matching patterns like this: 'Jay Smith and Albert Diamond' and generate a non-existent name 'Smith Diamond'

The photo captions may or may not have more words before this pattern, for example, 'It was a great day hanging out with John and Stephen Diamond.'

This is the code I have so far:

s = 'John and Albert McDonald'
so = re.search('([A-Z][a-z\-]+)\sand\s([A-Z][a-z\-]+\s[A-Z][a-z\-]+(?:[A-Z][a-z]+)?)', s)        
if so:
    print so.group(1) + ' ' + so.group(2).split()[1]
    print so.group(2)

This returns 'John McDonald' and 'Albert McDonald', but 'Jay Smith and Albert Diamond' will result in a non-existent name 'Smith Diamond'.

An idea would be to check whether the pattern is preceded by a capitalized word, something like (?<![A-Z][a-z\-]+)\s([A-Z][a-z\-]+)\sand\s([A-Z][a-z\-]+\s[A-Z][a-z\-]+(?:[A-Z][a-z]+)?) but unfortunately negative lookbehind only works if we know the exact length of the preceding word, which I don't.

Could you please let me know how I can correct my regex epression? Or is there a better way to do what I want? Thanks!

4
  • Can you rely on the names starting with a big letter? Commented Apr 2, 2017 at 6:39
  • 1
    For a problem like this, you should write the test cases first. Commented Apr 2, 2017 at 6:41
  • @Vallentin Yes, we can rely on names starting with a capital letter. Commented Apr 2, 2017 at 6:52
  • @AshishNitinPatil Sorry I'm new to Python and not really familiar with unittest module for writing test cases... Commented Apr 2, 2017 at 6:59

1 Answer 1

2

As you can rely on names starting with a capital letter, then you could do something like:

((?:[A-Z]\w+\s+)+)and\s+((?:[A-Z]\w+(?:\s+|\b))+)

Live preview

Swapping out your current pattern, with this pattern should work with your current Python code. You do need to strip() the captured results though.

Which for your examples and current code would yield:

Input
First print
Second print

John and Albert McDonald
John McDonald
Albert McDonald

Stephen Stewart, John and Albert Diamond
John Diamond
Albert Diamond

It was a great day hanging out with John and Stephen Diamond.
John Diamond
Stephen Diamond
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.