1

I have sentences.

text="The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. Obama was reelected president in November 2012".

I want to put <PERSON></PERSON> tag in "Obama", so the result will be like this:
The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. <PERSON>Obama</PERSON> was reelected president in November 2012".

I want to find substring(example: Obama) that there is no tag <PERSON> before the substring and there is no tag </PERSON> after the substring, but I don't know the right syntax for regex in python.
**I'm new to python :''

With simple regex re.sub(namedEntity, "<PERSON>"+namedEntity+"</PERSON>", text) will give an output
The president of America is <PERSON>Barack <PERSON>Obama</PERSON></PERSON>. He was born on August 4, 1961. <PERSON>Obama</PERSON> was reelected president in November 2012".

this is my code(using python2.7)

import re

result=re.sub(r"((?!<PERSON>).*"+namedEntity+".*(?!</PERSON>))","<PERSON>"+namedEntity+"</PERSON>",text)

print "result: "+result

The output
result: <PERSON>Obama</PERSON>
And I don't know that is the first "Obama" or the second one.

Thanks for your help before

2
  • Did you copied the code from somewhere ? Do you understand what you are doing in that regex ? Commented Mar 6, 2016 at 18:18
  • I tried the regex in regex101.com/#python by learn from this answer stackoverflow.com/questions/6259443/… . Maybe I'm wrong because I assume that ?!regex means "not contain regex" :'' Commented Mar 6, 2016 at 18:34

2 Answers 2

2

You are very close. In your new regex r"((?!<PERSON>).*"+namedEntity+".*(?!</PERSON>))", you have .* before and after which matches 'Obama' with any characters before and after it and the lookarounds are ignored because the tags are in the matched group. If you remove them, you get the results you're after.

>>> import re
>>> text = "The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. Obama was reelected president in November 2012"
>>> namedEntity = 'Obama'
>>> result = re.sub(r"((?!<PERSON>)"+namedEntity+"(?!</PERSON>))","<PERSON>"+namedEntity+"</PERSON>",text)
>>> print result
'The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. <PERSON>Obama</PERSON> was reelected president in November 2012'

For future regex testing, regex101 works well to check how things work as you change them live. For your case this shows what's happening.

Sign up to request clarification or add additional context in comments.

7 Comments

Shouldn't it be (?<!<PERSON>) i.e negative lookbehind ? I actually got confused there.
@noob, I don't think so. You want to ignore the matches that already have the tags around them.
Just an additional ideone.com demo to show that this is the correct answer (+1).
Is that ?! and ?<! applied for look ahead and look behind exactly before and after the substring? Because went I try to "The president of America is <PERSON>Barack Obama Aaa</PERSON>. He was born on August 4, 1961. Obama Aaa was reelected president in November 2012". It gives the result "The president of America is <PERSON>Barack <PERSON>Obama</PERSON> Aaa</PERSON>. He was born on August 4, 1961. <PERSON>Obama</PERSON> Aaa was reelected president in November 2012" cc @Holloway
@KhusnaNadia, that's true, it relies on the tags being around the name.
|
1

just remove the .* part in your regex-lookarounds.

>>>text="The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. Obama was reelected president in November 2012"
>>> surname=re.search(r'<PERSON>(.*)</PERSON>', text).group(1).split()[1]
>>> print surname
Obama
>>> re.sub(r'(?<!<PERSON>)'+surname+'(?!</PERSON>)', '<PERSON>'+surname+'</PERSON>', text)'  
The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. <PERSON>Obama</PERSON> was reelected president in November 2012'
>>> 

Note: you can also extract the surname of the person using regex and capture groups which i have captured in surname variable. You can use (?<!regex) to assert negative lookbehind and (?!regex) to assert negative lookahead

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.