0

I have some user reviews which was previously scraped from a website and I am trying to clean up the text to do some text analysis. There are several a href tags in the text that I would like to remove. For example, see a portion of text contained in a paragraph:

'We had a <a href="/redir?url=http%3A%2F%2Frestaurants.com&amp;amp;s=8b83bf0ff8b716aae84527dc95577a310f201b166dcca25c8ca3824b15703869" target="_blank" rel="nofollow"&gt;restaurants.com</a&gt; $25 gift certificate, so we visited this restaurant.'

I would like to remove this portion from the string:

<a href="/redir?url=http%3A%2F%2Frestaurants.com&amp;amp;s=8b83bf0ff8b716aae84527dc95577a310f201b166dcca25c8ca3824b15703869" target="_blank" rel="nofollow"&gt;restaurants.com</a&gt;

I am not an expert on regex, so the best I could do so far is:

import re
re.sub(r'<a href\S+', '', mytext)

But this removes only part of what I want to get rid off as shown below:

print(mytext)
'We had a  target="_blank" rel="nofollow"&gt;restaurants.com</a&gt; $25 gift certificate, so we visited this restaurant.'

I searched a lot for a solution but could only find one for javascript and several posts that warn against using regex for parsing html, which I guess does not apply to my case as I am processing a string. I guess if I read more about using regex, I can get this done, but I am looking for a quick solution. Really appreciate any help.

3
  • 1
    Does the text really have &gt; instead of >? That's strange...wonder why that's escaped, but not the <. Commented Jan 25, 2022 at 19:19
  • @David784 Yes, that's right. Someone else scraped the content from a website, so I don't know why those characters are in there. Commented Jan 25, 2022 at 19:27
  • See this answer: You can't parse [X]HTML with regex Commented Jan 25, 2022 at 19:34

2 Answers 2

1
import re
''.join(re.findall('(<a href)(.+?)(/a&gt;)', st)[0])

That'll work for your example, if you have multiple href links you could use:

[''.join(entry) for entry in re.findall('(<a href)(.+?)(/a&gt;)', st)]
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you! I am accepting this answer as it helped solve my problem. I used it slightly differently though. I just used re.sub(r'(<a href)(.+?)(/a&gt;)','',mytext) to remove the href links. Using this code I was able to remove multiple links in a string.
0

as you are looking for a quick solution. just go for basic and use string manipulation.

input_string = 'We had a <a href="/redir?url=http%3A%2F%2Frestaurants.com&amp;amp;s=8b83bf0ff8b716aae84527dc95577a310f201b166dcca25c8ca3824b15703869" target="_blank" rel="nofollow"&gt;restaurants.com</a&gt; $25 gift certificate, so we visited this restaurant.'
input_string = input_string.split('<a href')
first_part = input_string[0]
input_string = input_string[-1].split('</a&gt;')
sencond_part = input_string[-1]
new_string = first_part + sencond_part
print(new_string)  # We had a  $25 gift certificate, so we visited this restaurant.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.