How to remove a href tags from a string?

Question

I have some user reviews which was previously scraped from a website and I am trying to clean up the text to do some text analysis. There are several a href tags in the text that I would like to remove. For example, see a portion of text contained in a paragraph:

'We had a <a href="/redir?url=http%3A%2F%2Frestaurants.com&amp;amp;s=8b83bf0ff8b716aae84527dc95577a310f201b166dcca25c8ca3824b15703869" target="_blank" rel="nofollow"&gt;restaurants.com</a&gt; $25 gift certificate, so we visited this restaurant.'

I would like to remove this portion from the string:

<a href="/redir?url=http%3A%2F%2Frestaurants.com&amp;amp;s=8b83bf0ff8b716aae84527dc95577a310f201b166dcca25c8ca3824b15703869" target="_blank" rel="nofollow"&gt;restaurants.com</a&gt;

I am not an expert on regex, so the best I could do so far is:

import re
re.sub(r'<a href\S+', '', mytext)

But this removes only part of what I want to get rid off as shown below:

print(mytext)
'We had a  target="_blank" rel="nofollow"&gt;restaurants.com</a&gt; $25 gift certificate, so we visited this restaurant.'

I searched a lot for a solution but could only find one for javascript and several posts that warn against using regex for parsing html, which I guess does not apply to my case as I am processing a string. I guess if I read more about using regex, I can get this done, but I am looking for a quick solution. Really appreciate any help.

Does the text really have > instead of >? That's strange...wonder why that's escaped, but not the <. — David784
– David784, Commented Jan 25, 2022 at 19:19
@David784 Yes, that's right. Someone else scraped the content from a website, so I don't know why those characters are in there. — Rnovice
– Rnovice, Commented Jan 25, 2022 at 19:27

AlecZ · Accepted Answer · 2022-01-25 19:27:26Z

1

import re
''.join(re.findall('(<a href)(.+?)(/a&gt;)', st)[0])

That'll work for your example, if you have multiple href links you could use:

[''.join(entry) for entry in re.findall('(<a href)(.+?)(/a&gt;)', st)]

answered Jan 25, 2022 at 19:27

AlecZ

6086 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Rnovice Over a year ago

Thank you! I am accepting this answer as it helped solve my problem. I used it slightly differently though. I just used re.sub(r'(<a href)(.+?)(/a>)','',mytext) to remove the href links. Using this code I was able to remove multiple links in a string.

shivankgtm · Accepted Answer · 2022-01-25 19:23:39Z

0

as you are looking for a quick solution. just go for basic and use string manipulation.

input_string = 'We had a <a href="/redir?url=http%3A%2F%2Frestaurants.com&amp;amp;s=8b83bf0ff8b716aae84527dc95577a310f201b166dcca25c8ca3824b15703869" target="_blank" rel="nofollow"&gt;restaurants.com</a&gt; $25 gift certificate, so we visited this restaurant.'
input_string = input_string.split('<a href')
first_part = input_string[0]
input_string = input_string[-1].split('</a&gt;')
sencond_part = input_string[-1]
new_string = first_part + sencond_part
print(new_string)  # We had a  $25 gift certificate, so we visited this restaurant.

answered Jan 25, 2022 at 19:23

shivankgtm

1,2421 gold badge10 silver badges21 bronze badges

Collectives™ on Stack Overflow

How to remove a href tags from a string?

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related