1

I need to remove all the html tags from a given webpage data. I tried this using regular expressions:

import urllib2
import re
page = urllib2.urlopen("http://www.frugalrules.com")
from bs4 import BeautifulSoup, NavigableString, Comment
soup = BeautifulSoup(page)
link = soup.find('link', type='application/rss+xml')
print link['href']
rss = urllib2.urlopen(link['href']).read()
souprss = BeautifulSoup(rss)
description_tag = souprss.find_all('description')
content_tag = souprss.find_all('content:encoded')
print re.sub('<[^>]*>', '', content_tag)

But the syntax of the re.sub is:

re.sub(pattern, repl, string, count=0)

So, I modified the code as (instead of the print statement above):

for row in content_tag:
    print re.sub(ur"<[^>]*>",'',row,re.UNICODE

But it gives the following error:

Traceback (most recent call last):

File "C:\beautifulsoup4-4.3.2\collocation.py", line 20, in <module>
print re.sub(ur"<[^>]*>",'',row,re.UNICODE)
File "C:\Python27\lib\re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer

What am I doing wrong?

3
  • Can you not find a minimal code example that also fails? For example, remove all non stdlib dependencies bs4 unless they are crucial. If they are, then add a tag for them. This makes the question easier to answer and more useful. Commented Nov 13, 2013 at 15:46
  • Have you seen this answer Commented Nov 13, 2013 at 15:47
  • I know parsing HTML with RegEx is a sin, but umm, I really couldn't remove the tags any other way. Could you please suggest me a working method instead? :) Commented Nov 13, 2013 at 16:08

1 Answer 1

1

Last line of your code try:

print(re.sub('<[^>]*>', '', str(content_tag)))
Sign up to request clarification or add additional context in comments.

1 Comment

sorry - my code is written for python 3 try print re.sub('<[^>]*>', '', str(content_tag))

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.