0

Frustrated to say I'm stumped on this one. I'm extracting text from a paragraph:

    paragraphs = re.findall(r'(<p(.*?)</p>)', html)

Then I want to scrap the tags and just keep the paragraph text, word by word:

    paragraphs = re.sub(r'\<.*?\>', '', paragraphs)

Problem is that Python expects a string. If I understand it right I have to turn "paragraphs" into a string first. But, when I do:

    paragraphs = str(paragraphs)

…I get the text letter by letter, the words are broken apart. Well, I'm new to Python and this confuses me.

1st question: Why isn't "paragraphs" a string to begin with?

2nd question: How do I convert "paragraph" into a string, keeping it word by word, such as:

    paragraph = ['Two', 'words']

1 Answer 1

2

re.findall() returns a list of matches. You need re.search() instead.

A better option though would be to use an HTML Parser, like BeautifulSoup:

>>> from bs4 import BeautifulSoup
>>> 
>>> data = '<p>some text here</p>'
>>> soup = BeautifulSoup(data, "html.parser")
>>> soup.p.get_text().split()
[u'some', u'text', u'here']
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.