Converting list into string while keeping it word by word

Question

Frustrated to say I'm stumped on this one. I'm extracting text from a paragraph:

    paragraphs = re.findall(r'(<p(.*?)</p>)', html)

Then I want to scrap the tags and just keep the paragraph text, word by word:

    paragraphs = re.sub(r'\<.*?\>', '', paragraphs)

Problem is that Python expects a string. If I understand it right I have to turn "paragraphs" into a string first. But, when I do:

    paragraphs = str(paragraphs)

…I get the text letter by letter, the words are broken apart. Well, I'm new to Python and this confuses me.

1st question: Why isn't "paragraphs" a string to begin with?

2nd question: How do I convert "paragraph" into a string, keeping it word by word, such as:

    paragraph = ['Two', 'words']

alecxe · Accepted Answer · 2016-04-18 17:29:52Z

2

re.findall() returns a list of matches. You need re.search() instead.

A better option though would be to use an HTML Parser, like BeautifulSoup:

>>> from bs4 import BeautifulSoup
>>> 
>>> data = '<p>some text here</p>'
>>> soup = BeautifulSoup(data, "html.parser")
>>> soup.p.get_text().split()
[u'some', u'text', u'here']

answered Apr 18, 2016 at 17:29

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Converting list into string while keeping it word by word

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related