1

I'm somewhat of an amateur Programmer and new to this site. I have searched for this question, but have not found it anywhere else on the internet or this site.

I'm trying to grab all of the words in between the open and close paragraph html tags (<p> & </p>). My findall statement works for all the words in all the paragraphs in particular online articles except for where there is a single or double quotation mark. It is totally possible that there is a much better way to do what I'm trying to do or that this statement can be easily tweaked to include paragraphs with quotes. Any advice will be greatly appreciated!

findall statement:

aText = findall("<p>[A-Za-z0-9<>=\"\:/\.\-,\+\?#@'<>;%&\$\*\^\(\)\[\]\{\}\|\\!_`~ ]+</p>",text) 
1
  • 1
    Step 1) search for "Beautiful Soup" in your favorite search engine. Step 2) follow one of its clear examples for extracting text from HTML elements. There is no step 3, its actually a rather elegant library for just this purpose :) Commented Jul 7, 2013 at 3:21

2 Answers 2

1
>>> t = "<p>there isn't much here</p>"
>>> re.findall(r'<p>(.+?)</p>',t)
["there isn't much here"]

Example with "'s embedded:

>>> t = r"<p>there isn't much \"to go by\" here</p>"
>>> re.findall(r'<p>(.+?)</p>',t)
['there isn\'t much \\"to go by\\" here']

Normally + is a greedy qualifier, by adding the ? on the end we make it non-greedy, it tries to achieve a minimal match. So it will consume parts of the string until </p> can be matched.

Sign up to request clarification or add additional context in comments.

1 Comment

This will work (and I voted it up for correctness) but you will need to be cautious of its limitations. Closing </p> tags are optional (though invalid if missing) and <p> elements can have attributes like id and class that will break this regex.
1

To do this with an HTML parsing engine like Beautiful soup:

from BeautifulSoup import BeautifulSoup

html_doc= """
<p>
paragraph 1
</p>

<p>
paragraph 2
</ap>

<p>
paragraph 3
</p>
"""

soup = BeautifulSoup(html_doc)

soup.findAll('p')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.