Text Extraction from HTML Data

Question

I have have a problem with extracting information from messy HTML data. Basically what I want to do is extract only the actual displayed words from a given piece of HTML code. Here is an example of the raw HTML data I am given

<p>I have an app which send mail to my defined mail address "[email protected]". For this i create my own Custom Email View Which contains check boxes message body and other options. Now i want that when send button is pressed my app should not go to gmail view or other email client view it directly submit the data</p>

<p>String recepientEmail = "[email protected]"; </p>

<p>// either set to destination email or leave empty</p>

<pre><code>    Intent intent = new Intent(Intent.ACTION_SENDTO);

    intent.setData(Uri.parse("mailto:" + recepientEmail));

    startActivity(intent);
</code></pre>

<p>but on submit it opens gmail or chooser email client view but i dont want to show gmail view</p>

and I want to transform it into this

I have an app which send mail to my defined mail address "[email protected]". For this i create my own Custom Email View Which contains check boxes message body and other options. Now i want that when send button is pressed my app should not go to gmail view or other email client view it directly submit the data String recepientEmail = "[email protected]"; // either set to destination email or leave empty but on submit it opens gmail or chooser email client view but i dont want to show gmail view

So basically just retrieve everything within each of the <p> tags and concatenate them together. I am using python so I am thinking BeautifulSoup is probably the best way to do this, however I can't seem to figure out how to do this. I am also want to repeat this over several such examples (actually millions), but each example should have at least one <p> tag.

Vaibs_Cool · Accepted Answer · 2013-09-16 16:40:35Z

3

html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format).

<span id="midArticle_1"></span><p>Here is the First Paragraph.</p><span id="midArticle_2"></span><p>Here is the second Paragraph.</p><span id="midArticle_3"></span><p>Paragraph Three."</p>

print html.parse(url).xpath('//p/text()')

OUTPUT

['Here is the First Paragraph.', 'Here is the second Paragraph.',
'Paragraph Three."']

edited Sep 16, 2013 at 16:40

answered Sep 16, 2013 at 16:09

Vaibs_Cool

6,1506 gold badges30 silver badges61 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user1893354 Over a year ago

Cool find! So this removes everything in all tags except <p> tags?

user1893354 Over a year ago

I wanted the output to be just one big string. So I guess I could just join the output that you provided.

user1893354 Over a year ago

Sorry I'm not sure what the "html" object is. Are you using html2text in this example?

Birei · Accepted Answer · 2013-09-16 16:35:57Z

2

One way using BeautifulSoup module to extract all text from <p> tags.

Content of script.py:

from bs4 import BeautifulSoup
import sys 

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html')

print(' '.join(map(lambda e: e.string, soup.find_all('p'))))

Run it like:

python3 script.py infile

That yields:

I have an app which send mail to my defined mail address "[email protected]". For this i create my own Custom Email View Which contains check boxes message body and other options. Now i want that when send button is pressed my app should not go to gmail view or other email client view it directly submit the data String recepientEmail = "[email protected]";  // either set to destination email or leave empty but on submit it opens gmail or chooser email client view but i dont want to show gmail view

answered Sep 16, 2013 at 16:35

Birei

36.4k3 gold badges80 silver badges84 bronze badges

2 Comments

user1893354 Over a year ago

Thanks. Can anyone tell me which of these two solutions is the fastest? I have a lot of examples to run through. Thanks!

user1893354 Over a year ago

Sorry. For some examples I am getting the error "sequence item 1: expected string or Unicode, NoneType found" for the join line. Could you tell me how to get around this?

Greg · Accepted Answer · 2016-08-07 05:56:51Z

1

I recently started playing around with Beautiful Soup. I found this line of code that was extremely helpful. I will throw in my entire example in to show you.

import requests
from bs4 import BeautifulSoup

r = requests.get("your url")

html_text = r.text

soup = BeautifulSoup(html_text)

clean_html = ''.join(soup.findAll(text=True))

print(clean_html)

Hopefully this works for you/answers your question

answered Aug 7, 2016 at 5:56

Greg

231 silver badge7 bronze badges

Collectives™ on Stack Overflow

Text Extraction from HTML Data

3 Answers 3

3 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related