0

I have have a problem with extracting information from messy HTML data. Basically what I want to do is extract only the actual displayed words from a given piece of HTML code. Here is an example of the raw HTML data I am given

<p>I have an app which send mail to my defined mail address "[email protected]". For this i create my own Custom Email View Which contains check boxes message body and other options. Now i want that when send button is pressed my app should not go to gmail view or other email client view it directly submit the data</p>

<p>String recepientEmail = "[email protected]"; </p>

<p>// either set to destination email or leave empty</p>

<pre><code>    Intent intent = new Intent(Intent.ACTION_SENDTO);

    intent.setData(Uri.parse("mailto:" + recepientEmail));

    startActivity(intent);
</code></pre>

<p>but on submit it opens gmail or chooser email client view but i dont want to show gmail view</p>

and I want to transform it into this

I have an app which send mail to my defined mail address "[email protected]". For this i create my own Custom Email View Which contains check boxes message body and other options. Now i want that when send button is pressed my app should not go to gmail view or other email client view it directly submit the data String recepientEmail = "[email protected]"; // either set to destination email or leave empty but on submit it opens gmail or chooser email client view but i dont want to show gmail view 

So basically just retrieve everything within each of the <p> tags and concatenate them together. I am using python so I am thinking BeautifulSoup is probably the best way to do this, however I can't seem to figure out how to do this. I am also want to repeat this over several such examples (actually millions), but each example should have at least one <p> tag.

3 Answers 3

3

html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format).

<span id="midArticle_1"></span><p>Here is the First Paragraph.</p><span id="midArticle_2"></span><p>Here is the second Paragraph.</p><span id="midArticle_3"></span><p>Paragraph Three."</p>

print html.parse(url).xpath('//p/text()')

OUTPUT

['Here is the First Paragraph.', 'Here is the second Paragraph.',
'Paragraph Three."']

Sign up to request clarification or add additional context in comments.

3 Comments

Cool find! So this removes everything in all tags except <p> tags?
I wanted the output to be just one big string. So I guess I could just join the output that you provided.
Sorry I'm not sure what the "html" object is. Are you using html2text in this example?
2

One way using BeautifulSoup module to extract all text from <p> tags.

Content of script.py:

from bs4 import BeautifulSoup
import sys 

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html')

print(' '.join(map(lambda e: e.string, soup.find_all('p'))))

Run it like:

python3 script.py infile

That yields:

I have an app which send mail to my defined mail address "[email protected]". For this i create my own Custom Email View Which contains check boxes message body and other options. Now i want that when send button is pressed my app should not go to gmail view or other email client view it directly submit the data String recepientEmail = "[email protected]";  // either set to destination email or leave empty but on submit it opens gmail or chooser email client view but i dont want to show gmail view

2 Comments

Thanks. Can anyone tell me which of these two solutions is the fastest? I have a lot of examples to run through. Thanks!
Sorry. For some examples I am getting the error "sequence item 1: expected string or Unicode, NoneType found" for the join line. Could you tell me how to get around this?
1

I recently started playing around with Beautiful Soup. I found this line of code that was extremely helpful. I will throw in my entire example in to show you.

import requests
from bs4 import BeautifulSoup

r = requests.get("your url")

html_text = r.text

soup = BeautifulSoup(html_text)

clean_html = ''.join(soup.findAll(text=True))

print(clean_html)

Hopefully this works for you/answers your question

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.