Extracting text from html?

Question

I have a string as below

<p>&nbsp;Hello World, this is StackOverflow&#39;s question details page</p>

I want to extract text from above HTML as Hello World, this is StackOverflow's question details page notice that I want to remove the   as well.

How we can achieve this in PHP, I tried few functions, strip_tags, html_entity_decode etc, but all are failing in some conditions.

Please help, Thanks!

Edited my code which I am trying is as below, but its not working :( It leaves the   and ' this type of characters.

$TMP_DESCR = trim(strip_tags($rs['description']));

as @jakenoble says would help if you posted your sample code & output & errors. — diagonalbatman
– diagonalbatman, Commented Feb 2, 2011 at 11:46
If the shown string is part of a full HTML page or a larger snippet containing additional markup, please see Best Methods to parse HTML — Gordon
– Gordon, Commented Feb 2, 2011 at 11:47
@Gordon its not a big html, I just want to do it with simple methods :( — djmzfKnm
– djmzfKnm, Commented Feb 2, 2011 at 11:54

Aaron W. · Accepted Answer · 2011-02-02 12:06:41Z

1

Below worked for me...had to do a str_replace on the non-breaking space though.

$string = "<p>&nbsp;Hello World, this is StackOverflow&#39;s question details page</p>";
echo htmlspecialchars_decode(trim(strip_tags(str_replace('&nbsp;', '', $string))), ENT_QUOTES);

answered Feb 2, 2011 at 12:06

Aaron W.

9,2952 gold badges36 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

djmzfKnm Over a year ago

yes, that's working for me as well. If there is no solution for   then its fine, we can go with replace. Thanks for the help!

sevenseacat · Accepted Answer · 2011-02-02 11:45:54Z

0

strip_tags() will get rid of the tags, and trim() should get rid of the whitespace. I'm not sure if it will work with non-breaking spaces though.

answered Feb 2, 2011 at 11:45

sevenseacat

25.1k6 gold badges66 silver badges90 bronze badges

Comments

djmzfKnm · Accepted Answer · 2011-02-02 11:57:45Z

0

First, you'll have to call trim() on the HTML to remove the white space. http://php.net/manual/en/function.trim.php

Then strip_tags, then html_entity_decode.

So: html_entity_decode(strip_tags(trim(html)));

edited Feb 2, 2011 at 11:57

djmzfKnm

27.3k71 gold badges173 silver badges234 bronze badges

answered Feb 2, 2011 at 11:46

Rui Jiang

1,6721 gold badge16 silver badges26 bronze badges

Comments

lonesomeday · Accepted Answer · 2011-02-02 12:01:46Z

0

Probably the nicest and most reliable way to do this is with genuine (X|HT)ML parsing functions like the DOMDocument class:

<?php

$str = "<p>&nbsp;Hello World, this is StackOverflow&#39;s question details page</p>";

$dom = new DOMDocument;
$dom->loadXML(str_replace('&nbsp;', ' ', $str));

echo trim($dom->firstChild->nodeValue);
// "Hello World, this is StackOverflow's question details pages"

This is probably slight overkill for this problem, but using the proper parsing functionality is a good habit to get into.

Edit: You can reuse the DOMDocument object, so you only need two lines within the loop:

$dom = new DOMDocument;
while ($rs = mysql_fetch_assoc($result)) { // or whatever
    $dom->loadHTML(str_replace('&nbsp;', ' ', $rs['description']));
    $TMP_DESCR = $dom->firstChild->nodeValue;

    // do something with $TMP_DESCR
}

edited Feb 2, 2011 at 12:01

answered Feb 2, 2011 at 11:52

lonesomeday

239k54 gold badges330 silver badges329 bronze badges

1 Comment

djmzfKnm Over a year ago

seems a long method and as I am running a loop, so I think this will be extensive.

Collectives™ on Stack Overflow

Extracting text from html?

4 Answers 4

1 Comment

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related