1

Consider the following HTML code :

<strong>title</strong>
Hello World
<strong>Sub-Title</strong>
<div>This is just stuff</div>

How can I clean up the string to just return the string with no tags, i.e. 'Hello World'. I presume this is with DOM, and would prefer a non-regex answer if anyone has a way, and without using javascript or jquery.

[EDIT] Code it fails on.

<span style="color: #677b8d"><strong>Short Description</strong><br/>Microsoft Office Home and Business 2013-Word, Excel, PowerPoint, OneNote and Outlook(Does not include Publisher or Access), DSP , No Warranty on Software <br/><br/><strong>Long description<br/></strong><div>Microsoft Office Home and Business 2013 32-bit/x64 DSP No Warranty on Software </div> <font face="Arial"> <div><br/><strong>Product Overview </strong> <div><font face="Arial">The New Microsoft Office Home &amp; Business 2013 is designed to help you create and communicate faster with new, time-saving features and a clean, modern look. Plus, you can save your documents in the cloud on SkyDrive and access them virtually anywhere. The latest versions of Word, Excel, PowerPoint, OneNote plus Outlook on 1 PC.</font></div> </div> <div><strong><br/> Features<br/></strong><font face="Arial">•One time purchase for the life of your PC; non-transferrable.<br/> •Office on one PC for business and household use.<br/> •The latest versions of Word, Excel, PowerPoint, OneNote, and Outlook.<br/> •7 GB of online storage in SkyDrive.<br/> •Free Office Web Apps* for accessing, editing, and sharing documents.<br/> •An improved user interface optimized for touch, pen, and keyboard.</font> <div> </div> <div><font face="Arial"><strong>Specifications<br/></strong>Operating System Windows <br/> Office/Productivity Software Office Suites &amp; Tools <br/> Purchase Method Boxed <br/> Users/Devices per License 1-User <br/></font></div> </div> <div><font face="Arial"><strong>System Requirements:<br/></strong>Computer and Processor 1 GHz or faster x86 or 64-bit processor with SSE2 instruction set</font></div> <div> <p><font face="Arial"><strong>Memory<br/></strong>1 GB RAM (32-bit); 2 GB RAM (64-bit) recommended for graphics features, Outlook Instant Search, and certain advanced functionality**</font></p> <p><font face="Arial"><strong>Hard Disk<br/></strong>3.0 GB available disk space</font></p> <p><font face="Arial"><strong>Display<br/></strong>1366 x 768 resolution</font></p> <p><font face="Arial"><strong>Operating System<br/></strong>Windows® 7, Windows 8, Windows Server 2008 R2 with .NET 3.5 or later</font></p> <p><font face="Arial"><strong>Graphics<br/></strong>Graphics hardware acceleration requires a DirectX10 graphics card</font></p> <p><font face="Arial"><strong>Additional Requirements<br/></strong>Internet connection. Fees may apply.</font></p> <p><font face="Arial">Microsoft Internet Explorer 8, 9, or 10; Mozilla Firefox 10.x or a later version; Apple Safari 5; or Google Chrome 17.x.</font></p> <p><font face="Arial">A touch-enabled device is required to use any multi-touch functionality. However, all features and functionality are always available by using a keyboard, mouse, or other standard or accessible input device. New touch features are optimized for use with Windows 8.</font></p> <p><font face="Arial">Information Rights Management features require access to a Windows 2003 Server with SP1 or later running Windows Rights Management Services.</font></p> <p><font face="Arial">Microsoft and Skype accounts.</font></p> <p><font face="Arial"><strong>Other<br/></strong>Product functionality and graphics may vary based on your system configuration. Some features may require additional or advanced hardware or server connectivity.</font></p> <p><font face="Arial">*An appropriate device, Internet connection and Internet Explorer, Firefox or Safari browser are required.<br/> **512 MB RAM recommended for accessing Outlook data files larger than 1GB<br/></font></p> </div> </font></span>
4
  • php.net/manual/en/function.strip-tags.php Commented Nov 24, 2014 at 12:41
  • @mudasobwa, nope, cos striptags will leave 'title', 'Sub-Title', etc Commented Nov 24, 2014 at 12:43
  • Ah, I see, you need the alone TextNodes, right? Commented Nov 24, 2014 at 12:44
  • Yes, that's it. The bottom line is if I can remove all tags plus their nodeValue I should be left with a result. Commented Nov 24, 2014 at 12:46

1 Answer 1

1

I would suggest you to surround the code with kinda exotic tag, which is definitely not occured in the code itself, like:

 $a="<body><strong>title</strong>\nHello World\n<strong>Sub-Title</strong>\n<div>This is just stuff</div></body>";

Then use DOM:

$doc = new DOMDocument();
$doc->loadHTML($a);
$xpath = new DOMXPath($doc);
$textnodes = $xpath->evaluate('//body/text()[not(normalize-space() = '')]');

Now you may get whatever you want:

foreach( $textnodes as $el ) {
  print_r($el);
}

/*
DOMText Object
(
    [wholeText] => 
Hello World

    [data] => 
Hello World

    [length] => 13
    [nodeName] => #text
    [nodeValue] => 
Hello World

    [nodeType] => 3
    [parentNode] => (object value omitted)
    [childNodes] => 
    [firstChild] => 
    [lastChild] => 
    [previousSibling] => (object value omitted)
    [nextSibling] => (object value omitted)
    [attributes] => 
    [ownerDocument] => (object value omitted)
    [namespaceURI] => 
    [prefix] => 
    [localName] => 
    [baseURI] => 
    [textContent] => 
Hello World
*/
Sign up to request clarification or add additional context in comments.

8 Comments

Odd one - when using your sample data it works fine, but when I use my html loaded via file_get_contents it returns a blank. Same with Ghost's answer below. I have added the <body> tags. I am still looking into it but just wanted to update and thank you both for your help.
Close, but no (I think). I think you both made a typos as I do get all the text from all tags if the query contains two slashes, ie "$textnodes = $xpath->evaluate('//body//text()[not(normalize-space() = '')]');", but I want all text with tags removed.
Can’t say for another answer’s author, but I definitely did not make any typo. //body/text() states for “any body element immediately followed by text node.” Would you mind to print out the data from file_get_contents?
Ok, thanks. Other guy deleted his answer. For whatever reason. I cannot get it to return and data with the one slash, but if I add two I get the same results as your example. It is still not helping me remove all the text that do contain tags. It would help if I could be assured the format of all the html input was the same, but I cannot, so cannot just use array indices to decide on what to use. I must remove the other text with tags.
Of course it fails on this code. This code has no top-level text nodes. The whole code is surrounded with span. So, if you want to find text nodes with this portion of code, use '//span/text()[not(normalize-space() = '')]'—it will return each text node, directly nested by top-level span.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.