2

I don't care what the library is, but I need a way to extract <.script.> elements from the <.body.> of a page (as string). I then want to insert the extracted <.script.>s just before <./body.>.

Ideally, I'd like to extract the <.script.>s into 2 types;
1) External (those that have the src attribute) 2) Embedded (those with code between <.script.><./script.>)

So far I've tried with phpDOM, Simple HTML DOM and Ganon.
I've had no luck with any of them (I can find links and remove/print them - but fail with scripts every time!).

Alternative to
https://stackoverflow.com/questions/23414887/php-simple-html-dom-strip-scripts-and-append-to-bottom-of-body
(Sorry to repost, but it's been 24 Hours of trying and failing, using alternative libs, failing more etc.).


Based on the lovely RegEx answer from @alreadycoded.com, I managed to botch together the following;

$output = "<html><head></head><body><!-- Your stuff --></body></html>"
$content = '';
$js = '';

// 1) Grab <body>
preg_match_all('#(<body[^>]*>.*?<\/body>)#ims', $output, $body);
$content = implode('',$body[0]);

// 2) Find <script>s in <body>
preg_match_all('#<script(.*?)<\/script>#is', $content, $matches);
foreach ($matches[0] as $value) {
    $js .= '<!-- Moved from [body] --> '.$value;
}

// 3) Remove <script>s from <body>
$content2 = preg_replace('#<script(.*?)<\/script>#is', '<!-- Moved to [/body] -->', $content); 

// 4) Add <script>s to bottom of <body>
$content2 = preg_replace('#<body(.*?)</body>#is', '<body$1'.$js.'</body>', $content2);

// 5) Replace <body> with new <body>
$output = str_replace($content, $content2, $output);

Which does the job, and isn't that slow (fraction of a second)

Shame none of the DOM stuff was working (or I wasn't up to wading through naffed objects and manipulating).

8
  • 1
    "... This question may already have an answer here: ..." NO It doesn't! Thus Why I posted THIS ONE! (Maybe if you focused more on answering than policing, things would be better???) Commented May 2, 2014 at 13:34
  • 1
    If you are going to DownVote, at least have the stones to leave a comment explaining the reason. Commented May 2, 2014 at 18:43
  • Related: meta.stackoverflow.com/a/253857 Commented May 7, 2014 at 18:35
  • This is NOT a duplicate. // This is a post about "any" php library/method, where as the "other" post was about a specific library being used at that time. // Unfortunately, as the title was changed........ :sigh: Commented May 8, 2014 at 14:17
  • Because it had been around a day, in which I'd tried various snippets etc. Then I opted to consider >>different<< libraries. The other post about [Specific], this post about [Any]. // Worse, it got pointed to a topic with No Answers (hardly helpful to anyone). Commented May 8, 2014 at 14:30

3 Answers 3

8

To select all script nodes with a src-attribute

$xpathWithSrc = '//script[@src]';

To select all script nodes with content:

$xpathWithBody = '//script[string-length(text()) > 1]';

Basic usage(Replace the query with your actual xpath-query):

$doc = new DOMDocument();
$doc->loadHTML($html);

$xpath = new DOMXpath($doc);

foreach($xpath->query('//body//script[string-length(text()) > 1]') as $queryResult) {
    // access the element here. Documentation:
    // http://www.php.net/manual/de/class.domelement.php
}
Sign up to request clarification or add additional context in comments.

3 Comments

And the library is? (I'm going to assume "XPath"???). How does it handle possibly malformed HTML? (Thanks for the answer Amal Murali - just a bit peeved with the Overflow-Police and stressed with wasting 24 hours on parsing that doesn't do squat with Script tags.
It's just phps default dom representation. It should be present in almost any php5 installation(as long libxml was present in any form at compile time). Handling of malformed html is possible, but it depends. If possible you should avoid it. Or sanatize your html beforehand.
LOL - I cannot even get it to run without throwing errors. Where as the RegEx (I know, "yuk") above actually Works!
5
$js = "";
$content = file_get_contents("http://website.com");
preg_match_all('#<script(.*?)</script>#is', $content, $matches);
foreach ($matches[0] as $value) {
    $js .= $value;
}
$content = preg_replace('#<script(.*?)</script>#is', '', $content); 
echo $content = preg_replace('#<body(.*?)</body>#is', '<body$1'.$js.'</body>', $content);

4 Comments

That looks like it will grab the JS from the entire document, rather than only those contained in the <.body.> ?
I've accepted as answer as it was the only "complete" provision, and the only thing I managed to get working. I've appended the "working" version (including </body.> only) to the bottom of my question. // Thank You!
It's messy, and it parses html with regex (which we all know is a no-no).
@pguardiario - Yes, it's messy ... but It Works!!! That's more than I can say for my attempts with the DOM Libraries, not to mention doesn't involved includes and additional code etc. You don't like it? Then SHOW ME a library being included and doing the same job as that code does!
1

If you're really looking for an easy lib for this, I can recommend this one:

$dom = str_get_html($html);
$scripts = $dom->find('script')->remove;
$dom->find('body', 0)->after($scripts);
echo $dom;

There's really no easier way to do things like this in PHP.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.