get dom from html string code without executing it

Question

I need to parse a html string to get a node in special, and discard others like script tags

For example I use this code;

//I get the htmlCode from a textArea
htmlCode = '<video>'+
               '<source src="/media/video.oga">'+
               '<source src="/media/video.m4v">'+
               '<script src="evilscript.js"></script>'+
           '</video>';
var div = document.createElement('div');
div.innerHTML = htmlCode;

And from there I can access the nodes of the div and discard the unnecesary; but I realize in network tab that the assignment launches requests of the sources of the video. And I don't want to make any request, because any malicious script can be in the htmlCode. So how could I modify the htmlCode without launching httprequests?

it's for use in a chrome extension, to hide the code inside a gif and be able to show it — jscripter
– jscripter, Commented Mar 24, 2014 at 2:29
"because any malicious script can be in the htmlCode". Note that the HTML DOM specification explicitly says that <script> elements are not evaluated when set via innerHTML. w3.org/TR/2008/WD-html5-20080610/dom.html#innerhtml0 — Felix Kling
– Felix Kling, Commented Mar 24, 2014 at 3:24
@FelixKling—I think that might be because HTML5 just documents how innerHTML was implemented, firstly by Microsoft and then others. I'm not sure it was specifically because of security implications (thought it might well have been, just that it came from a time when MS wasn't particularly concerned with things like that), it might be just for convenience to store and re–use HTML fragments without worrying about stripping out script elements. — RobG
– RobG, Commented Mar 24, 2014 at 6:22
By the way I found another answer which responds my question http://stackoverflow.com/a/11530238/2359536 — jscripter
– jscripter, Commented Mar 25, 2014 at 9:54

jscripter · Accepted Answer · 2014-03-24 02:39:20Z

2

I thought of documentFragment but it cannot use innerHTML, it uses appendChild.

So it came to my mind document.implementation.createHTMLDocument().

So I tested it and it works. It doesn't make any http-requests from the sources.

This is my code now:

var dom = document.implementation.createHTMLDocument();
dom.body.innerHTML = '<video>'+
           '<source src="/media/video.oga">'+
           '<source src="/media/video.m4v">'+
           '<script src="evilscript.js"></script>'+
       '</video>';

And from here I can access the dom.

answered Mar 24, 2014 at 2:39

jscripter

8481 gold badge12 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

RobG Over a year ago

Do you have a specification reference for why innerHTML works like this on a document created by document.implementation.createHTMLDocument, but not in any other case? I have tracked down various references, but none mention the different behaviour of innerHTML.

jscripter Over a year ago

Well, here it says that createHTMLDocument return a HTMLDocument Object. And here it says that apply innerHTML on an element set the new childrens ownerDocument to the current document. So I think that that's the difference of the behaviours. When set the same ownerDocument they make httprequests.

RobG Over a year ago

Perhaps it's because the domain is null (empty string really), and therefore resources aren't retrieved to prevent cross site issues?

Collectives™ on Stack Overflow

get dom from html string code without executing it

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related