0

I need to parse a html string to get a node in special, and discard others like script tags

For example I use this code;

//I get the htmlCode from a textArea
htmlCode = '<video>'+
               '<source src="/media/video.oga">'+
               '<source src="/media/video.m4v">'+
               '<script src="evilscript.js"></script>'+
           '</video>';
var div = document.createElement('div');
div.innerHTML = htmlCode;

And from there I can access the nodes of the div and discard the unnecesary; but I realize in network tab that the assignment launches requests of the sources of the video. And I don't want to make any request, because any malicious script can be in the htmlCode. So how could I modify the htmlCode without launching httprequests?

5
  • What is the original task? Commented Mar 24, 2014 at 2:26
  • it's for use in a chrome extension, to hide the code inside a gif and be able to show it Commented Mar 24, 2014 at 2:29
  • 1
    "because any malicious script can be in the htmlCode". Note that the HTML DOM specification explicitly says that <script> elements are not evaluated when set via innerHTML. w3.org/TR/2008/WD-html5-20080610/dom.html#innerhtml0 Commented Mar 24, 2014 at 3:24
  • @FelixKling—I think that might be because HTML5 just documents how innerHTML was implemented, firstly by Microsoft and then others. I'm not sure it was specifically because of security implications (thought it might well have been, just that it came from a time when MS wasn't particularly concerned with things like that), it might be just for convenience to store and re–use HTML fragments without worrying about stripping out script elements. Commented Mar 24, 2014 at 6:22
  • By the way I found another answer which responds my question http://stackoverflow.com/a/11530238/2359536 Commented Mar 25, 2014 at 9:54

1 Answer 1

2

I thought of documentFragment but it cannot use innerHTML, it uses appendChild.

So it came to my mind document.implementation.createHTMLDocument().

So I tested it and it works. It doesn't make any http-requests from the sources.

This is my code now:

var dom = document.implementation.createHTMLDocument();
dom.body.innerHTML = '<video>'+
           '<source src="/media/video.oga">'+
           '<source src="/media/video.m4v">'+
           '<script src="evilscript.js"></script>'+
       '</video>';

And from here I can access the dom.

Sign up to request clarification or add additional context in comments.

3 Comments

Do you have a specification reference for why innerHTML works like this on a document created by document.implementation.createHTMLDocument, but not in any other case? I have tracked down various references, but none mention the different behaviour of innerHTML.
Well, here it says that createHTMLDocument return a HTMLDocument Object. And here it says that apply innerHTML on an element set the new childrens ownerDocument to the current document. So I think that that's the difference of the behaviours. When set the same ownerDocument they make httprequests.
Perhaps it's because the domain is null (empty string really), and therefore resources aren't retrieved to prevent cross site issues?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.