javascript HTML from document.body.innerHTML

Question

I am trying to build a string of the contents of a webpage, without HTML syntax (probably replace it with a space, so words are not all conjoined) or punctuation.

so say you have the code:

    <body>
    <h1>Content:</h1>
    <p>paragraph 1</p>
    <p>paragraph 2</p>

    <script> alert("blah blah blah"); </script>

    This is some text<br />
    ....and some more
    </body>

I want to return the string:

    var content = "Content paragraph 1 paragraph 2 this is some text and this is some more";

any idea how to do this? Thanks.

James Allardice · Accepted Answer · 2011-07-14 00:19:51Z

3

You can use the innerText property (instead of innerHTML, which returns the HTML tags as well):

var content = document.getElementsByTagName("body")[0].innerText;

However, note that this will also include new lines, so if you are after exactly what you specified in your question, you would need to remove them.

answered Jul 14, 2011 at 0:19

James Allardice

166k22 gold badges335 silver badges316 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Stoive Over a year ago

To get rid of the whitespace also: var content = document.getElementsByTagName("body")[0].innerText.replace(/\s*/g, ' ')

user113716 Over a year ago

Only problem would be that Firefox doesn't support innerText.

Griffin Over a year ago

You don't need a RegEx for that, a simple split -> join will do the job more efficiently.

James Allardice Over a year ago

@patrick dw - Very good point that somehow slipped my mind. The textContent property can go some way to solving this. @Stoive - Your regex looks like it will add a space between every character...

user113716 Over a year ago

@James: Yeah, I started a solution using (document.body.textContent || document.body.innerText).replace(..., but the textContent seems to give you the content of the <script> as well. Lost interest after that. :o)

|

RobG · Accepted Answer · 2011-07-14 00:42:30Z

2

There is the W3C DOM 3 Core textContent property supported by some browsers, or the MS/HTML5 innerText property supported by other browsers (some support both). Likely the content of the script element is unwanted, so a recursive traverse of the related part of the DOM tree seems best:

// Get the text within an element
// Doesn't do any normalising, returns a string
// of text as found.
function getTextRecursive(element) {
  var text = [];
  var self = arguments.callee;
  var el, els = element.childNodes;

  for (var i=0, iLen=els.length; i<iLen; i++) {
    el = els[i];

    // May need to add other node types here
    // Exclude script element content
    if (el.nodeType == 1 && el.tagName && el.tagName.toLowerCase() != 'script') {
      text.push(self(el));

    // If working with XML, add nodeType 4 to get text from CDATA nodes
    } else if (el.nodeType == 3) {

      // Deal with extra whitespace and returns in text here.
      text.push(el.data);
    }
  }
  return text.join('');
}

answered Jul 14, 2011 at 0:42

RobG

148k32 gold badges180 silver badges216 bronze badges

5 Comments

user113716 Over a year ago

I don't know, can I upvote an answer that doesn't have a jsFiddle attached? ;o) Here's the live example for those who are interested. Only thing I added was: .replace(/\s+/g, ' ') to give the output OP wanted. I'd also note that arguments.callee is deprecated, and currently unavailable in "strict mode". +1

RobG Over a year ago

@patrick - arguments.callee is not deprecated in ES5 (where deprecated means marked for deletion in future editions), however its use is restricted in that it is not available in strict mode.

user113716 Over a year ago

My understanding is that today's "strict mode" will be standard in the next version of ECMAScript. Is that not correct?

RobG Over a year ago

I have no idea. ES5 strict mode code will possibly not run without error in an ES 3 environment, and vice versa. I think it's a practical impossibility to remove ES 3 features restricted in ES 5 without a long period of deprecation first and clear statement of that intent. I haven't seen evidence of that.

user113716 Over a year ago

Yeah, I may be wrong about that. I thought I read it in the Wiki for Harmony, but now I can't find it. Closest thing I can find is from this MDN article "Future ECMAScript versions will likely introduce new syntax, and strict mode in ECMAScript 5 applies some restrictions to ease the transition..." Suggesting that at least some features in future versions will require some enforcement of strict mode rules, but certainly does not suggest full deprecation of strict mode violations.

ChrisR · Accepted Answer · 2011-07-14 00:20:29Z

0

You'll need a striptags function in javascript for that and a regex to replace consecutive newlines with a single space.

answered Jul 14, 2011 at 0:20

ChrisR

14.5k17 gold badges73 silver badges112 bronze badges

2 Comments

Griffin Over a year ago

-1 Everyone is too hasty to resort to regular expressions (computationally expensive) with more efficient solutions exist. The thing that "just works" isn't always the best.

ChrisR Over a year ago

If you know of a more efficient way to replace consecutive whitespace and newline characters with a single space i'm sure the OP would appreciate you providing it.

Jophin Joseph · Accepted Answer · 2011-07-14 06:29:08Z

0

You can try using the replace statement below

var str = "..your HTML..";
var content = str.replace(/</?[a-zA-Z0-9]+>|<[a-zA-Z0-9]+\s*/>|\r?\n/g," ");

For the HTML that you have provided above, this will give you the following string in content

   Content:   paragraph 1   paragraph 2    alert("blah blah blah");   This is some text  ....and some more

answered Jul 14, 2011 at 6:29

Jophin Joseph

2,9614 gold badges32 silver badges42 bronze badges

Collectives™ on Stack Overflow

javascript HTML from document.body.innerHTML

4 Answers 4

6 Comments

5 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

5 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related