2

I am trying to build a string of the contents of a webpage, without HTML syntax (probably replace it with a space, so words are not all conjoined) or punctuation.

so say you have the code:

    <body>
    <h1>Content:</h1>
    <p>paragraph 1</p>
    <p>paragraph 2</p>

    <script> alert("blah blah blah"); </script>

    This is some text<br />
    ....and some more
    </body>

I want to return the string:

    var content = "Content paragraph 1 paragraph 2 this is some text and this is some more";

any idea how to do this? Thanks.

4 Answers 4

3

You can use the innerText property (instead of innerHTML, which returns the HTML tags as well):

var content = document.getElementsByTagName("body")[0].innerText;

However, note that this will also include new lines, so if you are after exactly what you specified in your question, you would need to remove them.

Sign up to request clarification or add additional context in comments.

6 Comments

To get rid of the whitespace also: var content = document.getElementsByTagName("body")[0].innerText.replace(/\s*/g, ' ')
Only problem would be that Firefox doesn't support innerText.
You don't need a RegEx for that, a simple split -> join will do the job more efficiently.
@patrick dw - Very good point that somehow slipped my mind. The textContent property can go some way to solving this. @Stoive - Your regex looks like it will add a space between every character...
@James: Yeah, I started a solution using (document.body.textContent || document.body.innerText).replace(..., but the textContent seems to give you the content of the <script> as well. Lost interest after that. :o)
|
2

There is the W3C DOM 3 Core textContent property supported by some browsers, or the MS/HTML5 innerText property supported by other browsers (some support both). Likely the content of the script element is unwanted, so a recursive traverse of the related part of the DOM tree seems best:

// Get the text within an element
// Doesn't do any normalising, returns a string
// of text as found.
function getTextRecursive(element) {
  var text = [];
  var self = arguments.callee;
  var el, els = element.childNodes;

  for (var i=0, iLen=els.length; i<iLen; i++) {
    el = els[i];

    // May need to add other node types here
    // Exclude script element content
    if (el.nodeType == 1 && el.tagName && el.tagName.toLowerCase() != 'script') {
      text.push(self(el));

    // If working with XML, add nodeType 4 to get text from CDATA nodes
    } else if (el.nodeType == 3) {

      // Deal with extra whitespace and returns in text here.
      text.push(el.data);
    }
  }
  return text.join('');
}

5 Comments

I don't know, can I upvote an answer that doesn't have a jsFiddle attached? ;o) Here's the live example for those who are interested. Only thing I added was: .replace(/\s+/g, ' ') to give the output OP wanted. I'd also note that arguments.callee is deprecated, and currently unavailable in "strict mode". +1
@patrick - arguments.callee is not deprecated in ES5 (where deprecated means marked for deletion in future editions), however its use is restricted in that it is not available in strict mode.
My understanding is that today's "strict mode" will be standard in the next version of ECMAScript. Is that not correct?
I have no idea. ES5 strict mode code will possibly not run without error in an ES 3 environment, and vice versa. I think it's a practical impossibility to remove ES 3 features restricted in ES 5 without a long period of deprecation first and clear statement of that intent. I haven't seen evidence of that.
Yeah, I may be wrong about that. I thought I read it in the Wiki for Harmony, but now I can't find it. Closest thing I can find is from this MDN article "Future ECMAScript versions will likely introduce new syntax, and strict mode in ECMAScript 5 applies some restrictions to ease the transition..." Suggesting that at least some features in future versions will require some enforcement of strict mode rules, but certainly does not suggest full deprecation of strict mode violations.
0

You'll need a striptags function in javascript for that and a regex to replace consecutive newlines with a single space.

2 Comments

-1 Everyone is too hasty to resort to regular expressions (computationally expensive) with more efficient solutions exist. The thing that "just works" isn't always the best.
If you know of a more efficient way to replace consecutive whitespace and newline characters with a single space i'm sure the OP would appreciate you providing it.
0

You can try using the replace statement below

var str = "..your HTML..";
var content = str.replace(/</?[a-zA-Z0-9]+>|<[a-zA-Z0-9]+\s*/>|\r?\n/g," ");

For the HTML that you have provided above, this will give you the following string in content

   Content:   paragraph 1   paragraph 2    alert("blah blah blah");   This is some text  ....and some more  

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.