3

I want to retrieve the text within a webpage as a string. Is this possible? I am new to Javascript.

For example:

var url = "http://en.wikipedia.org/wiki/Programming";
var result = url.getText();  <---- stores text as a string
document.write(result);

How do I write the getText method? Ether the entire HTML source code (which I can use to get the text) or just the text. I would like to do this from within a web browser.

I tried this and I am able to get an index number:

var url = "http://www.youtube.com/results?search_query=cat&page=2";
var result;
function go(){
    result = url.search(/cat/i);
    document.write(result);
}

This gives me an index of 44. That means that reading a page is possible. Can I do the opposite and enter the index to retrieve the text?

3
  • You mean the entire HTML source? Commented Nov 3, 2012 at 2:04
  • Are you looking to do this inside a web browser or from a server-side JS engine like Node.js or Rhino? Commented Nov 3, 2012 at 2:07
  • In order to get around the cross-domain issue, is running a proxy service a possibility? Commented Nov 3, 2012 at 2:31

3 Answers 3

3

If the Ajax/Cross-Domain situation is not an issue for you, you can extract the text of a web page with

var el = document.body; // or some other element reference
var text = el.innerText || el.textContent;

If you need to read text from pages in the same domain as your application, you can use Ajax directly.

If you need to read text from pages outside of your domain, you'll have to jump through a few extra hoops like setting up a proxy server or dealing with CORS - http://en.wikipedia.org/wiki/Cross-origin_resource_sharing

Sign up to request clarification or add additional context in comments.

Comments

1

Ajax won't support cross domain. You need server side language.

Comments

1

You would be better off using a more powerful server-side language to do that, not JavaScript. Python or PHP would be decent choices.

4 Comments

JavaScript is also a server side language; see also en.wikipedia.org/wiki/…
Yes, but that's not the best option for parsing the HTML, Python would be much better, IMHO.
I used to do this in Perl, now I do it in Node.js - NPM has plenty of modules that are relevant. One day I'll actually sit down and learn Python :)
I really want to do it within a browser. Would a browser extention work?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.