2

I'm working on a web based applcation, which loads the HTML content of an URL using the call made to http://www.whateverorigin.org/ This avoids the same origin policy violation

url = 'http://' + document.getElementById("urlText").value
$.getJSON('http://whateverorigin.org/get?url=' + encodeURIComponent(url) + '&callback=?', function(data){
var doc = new DOMParser().parseFromString(data.contents, 'text/html');  

If I would need to extract the meaningful visible text from this html string, is there a way that I can do this like how beautifulsoup would do in python? I'm more a beginner to javascript.

2 Answers 2

1

Use jQuery in order to find and iterate over the appropriate elements. Then you can decide what to print out - for example: show the text-node of visible items. Here is a jsfiddle with a working script example: http://jsfiddle.net/w147o9f6/1/

<body>
    <div id="outputTexts">OUTPUT:</div>
</body>

javascript:

var parser = new DOMParser();
var doc;
var meaningfulTexts = [];
$.getJSON('http://whateverorigin.org/get?url=' + encodeURIComponent('https://www.facebook.com') + '&callback=?', function(data){
    doc = parser.parseFromString(data.contents, "text/html");

    var ELMS = $(doc).find("div, p, a, span");
    ELMS.each(function(index, element) {
        if(element.style.display != "none" && $(element).text() != "") {
            $("#outputTexts").append('<br>'+ element.tagName + ' - '+$(element).text());
            meaningfulTexts.push( $(element).text() );
        }
    });
});
Sign up to request clarification or add additional context in comments.

3 Comments

I happen to see the css styling info as a part of the meaningul text. Is there a way I can remove them?
I checked my code against facebook and some other websites and it worked very well. When i checked it against google it showed those CSS codes (saying they sit inside a span tag). I don't know if it's a problem with my code or with google's site. Is google.com the website you intend to work with?
The web based application would be fetching the visible text from any site. $(doc).find("p, a"); I made this change. This seemed to work better.
0

It looks like this is what you need? The code below parses google.nl with the whateverorigin.org website and adds it to a div. If not, please try to explain what more you need!

jQuery:

$(document).ready(function() { $.getJSON('http://whateverorigin.org/get?url=' + encodeURIComponent('http://www.google.nl') + '&callback=?', function(data){ $('.result').html(data.contents); }); });

HTML:

<div class="result"></div>

Example: http://jsfiddle.net/qddekhnc/1/

1 Comment

Thanks a lot Jeffrey. I would need the meaningful text information as raw strings.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.