Get FULL HTML content web page (including javascript content)

Question

After some hours of trying and reading, I'm a bit lost about the title subject.

My problem : I am trying to get the full HTML content (javascript HTML appended/added content) of a single web page. What I have already try :

I used Jsoup, but I had to change because of the fact that jsoup doesn't handle javascript content.
I used HmtlUtil but I get many errors on the loading of the targeted webpage (like Css error, runtimeError, EcmaError, etc.)
I used the basic functionnality of Chrome to save the full content webpage and then I used the Jsoup library to get the content i wanted to find. This is the only way I could have get the content I wish get.

So now, the question is, how can I imitate the "save as" function of a browser or how can I, in general, get the full HTML content first AND then use Jsoup to scan the static final HTML content ?

Thanks a lot for your advise and your help !

You can also try with content = $("html").html(); and add the js call at the end of the page... — crisu
– crisu, Commented May 11, 2015 at 12:47
I suppose that i have to replace $('html') by $('myurl.com/myPage') ? — Samuel Private
– Samuel Private, Commented May 11, 2015 at 12:48
I wrote a class to do this: It grabs the full HTML page and all scripts, etc. See: github.com/JonasCz/save-for-offline/blob/master/app/src/main/… — Jonas Czech
– Jonas Czech, Commented May 11, 2015 at 17:56

Samuel Private · Accepted Answer · 2015-05-11 14:43:04Z

I finally get what i wanted to. I will try to explain for thoose who need some help!

So ! The process is composed by two steps :

First, get the final content HTML (including javascript HTML content, etc.) like if you were visiting the web page and then save it to a simply file.html
Then, we are going to use the Jsoup library to get the wanted content in the saved file, file.hmtl.

1 - Get HTML content and save it

For this step, you will need to download phantomjs and use it to get the content. Here is the code to get the target page. Just change myTargetedPage.com by the URL of the page you want to get and the name of the file mySaveFile.html.

var page = require('webpage').create();
var fs = require('fs');
page.open('http://myTargetedPage.com', function () {
    page.evaluate();
    fs.write('mySaveFile.html', page.content, 'w');
    phantom.exit();
});

As you can see, the file saved is exactly the same as the content load in your browser.

2 - Extract the content you wanted

Now, we will use Java and the library Jsoup to get or specific content. in my example, I want to get this part of the web page :

/* HTML CONTENT */
<span class="my class" data="data1"></span>
/* HTML CONTENT */
<span class="my class" data="data2"></span>
/* HTML CONTENT */

To get this, this code will be fine (don't forget to edit thePathToYourSavedFile.html :

public static void main(String[] args) throws Exception {
    String url = "thePathToYourSavedFile.html";

    Document document = Jsoup.connect(url).userAgent("Mozilla").get();

    Elements spanList= document.select("span");

   for (Element span: spanList) {
       if(span.attr("class").equals("my class")){
           String data = span.attr("data");
           System.out.println("data : "+data);             
       }
    }       
}

Enjoy !

Sari Rahal · Accepted Answer · 2015-05-11 12:50:13Z

0

There is a nice plugin that gives you what you are looking for. It offers a way to see a page and it's functionality. It is available for some of the browsers but not all. Here is the link : http://chrispederick.com/work/web-developer/

P.S. after you install it, there is a little gear on the toolbar located at the top right. That is where all the functions is at.

answered May 11, 2015 at 12:50

Sari Rahal

1,9653 gold badges34 silver badges55 bronze badges

1 Comment

Samuel Private Over a year ago

Thank you for sharing and helping, but i am trying to do programmatically so I think this plugin won't help me :-(

Collectives™ on Stack Overflow

Get FULL HTML content web page (including javascript content)

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related