0

After some hours of trying and reading, I'm a bit lost about the title subject.

My problem : I am trying to get the full HTML content (javascript HTML appended/added content) of a single web page. What I have already try :

  • I used Jsoup, but I had to change because of the fact that jsoup doesn't handle javascript content.
  • I used HmtlUtil but I get many errors on the loading of the targeted webpage (like Css error, runtimeError, EcmaError, etc.)
  • I used the basic functionnality of Chrome to save the full content webpage and then I used the Jsoup library to get the content i wanted to find. This is the only way I could have get the content I wish get.

So now, the question is, how can I imitate the "save as" function of a browser or how can I, in general, get the full HTML content first AND then use Jsoup to scan the static final HTML content ?

Thanks a lot for your advise and your help !

5
  • Did you try $('<div/>').html( $('html').clone() ).html(); Commented May 11, 2015 at 12:45
  • You can also try with content = $("html").html(); and add the js call at the end of the page... Commented May 11, 2015 at 12:47
  • I suppose that i have to replace $('html') by $('myurl.com/myPage') ? Commented May 11, 2015 at 12:48
  • use a headless browser Commented May 11, 2015 at 12:48
  • I wrote a class to do this: It grabs the full HTML page and all scripts, etc. See: github.com/JonasCz/save-for-offline/blob/master/app/src/main/… Commented May 11, 2015 at 17:56

2 Answers 2

2

I finally get what i wanted to. I will try to explain for thoose who need some help!


So ! The process is composed by two steps :

  • First, get the final content HTML (including javascript HTML content, etc.) like if you were visiting the web page and then save it to a simply file.html
  • Then, we are going to use the Jsoup library to get the wanted content in the saved file, file.hmtl.

1 - Get HTML content and save it

For this step, you will need to download phantomjs and use it to get the content. Here is the code to get the target page. Just change myTargetedPage.com by the URL of the page you want to get and the name of the file mySaveFile.html.

var page = require('webpage').create();
var fs = require('fs');
page.open('http://myTargetedPage.com', function () {
    page.evaluate();
    fs.write('mySaveFile.html', page.content, 'w');
    phantom.exit();
});

As you can see, the file saved is exactly the same as the content load in your browser.

2 - Extract the content you wanted

Now, we will use Java and the library Jsoup to get or specific content. in my example, I want to get this part of the web page :

/* HTML CONTENT */
<span class="my class" data="data1"></span>
/* HTML CONTENT */
<span class="my class" data="data2"></span>
/* HTML CONTENT */

To get this, this code will be fine (don't forget to edit thePathToYourSavedFile.html :

public static void main(String[] args) throws Exception {
    String url = "thePathToYourSavedFile.html";

    Document document = Jsoup.connect(url).userAgent("Mozilla").get();

    Elements spanList= document.select("span");

   for (Element span: spanList) {
       if(span.attr("class").equals("my class")){
           String data = span.attr("data");
           System.out.println("data : "+data);             
       }
    }       
}

Enjoy !

Sign up to request clarification or add additional context in comments.

Comments

0

There is a nice plugin that gives you what you are looking for. It offers a way to see a page and it's functionality. It is available for some of the browsers but not all. Here is the link : http://chrispederick.com/work/web-developer/

P.S. after you install it, there is a little gear on the toolbar located at the top right. That is where all the functions is at.

1 Comment

Thank you for sharing and helping, but i am trying to do programmatically so I think this plugin won't help me :-(

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.