39

I have spent a day on researching a library that can be used to accomplish the following:

  • Retrieve the full contents of a webpage like in the background without rendering result to a view.
  • The lib should support pages that fires off ajax requests to load some additional result data after the initial HTML has loaded for example.
  • From the resulting html I need to grab elements in xpath or css selector form.
  • In future I also possibly need to navigate to a next page (fire off events, submitting buttons/links etc)

Here is what I have tried without success:

  • Jsoup: Works great but no support for javascript/ajax (so it does not load full page)
  • Android built in HttpEntity: same problem with javascript/ajax as jsoup
  • HtmlUnit: Looks exactly what I need but after hours cannot get it to work on Android (Other users failed by trying to load the 12MB+ worth of jar files. I myself loaded the full source code and referenced it as a project library only to find that things such as Applets and java.awt (used by HtmlUnit) does not exist in Android).
  • Rhino - I find this very confusing and don't know how to get it working in Android and even if it is what I am looking for.
  • Selenium Driver: Looks like it can work but you don't have an straightforward way to implement it in a headless way so that you don't have the actual html displayed to a view.

I really want HtmlUnit to work as it seems the best suited for my solution. Is there any way or at least another library I have missed that is suitable for my needs?

I am currently using Android Studio 0.1.7 and can move to Ellipse if needed.

Thanks in advance!

8
  • 1
    Seems that there is nothing that can be used for my scenario. I have started working on an Android port for HTMLUnit and hope to have something working soon. I will post here as soon as I have checked in a HtmlUnit branch that anyone can download. Hopefully I can get the HtmlUnit developers involved as it seems there are a lot of interest for an Android port. Commented Jul 3, 2013 at 7:14
  • 4
    It's been 4 YEARS AND WE'RE STILL HERE! I'M FACING THE SAME PROBLEM! Commented Mar 26, 2017 at 15:14
  • Given the current answers, this should be reworded to not be a library request. It could then be reopened. If you do reword it, please ping me @Makyen, so I can help in getting it reopened. Commented Oct 14, 2019 at 18:10
  • 3
    The link to htmlunit android port: github.com/HtmlUnit/htmlunit-android Commented Aug 4, 2022 at 13:04
  • 2
    !!!!!!!!!!!! HTMLUNIT IS NOW ON ANDROID: github.com/HtmlUnit/htmlunit-android !!!!!!!!!!!! Commented Oct 2, 2022 at 4:38

2 Answers 2

37

Ok after 2 weeks I admit defeat and are using a workaround which works great for me at the moment.

The problem:
It is too difficult to port HTMLUnit to Android (or at least with my level of expertise). I am sure its a worthwhile project (and not that time consuming for experienced java programmer) . I emailed the guys at HTMLUnit and they commented that they are not looking into a port or what effort will be involved but suggested anyone who wants to start with such a project should send an message to their mailing list to get more developers involved (http://htmlunit.sourceforge.net/mail-lists.html).

The workaround:
I used android's built in WebView and overrided the onPageFinished method of Webview class to inject Javascript that grabs all the html after the page has fully loaded. Webview can also be used to called futher javascript actions, clicking buttons, filling in forms etc.

Code:

webView.getSettings().setJavaScriptEnabled(true);
MyJavaScriptInterface jInterface = new MyJavaScriptInterface();
webView.addJavascriptInterface(jInterface, "HtmlViewer");

webView.setWebViewClient(new WebViewClient() {

    @Override
    public void onPageFinished(WebView view, String url) {
       //Load HTML
       webView.loadUrl("javascript:window.HtmlViewer.showHTML('<html>'+document.getElementsByTagName('html')[0].innerHTML+'</html>');");
    }

}

webView.loadUrl(StartURL);
ParseHtml(jInterface.html);   

public class MyJavaScriptInterface {

    public String html;

    @JavascriptInterface
    public void showHTML(String _html) {
        html = _html;
    }
}
Sign up to request clarification or add additional context in comments.

5 Comments

I am also trying to create an android app but I need to scrape the website first in order to proceed, and that site is also javascript enabled(dynamically loaded), any suggestions ? Thanks!
this problem is still not solved, htmlunit port for android would be a dream as you can pick up elements from the page and run a .click() method to generate a new page, is there anyway you can do that using the android WebView?
Can this work while the phone is in standby?
@LUKER Did you find the answer?
What about Retrofit? Has anyone tried? github.com/square/retrofit
0

I have taken the implementation mentioned above (injecting JavaScript) and that works for me. All I do is simply set the visibility of the webview to be hidden under other UI elements. I was also thinking of doing the same with selenium. I have used selenium with Chrome in Python and it's great but like you mentioned it is not easy to not show the browser window. But I think it might be possible to just not show the component in Android. I'll have to try.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.