Parse all HTML code

Question

I need to copy all HTML code on the page.

I do so:

URL url = new URL(testurl);
URLConnection connection = url.openConnection();
connection.connect();
Scanner in = new Scanner(connection.getInputStream());
  while(in.hasNextLine()) 
   {
     htmlText=htmlText+in.nextLine(); 
    }
   in.close();

But if the page is large, it takes a lot of time.

Is there a faster method?

Did you tried with Jsoup library ?

Alex DG
– Alex DG

2014-04-29 14:33:06 +00:00
Commented Apr 29, 2014 at 14:33 — Alex DG
– Alex DG, Commented Apr 29, 2014 at 14:33
how to keep HTML code?Jsoup parse just text

HelloWorld
– HelloWorld

2014-04-29 14:47:40 +00:00
Commented Apr 29, 2014 at 14:47 — HelloWorld
– HelloWorld, Commented Apr 29, 2014 at 14:47

Community · Accepted Answer · 2017-05-23 10:26:05Z

0

Have you tried a different method of reading the page? Like a buffered reader? Reading the content of web page or Reading entire html file to String?

I'm just thinking Scanner may be a little slow.

Tim

edited May 23, 2017 at 10:26

CommunityBot

11 silver badge

answered Apr 29, 2014 at 14:44

user3585702

162 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

HelloWorld Over a year ago

Thanks, first link is helpful!

Libin · Accepted Answer · 2014-04-29 14:48:11Z

Try to use use (http://jsoup.org "JSoup") to download and parse the HTML from URL

You can get the HTML as document and read the text on each elements

 new AsyncTask<Void, Integer, String>(){
    @Override
    protected String doInBackground(Void... params) {
        try {
            final Document doc = Jsoup.connect("http://youturl.com").get();
            final String content;
            runOnUiThread(new Runnable() {
                @Override
                public void run() {
                    // get the required text 
                   content = doc.body().getElementsByTag("bodyTag").text();

                }
            });

        } catch (IOException e) {
            e.printStackTrace();
        }
        return content;
    }
}.execute();

Collectives™ on Stack Overflow

Parse all HTML code

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related