Read online document in java

Question

I am using

URL url = new URL("http://www.puzzlers.org/pub/wordlists/pocket.txt"); 
Scanner s = new Scanner(url.openStream());

to read my document. But when I am trying to output the string then I am getting some unnecessary tags. My requirements is to be able to read the document as it is(i.e. without any unnecessary tags.)

Below is the peice of code I have written:

URL url = new URL("Link");
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String str;
String[] wordArray;
int wordCount;

while ((str = in.readLine()) != null) {
System.out.println(str);
wordArray = str.split("\\s+");
wordCount = wordArray.length;
System.out.println("Word count is = " + wordCount);
}

in.close();

Below is the output I am getting. I donot want any unneccessary tags which you can see in the below output. I just want the actual text which you can see between tags. Unnecessary tags mean

, etc. which you see in the below output snippet. I just want texts like 'UNITED STATES SECURITIES AND EXCHANGE' etc. in my string.

Word count is = 1
<P STYLE="font: 10pt/normal Arial, Helvetica, Sans-Serif; margin: 0; padding: 0; text-align: center"></P>
Word count is = 12

Word count is = 1
<P STYLE="font: 10pt/normal Arial, Helvetica, Sans-Serif; margin: 0; padding: 0; text-align: center"><B>UNITED STATES SECURITIES AND EXCHANGE
Word count is = 16
COMMISSION</B></P>
Word count is = 1

Word count is = 1
<P STYLE="font: 10pt/normal Arial, Helvetica, Sans-Serif; margin: 0; padding: 0; text-align: center"><B>Washington, D.C. 20549</B></P>
Word count is = 14

Word count is = 1
<P STYLE="font: 10pt/normal Arial, Helvetica, Sans-Serif; margin: 0; padding: 0; text-align: center"><B>&nbsp;</B></P>
Word count is = 12

possible duplicate of stackoverflow.com/questions/240546/… — Afshin Moazami
– Afshin Moazami, Commented Dec 30, 2013 at 20:59

SergeyB · Accepted Answer · 2013-12-30 20:58:39Z

5

What you are getting back is the HTML source code. If you are interested in just the contents of the body tag for example, you can use jsoup to extract it AND remove all the tags within the body too.

answered Dec 30, 2013 at 20:58

SergeyB

9,9384 gold badges40 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Franz Ebner · Accepted Answer · 2013-12-30 21:10:18Z

2

jsoup does stuff like this:

    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;

    String url = "http://www.google.com"
    Document doc = null;
    try {
        doc = Jsoup.connect(url).get();
    } catch (IOException e) {
        e.printStackTrace();
        return;
    }

    if(doc==null)
        return;

    //start browsing
    // something like
    for(Element p : doc.getElementsByTag("p")){
        System.out.println(p.text());
    }

answered Dec 30, 2013 at 21:10

Franz Ebner

5,1363 gold badges42 silver badges61 bronze badges

Comments

arsingh1212 · Accepted Answer · 2013-12-31 13:42:35Z

1

You can use Jsoup Library for this job [ http://www.jsoup.org ]. See the code below, I have used the URL you mentioned in your question and have extracted the text.

import java.io.IOException;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class main {


    public static void main(String[] args) throws IOException {

        String url = "http://www.puzzlers.org/pub/wordlists/pocket.txt";
        Document doc = Jsoup.connect(url).get();
        System.out.println(doc.getElementsByTag("body").html());
    }
}

answered Dec 31, 2013 at 13:42

arsingh1212

1181 silver badge8 bronze badges

Comments

Afshin Moazami · Accepted Answer · 2013-12-30 20:59:45Z

0

try

str = str.replaceAll("\\<.*?>","");

answered Dec 30, 2013 at 20:59

Afshin Moazami

2,0985 gold badges33 silver badges57 bronze badges

Collectives™ on Stack Overflow

Read online document in java

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related