0

I am using

URL url = new URL("http://www.puzzlers.org/pub/wordlists/pocket.txt"); 
Scanner s = new Scanner(url.openStream());

to read my document. But when I am trying to output the string then I am getting some unnecessary tags. My requirements is to be able to read the document as it is(i.e. without any unnecessary tags.)

Below is the peice of code I have written:

URL url = new URL("Link");
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String str;
String[] wordArray;
int wordCount;

while ((str = in.readLine()) != null) {
System.out.println(str);
wordArray = str.split("\\s+");
wordCount = wordArray.length;
System.out.println("Word count is = " + wordCount);
}

in.close();

Below is the output I am getting. I donot want any unneccessary tags which you can see in the below output. I just want the actual text which you can see between tags. Unnecessary tags mean

, etc. which you see in the below output snippet. I just want texts like 'UNITED STATES SECURITIES AND EXCHANGE' etc. in my string.

Word count is = 1
<P STYLE="font: 10pt/normal Arial, Helvetica, Sans-Serif; margin: 0; padding: 0; text-align: center"></P>
Word count is = 12

Word count is = 1
<P STYLE="font: 10pt/normal Arial, Helvetica, Sans-Serif; margin: 0; padding: 0; text-align: center"><B>UNITED STATES SECURITIES AND EXCHANGE
Word count is = 16
COMMISSION</B></P>
Word count is = 1

Word count is = 1
<P STYLE="font: 10pt/normal Arial, Helvetica, Sans-Serif; margin: 0; padding: 0; text-align: center"><B>Washington, D.C. 20549</B></P>
Word count is = 14

Word count is = 1
<P STYLE="font: 10pt/normal Arial, Helvetica, Sans-Serif; margin: 0; padding: 0; text-align: center"><B>&nbsp;</B></P>
Word count is = 12
4
  • 3
    Which unnecessary tags are you getting? Commented Dec 30, 2013 at 20:56
  • 2
    Define "unnecessary tags"... Commented Dec 30, 2013 at 20:56
  • what is the output write 1st 5 lines Commented Dec 30, 2013 at 20:58
  • possible duplicate of stackoverflow.com/questions/240546/… Commented Dec 30, 2013 at 20:59

4 Answers 4

5

What you are getting back is the HTML source code. If you are interested in just the contents of the body tag for example, you can use jsoup to extract it AND remove all the tags within the body too.

Sign up to request clarification or add additional context in comments.

Comments

2

jsoup does stuff like this:

    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;

    String url = "http://www.google.com"
    Document doc = null;
    try {
        doc = Jsoup.connect(url).get();
    } catch (IOException e) {
        e.printStackTrace();
        return;
    }

    if(doc==null)
        return;

    //start browsing
    // something like
    for(Element p : doc.getElementsByTag("p")){
        System.out.println(p.text());
    }

Comments

1

You can use Jsoup Library for this job [ http://www.jsoup.org ]. See the code below, I have used the URL you mentioned in your question and have extracted the text.

import java.io.IOException;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class main {


    public static void main(String[] args) throws IOException {

        String url = "http://www.puzzlers.org/pub/wordlists/pocket.txt";
        Document doc = Jsoup.connect(url).get();
        System.out.println(doc.getElementsByTag("body").html());
    }
}

Comments

0

try

str = str.replaceAll("\\<.*?>","");

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.