I am using
URL url = new URL("http://www.puzzlers.org/pub/wordlists/pocket.txt");
Scanner s = new Scanner(url.openStream());
to read my document. But when I am trying to output the string then I am getting some unnecessary tags. My requirements is to be able to read the document as it is(i.e. without any unnecessary tags.)
Below is the peice of code I have written:
URL url = new URL("Link");
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String str;
String[] wordArray;
int wordCount;
while ((str = in.readLine()) != null) {
System.out.println(str);
wordArray = str.split("\\s+");
wordCount = wordArray.length;
System.out.println("Word count is = " + wordCount);
}
in.close();
Below is the output I am getting. I donot want any unneccessary tags which you can see in the below output. I just want the actual text which you can see between tags. Unnecessary tags mean
, etc. which you see in the below output snippet. I just want texts like 'UNITED STATES SECURITIES AND EXCHANGE' etc. in my string.
Word count is = 1
<P STYLE="font: 10pt/normal Arial, Helvetica, Sans-Serif; margin: 0; padding: 0; text-align: center"></P>
Word count is = 12
Word count is = 1
<P STYLE="font: 10pt/normal Arial, Helvetica, Sans-Serif; margin: 0; padding: 0; text-align: center"><B>UNITED STATES SECURITIES AND EXCHANGE
Word count is = 16
COMMISSION</B></P>
Word count is = 1
Word count is = 1
<P STYLE="font: 10pt/normal Arial, Helvetica, Sans-Serif; margin: 0; padding: 0; text-align: center"><B>Washington, D.C. 20549</B></P>
Word count is = 14
Word count is = 1
<P STYLE="font: 10pt/normal Arial, Helvetica, Sans-Serif; margin: 0; padding: 0; text-align: center"><B> </B></P>
Word count is = 12