1

I want to extract texts from a html file that are placed between parapraph(p) and link(a href) tags.I want to do it without java regex and html parsers.I thougth

while ((word = reader.readLine()) !=null) { //iterate to the end of the file
    if(word.contains("<p>")) { //catching p tag
        while(!word.contains("</p>") { //iterate to the end of that tag
            try { //start writing
                out.write(word);
            } catch (IOException e) {
            }
        }
    }
}

But not working.The code seems pretty valid to me.How the reader can catch the "p" and "a href" tags.

1
  • 2
    1) always catch your exceptions -- never leave that block empty, else who knows what could be messing up in the try. 2) Put in println's or use a debugger to test the state of your variables inside of your while loop. To treat a problem, first you must diagnose the cause. 3) For my money, I'd use an HTML parser like JSoup to make my life easier. Why re-invent the wheel with a solution that is almost always going to be guaranteed to be kludgy? Commented May 18, 2013 at 14:03

2 Answers 2

3

The problems start when you have something like this <p>blah</p> in a single line. One simple solution would be to change all the < to \n< - something like this:

boolean insidePar = false;
while ((line = reader.readLine()) !=null) {
    for(String word in line.replaceAll("<","\n<").split("\n")){
        if(word.contains("<p>")){
            insidePar = true;
        }else if(word.contains("</p>")){
            insidePar = false;
        }
        if(insidePar){ // write the word}
    }
}

Still I'd also recommend using a parser library like @HovercraftFullOfEels.

Edit: I've updated the code so it's a bit closer to a working version, but probably there will be more problems along the way.

Sign up to request clarification or add additional context in comments.

6 Comments

Why <p>blah</p> cause problems with that code?I dont understand.
@cane-r if you have the opening and closing tags in the same line then both conditions (word.contains("<p>") and word.contains("</p>")) are true, so out.write(word); never gets called.
@cane-r Apart from that problem you need to place reader.readLine() somewhere in the inner loop - otherwise it will be writing out the same word over and over again till the world ends.
@cane-r I've edited the code so in my answer you can see what I mean.
One note: in Java there is no in keyword. The foreach loop is: for (type item : iterableCollection)
|
0

I think using a library for this will be easier. use this http://jsoup.org/ . You can also parse String

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.