Extracting certain text from html file

Question

I want to extract texts from a html file that are placed between parapraph(p) and link(a href) tags.I want to do it without java regex and html parsers.I thougth

while ((word = reader.readLine()) !=null) { //iterate to the end of the file
    if(word.contains("<p>")) { //catching p tag
        while(!word.contains("</p>") { //iterate to the end of that tag
            try { //start writing
                out.write(word);
            } catch (IOException e) {
            }
        }
    }
}

But not working.The code seems pretty valid to me.How the reader can catch the "p" and "a href" tags.

1) always catch your exceptions -- never leave that block empty, else who knows what could be messing up in the try. 2) Put in println's or use a debugger to test the state of your variables inside of your while loop. To treat a problem, first you must diagnose the cause. 3) For my money, I'd use an HTML parser like JSoup to make my life easier. Why re-invent the wheel with a solution that is almost always going to be guaranteed to be kludgy? — Hovercraft Full Of Eels
– Hovercraft Full Of Eels, Commented May 18, 2013 at 14:03

dratewka · Accepted Answer · 2013-05-18 14:29:35Z

3

The problems start when you have something like this blah in a single line. One simple solution would be to change all the < to \n< - something like this:

boolean insidePar = false;
while ((line = reader.readLine()) !=null) {
    for(String word in line.replaceAll("<","\n<").split("\n")){
        if(word.contains("<p>")){
            insidePar = true;
        }else if(word.contains("</p>")){
            insidePar = false;
        }
        if(insidePar){ // write the word}
    }
}

Still I'd also recommend using a parser library like @HovercraftFullOfEels.

Edit: I've updated the code so it's a bit closer to a working version, but probably there will be more problems along the way.

edited May 18, 2013 at 14:29

answered May 18, 2013 at 14:03

dratewka

2,11414 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

rena-c Over a year ago

Why blah cause problems with that code?I dont understand.

dratewka Over a year ago

@cane-r if you have the opening and closing tags in the same line then both conditions (word.contains("") and word.contains("")) are true, so out.write(word); never gets called.

dratewka Over a year ago

@cane-r Apart from that problem you need to place reader.readLine() somewhere in the inner loop - otherwise it will be writing out the same word over and over again till the world ends.

dratewka Over a year ago

@cane-r I've edited the code so in my answer you can see what I mean.

informatik01 Over a year ago

One note: in Java there is no in keyword. The foreach loop is: for (type item : iterableCollection)

|

bmavus · Accepted Answer · 2013-05-18 14:07:54Z

0

I think using a library for this will be easier. use this http://jsoup.org/ . You can also parse String

answered May 18, 2013 at 14:07

bmavus

8921 gold badge7 silver badges21 bronze badges

Collectives™ on Stack Overflow

Extracting certain text from html file

2 Answers 2

6 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related