1

For a little university project i'm doing, i need to extract code samples from html given as a string. To by more precise, i need to get from that html string, everything in between <code> and </code>.

I'm writing in Java, and using String.match to do that.

My code:

public static ArrayList<String> extractByHTMLtagDelimiters(String source, String startDelimiter, String endDelimiter){
ArrayList<String> results = new ArrayList<String>();
if (source.matches("([\t\n\r]|.)*" + startDelimiter + "([\t\n\r]|.)*" + endDelimiter)){
    //source has some code samples in it
    //get array entries of the form: {Some code}</startDelimiter>{something else}
    String[] splittedSource = source.split(startDelimiter);
        for (String sourceMatch : splittedSource){
        if (sourceMatch.matches("([\t\n\r]|.)*" + endDelimiter + "([\t\n\r]|.)*")){
            //current string has code sample in it (with some body leftovers)
            //the code sample located before the endDelimiter - extract it
            String codeSample = (sourceMatch.split(endDelimiter))[0];
            //add the code samples to results
            results.add(codeSample);
        }
        }
}
return results;

iv'e tried to extract that samples from some html of ~1300 chars and got pretty massive exception: (it goes on and on for few dozens of lines)

Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)

i've found the following bug report: https://bugs.java.com/bugdatabase/view_bug?bug_id=5050507

is there anything i can do to still use string.match? if not, can you please recommend some other way to do it without implementing html parsing by myself?

Thank a lot, Dub.

9
  • 1
    See What HTML parsing libraries do you recommend in Java. Commented Apr 1, 2011 at 20:08
  • 1
    @khachik, if you bothered to look at the bug, you would realize it was closed as "Will not fix", as it's pretty fundamental to the way the regex library was written. So upgrading won't make any difference. Commented Apr 1, 2011 at 20:09
  • @Matthew: you are right. Commented Apr 1, 2011 at 20:10
  • I'm useing the newest Java (i think, i updated few months ago), i just mentioned that iv'e encountered this problem in the web, and it look that in my java version it still exists. Commented Apr 1, 2011 at 20:11
  • 3
    Don't use regex to parse html :) Commented Apr 1, 2011 at 20:19

2 Answers 2

3

You can just manually go through the input string using String's indexOf() method to find the start and end deliminters and extract out the bits between yourself.

public static void main(String[] args) {
    String source = "<html>blah<code>this is awesome</code>more junk</html>";

    String startDelim = "<code>";
    String endDelim = "</code>";
    int start = source.indexOf(startDelim);
    int end = source.indexOf(endDelim);

    String code = source.substring(start + startDelim.length(), end);
    System.out.println(code);
}

If you need to find more than one, then just use indexOf again starting at the point you finished:

int nextStart = source.indexOf(startDelim, end + endDelim.length())
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, it did the job! somehow i always forget that the simplest solution might be the best.
1

Simply replace your regex pattern with "(?s).*"

This matches anything including new lines as you intended.

1 Comment

Personally, I prefer the non-regex solution from wolfcastle.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.