6

I want a method in the following format:

public boolean isValidHtmlEscapeCode(String string);

Usage would be:

isValidHtmlEscapeCode("A") == false
isValidHtmlEscapeCode("ש") == true // Valid unicode character
isValidHtmlEscapeCode("ש") == true // same as 1513 but in HEX
isValidHtmlEscapeCode("�") == false // Invalid unicode character

I wasn't able to find anything that does that - is there any utility that does that? If not, is there any smart way to do it?

4
  • What about &, ä and &customEntity;? Commented Dec 20, 2012 at 15:22
  • I don't mind a function that deals with those - but it is not my requirement (in other words - I'm impartial regarding it) Commented Dec 20, 2012 at 15:25
  • Why can't you just check if it starts with &, ends with ; and middle portion consist of (i) a-z, 0-9 (ii) # followed by digits (iii) #x followed by hex digits? Commented Dec 20, 2012 at 15:43
  • 1
    @SalmanA I was hoping for a smarter way to do it - I don't like reinventing wheels Commented Dec 20, 2012 at 15:49

5 Answers 5

3
public static boolean isValidHtmlEscapeCode(String string) {
    if (string == null) {
        return false;
    }
    Pattern p = Pattern
            .compile("&(?:#x([0-9a-fA-F]+)|#([0-9]+)|([0-9A-Za-z]+));");
    Matcher m = p.matcher(string);

    if (m.find()) {
        int codePoint = -1;
        String entity = null;
        try {
            if ((entity = m.group(1)) != null) {
                if (entity.length() > 6) {
                    return false;
                }
                codePoint = Integer.parseInt(entity, 16);
            } else if ((entity = m.group(2)) != null) {
                if (entity.length() > 7) {
                    return false;
                }
                codePoint = Integer.parseInt(entity, 10);
            } else if ((entity = m.group(3)) != null) {
                return namedEntities.contains(entity);
            }
            return 0x00 <= codePoint && codePoint < 0xd800
                    || 0xdfff < codePoint && codePoint <= 0x10FFFF;
        } catch (NumberFormatException e) {
            return false;
        }
    } else {
        return false;
    }
}

Here's the set of named entities http://pastebin.com/XzzMYDjF

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks, this was the best solution for my issue - I slightly modified it by adding a ^ at the beginning of the pattern and a $ at the end so for strings like hello &#123; world it will not return true
@RonK cheers, yeah I use this regex for extracting html entities and forgot to modify it ^^
named entities paste bin needs updating..could you update please
2

you might want to have a look at Apache commons StringUtils: http://commons.apache.org/lang/api-2.3/org/apache/commons/lang/StringEscapeUtils.html#unescapeHtml(java.lang.String)

with the unescapeHtml you could do sth. like:

String input = "A";
String unescaped = StringEscapeUtils.unescapeHtml(input);
boolean containsValidEscape = !input.equals(a);

Comments

2

Not sure if this is a perfect solution, but you can use Apache Commons Lang:

try {
    return StringEscapeUtils.unescapeHtml4(code).length() < code.length();
} catch (IllegalArgumentException e) {
    return false;
}

3 Comments

StringEscapeUtils doesn't handle anything in the &#xxx; format. So basing my code on it will not work
It does indeed, unescapeHtml4 contains NumericEntityUnescaper, so it should handle them
looks like it throws IllegalArgumentException when you pass invalid entity, so I've updated my solution a bit
1

This should be the method you wanted:

public static boolean isValidHtmlEscapeCode(String string) {
String temp = "";
try {
    temp = StringEscapeUtils.unescapeHtml3(string);
} catch (IllegalArgumentException e) {
    return false;
}
return !string.equals(temp);
}

Comments

0

Try matching using a regular expression:

public boolean isValidHtmlEscapeCode(String string) {
    return string.matches("&;#([0-9]{1,4}|x[0-9a-fA-F]{1,4});");
}

Or to save some processing cycles you can reuse the regex for multiple comparisons:

Pattern pattern = Pattern.compile("&;#([0-9]{1,4}|x[0-9a-fA-F]{1,4});");

public boolean isValidHtmlEscapeCode(String string) {
    return pattern.matches(string);
}

The source of the regex can be found at RexLib.com

5 Comments

Not bad - but isValidHtmlEscapeCode("&#99999;") will return true
@RonK the regex has been changed to correct the length constraint.
@RonK &#99999; is valid... in decimal, up to &#1114111; is valid.
@Esailija - thanks, so length validation is not enough - the value of the number also needs to be validated.
@RonK among other things, such as character references that would amount to lone surrogates, which I don't think htmlUnescape accounts for. Even chrome and firefox inconsistently treat &#xd801; as a valid entity, transforming it to &#xFFFD;

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.