java replace HTML_Escapecodes

Question

i need to develope a new methode, that should replace all Umlaute (ä, ö, ü) of a string entered with high performance with the correspondent HTML_Escapecodes. According to statistics only 5% of all strings entered contain Umlauts. As it is supposed that the method will be used extensively, any instantiation that is not necessary should be avoided. Could someone show me a way to do it?

So...what encoding are you using?

Makoto
– Makoto

2014-04-14 18:33:01 +00:00
Commented Apr 14, 2014 at 18:33 — Makoto
– Makoto, Commented Apr 14, 2014 at 18:33

Ghostkeeper · Accepted Answer · 2014-04-14 18:53:35Z

1

These are the HTML escape codes. Additionally, HTML features arbitrary escaping with codes of the format : and equivalently :

A simple string-replace is not going to be efficient with so many strings to replace. I suggest you split the string by entity matches, such as this:

String[] parts = str.split("&([A-Za-z]+|[0-9]+|x[A-Fa-f0-9]+);");
if(parts.length <= 1) return str; //No matched entities.

Then you can re-build the string with the replaced parts inserted.

StringBuilder result = new StringBuilder(str.length());
result.append(parts[0]); //First part always exists.
int pos = parts[0].length + 1; //Skip past the first entity and the ampersand.
for(int i = 1;i < parts.length;i++) {
    String entityName = str.substring(pos,str.indexOf(';',pos));
    if(entityName.matches("x[A-Fa-f0-9]+") && entityName.length() <= 5) {
        result.append((char)Integer.decode("0" + entityName));
    } else if(entityName.matches("[0-9]+")) {
        result.append((char)Integer.decode(entityName));
    } else {
        switch(entityName) {
            case "euml": result.append('ë'); break;
            case "auml": result.append('ä'); break;
            ...
            default: result.append("&" + entityName + ";"); //Unknown entity. Give the original string.
        }
    }
    result.append(parts[i]); //Append the text after the entity.
    pos += entityName.length() + parts[i].length() + 2; //Skip past the entity name, the semicolon and the following part.
}
return result.toString();

Rather than copy-pasting this code, type it in your own project by hand. This gives you the opportunity to look at how the code actually works. I didn't run this code myself, so I can't guarantee it being correct. It can also be made slightly more efficient by pre-compiling the regular expressions.

edited Apr 14, 2014 at 18:53

answered Apr 14, 2014 at 18:46

Ghostkeeper

3,1101 gold badge20 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

tlq Over a year ago

actually isnt here asked to raplace Umlaute (ä, ö, ü) with the html escapingcodes ? how much i see that method does opposite..

Ghostkeeper Over a year ago

Oh, ah, yes, I see I read your question wrong. A similar technique can be used to do the reverse though, and it doesn't need to skip past the ampersands and semicolons or parse integers.

Collectives™ on Stack Overflow

java replace HTML_Escapecodes

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related