0

i need to develope a new methode, that should replace all Umlaute (ä, ö, ü) of a string entered with high performance with the correspondent HTML_Escapecodes. According to statistics only 5% of all strings entered contain Umlauts. As it is supposed that the method will be used extensively, any instantiation that is not necessary should be avoided. Could someone show me a way to do it?

1
  • 2
    So...what encoding are you using? Commented Apr 14, 2014 at 18:33

1 Answer 1

1

These are the HTML escape codes. Additionally, HTML features arbitrary escaping with codes of the format : and equivalently :

A simple string-replace is not going to be efficient with so many strings to replace. I suggest you split the string by entity matches, such as this:

String[] parts = str.split("&([A-Za-z]+|[0-9]+|x[A-Fa-f0-9]+);");
if(parts.length <= 1) return str; //No matched entities.

Then you can re-build the string with the replaced parts inserted.

StringBuilder result = new StringBuilder(str.length());
result.append(parts[0]); //First part always exists.
int pos = parts[0].length + 1; //Skip past the first entity and the ampersand.
for(int i = 1;i < parts.length;i++) {
    String entityName = str.substring(pos,str.indexOf(';',pos));
    if(entityName.matches("x[A-Fa-f0-9]+") && entityName.length() <= 5) {
        result.append((char)Integer.decode("0" + entityName));
    } else if(entityName.matches("[0-9]+")) {
        result.append((char)Integer.decode(entityName));
    } else {
        switch(entityName) {
            case "euml": result.append('ë'); break;
            case "auml": result.append('ä'); break;
            ...
            default: result.append("&" + entityName + ";"); //Unknown entity. Give the original string.
        }
    }
    result.append(parts[i]); //Append the text after the entity.
    pos += entityName.length() + parts[i].length() + 2; //Skip past the entity name, the semicolon and the following part.
}
return result.toString();

Rather than copy-pasting this code, type it in your own project by hand. This gives you the opportunity to look at how the code actually works. I didn't run this code myself, so I can't guarantee it being correct. It can also be made slightly more efficient by pre-compiling the regular expressions.

Sign up to request clarification or add additional context in comments.

2 Comments

actually isnt here asked to raplace Umlaute (ä, ö, ü) with the html escapingcodes ? how much i see that method does opposite..
Oh, ah, yes, I see I read your question wrong. A similar technique can be used to do the reverse though, and it doesn't need to skip past the ampersands and semicolons or parse integers.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.