4

I run a social network that requires unicode usernames to be unique (as expected).

Some creative users have started using Cyrillic (and other) unicode characters to create optically equivalent (but unicode distinct) usernames.

For example, they'll use the Cyrillic small letter a 'а', which looks identical to the roman one.

Does anyone know of a way to convert these optically equivalent characters automatically in Java? I'd rather not have to create a conversion table by hand if a mechanism already exists.

4
  • stackoverflow.com/questions/2096667/… Commented Nov 24, 2013 at 2:08
  • This might depend on what font is used. Tough problem. Commented Nov 24, 2013 at 2:12
  • The referenced answer doesn't solve the problem at hand. The first answer simply removes diacritical marks and converts the remaining non-ASCII characters to '?'s. The second answer regarding Normalizer.Form.NFD does not affect the Cyrillic letter 'a' at all. Commented Nov 24, 2013 at 2:17
  • unicode.org/reports/tr39/#Confusable_Detection Commented May 12, 2014 at 19:44

2 Answers 2

1

You can try Unicode normalization - basically, indistinguishable code points have a 'canonical' code point designated, and normalization is the process of replacing each character with its canonical form.

Java seems to support Unicode normalization via java.text.Normalizer - more info here.

However, I'm not sure that latin A and cyrillic A are marked as equivalent in Unicode - you'd have to try.

This will also not help you when your users start using very similar instead of identical characters - humans are very inventive and a technical solution might not work 100% here, so you will probably have to resort to human moderation anyway.

There are also some other solutions - limiting the usernames to latin alphanumerics, for example.

Sign up to request clarification or add additional context in comments.

2 Comments

Yeah...I tried the Normalizer approach, and it looks like latin a and cyrillic a are not marked as equivalent. Looks like I may just have to build a conversion table by hand. Bummer.
@OnesAndZeroes Did you expect that they would be?
1

Why don't you try to apply an OCR library.

2 Comments

Yeah, one could even statically perform the OCR and build up the desired translation tables, vs having to do the OCR analysis on the fly.
I considered writing something to compare the pixels between characters, but decided just to go through the unicode tables by hand. The Cyrillic, Greek and Latin sets seemed to have the most offenders. It wasn't too bad in the end.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.