Convert optically equivalent unicode strings to ASCII in Java?

Question

I run a social network that requires unicode usernames to be unique (as expected).

Some creative users have started using Cyrillic (and other) unicode characters to create optically equivalent (but unicode distinct) usernames.

For example, they'll use the Cyrillic small letter a 'а', which looks identical to the roman one.

Does anyone know of a way to convert these optically equivalent characters automatically in Java? I'd rather not have to create a conversion table by hand if a mechanism already exists.

The referenced answer doesn't solve the problem at hand. The first answer simply removes diacritical marks and converts the remaining non-ASCII characters to '?'s. The second answer regarding Normalizer.Form.NFD does not affect the Cyrillic letter 'a' at all. — OnesAndZeroes
– OnesAndZeroes, Commented Nov 24, 2013 at 2:17

Jakub Wasilewski · Accepted Answer · 2013-11-24 02:13:30Z

1

You can try Unicode normalization - basically, indistinguishable code points have a 'canonical' code point designated, and normalization is the process of replacing each character with its canonical form.

Java seems to support Unicode normalization via java.text.Normalizer - more info here.

However, I'm not sure that latin A and cyrillic A are marked as equivalent in Unicode - you'd have to try.

This will also not help you when your users start using very similar instead of identical characters - humans are very inventive and a technical solution might not work 100% here, so you will probably have to resort to human moderation anyway.

There are also some other solutions - limiting the usernames to latin alphanumerics, for example.

answered Nov 24, 2013 at 2:13

Jakub Wasilewski

2,99624 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

OnesAndZeroes Over a year ago

Yeah...I tried the Normalizer approach, and it looks like latin a and cyrillic a are not marked as equivalent. Looks like I may just have to build a conversion table by hand. Bummer.

Andyz Smith Over a year ago

@OnesAndZeroes Did you expect that they would be?

Andyz Smith · Accepted Answer · 2013-11-24 02:49:08Z

1

Why don't you try to apply an OCR library.

answered Nov 24, 2013 at 2:49

Andyz Smith

7085 silver badges20 bronze badges

2 Comments

Hot Licks Over a year ago

Yeah, one could even statically perform the OCR and build up the desired translation tables, vs having to do the OCR analysis on the fly.

OnesAndZeroes Over a year ago

I considered writing something to compare the pixels between characters, but decided just to go through the unicode tables by hand. The Cyrillic, Greek and Latin sets seemed to have the most offenders. It wasn't too bad in the end.

Collectives™ on Stack Overflow

Convert optically equivalent unicode strings to ASCII in Java?

2 Answers 2

2 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related