Convert Unicode to ASCII without changing the string length (in Java)

Question

What is the best way to convert a string from Unicode to ASCII without changing it's length (that is very important in my case)? Also the characters without any conversion problems must be at the same positions as in the original string. So an "Ä" must be converted to "A" and not something cryptic that has more characters.

Edit:
@novalis - Such symbols (for example of asian languages) should just be converted to some placeholders. I am not too interested in those words or what they mean.

@MtnViewMark - I must preserve the number of all characters and the position of ASCII available characters under any circumstance.

Here some more info: I have some text mining tools that can only process ASCII strings. Most of the text that should be processed is in English, but some do contain non ASCII characters. I am not interested in those words, but I must be sure that the words I am interested in (those that only contain ASCII characters) are at the same positions after the string conversion.

What do you intend to convert 口水雞 to? I don't know how one could express the concept of saliva chicken in three ascii characters. — novalis
– novalis, Commented Jan 19, 2010 at 20:12
It isn't clear - are you trying to preserve the number of characters or the number of bytes… or perhaps the width of the string when displayed? — MtnViewMark
– MtnViewMark, Commented Jan 19, 2010 at 20:36

Community · Accepted Answer · 2017-05-23 11:46:50Z

14

As stated in this answer, the following code should work:

    String s = "口水雞 hello Ä";

    String s1 = Normalizer.normalize(s, Normalizer.Form.NFKD);
    String regex = "[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+";

    String s2 = new String(s1.replaceAll(regex, "").getBytes("ascii"), "ascii");

    System.out.println(s2);
    System.out.println(s.length() == s2.length());

Output is

??? hello A
true

So you first remove diactrical marks, the convert to ascii. Non-ascii characters will become question marks.

edited May 23, 2017 at 11:46

CommunityBot

11 silver badge

answered Jan 19, 2010 at 21:27

Denis Tulskiy

19.2k7 gold badges54 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

medihack Over a year ago

Thanks ... seems to work almost fine. But there is a problem with the '^' character. When it is inside a string (like "he^^o") it fails (simply gets deleted).

Denis Tulskiy Over a year ago

Just remove \\p{IsLm}\\p{IsSk} from the regex.

RedYeti Over a year ago

If anyone wants to remove question marks and fully reduce the text to basic letters try: "[\\P{InBasicLatin}]+" (note the upper-case P means "Not in). Tested using: r̀r̂r̃r̈rʼŕřt̀t̂ẗţỳỹẙyʼy̎ýÿŷp̂p̈s̀s̃s̈s̊sʼs̸śŝŞşšd̂d̃d̈ďdʼḑf̈f̸g̀g̃g̈gʼģq‌́ĝǧḧĥj̈jʼḱk̂k̈k̸ǩl̂l̃l̈Łłẅẍc̃c̈c̊cʼc̸Çççćĉčv̂v̈vʼv̸b́b̧ǹn̂n̈n̊nʼńņňñm̀m̂m̃m̈‌m̊m̌ǵß

Ignacio Vazquez-Abrams · Accepted Answer · 2010-01-19 20:07:58Z

8

Use java.text.Normalizer.normalize() with Normalizer.Form.NFD, then filter out the non-ASCII characters.

answered Jan 19, 2010 at 20:07

Ignacio Vazquez-Abrams

804k160 gold badges1.4k silver badges1.4k bronze badges

3 Comments

Paul Clapham Over a year ago

This is probably what Zardoz actually wanted, although it is going to be ineffective for characters which aren't in the Latin pages.

Pekka Over a year ago

+1 this looks like the best solution to the problem (as far as can be told from the question).

jarnbjo Over a year ago

Unicode normalizing will only work for characters, which can be composed of a plain latin character from the ASCII charset and a diacritics mark.

Community · Accepted Answer · 2017-05-23 11:46:50Z

As Paul Taylor mentioned: there is issue with using Normalizer if you need the project to be compilable/runnable in pre-1.6 and also in 1.6 and higher java. You will get into troubles since Normalizer is in different packages (java.text.Normalizer (for 1.6) instead of sun.text.Normalizer (for pre-1.6)) and has different method-signature.

Usually it is recommended to use reflection to invoke appropriate Normalizer.normalize() method. (Example could be found here).
But if you don't want to put reflection-mess in your code, you can use icu4j library. It contains com.ibm.icu.text.Normalizer class with normalize() method that perform the same job as java.text.Normalizer/sun.text.Normalizer. Icu library has (should have) own implementation of Normalizer so you can share your project with library and that should be java-independent.
Disadvantage is that the icu library is quite big.

If you using Normalizer class just for removing accents/diacritics from Strings, there's also another way. You can use Apache commons lang library (ver. 3) that contains StringUtils with method stripAccents():

String noAccentsString = org.apache.commons.lang3.StringUtils.stripAccents(s);

Lang3 library probably use reflection to invoke appropriate Normalizer according to java version. So advantage is that you don't have reflection mess in your code.

Pekka · Accepted Answer · 2010-01-19 20:13:43Z

2

Caveat: I don't know Java. Just a bit about character sets.

You are not stating which character set you are using exactly.

But no matter which you use, it's impossible to convert a Unicode string to ASCII and retain the original length and character positions, simply because a Unicode character set will use multiple bytes for some characters (obviously).

The only exception I know of would be a UTF-8 string that contains only ASCII characters: This string will already be identical in both UTF-8 and ASCII, because UTF-8 uses multibyte characters only when necessary. (I don't know about the other Unicode flavours, there may be other dynamic ones).

The only workaround I can see is adding a space to any special character that was replaced by an ASCII one, but that will screw up the string (Göteborg in UTF8 would have to become Go teborg to keep the length).

Maybe you want to elaborate on what you want to / need to achieve, so people here can suggest workarounds.

edited Jan 19, 2010 at 20:13

answered Jan 19, 2010 at 20:08

Pekka

451k150 gold badges990 silver badges1.1k bronze badges

1 Comment

Ignacio Vazquez-Abrams Over a year ago

Java uses UTF-16 for strings internally, so for most common "Western" languages the original text and the "ASCII-reduced" text will have the same length (save the occasional odd punctuation).

KRP · Accepted Answer · 2014-03-26 09:28:05Z

2

One isssue with Normalizer is that pre Java 1.6 its in sun.text package whereas in 1.6 its in java.text package and it method signature has changed. So if your application neeeds to run on both platforms you'll have to use reflection.

An alternative custom solution is described as techniwue 3 here

edited Mar 26, 2014 at 9:28

KRP

2947 silver badges22 bronze badges

answered Jun 3, 2010 at 10:40

Paul Taylor

13.4k55 gold badges220 silver badges407 bronze badges

Collectives™ on Stack Overflow

Convert Unicode to ASCII without changing the string length (in Java)

5 Answers 5

3 Comments

3 Comments

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

3 Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related