7

I am new to Objective-C and try to convert a malformed UTF8 encoded NSString to a wellformed one using the example on apples docs.

NSString *theString = @"Lügen"; //should be "ü"
NSString *asciiString = [[NSString alloc] initWithData:asciiData encoding:NSASCIIStringEncoding];

NSLog(@"Original: %@ (length %d)", theString, [theString length]);  
NSLog(@"Converted: %@ (length %d)", asciiString, [asciiString length]);

Result:

Original: Lügen (length 6)
Converted: LA1/4gen (length 8)

This here is doing nothing:

NSString* str = [NSString stringWithUTF8String:
                 [theString cStringUsingEncoding:NSASCIIStringEncoding]];

This here crashes my app

NSString* str = [NSString stringWithUTF8String:
                 [theString cStringUsingEncoding:NSUTF8StringEncoding]];

Anyone any idea what I am doing wrong?

4
  • Could you dump the strings as hex? I don't read malformed UTF8 fluently :) Commented Jan 13, 2012 at 11:47
  • this is an "ü" don't know how to get the hex value ;) Commented Jan 13, 2012 at 11:52
  • Please post details of the crash in any question that involves a crash. Commented Jan 13, 2012 at 12:41
  • @Jano: You should add that as answer. Commented Jan 13, 2012 at 14:46

1 Answer 1

16
NSString *string = @"ü";
const char *c = [string cStringUsingEncoding:NSISOLatin1StringEncoding];
NSString *newString = [[NSString alloc]initWithCString:c encoding:NSUTF8StringEncoding];
NSLog(@"%@",newString); // ü

"Malformed UTF-8 sequence" means a sequence of bytes which are invalid in UTF-8. Your problem is unexpected results after parsing a string with a different encoding than the one used by the original author of the string.

Hexadecimal data C3 BC parsed with UTF-8 encoding is character ü. Instead you used Latin-1 encoding, which results in ü. Then you created a NSString from the Latin-1 parsed string, which means you converted the Latin-1 string to a UTF-16 string (which is the native format of NSString).

Representing a given data in different encodings shows up as different chars, but doesn't change the data. Converting to a different encoding does change the data in an attempt to reproduce the same characters. Example: The character ü is C3 83 C2 BC in UTF-8, but C3 BC in Latin-1. So I converted to the same chars in Latin-1 to get the original data, and then I parsed as UTF-8.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.