UTF8 Decoding with NSString

Question

I am new to Objective-C and try to convert a malformed UTF8 encoded NSString to a wellformed one using the example on apples docs.

NSString *theString = @"LÃ¼gen"; //should be "ü"
NSString *asciiString = [[NSString alloc] initWithData:asciiData encoding:NSASCIIStringEncoding];

NSLog(@"Original: %@ (length %d)", theString, [theString length]);  
NSLog(@"Converted: %@ (length %d)", asciiString, [asciiString length]);

Result:

Original: LÃ¼gen (length 6)
Converted: LA1/4gen (length 8)

This here is doing nothing:

NSString* str = [NSString stringWithUTF8String:
                 [theString cStringUsingEncoding:NSASCIIStringEncoding]];

This here crashes my app

NSString* str = [NSString stringWithUTF8String:
                 [theString cStringUsingEncoding:NSUTF8StringEncoding]];

Anyone any idea what I am doing wrong?

Could you dump the strings as hex? I don't read malformed UTF8 fluently :) — Joachim Isaksson
– Joachim Isaksson, Commented Jan 13, 2012 at 11:47
Please post details of the crash in any question that involves a crash. — jrturton
– jrturton, Commented Jan 13, 2012 at 12:41

Jano · Accepted Answer · 2012-01-15 11:56:26Z

NSString *string = @"Ã¼";
const char *c = [string cStringUsingEncoding:NSISOLatin1StringEncoding];
NSString *newString = [[NSString alloc]initWithCString:c encoding:NSUTF8StringEncoding];
NSLog(@"%@",newString); // ü

"Malformed UTF-8 sequence" means a sequence of bytes which are invalid in UTF-8. Your problem is unexpected results after parsing a string with a different encoding than the one used by the original author of the string.

Hexadecimal data C3 BC parsed with UTF-8 encoding is character ü. Instead you used Latin-1 encoding, which results in Ã¼. Then you created a NSString from the Latin-1 parsed string, which means you converted the Latin-1 string to a UTF-16 string (which is the native format of NSString).

Representing a given data in different encodings shows up as different chars, but doesn't change the data. Converting to a different encoding does change the data in an attempt to reproduce the same characters. Example: The character Ã¼ is C3 83 C2 BC in UTF-8, but C3 BC in Latin-1. So I converted to the same chars in Latin-1 to get the original data, and then I parsed as UTF-8.

Collectives™ on Stack Overflow

UTF8 Decoding with NSString

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related