2

I've a small problem with Text Encodings.

I've two Strings which I'm loading from a SQL Server 2008 database (nvarchar-field)

After loading them from the database Visual Studio 2010 displays them as follows in the watch window:

str1 = "Test"
str2 = "Test"

But the comparison with str1 = str2 returns False

If I write those strings to a file with UTF8 Encoding the result is as expected:

Test
Test

If I write those strings to a file with ANSI (Default) Encoding the result is NOT as expected:

?Test
Test

Converting the strings to bytes:

System.Text.Encoding.Default.GetBytes(str1) 'Returns ByteArray {63, 84, 101, 115, 116}
System.Text.Encoding.Default.GetBytes(str2) 'Returns ByteArray {84, 101, 115, 116}

System.Text.Encoding.UTF8.GetBytes(str1) 'Returns ByteArray {239, 187, 191, 84, 101, 115, 116}
System.Text.Encoding.UTF8.GetBytes(str2) 'Returns ByteArray {84, 101, 115, 116}

Where is the Byte 63 in case of ANSI Encoding OR Bytes 239, 187, 191 in case of UTF8 Encoding for str1 coming from?

Well, Bytes 239, 187, 191 are the BOM for UTF8. The question here would more likely be: Why do I get the BOM for str1 but not for str2?

(Well, the values are values passed to a webservice which inserts them into the database, the initial values are passed to this webservice by a client I've no control over)

8
  • Please decide - C# or VB.NET. Your code sample suggest VB.NET and is a coding error in C#. Commented Apr 11, 2012 at 11:13
  • 1
    The example is vb.net yet as this is not a question about vb.net or c# but of the Encoding i selected both (if an answer is in c# or vb.net doesn't really matter for me) Commented Apr 11, 2012 at 11:19
  • Then don't tag it with these languages (might as well tag with F# or Cobol.NET) if it isn't relevant. Commented Apr 11, 2012 at 11:21
  • 1
    @Oded Encoding.Default returns an encoding for the operating system's current "ANSI" code page, not UTF16. See docs. Not to be confused with the "default" encoding UTF8, using "default" to mean "what you get if you don't specify the encoding explicitly". I don't know why Microsoft decided to call it Encoding.Default when it is not the default!! Commented Apr 11, 2012 at 12:10
  • 3
    @Ramhound - System.String overrides Equals and ==. Same contents will produce true. And then there is interning. Commented Apr 11, 2012 at 12:25

3 Answers 3

3

Just so I'm clear, you do read the two strings from two different records in the database, right? Not from one record in two different ways?

Well then, someone has stored a BOM in the one record. Since BOMs are invisible when you print them, you won't see a visual difference. Unless you convert the string to an encoding that can't store a BOM.
That's what happens above.

To solve this, you will need to clean up the database. Read every record, see it if starts with a BOM and if so, write the content (without the BOM) back.

Edit: I only noticed later that you said this database was created on-the-fly by the webservice. In that case, the solution is to contact the author of the webservice and tell them they've got a bug in their routine.

Sign up to request clarification or add additional context in comments.

3 Comments

Yes, those are different records. Forgot to mention that ;) Which "feature" creates then the byte 63 with default encoding?
Like I said, if you convert a string to an encoding that can't store a BOM. Or in general, if you have a string with any character that can't be converted to the destination character set, the system translates it into a ?. The fact that it then becomes visible is just a side effect!
Thanks. That explains the unexpected question mark ;) So I'll need to cleanup those values..
1

You answered it yourself: "the values are values passed to a webservice which inserts them into the database, the initial values are passed to this webservice by a client I've no control over"

The BOM is inserted there. check how's the data inserted, and why it was inserted with a BOM for str1 and without for str2.

Comments

1

I've seen this happen before when importing data in to SQL. Actually, the import was from bulk import from a CSV file. This caused the data in the first column of the first row to contain the BOM hence it sort of invalidated the data.

The solution is to clean the database, but also ensure that all new imports from files are cleaned before insert.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.