4

I've run into a problem in a project I'm working on: some of the string values in a specific SQL Server 2008 table column contain Unicode characters. For example, instead of a dash some strings will instead contain an EM DASH (http://www.fileformat.info/info/unicode/char/2014/index.htm).

The column values that contain Unicode characters are causing problems when I send HTTP requests to a third-party server. Is there a way to query what rows contain one-or-more Unicode characters, so I can at least begin to identify how many rows need to be fixed?

2
  • 1
    Well if your data type is unicode then all of the characters will be Unicode. Do you want to find all values which can't be round tripped to single byte in your default collation without data loss? Commented Mar 3, 2011 at 17:54
  • I think single-byte items may still include invisible control characters that I don't care about. It would be more accurate to say that I want to find all strings that contain one or more characters outside ASCII characters 32-126. Commented Mar 9, 2011 at 19:47

3 Answers 3

18

You want to find all strings that contain one or more characters outside ASCII characters 32-126.

I think this should do the job.

SELECT *
FROM your_table
WHERE your_column LIKE N'%[^ -~]%' collate Latin1_General_BIN
Sign up to request clarification or add additional context in comments.

Comments

2

One way you can do it is to see which rows no longer equal themselves when converted to a datatype that doesn't support unicode.

CREATE TABLE myStrings (
    string nvarchar(max) not null
)

INSERT INTO myStrings (string)
SELECT 'This is not unicode' union all
SELECT 'This has '+nchar(500)+' unicode' union all
SELECT 'This also does not have unicode' union all
SELECT 'This has lots of unicode '+nchar(600)+nchar(700)+nchar(800)+'!'

SELECT cast(string as varchar)
FROM myStrings

SELECT *
FROM myStrings
WHERE cast(cast(string as varchar(max)) as nvarchar(max)) <> string

1 Comment

what did your results look like? I tried something similar and it failed. I think it had something to do with DASH and HYPHEN
-1
SELECT *
FROM your_table
WHERE your_column LIKE N'%[^ -~]%' collate Latin1_General_BIN

finds all strings that contain one or more characters within ASCII characters 32-126.

I thought the purpose was to find strings where ASCII characters are not in the range 32-126?

NOT is possible with LIKE. Wouldn't this work?

SELECT *
FROM your_table
WHERE your_column NOT LIKE N'%[^ -~]%'

No collate required.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.