1

I have a table called airports in a SQL Server database, with a column declared as nvarchar(255). I had to declare it as nvarchar otherwise SSIS failed to import the data from a .csv file generated by an API.

I have approx 25k records in this table, where by from what I can tell 763 have Unicode characters in them, by running this query:

select cast(name as varchar), name 
from airports 
where cast(name as varchar) <> name

The first row shows the following two values returned in column 1 and 2

Harrisburg Capital City Airpor  
Harrisburg Capital City Airport

The first value from column 1 has had the last t stripped off it, which I assume means there is one unicode character in the string. Please let me know if I am wrong, as I am a bit useless with unicode characters.

My question is: how can I find the unicode characters in the column, and is there a safe / recommended way to remove them?

I did try this to see if I could find it, but it didn't do what I thought it would do.

set nocount on
DECLARE @nstring NVARCHAR(100)
SET @nstring =(select name from airports where fs = 'HAR')
DECLARE @position INT
SET @position = 1

DECLARE @CharList TABLE (Position INT,UnicodeChar NVARCHAR(1),UnicodeValue INT)

WHILE @position <= DATALENGTH(@nstring)
BEGIN
    INSERT @CharList
       SELECT 
           @position as Position,
           CONVERT(nchar(1),SUBSTRING(@nstring, @position, 1)) as UnicodeChar,
           UNICODE(SUBSTRING(@nstring, @position, 1)) as UnicodeValue
    SET @position = @position + 1
END

SELECT * 
FROM @CharList[/sql]
ORDER BY unicodevalue

The output is as follows

32      NULL
33      NULL
34      NULL
35      NULL
36      NULL
37      NULL
38      NULL
39      NULL
40      NULL
41      NULL
42      NULL
43      NULL
44      NULL
45      NULL
46      NULL
47      NULL
48      NULL
49      NULL
50      NULL
51      NULL
52      NULL
53      NULL
54      NULL
55      NULL
56      NULL
57      NULL
58      NULL
59      NULL
60      NULL
61      NULL
62      NULL
11      32
19      32
24      32
25  A   65
20  C   67
12  C   67
1   H   72
2   a   97
13  a   97
17  a   97
7   b   98
10  g   103
15  i   105
5   i   105
21  i   105
26  i   105
18  l   108
29  o   111
28  p   112
14  p   112
9   r   114
3   r   114
4   r   114
30  r   114
27  r   114
6   s   115
16  t   116
22  t   116
31  t   116
8   u   117
23  y   121
4
  • 3
    first off, try cast (name as varchar(255)) instead of cast (name as varchar). If you don't specify a length, it will default to 30, which could explain the truncation you are seeing. Commented Nov 10, 2017 at 2:29
  • thanks, good pickup. Personally think that is big ignorant of the database to decide that... Commented Nov 10, 2017 at 2:35
  • turns out i have 14 bad rows, which need further investigation but you pretty much solved the problem, as there were no unicode characters Commented Nov 10, 2017 at 2:47
  • also, if you need to "see" exactly what is in the field, you can do something like: CAST(name as varbinary(max)). This will show you the content as hex bytes. Commented Nov 10, 2017 at 2:54

1 Answer 1

1

However, if you want to first find the records which have some unicode chars then follow below approach with help of case expression

;WITH CTE
 AS (
 SELECT DATA, 
        CASE
            WHEN(CAST(DATA AS VARCHAR(MAX)) COLLATE SQL_Latin1_General_Cp1251_CS_AS) = DATA
            THEN 0
            ELSE 1
        END HasUnicodeChars,
       ROW_NUMBER() OVER (ORDER BY (SELECT 1)) RN
 FROM <table_name>)

SELECT * FROM CTE where HasUnicodeChars = 1
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.