2

I have a text column varchar(4000) with text:

'aaabbaaacbaaaccc'

and I need to remove all duplicated chars - so only one from sequence left:

'abacbac'

It should not be a function, Procedure or CLR - Regex solution. Only true SQL select.

Currently I think about using recursive WITH clause with replace 'aa'->'a', 'bb'->'b', 'cc'->'c'.

So recursion should cycle until all duplicated sequences of that chars would be replaced.

Do you have another solution, perhaps more performant one?

PS: I searched through this site about different replace examples - they didn't suit to this case.

4
  • This sounds like a homework question. Why no functions? Commented Mar 17, 2010 at 23:03
  • 1
    Yeah, kind of test. But i want to check wether Recursive "WITH" variant is OK. No functions because I know how to implement this with function. It's intresting to find best SQL-Native approach. Commented Mar 17, 2010 at 23:29
  • Can I add a helper table? It will be very small but have, oh I dont know, 4000 rows :) Commented Mar 17, 2010 at 23:30
  • No problem if it will be inside Query. Create Table is not allowed. To clarify: Column may contain 4000 chars wide sequence. Commented Mar 17, 2010 at 23:32

1 Answer 1

3

Assuming a table definition of

CREATE TABLE myTable(rowID INT IDENTITY(1,1), dupedchars NVARCHAR(4000)) 

and data..

 INSERT INTO myTable
      SELECT 'aaabbaaacbaaaccc'
       UNION
      SELECT 'abcdeeeeeffgghhaaabbbjdduuueueu999whwhwwwwwww'

this query meets your criteria

    WITH Numbers(n)
      AS
       (   SELECT 1 AS n
          UNION ALL
             SELECT (n + 1) AS n
              FROM Numbers
             WHERE n < 4000
       )
  SELECT rowid,
       (   SELECT CASE 
           WHEN SUBSTRING(dupedchars,n2.n,1) = SUBSTRING(dupedchars+' ',n2.n+1,1) THEN '' 
           ELSE SUBSTRING(dupedchars,n2.n,1) 
            END AS [text()]
           FROM myTable t2,numbers n2
          WHERE n2.n <= LEN(dupedchars)
            AND t.rowid = t2.rowid
            FOR XML path('')
       ) AS deduped
    FROM myTable  t
  OPTION(MAXRECURSION 4000)

Output

rowid   deduped
   1    abacbac
   2    abcdefghabjdueueu9whwhw
Sign up to request clarification or add additional context in comments.

3 Comments

CResults: it's Fantastic! )) I thought almost about the same. But different approach. Yours one is more universal! Thanks! And what about Performance issues for table with 100 000+ rows ??? Am I right, that it is THE ONLY one option doing this via Native SQL?
For that many rows you're looking at an execution time of around 10 seconds. The alternatives (which I was looking at originally) would be to have a physical table alternative to Numbers with an index. You may get some improvement from that but the slow part of the query is the de-duping - any string manipulation of this type will have a speed overhead.
Note the 10 seconds is based on string lengths similar to above. As suggested the time is involved in de-duping. Set all your fields to 4000 characters and you're looking at around 1000 results per minute. If you have duplicate values in your fields you will get an optimisation by only supplying the unique values to this query.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.