SQL Server: replace sequence of same characters inside Text Field (TSQL only)

Question

I have a text column varchar(4000) with text:

'aaabbaaacbaaaccc'

and I need to remove all duplicated chars - so only one from sequence left:

'abacbac'

It should not be a function, Procedure or CLR - Regex solution. Only true SQL select.

Currently I think about using recursive WITH clause with replace 'aa'->'a', 'bb'->'b', 'cc'->'c'.

So recursion should cycle until all duplicated sequences of that chars would be replaced.

Do you have another solution, perhaps more performant one?

PS: I searched through this site about different replace examples - they didn't suit to this case.

Yeah, kind of test. But i want to check wether Recursive "WITH" variant is OK. No functions because I know how to implement this with function. It's intresting to find best SQL-Native approach. — zmische
– zmische, Commented Mar 17, 2010 at 23:29
Can I add a helper table? It will be very small but have, oh I dont know, 4000 rows :) — CResults
– CResults, Commented Mar 17, 2010 at 23:30
No problem if it will be inside Query. Create Table is not allowed. To clarify: Column may contain 4000 chars wide sequence. — zmische
– zmische, Commented Mar 17, 2010 at 23:32

CResults · Accepted Answer · 2010-03-18 00:37:48Z

3

Assuming a table definition of

CREATE TABLE myTable(rowID INT IDENTITY(1,1), dupedchars NVARCHAR(4000))

and data..

 INSERT INTO myTable
      SELECT 'aaabbaaacbaaaccc'
       UNION
      SELECT 'abcdeeeeeffgghhaaabbbjdduuueueu999whwhwwwwwww'

this query meets your criteria

    WITH Numbers(n)
      AS
       (   SELECT 1 AS n
          UNION ALL
             SELECT (n + 1) AS n
              FROM Numbers
             WHERE n < 4000
       )
  SELECT rowid,
       (   SELECT CASE 
           WHEN SUBSTRING(dupedchars,n2.n,1) = SUBSTRING(dupedchars+' ',n2.n+1,1) THEN '' 
           ELSE SUBSTRING(dupedchars,n2.n,1) 
            END AS [text()]
           FROM myTable t2,numbers n2
          WHERE n2.n <= LEN(dupedchars)
            AND t.rowid = t2.rowid
            FOR XML path('')
       ) AS deduped
    FROM myTable  t
  OPTION(MAXRECURSION 4000)

Output

rowid   deduped
   1    abacbac
   2    abcdefghabjdueueu9whwhw

answered Mar 18, 2010 at 0:37

CResults

5,1051 gold badge24 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

zmische Over a year ago

CResults: it's Fantastic! )) I thought almost about the same. But different approach. Yours one is more universal! Thanks! And what about Performance issues for table with 100 000+ rows ??? Am I right, that it is THE ONLY one option doing this via Native SQL?

CResults Over a year ago

For that many rows you're looking at an execution time of around 10 seconds. The alternatives (which I was looking at originally) would be to have a physical table alternative to Numbers with an index. You may get some improvement from that but the slow part of the query is the de-duping - any string manipulation of this type will have a speed overhead.

CResults Over a year ago

Note the 10 seconds is based on string lengths similar to above. As suggested the time is involved in de-duping. Set all your fields to 4000 characters and you're looking at around 1000 results per minute. If you have duplicate values in your fields you will get an optimisation by only supplying the unique values to this query.

Collectives™ on Stack Overflow

SQL Server: replace sequence of same characters inside Text Field (TSQL only)

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related