0

I'm using a SQL query to clean saved text before apply post-processing in C#. I have link embedded between text where sometimes there is no space between them. the code down can get ride of it but not when there is no space between the link and the next word or if it is at the beginning of the sentence.

if CHARINDEX(N'http',@SelectCol1) > 0
    set @link = SUBSTRING(@SelectCol1, 
                          CHARINDEX('http', @SelectCol1), 
                          LEN(@SelectCol1))

update @StringToFix 
set [links] = @link,
    [text] = REPLACE(@SelectCol1, SUBSTRING(@SelectCol1, 
                                            CHARINDEX('http', @SelectCol1), 
                                            LEN(@SelectCol1)), ' ') 
where RowID = @CurrentRow 

Original example

🔴 test test”http://t.co/pGRj7mxt6n#test#test

link extracted

http://t.co/pGRj7mxt6n#test #test

The reason why it is not working because I didn't find a way to know where the link is ending if there is no space.

2
  • 1
    Without a predictable delimiter I don't see how you will ever be able to parse the link out, even with regex. Commented Aug 29, 2015 at 14:19
  • it is working properly down in the answer Commented Aug 31, 2015 at 9:03

2 Answers 2

0

The most powerful stuff to process text is Regular Expression. You can find out how to use it to deal with data in your app. Here is the link

Sign up to request clarification or add additional context in comments.

2 Comments

thanks, but actually I'm trying do this phase in sql-server and not by C#. is it possible to use REGEX in SQL?
@Feras Actually I have never tried to do it, but I found some links which can help you. Link1 Link2 Link3 But remember if it won't help you, you can try to solve this problem in client side. Each programming language has RegExp library.
0

here is the code if someone is interested. you can always optimize it. SQLsharp library is used

DECLARE @RowsToProcess  int
DECLARE @CurrentRow     int
DECLARE @SelectCol1     nvarchar(max)
declare @link nvarchar(max)
DECLARE @regEx nvarchar(max)
DECLARE @FtestReg nvarchar(max)

set @regex= '(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:''.,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))'
--'http://(linkd\.in|t\.co|bitly\.co|tcrn\.ch).*?(\s|$)'
declare @StringToFix table
(   
    RowID int not null primary key identity(1,1),
    text nvarchar(max),
    statusID nvarchar(50),
    links nvarchar(max) default ((0))
)
insert into  @StringToFix 
select [text],min([statusID]),'' FROM [DB].[dbo].[Statuses] 
where inResponsetoSearchID =21
group by [text]
SET @RowsToProcess=@@ROWCOUNT


SET @CurrentRow=0
WHILE @CurrentRow<@RowsToProcess
BEGIN
    SET @CurrentRow=@CurrentRow+1
    SELECT 
        @SelectCol1=[text]
        FROM @StringToFix
        WHERE RowID=@CurrentRow

        set @FtestReg = SQL#.RegEx_MatchSimple(@SelectCol1, @regEx, 1,'IgnoreCase')
        set @link = @FtestReg

        while LEN(@FtestReg) >0  
        begin
            set @SelectCol1 =  REPLACE(@SelectCol1,@FtestReg,'')
            SELECT @FtestReg= SQL#.RegEx_MatchSimple(@SelectCol1,@regEx, 1,'IgnoreCase')
            set @link = CONCAT(@link,' ',@FtestReg)
        end

        update @StringToFix 
        set [links] = @link,
        [text] = @SelectCol1
        WHERE RowID=@CurrentRow

End

1 Comment

If this is the solution you're using, you should mark it as answered.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.