1

I work as asp.net developer using C#, I receive text like this from the client:

> <p><a
> href="http://www.vogue.co.uk/person/kate-winslet">KATE
> WINSLET</a> has given birth to a 9lb baby boy. The
> Oscar-winning actress welcomed the baby with her husband Ned Rocknroll
> at a hospital in Sussex.</p>
> 
> <p>"Kate had 'Baby Boy Winslet' on
> Saturday at an NHS Hospital," Winslet's spokeswoman
> said, adding that the family were "thrilled to
> bits".</p>
> 
> <p>The announcement suggests that the child might bear his
> mother's surname, rather than his father's slightly
> more unusual moniker.</p>
> 
> <p>The baby is Winslet's third - she is already mother
> to Mia, 13, and Joe, eight,  from previous relationships -
> and her husband's first. They met on Necker Island, owned by
> Rocknroll's uncle, Richard Branson, and<a
> href="http://www.vogue.co.uk/news/2013/kate-winslet-married-to-ned-rocknroller---wedding-details">married almost a year ago</a> in New York.</p>

I need a way to extract the real text without tags and special characters using sql server 2008 or above ??

3
  • Are you constrained to use SQL for this? It would probably be more suited to be handled in the application. Commented Dec 11, 2013 at 10:12
  • @bendataclear If there are a way to handle it in the application layer, I would like to use it.. Commented Dec 11, 2013 at 10:14
  • @bendataclear I think you are being too defensive. SQL is unsuitable for this, it must be done in the application layer. Commented Dec 11, 2013 at 10:48

3 Answers 3

1

The best I can suggest is to use a .net HTML parser or such which is wrapped in a SQL CLR function. Or to wrap the regex in SQL CLR if you want.

Note regex limitations: http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html

Raw SQL language won't do it: it is not a string (or HTML) processing language

Sign up to request clarification or add additional context in comments.

1 Comment

Since he is developing in C# there is no need for a CLR wrapper.
1

I recently had the same requirement (to remove HTML tags and entities) so developed this function in SQL Server.

CREATE FUNCTION CTU_FN_StripHTML (@dirtyText NVARCHAR(MAX))
RETURNS NVARCHAR(MAX)
AS
BEGIN
-- Cleaned Text
DECLARE @cleanText NVARCHAR(MAX)=RTRIM(LTRIM(@dirtyText));
-- HTML Tags
DECLARE @tagStart SMALLINT =PATINDEX('%<%>%', @cleanText);
DECLARE @tagEnd SMALLINT;
DECLARE @tagLength SMALLINT;
-- HTML Entities
DECLARE @entityStart SMALLINT =PATINDEX('%&%;%', @cleanText);
DECLARE @entityEnd SMALLINT;
DECLARE @entityLength SMALLINT;
WHILE @tagStart > 0
    OR 
    @entityStart > 0
BEGIN
-- Remove HTML Tag 
SET @tagStart=PATINDEX('%<%>%', @cleanText);
IF @tagStart > 0 
BEGIN
SET @tagEnd=CHARINDEX('>', @cleanText, @tagStart);
SET @tagLength=(@tagEnd - @tagStart) + 1;
SET @cleanText=STUFF(@cleanText, @tagStart, @tagLength, '');
END;
-- Remove HTML Entity
SET @entityStart=PATINDEX('%&%;%', @cleanText);
IF @entityStart > 0 
BEGIN
SET @entityEnd=CHARINDEX(';', @cleanText, @entityStart);
SET @entityLength=(@entityEnd - @entityStart) + 1;
SET @cleanText=STUFF(@cleanText, @entityStart, @entityLength, '');
END;
END;

SET @cleanText = RTRIM(LTRIM(@cleanText))
RETURN @cleanText;
END;

Comments

0

HTML is so complex it's a very bad idea to do this without an HTML Parser.

You might be interested in This Question. The answer that's accepted there is to just use Lynx via the command line and dump the output to a file. If you can do it outside the users page-load it might be the best option.

5 Comments

umm, I think I need simpler solution
@dotWasim The problem is, it will always be a sliding scale between simple and reliable, there won't be a simple solution that can take everything into account. You can make a basic function to remove everything between tags but this won't take into account the codes.
thanks, I don't need any code inside the tags.. anyway I coded it using regix for now
@dotWasim I am reminded of this answer: stackoverflow.com/a/1732454/1281901
@bendataclear I just take first 400 characters from the text to display it as sample and then uses can click read more button to read the whole news.. so what I need is simple and I don't need to make it complex..

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.