Remove redundant html tags from string

Question

I various string objects with html formatted text. Some of these strings contain certain tags at the end that I want to remove programmatically, like these linebreak and paragraph tags at the end:

<li><ol>  **Text/List**  </li></ol><p><br></p><br><br>

I need to check the string from its endpoint, but I can't figure out where to cut the end off, or where to look for the cutting point. I just need to get rid of these redundant tags.

I tried to build a function that checks the string, I know it doesn't work properly, but it's my basis:

public static String RemoveRedundantTags(this String baseString, String html)
    {
        if (html.Contains("<"))
        {
            for (Int32 i = html.Length - 1; i >= 1; i--)
            {
                if (html[i] == '<' && html[i - 1] != '>' && html[i + 1] != '/')
                {
                    redundantTags = html.Substring(html[i], html.Length - i);

                    html = html.Replace(redundantTags, String.Empty);

                    return html;
                }
            }
        }

        return html;
    }

If you don't say any condition, there is no way to help you. The easier way is to remove all of the HTML tags and leave only the text. — mybirthname
– mybirthname, Commented Sep 26, 2016 at 10:58
What means redundant? What makes them redundant? What have you tried so far? — Tim Schmelter
– Tim Schmelter, Commented Sep 26, 2016 at 10:58
I get them from the TFS, and they are left over tags when someone creating the whole thing made a few line breaks and didn't delete them. They create empty space in my output. I need the other html tags, because the whole thing is inserted into Word. — tweedledum11
– tweedledum11, Commented Sep 26, 2016 at 10:59
What about htmlstring.Replace(" ","").Replace("","").Replace("",""); or something like that? — Pikoh
– Pikoh, Commented Sep 26, 2016 at 11:01
That would remove every paragraph and every linebreak in the whole string though, these types of tags are not generally redundant, just the ones at the end that have no actual content. — tweedledum11
– tweedledum11, Commented Sep 26, 2016 at 11:03

Community · Accepted Answer · 2017-05-23 12:19:29Z

2

If i'd need to manipulate HTML, i'd use a HTML-parser like HtmlAgilityPack, not string methods or regex. Here is an example that removes all br from the end:

string html = "<li><ol>  **Text/List**  </li></ol><p><br></p><br><br>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var brToRemove = doc.DocumentNode.Descendants().Reverse().TakeWhile(n => n.Name == "br");
foreach (HtmlNode node in brToRemove)
    node.Remove();

using (StringWriter writer = new StringWriter())
{
    doc.Save(writer);
    string result = writer.ToString();
}

The result is:

<li><ol>  **Text/List**  </ol></li><p>

As you can see by default it fixes parse errors by itself. There was one:

Start tag <ol> was not found

If the html was

html = "<ol><li>TEXT</li></ol><p><br></p><p><br></p>&nbsp;";

and you wanted to remove all  and   tags but also the   from the end as commented. You could use following approach that uses a dictionary where the key is the tag-name and the value are the strings of the inner-text of this tag, so a sub-selector. If the value is an empty sequence the tag should be removed no matter what inner-text it has. Here is a dictionary for your requirement:

var tagsToRemove = new Dictionary<string, IEnumerable<string>>
{
    { "br", Enumerable.Empty<string>() }, { "p", Enumerable.Empty<string>() }, { "#text", new[] { "&nbsp;" } }
};

Now the LINQ query to find all tags to remove is:

var brToRemove = doc.DocumentNode.Descendants()
    .Reverse()
    .TakeWhile(n => tagsToRemove.ContainsKey(n.Name) 
                 && tagsToRemove[n.Name].DefaultIfEmpty(n.InnerText).Contains(n.InnerText));

The (desired) result is:

<ol><li>TEXT</li></ol>

edited May 23, 2017 at 12:19

CommunityBot

11 silver badge

answered Sep 26, 2016 at 11:19

Tim Schmelter

462k79 gold badges719 silver badges980 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

tweedledum11 Over a year ago

That works for this specific string, thank you, it's much more than I have now, but where it doesn't work is for this string:

tweedledum11 Over a year ago

TEXT</li></ol>

Tim Schmelter Over a year ago

@tweedledum11: what is the desired result?

tweedledum11 Over a year ago

The desired result would be that there is text and used html tags (like li and ol in this example) without these linebreaking and paragraph tags that are at the end of the actual content. They are leftovers that I don't want in the resulting output.

Tim Schmelter Over a year ago

@tweedledum11: but that isn't valid html anyway, the li and ol tags have no start tag. If you wanted to remove the   and the   +  tags the only valid html remaining was "TEXT". Do you want to keep that invalid html or do you also want to remove it?

|

Collectives™ on Stack Overflow

Remove redundant html tags from string

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related