1

I various string objects with html formatted text. Some of these strings contain certain tags at the end that I want to remove programmatically, like these linebreak and paragraph tags at the end:

<li><ol>  **Text/List**  </li></ol><p><br></p><br><br>

I need to check the string from its endpoint, but I can't figure out where to cut the end off, or where to look for the cutting point. I just need to get rid of these redundant tags.

I tried to build a function that checks the string, I know it doesn't work properly, but it's my basis:

public static String RemoveRedundantTags(this String baseString, String html)
    {
        if (html.Contains("<"))
        {
            for (Int32 i = html.Length - 1; i >= 1; i--)
            {
                if (html[i] == '<' && html[i - 1] != '>' && html[i + 1] != '/')
                {
                    redundantTags = html.Substring(html[i], html.Length - i);

                    html = html.Replace(redundantTags, String.Empty);

                    return html;
                }
            }
        }

        return html;
    }
7
  • If you don't say any condition, there is no way to help you. The easier way is to remove all of the HTML tags and leave only the text. Commented Sep 26, 2016 at 10:58
  • 2
    What means redundant? What makes them redundant? What have you tried so far? Commented Sep 26, 2016 at 10:58
  • I get them from the TFS, and they are left over tags when someone creating the whole thing made a few line breaks and didn't delete them. They create empty space in my output. I need the other html tags, because the whole thing is inserted into Word. Commented Sep 26, 2016 at 10:59
  • What about htmlstring.Replace("<br>","").Replace("<p>","").Replace("</p>",""); or something like that? Commented Sep 26, 2016 at 11:01
  • That would remove every paragraph and every linebreak in the whole string though, these types of tags are not generally redundant, just the ones at the end that have no actual content. Commented Sep 26, 2016 at 11:03

1 Answer 1

2

If i'd need to manipulate HTML, i'd use a HTML-parser like HtmlAgilityPack, not string methods or regex. Here is an example that removes all br from the end:

string html = "<li><ol>  **Text/List**  </li></ol><p><br></p><br><br>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var brToRemove = doc.DocumentNode.Descendants().Reverse().TakeWhile(n => n.Name == "br");
foreach (HtmlNode node in brToRemove)
    node.Remove();

using (StringWriter writer = new StringWriter())
{
    doc.Save(writer);
    string result = writer.ToString();
}

The result is:

<li><ol>  **Text/List**  </ol></li><p>

As you can see by default it fixes parse errors by itself. There was one:

Start tag <ol> was not found


If the html was

html = "<ol><li>TEXT</li></ol><p><br></p><p><br></p>&nbsp;";

and you wanted to remove all <p> and <br> tags but also the &nbsp; from the end as commented. You could use following approach that uses a dictionary where the key is the tag-name and the value are the strings of the inner-text of this tag, so a sub-selector. If the value is an empty sequence the tag should be removed no matter what inner-text it has. Here is a dictionary for your requirement:

var tagsToRemove = new Dictionary<string, IEnumerable<string>>
{
    { "br", Enumerable.Empty<string>() }, { "p", Enumerable.Empty<string>() }, { "#text", new[] { "&nbsp;" } }
};

Now the LINQ query to find all tags to remove is:

var brToRemove = doc.DocumentNode.Descendants()
    .Reverse()
    .TakeWhile(n => tagsToRemove.ContainsKey(n.Name) 
                 && tagsToRemove[n.Name].DefaultIfEmpty(n.InnerText).Contains(n.InnerText));

The (desired) result is:

<ol><li>TEXT</li></ol>
Sign up to request clarification or add additional context in comments.

7 Comments

That works for this specific string, thank you, it's much more than I have now, but where it doesn't work is for this string:
TEXT</li></ol><p><br></p><p><br></p>&nbsp;
@tweedledum11: what is the desired result?
The desired result would be that there is text and used html tags (like li and ol in this example) without these linebreaking and paragraph tags that are at the end of the actual content. They are leftovers that I don't want in the resulting output.
@tweedledum11: but that isn't valid html anyway, the li and ol tags have no start tag. If you wanted to remove the &nbsp; and the <br> + <p> tags the only valid html remaining was "TEXT". Do you want to keep that invalid html or do you also want to remove it?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.