0

Possible Duplicate:
Using C# regular expressions to remove HTML tags

I'm trying to write a code that will return only the content of an HTML file. The best way I've figured revolves either around eliminating all elements within < ..> brackets, or to make a list of all text in between >...< brackets. I'm pretty new to regular expressions, but I'm pretty sure they're the way to go.

Here's the code I've tried

        Regex reg = new Regex(@"<.*>");
        file = reg.Replace(file, ""); 

Which works, as long as there is only one <...> before a block of text. Any file that has two or more of those elements in sequence, like <...><...>, and it just starts deleting any text it finds. Can someone tell me what I'm doing wrong?

2
  • 1
    Just try the test string in the comment. stackoverflow.com/a/12510496/932418 Commented Sep 25, 2012 at 19:15
  • .*? will work like charm. unless you want something else to be removed. Commented Sep 25, 2012 at 19:23

2 Answers 2

0

Regex are regulary greedy (they match the longest string they can find). Try checking, depending on the language you are looking for, for the +? or *? operators, that will try the shortest match. Otherwise you must build another regex.

Sign up to request clarification or add additional context in comments.

1 Comment

THanks, I'll read up more on greediness.
0

Well, the unexpected behavior you're getting is because your regular expression is greedy

If you change your regex to

    Regex reg = new Regex(@"<.*?>");
    file = reg.Replace(file, ""); 

you'll get what you expect.

Also, Know that Regex doesn't handle nesting, which HTML has a lot of, and I'd avoid using Regex to parse HTML unless you're trying to match a very specific thing, on a specifically formed piece of html.

8 Comments

Thanks. Should I use HTML agility pack instead? I saw that referenced in a comment.
@Sam what does your code give for <h4 title='e>Sh<opping'>it happens</h4> ?
@GeorgeAbrahamSiegelDuffy I've never really had to parse HTML myself, but I would definitely have a look at it. if you need to parse HTML.
@L.B It renders "shit happens" you've made that comment/linked to that comment 3 times at least in this single thread. Stop being a broken record, and also read the answer you're replying to
@Sam When you stop trying to parse html with regex, i will stop giving the same reference.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.