regular expression to eliminate text inside < and > [duplicate]

Question

Possible Duplicate:
Using C# regular expressions to remove HTML tags

I'm trying to write a code that will return only the content of an HTML file. The best way I've figured revolves either around eliminating all elements within < ..> brackets, or to make a list of all text in between >...< brackets. I'm pretty new to regular expressions, but I'm pretty sure they're the way to go.

Here's the code I've tried

        Regex reg = new Regex(@"<.*>");
        file = reg.Replace(file, "");

Which works, as long as there is only one <...> before a block of text. Any file that has two or more of those elements in sequence, like <...><...>, and it just starts deleting any text it finds. Can someone tell me what I'm doing wrong?

Just try the test string in the comment. stackoverflow.com/a/12510496/932418 — L.B
– L.B, Commented Sep 25, 2012 at 19:15
.*? will work like charm. unless you want something else to be removed. — Pradip
– Pradip, Commented Sep 25, 2012 at 19:23

Enoban · Accepted Answer · 2012-09-25 19:18:14Z

0

Regex are regulary greedy (they match the longest string they can find). Try checking, depending on the language you are looking for, for the +? or *? operators, that will try the shortest match. Otherwise you must build another regex.

answered Sep 25, 2012 at 19:18

Enoban

261 bronze badge

Sign up to request clarification or add additional context in comments.

1 Comment

George Abraham Siegel Duffy Over a year ago

THanks, I'll read up more on greediness.

Sam I am says Reinstate Monica · Accepted Answer · 2012-09-25 19:18:45Z

0

Well, the unexpected behavior you're getting is because your regular expression is greedy

If you change your regex to

    Regex reg = new Regex(@"<.*?>");
    file = reg.Replace(file, "");

you'll get what you expect.

Also, Know that Regex doesn't handle nesting, which HTML has a lot of, and I'd avoid using Regex to parse HTML unless you're trying to match a very specific thing, on a specifically formed piece of html.

answered Sep 25, 2012 at 19:18

Sam I am says Reinstate Monica

31.3k12 gold badges74 silver badges101 bronze badges

8 Comments

George Abraham Siegel Duffy Over a year ago

Thanks. Should I use HTML agility pack instead? I saw that referenced in a comment.

L.B Over a year ago

@Sam what does your code give for <h4 title='e>Sh<opping'>it happens</h4> ?

Sam I am says Reinstate Monica Over a year ago

@GeorgeAbrahamSiegelDuffy I've never really had to parse HTML myself, but I would definitely have a look at it. if you need to parse HTML.

Sam I am says Reinstate Monica Over a year ago

@L.B It renders "shit happens" you've made that comment/linked to that comment 3 times at least in this single thread. Stop being a broken record, and also read the answer you're replying to

L.B Over a year ago

@Sam When you stop trying to parse html with regex, i will stop giving the same reference.

|

Collectives™ on Stack Overflow

regular expression to eliminate text inside < and > [duplicate]

2 Answers 2

1 Comment

8 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

8 Comments

Linked

Related