Problem With Regular Expression to Remove HTML Tags

Question

In my Ruby app, I've used the following method and regular expression to remove all HTML tags from a string:

str.gsub(/<\/?[^>]*>/,"")

This regular expression did just about all I was expecting it to, except it caused all quotation marks to be transformed into “ and all single quotes to be changed to ” .

What's the obvious thing I'm missing to convert the messy codes back into their proper characters?

Edit: The problem occurs with or without the Regular Expression, so it's clear my problem has nothing to do with it. My question now is how to deal with this formatting error and correct it. Thanks!

Are you using some kind of “quote beautifier”?

Gumbo
– Gumbo

2009-02-13 00:15:03 +00:00
Commented Feb 13, 2009 at 0:15 — Gumbo
– Gumbo, Commented Feb 13, 2009 at 0:15

vladr · Accepted Answer · 2009-02-14 23:21:26Z

5

Use CGI::unescapeHTML after you perform your regular expression substitution:

CGI::unescapeHTML(str.gsub(/<\/?[^>]*>/,""))

See http://www.ruby-doc.org/core/classes/CGI.html#M000547

In the above code snippet, gsub removes all HTML tags. Then, unescapeHTML() reverts all HTML entities (such as <, &#8220) to their actual characters (<, quotes, etc.)

With respect to another post on this page, note that you will never ever be passed HTML such as

<tag attribute="<value>">2 + 3 < 6</tag>

(which is invalid HTML); what you may receive is, instead:

<tag attribute="&lt;value&gt;">2 + 3 &lt; 6</tag>

The call to gsub will transform the above to:

2 + 3 &lt; 6

And unescapeHTML will finish the job:

2 + 3 < 6

edited Feb 14, 2009 at 23:21

answered Feb 14, 2009 at 23:04

vladr

67k18 gold badges131 silver badges132 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Sniggerfardimungus · Accepted Answer · 2009-02-12 23:45:50Z

2

You're going to run into more trouble when you see something like:

<doohickey name="<foobar>">

You'll want to apply something like:

gsub(/<[^<>]*>/, "")

...for as long as the pattern matches.

answered Feb 12, 2009 at 23:45

Sniggerfardimungus

11.9k10 gold badges60 silver badges98 bronze badges

1 Comment

R.. GitHub STOP HELPING ICE Over a year ago

Unless you meant that the OP should be prepared to deal with bogus/invalid HTML, you're wrong. This form will never appear in correct HTML.

Community · Accepted Answer · 2017-05-23 12:19:25Z

2

This regular expression did just about all I was expecting it to, except it caused all quotation marks to be transformed into “ and all single quotes to be changed to ” .

This doesn't sound as if the RegExp would be doing this. Are you sure it's different before?

See this question here for information about the problem, it has got an excellent answer:
Get non UTF-8 form fields as UTF-8 in php.

edited May 23, 2017 at 12:19

CommunityBot

11 silver badge

answered Feb 13, 2009 at 0:10

Georg Schölly

127k54 gold badges225 silver badges277 bronze badges

1 Comment

btw Over a year ago

Holy cow, you're right. I noticed it after adding the Regex, but the effect on characters happens either way. I just hadn't noticed due to it being less obvious. So my question becomes: How do I fix this formatting?

lazyfly · Accepted Answer · 2009-02-13 21:15:00Z

0

I've run into a similar problem with character changes, this happened when my code ran through another module that enforced UTF-8 encoding and then when it came back, I had a different file (slurped array of lines) on my hands.

answered Feb 13, 2009 at 21:15

lazyfly

Comments

Tim · Accepted Answer · 2009-02-12 23:40:29Z

-3

You could use a multi-pass system to get the results you are looking for.

After running your regular expression, run an expression to convert &8220; to quotes and another to convert &8221; to single quotes.

answered Feb 12, 2009 at 23:40

Tim

1,8848 gold badges29 silver badges41 bronze badges

Collectives™ on Stack Overflow

Problem With Regular Expression to Remove HTML Tags

5 Answers 5

Comments

1 Comment

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

1 Comment

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related