skip over HTML tags in Regular Expression patterns

Question

I'm trying to write a regular expression pattern (in python) for reformatting these template engine files.

Basically the scheme looks like this:

[$$price$$]
{
    <h3 class="price">
    $12.99
    </h3>
}

I'm trying to make it remove any extra tabs\spaces\new lines so it should look like this:

[$$price$$]{<h3 class="price">$12.99</h3>}

I wrote this: (\t|\s)+? which works except it matches within the html tags, so h3 becomes h3class and I am unable to figure out how to make it ignore anything inside the tags.

you can always hash all the html tags (put a marker in place of the html), strip the spaces, then unhash the tags... — mpen
– mpen, Commented Apr 9, 2009 at 1:37

Charles Duffy · Accepted Answer · 2009-04-09 01:33:31Z

5

Using regular expressions to deal with HTML is extremely error-prone; they're simply not the right tool.

Instead, use a HTML/XML-aware library (such as lxml) to build a DOM-style object tree; modify the text segments within the tree in-place, and generate your output again using said library.

answered Apr 9, 2009 at 1:33

Charles Duffy

299k43 gold badges441 silver badges497 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Alan Moore Over a year ago

The question isn't really about HTML, it's about whitespace, and it's well within the capabilities of regexes.

Charles Duffy Over a year ago

Alan - it's about doing whitespace removal in a context-sensitive manner; handling the general case calls for something with the expressiveness of a recursive descent parser.

Alan Moore · Accepted Answer · 2009-04-09 04:50:30Z

0

Try this:

\r?\n[ \t]*

EDIT: The idea is to remove all newlines (either Unix: "\n", or Windows: "\r\n") plus any horizontal whitespace (TABs or spaces) that immediately follow them.

edited Apr 9, 2009 at 4:50

answered Apr 9, 2009 at 1:36

Alan Moore

75.6k13 gold badges110 silver badges161 bronze badges

2 Comments

Charles Duffy Over a year ago

That works for the example given -- but we haven't been given a formal definition for the template syntax, and so don't know if it works in the general case.

Alan Moore Over a year ago

And we probably never will be given one; I've never seen any follow-up from anyone posting as "unknown (whatever)".

Alexis Wilke · Accepted Answer · 2009-05-31 21:04:54Z

Alan,

I have to agree with Charles that the safest way is to parse the HTML, then work on the Text nodes only. Sounds overkill but that's the safest.

On the other hand, there is a way in regex to do that as long as you trust that the HTML code is correct (i.e. does not include invalid < and > in the tags as in: <a title="<this is a test>" href="look here">...)

Then, you know that any text has to be between > and < except at the very beginning and end (if you just get a snapshot of the page, otherwise there is the HTML tag minimum.)

So... You still need two regex's: find the text '>[^<]+<', then apply the other regex as you mentioned.

The other way, is to have an or with something like this (not tested!):

'(<[^>]*>)|([\r\n\f ]+)'

This will either find a tag or spaces. When you find a tag, do not replace, if you don't find a tag, replace with an empty string.

Collectives™ on Stack Overflow

skip over HTML tags in Regular Expression patterns

3 Answers 3

2 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related