0

I'm trying to write a regular expression pattern (in python) for reformatting these template engine files.

Basically the scheme looks like this:

[$$price$$]
{
    <h3 class="price">
    $12.99
    </h3>
}

I'm trying to make it remove any extra tabs\spaces\new lines so it should look like this:

[$$price$$]{<h3 class="price">$12.99</h3>}

I wrote this: (\t|\s)+? which works except it matches within the html tags, so h3 becomes h3class and I am unable to figure out how to make it ignore anything inside the tags.

1
  • you can always hash all the html tags (put a marker in place of the html), strip the spaces, then unhash the tags... Commented Apr 9, 2009 at 1:37

3 Answers 3

5

Using regular expressions to deal with HTML is extremely error-prone; they're simply not the right tool.

Instead, use a HTML/XML-aware library (such as lxml) to build a DOM-style object tree; modify the text segments within the tree in-place, and generate your output again using said library.

Sign up to request clarification or add additional context in comments.

2 Comments

The question isn't really about HTML, it's about whitespace, and it's well within the capabilities of regexes.
Alan - it's about doing whitespace removal in a context-sensitive manner; handling the general case calls for something with the expressiveness of a recursive descent parser.
0

Try this:

\r?\n[ \t]*

EDIT: The idea is to remove all newlines (either Unix: "\n", or Windows: "\r\n") plus any horizontal whitespace (TABs or spaces) that immediately follow them.

2 Comments

That works for the example given -- but we haven't been given a formal definition for the template syntax, and so don't know if it works in the general case.
And we probably never will be given one; I've never seen any follow-up from anyone posting as "unknown (whatever)".
0

Alan,

I have to agree with Charles that the safest way is to parse the HTML, then work on the Text nodes only. Sounds overkill but that's the safest.

On the other hand, there is a way in regex to do that as long as you trust that the HTML code is correct (i.e. does not include invalid < and > in the tags as in: <a title="<this is a test>" href="look here">...)

Then, you know that any text has to be between > and < except at the very beginning and end (if you just get a snapshot of the page, otherwise there is the HTML tag minimum.)

So... You still need two regex's: find the text '>[^<]+<', then apply the other regex as you mentioned.

The other way, is to have an or with something like this (not tested!):

'(<[^>]*>)|([\r\n\f ]+)'

This will either find a tag or spaces. When you find a tag, do not replace, if you don't find a tag, replace with an empty string.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.