0

The following is the target string.

July 17, 2007 –<br> September 25, 2009 <br> June 2007 - July 2010

I am trying to add a newline before <br> tags which DOES NOT follow -. Thus, the resulting string should be:

July 17, 2007 –<br> September 25, 2009 \n<br> June 2007 - July 2010

I tried the following regular expression to no avail.

re.sub(r'([^-])(\s*<br)',r'\1\n\2', astring)

gives me

July 17, 2007 –\n<br> September 25, 2009\n <br> June 2007 - July 2010

What is the solution?

UPDATE:

I am not actually parsing the HTML with regular expressions. I know that HTML + RegEx combo will plummet me to insanity. I am using lxml to parse HTML already. However, what I am not able to understand is why regex can't catch the -\s*< pattern.

2 Answers 2

4

The dash character in your text is EN DASH U+2013, that's why ([^-]) matches the EN DASH and a replacement occurs.

You need to add the character into your character class, and shift the \s* to the first capturing group, and add \s to the negated character class to make it works as you want:

re.sub(r'([^\s–-]\s*)(<br)',r'\1\n\2', astring)

Note that while the code above works, it is not maintainable - since it is very hard to notice the EN DASH in the character class.

From Python 3.3 and above, \u or \U Unicode escape sequence has been added. You can specify your regex as such:

re.sub(r'([^\s\u2013-]\s*)(<br)',r'\1\n\2', astring)

It is arguably less clear what \u2013 is, but at least, reader of the code won't get tripped.

For Python 3.2 and below, you can use the normal literal string instead of raw literal string syntax for the regex to achieve the same effect.

re.sub('([^\\s\u2013-]\\s*)(<br)',r'\1\n\2', astring)

Technically, due to Python's way of parsing literal string (preserve the \ if it does not form a valid escape sequence), '([^\s\u2013-]\s*)(<br)' also works (compare \\s and \s), but I double up the escape just to be safe.

Sign up to request clarification or add additional context in comments.

5 Comments

This gets me every time, dammit! Instead of playing with those two characters, I have decided to first replace all the EN DASH with hyphen -. Then, I would require only hyphen - in the regex.
This doesn't work when I have whitespace(s) in between - and <br>. Apparantly, [^-] would include whitespace, too, thus tearing down the regexp.
I improved the regexp thus - ([^-\s]\s*)(<br). It seems to have solved the problem. Tell me if there is a catch in this.
@MisterBhoot: It seems that we arrived at the same solution.
@nhahtdh - Yepp. Thanks for your quick effort anyways.
1

The in your string and the - in your regex are not the same characters. Try

re.sub(r'([^–])(\s*<br)',r'\1\n\2', astring)

1 Comment

Though I understood the issue already from your answer (which was the first to appear in my feed), @nhahtdh's answer is more comprehensive. I will accept his answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.