Failure in finding '-<' substring using regex

Question

The following is the target string.

July 17, 2007 –<br> September 25, 2009 <br> June 2007 - July 2010

I am trying to add a newline before <br> tags which DOES NOT follow -. Thus, the resulting string should be:

July 17, 2007 –<br> September 25, 2009 \n<br> June 2007 - July 2010

I tried the following regular expression to no avail.

re.sub(r'([^-])(\s*<br)',r'\1\n\2', astring)

gives me

July 17, 2007 –\n<br> September 25, 2009\n <br> June 2007 - July 2010

What is the solution?

UPDATE:

I am not actually parsing the HTML with regular expressions. I know that HTML + RegEx combo will plummet me to insanity. I am using lxml to parse HTML already. However, what I am not able to understand is why regex can't catch the -\s*< pattern.

nhahtdh · Accepted Answer · 2013-04-24 16:54:16Z

4

The dash character in your text is EN DASH U+2013, that's why ([^-]) matches the EN DASH and a replacement occurs.

You need to add the character into your character class, and shift the \s* to the first capturing group, and add \s to the negated character class to make it works as you want:

re.sub(r'([^\s–-]\s*)(<br)',r'\1\n\2', astring)

Note that while the code above works, it is not maintainable - since it is very hard to notice the EN DASH in the character class.

From Python 3.3 and above, \u or \U Unicode escape sequence has been added. You can specify your regex as such:

re.sub(r'([^\s\u2013-]\s*)(<br)',r'\1\n\2', astring)

It is arguably less clear what \u2013 is, but at least, reader of the code won't get tripped.

For Python 3.2 and below, you can use the normal literal string instead of raw literal string syntax for the regex to achieve the same effect.

re.sub('([^\\s\u2013-]\\s*)(<br)',r'\1\n\2', astring)

Technically, due to Python's way of parsing literal string (preserve the \ if it does not form a valid escape sequence), '([^\s\u2013-]\s*)(<br)' also works (compare \\s and \s), but I double up the escape just to be safe.

edited Apr 24, 2013 at 16:54

answered Apr 24, 2013 at 15:13

nhahtdh

56.9k15 gold badges131 silver badges164 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Bhoot Over a year ago

This gets me every time, dammit! Instead of playing with those two characters, I have decided to first replace all the EN DASH – with hyphen -. Then, I would require only hyphen - in the regex.

Bhoot Over a year ago

This doesn't work when I have whitespace(s) in between - and <br>. Apparantly, [^-] would include whitespace, too, thus tearing down the regexp.

Bhoot Over a year ago

I improved the regexp thus - ([^-\s]\s*)(<br). It seems to have solved the problem. Tell me if there is a catch in this.

nhahtdh Over a year ago

@MisterBhoot: It seems that we arrived at the same solution.

Bhoot Over a year ago

@nhahtdh - Yepp. Thanks for your quick effort anyways.

Loamhoof · Accepted Answer · 2013-04-24 15:10:31Z

1

The – in your string and the - in your regex are not the same characters. Try

re.sub(r'([^–])(\s*<br)',r'\1\n\2', astring)

answered Apr 24, 2013 at 15:10

Loamhoof

8,29329 silver badges30 bronze badges

1 Comment

Bhoot Over a year ago

Though I understood the issue already from your answer (which was the first to appear in my feed), @nhahtdh's answer is more comprehensive. I will accept his answer.

Collectives™ on Stack Overflow

Failure in finding '-<' substring using regex

2 Answers 2

5 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related