python regex sub compile with flags

Question

I want to run a sub re process using compile. I have the sub logic working (see the first example below), so that the function prints ", peanut, " as desired. Basically the sub takes emphasis html tags of any length with or without attributes and replaces with ", ". However, I want to get the second version working, which although more verbose is easier to modify because to add a new emphasis tag I add "tagname" instead of "|tagname|/tagname" for the open and close versions respectively. I know the answer is using compile somehow. I searched and could not find the answer.

Works:

def cut_out_emphasis():
    str1 = "<b class='boldclass'>peanut</b>
    str1 = re.sub(r'<(b|\/b|i|\/i|ul|\/ul)[^>]*>', ', ', str1, flags=re.I)
    print str1

Doesn't work:

def cut_out_emphasis():
    str1 = "<b class='boldclass'>peanut</b>
    list1 = ["b", "i", "ul"]
    str2 = ""
    for x in list1:
        str2 = '%s|%s|/%s' % (str2, x, x)
    str2 = "r'<(%s)[^>]*>'" % (str2, )
    str1 = re.sub(re.compile(str2, re.IGNORECASE), ', ', str1, flags=re.I)
    print str1

Add quotes after html strings

Alex Shkop
– Alex Shkop

2014-05-26 13:04:49 +00:00
Commented May 26, 2014 at 13:04 — Alex Shkop
– Alex Shkop, Commented May 26, 2014 at 13:04

Alex Shkop · Accepted Answer · 2014-05-26 13:16:02Z

2

Compiled RE doesn't work like that. You should do it like this:

def cut_out_emphasis(str1):
    list1 = ["b", "i", "ul"]
    str2 = ""
    for x in list1:
        str2 = r'%s|%s|\/%s' % (str2, x, x)
    str2 = r'<(%s)[^>]*>' % (str2, )
    re_compiled = re.compile(str2, re.IGNORECASE)
    str1 = re_compiled.sub(', ', str1)
    return str1

But its optional to compile a regex. It improves performance if you use same regex multiple times. In your case you can stick with this:

def cut_out_emphasis(str1):
    list1 = ["b", "i", "ul"]
    str2 = ""
    for x in list1:
        str2 = r'%s|%s|\/%s' % (str2, x, x)
    str2 = r'<(%s)[^>]*>' % (str2, )
    str1 = re.sub(str2, ', ', str1, flags=re.I)
    return str1

edited May 26, 2014 at 13:16

answered May 26, 2014 at 12:50

Alex Shkop

2,04213 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user2104778 Over a year ago

Thanks but neither function works for me. For some reason Stackoverflow didn't print the html tags in the test string I wanted to process. See edited question above, thanks.

Alex Shkop Over a year ago

I just copied your code and changed only re.sub part. Looks like you forgot to escape / in second example. I fixed it in my answer, you can try it.

user2104778 Over a year ago

Akex, thanks, but it still doesn't work. Try: print cut_out_emphasis("<i csdf>peanut</i>") - this should return ", peanut, " but it doesn't.

Alex Shkop Over a year ago

I found it. r is not a part of string, it is modifier that tells Python that you supply raw string, i.e. it should not use escaping rules. Try my code now

ridgerunner · Accepted Answer · 2014-05-26 16:13:05Z

First, all non-trivial regexes should be written in free-spacing mode with proper indentation and lots of comments. Doing it this way allows you to easily see and edit the list of tags to be stripped - (i.e. there is no need to place the tags in a fixed, constant list - its just as easy to add one line to the regex).

import re
re_strip_tags = re.compile(r"""
    # Match open or close tag from list of tag names.
    <                # Tag opening "<" delimiter.
    /?               # Match close tags too (which begin with "</").
    (?:              # Group list of tag names.
      b              # Bold tags.
    | i              # Italic tags.
    | em             # Emphasis tags.
#   | othertag       # Add additional tags here...
    )                # End list of tag names.
    (?:              # Non-capture group for optional attribute(s).
      \s+            # Attributes must be separated by whitespace.
      [\w\-.:]+      # Attribute name is required for attr=value pair.
      (?:            # Non-capture group for optional attribute value.
        \s*=\s*      # Name and value separated by "=" and optional ws.
        (?:          # Non-capture group for attrib value alternatives.
          "[^"]*"    # Double quoted string (Note: may contain "&<>").
        | '[^']*'    # Single quoted string (Note: may contain "&<>").
        | [\w\-.:]+  # Non-quoted attrib value can be A-Z0-9-._:
        )            # End of attribute value
      )?             # Attribute value is optional.
    )*               # Zero or more tag attributes.
    \s* /?           # Optional whitespace and "/" before ">".
    >                # Tag closing ">" delimiter.
    """, re.VERBOSE | re.IGNORECASE)

def cut_out_emphasis(str):
    return re.sub(re_strip_tags, ', ', str)

print (cut_out_emphasis("<b class='boldclass'>peanut</b>"))

When writing Python regexes, to completely avoid any issues with escapes/backslashes, always use either the: r"raw string" or r"""raw multi-line string""" syntax. Note that in the above script, the regex is compiled only once but can be used many times. (However, it should also be noted that this is not really an advantage, as Python internally caches compiled regular expressions.)

It should also be mentioned that parsing HTML using regex is, lets just say; generally frowned upon.

Collectives™ on Stack Overflow

python regex sub compile with flags

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related