0

I want to run a sub re process using compile. I have the sub logic working (see the first example below), so that the function prints ", peanut, " as desired. Basically the sub takes emphasis html tags of any length with or without attributes and replaces with ", ". However, I want to get the second version working, which although more verbose is easier to modify because to add a new emphasis tag I add "tagname" instead of "|tagname|/tagname" for the open and close versions respectively. I know the answer is using compile somehow. I searched and could not find the answer.

Works:

def cut_out_emphasis():
    str1 = "<b class='boldclass'>peanut</b>
    str1 = re.sub(r'<(b|\/b|i|\/i|ul|\/ul)[^>]*>', ', ', str1, flags=re.I)
    print str1

Doesn't work:

def cut_out_emphasis():
    str1 = "<b class='boldclass'>peanut</b>
    list1 = ["b", "i", "ul"]
    str2 = ""
    for x in list1:
        str2 = '%s|%s|/%s' % (str2, x, x)
    str2 = "r'<(%s)[^>]*>'" % (str2, )
    str1 = re.sub(re.compile(str2, re.IGNORECASE), ', ', str1, flags=re.I)
    print str1
1
  • Add quotes after html strings Commented May 26, 2014 at 13:04

2 Answers 2

2

Compiled RE doesn't work like that. You should do it like this:

def cut_out_emphasis(str1):
    list1 = ["b", "i", "ul"]
    str2 = ""
    for x in list1:
        str2 = r'%s|%s|\/%s' % (str2, x, x)
    str2 = r'<(%s)[^>]*>' % (str2, )
    re_compiled = re.compile(str2, re.IGNORECASE)
    str1 = re_compiled.sub(', ', str1)
    return str1

But its optional to compile a regex. It improves performance if you use same regex multiple times. In your case you can stick with this:

def cut_out_emphasis(str1):
    list1 = ["b", "i", "ul"]
    str2 = ""
    for x in list1:
        str2 = r'%s|%s|\/%s' % (str2, x, x)
    str2 = r'<(%s)[^>]*>' % (str2, )
    str1 = re.sub(str2, ', ', str1, flags=re.I)
    return str1
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks but neither function works for me. For some reason Stackoverflow didn't print the html tags in the test string I wanted to process. See edited question above, thanks.
I just copied your code and changed only re.sub part. Looks like you forgot to escape / in second example. I fixed it in my answer, you can try it.
Akex, thanks, but it still doesn't work. Try: print cut_out_emphasis("<i csdf>peanut</i>") - this should return ", peanut, " but it doesn't.
I found it. r is not a part of string, it is modifier that tells Python that you supply raw string, i.e. it should not use escaping rules. Try my code now
2

First, all non-trivial regexes should be written in free-spacing mode with proper indentation and lots of comments. Doing it this way allows you to easily see and edit the list of tags to be stripped - (i.e. there is no need to place the tags in a fixed, constant list - its just as easy to add one line to the regex).

import re
re_strip_tags = re.compile(r"""
    # Match open or close tag from list of tag names.
    <                # Tag opening "<" delimiter.
    /?               # Match close tags too (which begin with "</").
    (?:              # Group list of tag names.
      b              # Bold tags.
    | i              # Italic tags.
    | em             # Emphasis tags.
#   | othertag       # Add additional tags here...
    )                # End list of tag names.
    (?:              # Non-capture group for optional attribute(s).
      \s+            # Attributes must be separated by whitespace.
      [\w\-.:]+      # Attribute name is required for attr=value pair.
      (?:            # Non-capture group for optional attribute value.
        \s*=\s*      # Name and value separated by "=" and optional ws.
        (?:          # Non-capture group for attrib value alternatives.
          "[^"]*"    # Double quoted string (Note: may contain "&<>").
        | '[^']*'    # Single quoted string (Note: may contain "&<>").
        | [\w\-.:]+  # Non-quoted attrib value can be A-Z0-9-._:
        )            # End of attribute value
      )?             # Attribute value is optional.
    )*               # Zero or more tag attributes.
    \s* /?           # Optional whitespace and "/" before ">".
    >                # Tag closing ">" delimiter.
    """, re.VERBOSE | re.IGNORECASE)

def cut_out_emphasis(str):
    return re.sub(re_strip_tags, ', ', str)

print (cut_out_emphasis("<b class='boldclass'>peanut</b>"))

When writing Python regexes, to completely avoid any issues with escapes/backslashes, always use either the: r"raw string" or r"""raw multi-line string""" syntax. Note that in the above script, the regex is compiled only once but can be used many times. (However, it should also be noted that this is not really an advantage, as Python internally caches compiled regular expressions.)

It should also be mentioned that parsing HTML using regex is, lets just say; generally frowned upon.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.