0

Summary: What regex string would I use to remove tags in a HTML document? Although, this may be a duplicate from a previous answer: How to remove only html tags in a string? and Remove HTML tags in String, I can not programme in those languages fully yet, so this is why I am asking the question.

I am completing a Python Exercise by Google: https://developers.google.com/edu/python/exercises/baby-names it requires you two parse HTML data using regex (the HTML is structured so it is easier). I've been having problems removing the tags surrounding the data:

   def extract_names(filename):
  """
  Given a file name for baby.html, returns a list starting with the year string
  followed by the name-rank strings in alphabetical order.
  ['2006', 'Aaliyah 91', Aaron 57', 'Abagail 895', ' ...]
  """
  # +++your code here+++
  #open and read file
  file = open(filename,'r')
  HTML = file.read()
  #html file
  #print(HTML)

  #extract date
  date = re.search(r'(Popularity in )([\d]+)',HTML)
  print('Date: ',date.group(2))

  #find rank and name remove html tags
  ranking_tags = re.findall(r'<td>[\d]</td>',HTML)
  rankings = []
  name_tags = re.findall(r'<td>[a-z]</td>',HTML,re.IGNORECASE)
  names = []

  for value in ranking_tags:
      rankings.append(re.sub('[<td></td>]','',value))

  for value in name_tags:
    names.append(re.sub('[<td></td>]','',value))
  print(rankings)
  print(names)

Currently, my regex does not replace the tags, as they're wrong. I have already tried teaching myself how to remove the tags to no avail: http://www.cbs.dtu.dk/courses/27610/regular-expressions-cheat-sheet-v2.pdf and https://www.tutorialspoint.com/python/python_reg_expressions.htm as well as looking at other sights before writing this.

Any suggestions would be much appreciated.

0

1 Answer 1

0

If regex is not required and to get the job done you can check existing implementations.

Django's strip_tags:

https://github.com/django/django/blob/master/django/utils/html.py#L183

def _strip_once(value):
    """
    Internal tag stripping utility used by strip_tags.
    """
    s = MLStripper()
    s.feed(value)
    s.close()
    return s.get_data()


@keep_lazy_text
def strip_tags(value):
    """Return the given HTML with all tags stripped."""
    # Note: in typical case this loop executes _strip_once once. Loop condition
    # is redundant, but helps to reduce number of executions of _strip_once.
    value = str(value)
    while '<' in value and '>' in value:
        new_value = _strip_once(value)
        if len(new_value) >= len(value):
            # _strip_once was not able to detect more tags
            break
        value = new_value
    return value

You can modify that implementation.

Python Standard Library and its xml module

https://docs.python.org/3/library/xml.etree.elementtree.html

It contains examples on how to use it properly.

Use lxml package

https://lxml.de/api/lxml.etree-module.html#strip_tags

Example usage:

strip_tags(some_element,
    'simpletagname',             # non-namespaced tag
    '{http://some/ns}tagname',   # namespaced tag
    '{http://some/other/ns}*'    # any tag from a namespace
    Comment                      # comments (including their text!)
    )
Sign up to request clarification or add additional context in comments.

1 Comment

I could use this. But the point of the exercise was to try and use regexes.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.