How can I remove HTML tag in Python from HTML file?

Question

Summary: What regex string would I use to remove tags in a HTML document? Although, this may be a duplicate from a previous answer: How to remove only html tags in a string? and Remove HTML tags in String, I can not programme in those languages fully yet, so this is why I am asking the question.

I am completing a Python Exercise by Google: https://developers.google.com/edu/python/exercises/baby-names it requires you two parse HTML data using regex (the HTML is structured so it is easier). I've been having problems removing the tags surrounding the data:

   def extract_names(filename):
  """
  Given a file name for baby.html, returns a list starting with the year string
  followed by the name-rank strings in alphabetical order.
  ['2006', 'Aaliyah 91', Aaron 57', 'Abagail 895', ' ...]
  """
  # +++your code here+++
  #open and read file
  file = open(filename,'r')
  HTML = file.read()
  #html file
  #print(HTML)

  #extract date
  date = re.search(r'(Popularity in )([\d]+)',HTML)
  print('Date: ',date.group(2))

  #find rank and name remove html tags
  ranking_tags = re.findall(r'<td>[\d]</td>',HTML)
  rankings = []
  name_tags = re.findall(r'<td>[a-z]</td>',HTML,re.IGNORECASE)
  names = []

  for value in ranking_tags:
      rankings.append(re.sub('[<td></td>]','',value))

  for value in name_tags:
    names.append(re.sub('[<td></td>]','',value))
  print(rankings)
  print(names)

Currently, my regex does not replace the tags, as they're wrong. I have already tried teaching myself how to remove the tags to no avail: http://www.cbs.dtu.dk/courses/27610/regular-expressions-cheat-sheet-v2.pdf and https://www.tutorialspoint.com/python/python_reg_expressions.htm as well as looking at other sights before writing this.

Any suggestions would be much appreciated.

shalakhin · Accepted Answer · 2018-11-08 15:16:46Z

0

If regex is not required and to get the job done you can check existing implementations.

Django's `strip_tags`:

https://github.com/django/django/blob/master/django/utils/html.py#L183

def _strip_once(value):
    """
    Internal tag stripping utility used by strip_tags.
    """
    s = MLStripper()
    s.feed(value)
    s.close()
    return s.get_data()


@keep_lazy_text
def strip_tags(value):
    """Return the given HTML with all tags stripped."""
    # Note: in typical case this loop executes _strip_once once. Loop condition
    # is redundant, but helps to reduce number of executions of _strip_once.
    value = str(value)
    while '<' in value and '>' in value:
        new_value = _strip_once(value)
        if len(new_value) >= len(value):
            # _strip_once was not able to detect more tags
            break
        value = new_value
    return value

You can modify that implementation.

Python Standard Library and its `xml` module

https://docs.python.org/3/library/xml.etree.elementtree.html

It contains examples on how to use it properly.

Use `lxml` package

https://lxml.de/api/lxml.etree-module.html#strip_tags

Example usage:

strip_tags(some_element,
    'simpletagname',             # non-namespaced tag
    '{http://some/ns}tagname',   # namespaced tag
    '{http://some/other/ns}*'    # any tag from a namespace
    Comment                      # comments (including their text!)
    )

answered Nov 8, 2018 at 15:16

shalakhin

4,9365 gold badges28 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Please don't judge me. Over a year ago

I could use this. But the point of the exercise was to try and use regexes.

Collectives™ on Stack Overflow

How can I remove HTML tag in Python from HTML file?

1 Answer 1

Django's `strip_tags`:

Python Standard Library and its `xml` module

Use `lxml` package

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Django's strip_tags:

Python Standard Library and its xml module

Use lxml package

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Django's `strip_tags`:

Python Standard Library and its `xml` module

Use `lxml` package