How can I extract substrings between two characters in python?

Question

I have a nasty string that I converted from HTML code that looks like this:

<p><topic url="car-colours">Toyota Camry</topic> has <a href="/colours/dark-red">Dark Red</a><span> (2020)</span>, <a href="/colours/pearl-white">Pearl White</a><span> (2016 - 2017)

I want to extract the names of the colours from this string and put them in a list. I was thinking maybe I extract all substrings between the ">" and the "<" character as all colours are wrapped in it but I don't know how.

My goal is to have a list that will store all colours for the toyota camry like: toyota_camry_colours = ["Dark Red", "Pearl White"]

Any ideas how I can do this? In bash I would use like grep or awk and stuff but don't know for python.

Tim Roberts · Accepted Answer · 2021-11-04 19:52:15Z

3

The BeautifulSoup module was designed to parse HTML.

from bs4 import BeautifulSoup 

str = """\
<p><topic url="car-colours">Toyota Camry</topic> has <a href="/colours/dark-red">Dark Red</a><span> (2020)</span>, <a href="/colours/pearl-white">Pearl White</a><span> (2016 - 2017)"""

soup = BeautifulSoup(str, 'html.parser')
for link in soup.find_all('a'):
    print( link.text )

Output:

Dark Red
Pearl White

answered Nov 4, 2021 at 19:52

Tim Roberts

55.3k4 gold badges29 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

azro · Accepted Answer · 2021-11-04 19:46:56Z

0

A simple regex would help it /colours/([\w-]+)

import re

txt = '<p><topic url="car-colours">Toyota Camry</topic> has <a href="/colours/dark-red">Dark Red</a><span>' \
      ' (2020)</span>, <a href="/colours/pearl-white">Pearl White</a><span> (2016 - 2017)'
colors = re.findall(r"/colours/([\w-]+)", txt)
print(colors)  # ['dark-red', 'pearl-white']

colors = [" ".join(word.capitalize() for word in color.split("-")) for color in colors]
print(colors)  # ['Dark Red', 'Pearl White']

answered Nov 4, 2021 at 19:46

azro

54.2k9 gold badges39 silver badges75 bronze badges

Collectives™ on Stack Overflow

How can I extract substrings between two characters in python?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related