0

I have a nasty string that I converted from HTML code that looks like this:

<p><topic url="car-colours">Toyota Camry</topic> has <a href="/colours/dark-red">Dark Red</a><span> (2020)</span>, <a href="/colours/pearl-white">Pearl White</a><span> (2016 - 2017)

I want to extract the names of the colours from this string and put them in a list. I was thinking maybe I extract all substrings between the ">" and the "<" character as all colours are wrapped in it but I don't know how.

My goal is to have a list that will store all colours for the toyota camry like: toyota_camry_colours = ["Dark Red", "Pearl White"]

Any ideas how I can do this? In bash I would use like grep or awk and stuff but don't know for python.

2 Answers 2

3

The BeautifulSoup module was designed to parse HTML.

from bs4 import BeautifulSoup 

str = """\
<p><topic url="car-colours">Toyota Camry</topic> has <a href="/colours/dark-red">Dark Red</a><span> (2020)</span>, <a href="/colours/pearl-white">Pearl White</a><span> (2016 - 2017)"""

soup = BeautifulSoup(str, 'html.parser')
for link in soup.find_all('a'):
    print( link.text )

Output:

Dark Red
Pearl White
Sign up to request clarification or add additional context in comments.

Comments

0

A simple regex would help it /colours/([\w-]+)

import re

txt = '<p><topic url="car-colours">Toyota Camry</topic> has <a href="/colours/dark-red">Dark Red</a><span>' \
      ' (2020)</span>, <a href="/colours/pearl-white">Pearl White</a><span> (2016 - 2017)'
colors = re.findall(r"/colours/([\w-]+)", txt)
print(colors)  # ['dark-red', 'pearl-white']

colors = [" ".join(word.capitalize() for word in color.split("-")) for color in colors]
print(colors)  # ['Dark Red', 'Pearl White']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.