0

I have a string like this:


string = r'''<img height="233" src="monline/" title="email example" width="500" ..
title="second example title"  width="600"...
title="one more title"...> '''

I am trying to get anything that appears as title (title="Anything here") I have already tried this but it does not work correctly.

re.findall(r'title=\"(.*)\"',string)
3
  • 3
    Regex is not nice way to parse html. Use html parsers. Commented Feb 12, 2020 at 15:47
  • The requests library using xpath is probably the way to go: pypi.org/project/requests-html Commented Feb 12, 2020 at 15:50
  • Parsing HTML with regex is a hard job HTML and regex are not good friends. Use a parser, it is simpler, faster and much more maintainable. Commented Feb 12, 2020 at 18:03

3 Answers 3

2

I think your Regex is too Greedy. You can try something like this

re.findall(r'title=\"(?P<title>[\w\s]+)\"', string)

As @Austin and @Plato77 said in the comments, there is a better way to parse HTML in python. See other SO Answers for more context. There are a few common tools for this like:

If you would like to read more on performance testing of different python HTML parsers you can learn more here

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks this works fine!
@mahhos, I'm glad this answer was useful. Please accept answers as correct once your issue has been solved. Learn how
0

As @Austin and @Plato77 said in the comments, there is a better way to parse HTML in python. I stand by this too, but if you want to get it done through regex this may help

c = re.finditer(r'title=[\"]([a-zA-Z0-9\s]+)[\" ]', string)

for i in c:
    print(i.group(1))

Comments

0

The problem here is that the next " symbol is parsed as a character and is considered part of the (.*) of your RE. For your usecase, you can use only letters and numbers.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.