find substrings between two string [duplicate]

Question

I have a string like this:


string = r'''<img height="233" src="monline/" title="email example" width="500" ..
title="second example title"  width="600"...
title="one more title"...> '''

I am trying to get anything that appears as title (title="Anything here") I have already tried this but it does not work correctly.

re.findall(r'title=\"(.*)\"',string)

The requests library using xpath is probably the way to go: pypi.org/project/requests-html — Plato77
– Plato77, Commented Feb 12, 2020 at 15:50
Parsing HTML with regex is a hard job HTML and regex are not good friends. Use a parser, it is simpler, faster and much more maintainable. — Toto
– Toto, Commented Feb 12, 2020 at 18:03

PatientOtter · Accepted Answer · 2020-02-12 17:35:38Z

2

I think your Regex is too Greedy. You can try something like this

re.findall(r'title=\"(?P<title>[\w\s]+)\"', string)

As @Austin and @Plato77 said in the comments, there is a better way to parse HTML in python. See other SO Answers for more context. There are a few common tools for this like:

If you would like to read more on performance testing of different python HTML parsers you can learn more here

edited Feb 12, 2020 at 17:35

answered Feb 12, 2020 at 16:00

PatientOtter

2,34822 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Mahhos Over a year ago

Thanks this works fine!

PatientOtter Over a year ago

@mahhos, I'm glad this answer was useful. Please accept answers as correct once your issue has been solved. Learn how

ibrahim · Accepted Answer · 2020-02-12 16:39:39Z

0

As @Austin and @Plato77 said in the comments, there is a better way to parse HTML in python. I stand by this too, but if you want to get it done through regex this may help

c = re.finditer(r'title=[\"]([a-zA-Z0-9\s]+)[\" ]', string)

for i in c:
    print(i.group(1))

answered Feb 12, 2020 at 16:39

ibrahim

911 silver badge10 bronze badges

Comments

Almostapha · Accepted Answer · 2020-02-12 16:42:23Z

0

The problem here is that the next " symbol is parsed as a character and is considered part of the (.*) of your RE. For your usecase, you can use only letters and numbers.

answered Feb 12, 2020 at 16:42

Almostapha

366 bronze badges

Collectives™ on Stack Overflow

find substrings between two string [duplicate]

3 Answers 3

2 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Linked

Related