0

In the following string:

s = '>foo</a> Start >bar</a> >baz</a>'

I want to extract the first value that comes between > and </a> after start, which is bar.

The following scripts do the job separately, but I don't know how to merge them.

regexp = re.compile("Start(.*)$")
output = regexp.search(s).group(1)

output = re.search('>(.*?)</a>', s).group(1)
4
  • 1
    Use r"Start[^>]*>(.*?)</a>" or r"(?s)Start.*?>([^<]*)</a>" Commented Jan 5, 2021 at 0:30
  • 2
    This looks like HTML, is it possible to use a DOM parser instead? Commented Jan 5, 2021 at 0:48
  • 2
    Are you trying to parse HTML? Commented Jan 5, 2021 at 2:03
  • Yes, it is a very long HTML, and since I couldn't get Beautifulsoup to work I thought regex is the way to go. Commented Jan 5, 2021 at 7:27

2 Answers 2

2

You can use

r"Start[^>]*>(.*?)</a>"
r"(?s)Start.*?>([^<]*)</a>"

See the regex demo. Details:

  • Start - a literal string
  • [^>]* - zero or more chars other than >
  • > - a > char
  • (.*?) - Group 1: any zero or more chars, as few as possible
  • </a> - a literal string.

See the Python demo:

import re
s = '>foo</a> Start >bar</a> >baz</a>'
regexp = re.compile(r"Start.*?>([^<]*)</a>", re.DOTALL)
m = regexp.search(s)
if m:
    print(m.group(1)) # => bar
Sign up to request clarification or add additional context in comments.

Comments

2

Well, Idk what you want to do with this but much simpler is:

s = '>foo</a> Start >bar</a> >baz</a>'

print (s.split("</a>")[1].split(">")[-1])

Output:

bar

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.