0

I've written a script (see here) to get all the URLs from within a template directory, however some of the hrefs contain two URLs to use depending on what language the app runs in.

So my script currently gives me a list of whatever is in href='here', but now I want to also collect the URLs from a href that looks like this;

href="{{ 'http://www.link.com/blah/page.htm'|cy:'http://www.link.com/welsh/blah/page.htm' }}"

What regular expression would I need to return those? (As with so many people, I'm awful at Regex!)

1 Answer 1

2

Something like:

href="{{ 'http://www.link.com/blah/page.htm'|cy:'http://www.link.com/welsh/blah/page.htm' }}"

import re
print re.findall("'(http://(?:.*?))'", href)
# ['http://www.link.com/blah/page.htm', 'http://www.link.com/welsh/blah/page.htm']

Takes anything starting with http:// that's inside apostrophes.

Sign up to request clarification or add additional context in comments.

4 Comments

+1 You can also add http(s)? to handle both http and https.
@AshwiniChaudhary yup, or just s? will do it... Suppose it should be up to the OP if they want to handle that/any other protocols...
Wonderful. I was trying to find by start and end characters. Does re.findall("'(http[s]:// work to match http and https? I've seen the [s] used in an example, but don't fully understand it.
@marksweb sorry - was out for moment. [s] means match one of the characters inside the [] therefore it would only match https - Using s? means match none or one s... so it matches http or https

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.