0

This is part of my html code:

<link rel ="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/stylesheet.css" />
<link id='all-css-0' href='http://1' type='text/css' media='all' rel='stylesheet'  />
<link rel='stylesheet'  id='all-css-1' href =   'http://2' type='text/css' media='all' />

I have to find all hrefs of stylesheets.

I tried to use regular expression like

 <link\s+rel\s*=\s*["']stylesheet["']\s*href\s*=\s*["'](.*?)["'][^>]*?>

The full code is

body = '''<link rel ="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/stylesheet.css" />
<link id='all-css-0' href='http://1' type='text/css' media='all' rel='stylesheet'  />
<link rel='stylesheet'  id='all-css-1' href =   'http://2' type='text/css' media='all' />''''

real_viraz = '''<link\s+rel\s*=\s*["']stylesheet["']\s*href\s*=\s*["'](.*?)["'][^>]*?>'''
r = re.findall(real_viraz, body, re.I|re.DOTALL)
print r

But the problem is that rel='stylesheet' and href='' can be in any order in <link ...>, and it can be almost everything between them.

Please help me to find the right regular expression. Thanks.

1
  • I guess someone is going to paste here a very famous link... Commented Oct 27, 2013 at 16:00

3 Answers 3

3

Somehow, your name looks like a power automation tool Sikuli :)

If you are trying to parse HTML/XML based text in Python. BeautifulSoup (DOCUMENT)is an extremely powerful library to help you with that. Otherwise, you are indeed reinventing the wheel(an interesting story from Randy Sargent).

from bs4 import BeautifulSoup4
# in case you need to get the page first. 
#import urllib2
#url = "http://selenium-python.readthedocs.org/en/latest/"
#text = urllib2.urlopen("url").read()
text = """<link rel ="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/stylesheet.css" /><link id='all-css-0' href='http://1' type='text/css' media='all' rel='stylesheet'  /><link rel='stylesheet'  id='all-css-1' href =   'http://2' type='text/css' media='all' />"""
soup = BeautifulSoup(text)
links = soup.find_all("link", {"rel":"stylesheet"})
for link in links:
    try:
        print link['href']
    except:
        pass

the output is:

catalog/view/theme/default/stylesheet/stylesheet.css
http://1
http://2

Learn beautifulsoup well and you are 100% ready for parsing anything in HTML or XML. (You might also want to put Selenium, Scrapy into your toolbox in the future.)

Sign up to request clarification or add additional context in comments.

2 Comments

The BeautifulSoup parser has been integrated in lxml, and is much slower than lxml's HTML parser. So unless you know for sure you have to deal with broken HTML, you should try more strict and faster parsers first.
@LukasGraf You can do BeautifulSoup(text, 'lxml') to use the whatever parser you want and lxml is one of the options.
1

Short answer: Don't use regular expressions to parse (X)HTML, use a (X)HTML parser.

In Python, this would be lxml. You could parse the HTML using lxml's HTML Parser, and use an XPath query to get all the link elements, and collect their href attributes:

from lxml import etree

parser = etree.HTMLParser()

doc = etree.parse(open('sample.html'), parser)
links = doc.xpath("//head/link[@rel='stylesheet']")
hrefs = [l.attrib['href'] for l in links]

print hrefs

Output:

['catalog/view/theme/default/stylesheet/stylesheet.css', 'http://1', 'http://2']

Comments

1

I'm amazed by the many developers here in Stack-Exchange who insist on using outside Modules over the RE module for obtaining data and Parsing Strings,HTML and CSS. Nothing works more efficiently or faster than RE.

These two lines not only grab the CSS style-sheet path but also grab several if there is more than one CSS stylesheet and place them into a nice Python List for processing and or for a urllib request method.

a = re.findall('link rel="stylesheet" href=".*?"', t)
a=str(a)

Also for those unaware of Native C's use of what most developers know to be the HTML Comment Out Lines.

<!-- stuff here -->

Which allows anything in RE to process and grab data at will from HTML or CSS. And or to remove chunks of pesky Java Script for testing browser capabilities in a single iteration as shown below.

txt=re.sub('<script>', '<!--', txt)
txt=re.sub('</script>', '-->', txt)
txt=re.sub('<!--.*?-->', '', txt)

Python retains all the regular expressions from native C,, so use them people. That's what their for and nothing is as slow as Beautiful Soup and HTMLParser. Use the RE module to grab all your data from Html tags as well as CSS. Or from anything a string can contain. And if you have a problem with a variable not being of type string then make it a string with a single tiny line of code.

var=str(var)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.