Regex in Python - find all stylesheets in html

Question

This is part of my html code:

<link rel ="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/stylesheet.css" />
<link id='all-css-0' href='http://1' type='text/css' media='all' rel='stylesheet'  />
<link rel='stylesheet'  id='all-css-1' href =   'http://2' type='text/css' media='all' />

I have to find all hrefs of stylesheets.

I tried to use regular expression like

 <link\s+rel\s*=\s*["']stylesheet["']\s*href\s*=\s*["'](.*?)["'][^>]*?>

The full code is

body = '''<link rel ="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/stylesheet.css" />
<link id='all-css-0' href='http://1' type='text/css' media='all' rel='stylesheet'  />
<link rel='stylesheet'  id='all-css-1' href =   'http://2' type='text/css' media='all' />''''

real_viraz = '''<link\s+rel\s*=\s*["']stylesheet["']\s*href\s*=\s*["'](.*?)["'][^>]*?>'''
r = re.findall(real_viraz, body, re.I|re.DOTALL)
print r

But the problem is that rel='stylesheet' and href='' can be in any order in <link ...>, and it can be almost everything between them.

Please help me to find the right regular expression. Thanks.

I guess someone is going to paste here a very famous link... — Birei
– Birei, Commented Oct 27, 2013 at 16:00

B.Mr.W. · Accepted Answer · 2013-10-27 16:33:26Z

3

Somehow, your name looks like a power automation tool Sikuli :)

If you are trying to parse HTML/XML based text in Python. BeautifulSoup (DOCUMENT)is an extremely powerful library to help you with that. Otherwise, you are indeed reinventing the wheel(an interesting story from Randy Sargent).

from bs4 import BeautifulSoup4
# in case you need to get the page first. 
#import urllib2
#url = "http://selenium-python.readthedocs.org/en/latest/"
#text = urllib2.urlopen("url").read()
text = """<link rel ="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/stylesheet.css" /><link id='all-css-0' href='http://1' type='text/css' media='all' rel='stylesheet'  /><link rel='stylesheet'  id='all-css-1' href =   'http://2' type='text/css' media='all' />"""
soup = BeautifulSoup(text)
links = soup.find_all("link", {"rel":"stylesheet"})
for link in links:
    try:
        print link['href']
    except:
        pass

the output is:

catalog/view/theme/default/stylesheet/stylesheet.css
http://1
http://2

Learn beautifulsoup well and you are 100% ready for parsing anything in HTML or XML. (You might also want to put Selenium, Scrapy into your toolbox in the future.)

edited Oct 27, 2013 at 16:33

answered Oct 27, 2013 at 16:07

B.Mr.W.

19.7k35 gold badges122 silver badges186 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Lukas Graf Over a year ago

The BeautifulSoup parser has been integrated in lxml, and is much slower than lxml's HTML parser. So unless you know for sure you have to deal with broken HTML, you should try more strict and faster parsers first.

B.Mr.W. Over a year ago

@LukasGraf You can do BeautifulSoup(text, 'lxml') to use the whatever parser you want and lxml is one of the options.

Lukas Graf · Accepted Answer · 2013-10-27 16:29:40Z

1

Short answer: Don't use regular expressions to parse (X)HTML, use a (X)HTML parser.

In Python, this would be lxml. You could parse the HTML using lxml's HTML Parser, and use an XPath query to get all the link elements, and collect their href attributes:

from lxml import etree

parser = etree.HTMLParser()

doc = etree.parse(open('sample.html'), parser)
links = doc.xpath("//head/link[@rel='stylesheet']")
hrefs = [l.attrib['href'] for l in links]

print hrefs

Output:

['catalog/view/theme/default/stylesheet/stylesheet.css', 'http://1', 'http://2']

edited Oct 27, 2013 at 16:29

answered Oct 27, 2013 at 16:08

Lukas Graf

33k8 gold badges83 silver badges95 bronze badges

Comments

rodeone2 · Accepted Answer · 2017-04-26 06:10:44Z

I'm amazed by the many developers here in Stack-Exchange who insist on using outside Modules over the RE module for obtaining data and Parsing Strings,HTML and CSS. Nothing works more efficiently or faster than RE.

These two lines not only grab the CSS style-sheet path but also grab several if there is more than one CSS stylesheet and place them into a nice Python List for processing and or for a urllib request method.

a = re.findall('link rel="stylesheet" href=".*?"', t)
a=str(a)

Also for those unaware of Native C's use of what most developers know to be the HTML Comment Out Lines.

<!-- stuff here -->

Which allows anything in RE to process and grab data at will from HTML or CSS. And or to remove chunks of pesky Java Script for testing browser capabilities in a single iteration as shown below.

txt=re.sub('<script>', '<!--', txt)
txt=re.sub('</script>', '-->', txt)
txt=re.sub('<!--.*?-->', '', txt)

Python retains all the regular expressions from native C,, so use them people. That's what their for and nothing is as slow as Beautiful Soup and HTMLParser. Use the RE module to grab all your data from Html tags as well as CSS. Or from anything a string can contain. And if you have a problem with a variable not being of type string then make it a string with a single tiny line of code.

var=str(var)

Collectives™ on Stack Overflow

Regex in Python - find all stylesheets in html

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related