(Scrapy) How to get the CSS rule for a HTML element?

Question

I am building a crawler using Scrapy. I need to get the font-family assigned to a particular HTML element.

Let's say there is a css file, styles.css, which contains the following:

p {
    font-family: "Times New Roman", Georgia, Serif;
}

And in the HTML page there is text as follows:

<p>Hello how are you?</p>

Its easy for me to extract the text using Scrapy, however I would also like to know the font-family applied to Hello how are you?

I am hoping it is simply a case of (imaginary XPATH) /p[font-family] or something like that.

Do you know how I can do this?

Thanks for your thoughts.

Personally I don't think that's something that could be handled by Scrapy :( You might need to look into something like an HTML renderer. — starrify
– starrify, Commented Sep 20, 2016 at 7:47

Granitosaurus · Accepted Answer · 2016-09-20 08:58:59Z

1

You need to download and parse css seperately. For css parsing you can use tinycss or even regex:

import tinycss
class MySpider(Spider):
    name='myspider'
    start_urls = [
        'http://some.url.com'
    ]
    css_rules = {}

def parse(self, response):
    # find css url and parse it
    css_url = response.xpath("").extract_first()
    yield Request(css_url, self.parse_css)

def parse_css(self, response):
    parser = tinycss.make_parser()
    stylesheet = parser.parse_stylesheet(response.body)
    for rule in stylesheet.rules:
        if not getattr(rule, 'selector'):
            continue 
        path = rule.selector.as_css()
        css =  [d.value.as_css() for d in rule.declarations]
        self.css_rules[path] = css

Now you have a dictionary with css paths and their attributes that you can use later in your spider request chain to assign some values:

def parse_item(self, response):
    item = {}
    item['name'] = response.css('div.name').extract_first()
    name_css = []
    for k,v in css_rules.items():
        if 'div' in k and '.name' in k:
            name_css.append(v)
    item['name_css'] = name_css

edited Sep 20, 2016 at 8:58

answered Sep 20, 2016 at 8:42

Granitosaurus

21.6k6 gold badges64 silver badges88 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Tom Brock Over a year ago

Thanks for your reply. If the page has multiple CSS files (e.g. Bootstrap, Normalize, etc.) and these files (for example) contain multiple styling for p elements, would your code select the actual p CSS styling being used by the p elements on the page, or would it also select the unused p CSS stylings in the CSS files? As an example, I could create many CSS files and have multiple style entries for p in them, but the HTML on my page might only use one of the p styles due to nesting or some other CSS rule.

Granitosaurus Over a year ago

AFAIK html has to specify what css it's using, so you can just select that and parse it. I.e. for stack overflow you can find it via response.xpath("//link[@rel='stylesheet']/@href") If it has multiple css files than it will use multiple css files, so you need to parse all of them to generate yourself a dictionary or a tree of sorts.

Tom Brock Over a year ago

Thanks. I need to think about your solution to ensure I understand it. I will get back to you either way!

Collectives™ on Stack Overflow

(Scrapy) How to get the CSS rule for a HTML element?

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related