1

I am building a crawler using Scrapy. I need to get the font-family assigned to a particular HTML element.

Let's say there is a css file, styles.css, which contains the following:

p {
    font-family: "Times New Roman", Georgia, Serif;
}

And in the HTML page there is text as follows:

<p>Hello how are you?</p>

Its easy for me to extract the text using Scrapy, however I would also like to know the font-family applied to Hello how are you?

I am hoping it is simply a case of (imaginary XPATH) /p[font-family] or something like that.

Do you know how I can do this?

Thanks for your thoughts.

2
  • Personally I don't think that's something that could be handled by Scrapy :( You might need to look into something like an HTML renderer. Commented Sep 20, 2016 at 7:47
  • You can have a look at pythonhosted.org/tinycss Commented Sep 20, 2016 at 8:35

1 Answer 1

1

You need to download and parse css seperately. For css parsing you can use tinycss or even regex:

import tinycss
class MySpider(Spider):
    name='myspider'
    start_urls = [
        'http://some.url.com'
    ]
    css_rules = {}

def parse(self, response):
    # find css url and parse it
    css_url = response.xpath("").extract_first()
    yield Request(css_url, self.parse_css)

def parse_css(self, response):
    parser = tinycss.make_parser()
    stylesheet = parser.parse_stylesheet(response.body)
    for rule in stylesheet.rules:
        if not getattr(rule, 'selector'):
            continue 
        path = rule.selector.as_css()
        css =  [d.value.as_css() for d in rule.declarations]
        self.css_rules[path] = css

Now you have a dictionary with css paths and their attributes that you can use later in your spider request chain to assign some values:

def parse_item(self, response):
    item = {}
    item['name'] = response.css('div.name').extract_first()
    name_css = []
    for k,v in css_rules.items():
        if 'div' in k and '.name' in k:
            name_css.append(v)
    item['name_css'] = name_css
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for your reply. If the page has multiple CSS files (e.g. Bootstrap, Normalize, etc.) and these files (for example) contain multiple styling for p elements, would your code select the actual p CSS styling being used by the p elements on the page, or would it also select the unused p CSS stylings in the CSS files? As an example, I could create many CSS files and have multiple style entries for p in them, but the HTML on my page might only use one of the p styles due to nesting or some other CSS rule.
AFAIK html has to specify what css it's using, so you can just select that and parse it. I.e. for stack overflow you can find it via response.xpath("//link[@rel='stylesheet']/@href") If it has multiple css files than it will use multiple css files, so you need to parse all of them to generate yourself a dictionary or a tree of sorts.
Thanks. I need to think about your solution to ensure I understand it. I will get back to you either way!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.