1

I want to extract URLs from a particular website using scrapy in python which has the following HTML structure

<div class="comic-table">
<div id="comic">
		<img src="http://demowebsite.com/uploads/image1" alt="" title="">
		<img src="http://demowebsite.com/uploads/image2" alt="" title="">
</div>
</div>

here is the scrapy code I have written:

import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from Pencils.items import PencilsItem

class Spider(CrawlSpider):
name = 'pencil'
allowed_domains = ['demowebsite.com']
start_urls = ['http://demowebsite.com']
rules = [Rule(LinkExtractor(allow=['/uploads/.*']), 'parse_pencil')]

def parse_pencil(self, response):

    image = PencilsItem()
    rel = response.xpath("WHAT_SHOULD_I_PUT_HERE").extract()
    image['image_urls'] = ['http:'+rel[0]]
    return image

what Should I put in the response.xpath field.

P.S I'm a beginner in HTML and Python

2 Answers 2

2

Try this:

    '//div[@id="comic"]/img'

//   =>  search the whole html page
@    =>  attribute 

That xpath looks for all <div> tags which have an attribute named id which is equal to "comic" (there should only be one <div> tag with the attribute id="comic" because an id should be unique), and extracts the <img> tags therein.

With scrapy you can do something like the following to get all the <img> tags:

import scrapy

class TestSpider(scrapy.Spider):
    name = "my_spider"

    start_urls = [
        "file:///Users/7stud/python_programs/scrapy_stuff/html_files/html.html"
    ]

    def parse(self, response):
        for selector in response.xpath('//div[@id="comic"]/img'):
            src = selector.xpath('@src').extract()
            print src[0]



--output:--
(scrapy_env)~/python_programs/scrapy_stuff$ scrapy crawl my_spider
2016-03-29 02:19:09 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapy_stuff)
2016-03-29 02:19:09 [scrapy] INFO: Optional features available: ssl, http11
2016-03-29 02:19:09 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scrapy_stuff.spiders', 'SPIDER_MODULES': ['scrapy_stuff.spiders'], 'BOT_NAME': 'scrapy_stuff'}
2016-03-29 02:19:09 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-03-29 02:19:09 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-29 02:19:09 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-29 02:19:09 [scrapy] INFO: Enabled item pipelines: 
2016-03-29 02:19:09 [scrapy] INFO: Spider opened
2016-03-29 02:19:09 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-29 02:19:09 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-29 02:19:09 [scrapy] DEBUG: Crawled (200) <GET file:///Users/7stud/python_programs/scrapy_stuff/html_files/html.html> (referer: None)
http://demowebsite.com/uploads/image1
http://demowebsite.com/uploads/image2
2016-03-29 02:19:09 [scrapy] INFO: Closing spider (finished)
2016-03-29 02:19:09 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 263,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 243,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 3, 29, 8, 19, 9, 251971),
 'log_count/DEBUG': 2,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2016, 3, 29, 8, 19, 9, 139531)}
2016-03-29 02:19:09 [scrapy] INFO: Spider closed (finished)
(scrapy_env)~/python_programs/scrapy_stuff$ 

And in fact, if all you want is the src attribute from the <img> tags, you can get the src attributes directly using the following xpath:

def parse(self, response):
    for selector in response.xpath('//div[@id="comic"]/img/@src'):
        print selector.extract()

--output:--
...
2016-03-29 02:33:56 [scrapy] DEBUG: Crawled (200) <GET file:///Users/7stud/python_programs/scrapy_stuff/html_files/html.html> (referer: None)
http://demowebsite.com/uploads/image1
http://demowebsite.com/uploads/image2
2016-03-29 02:33:57 [scrapy] INFO: Closing spider (finished)
...

P.S I'm a beginner in HTML and Python

What about xml and xpath? The subject you really need to explore is xpath. But, I would suggest that as a beginner to html and xpath you should start with BeautifulSoup for scraping web pages.

Sign up to request clarification or add additional context in comments.

Comments

0

In order to get all links you should use

response.xpath("//div[@id='comic']/img/@src").extract()

and you code will look like

import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from stackoverflow.items import PencilsItem

class Spider(CrawlSpider):
    name = 'pencil'
    allowed_domains = ['demowebsite.com']
    start_urls = ['http://demowebsite.com']
    rules = [Rule(LinkExtractor(allow=['/uploads/.*']), 'parse_pencil')]

    def parse_pencil(self, response):
        item = PencilsItem()
        item['image_urls'] = response.xpath("//div[@id='comic']/img/@src").extract()
        yield item

use this code if img src doesn't contain domain

from urlparse import urlparse
parsed_uri = urlparse(response.url)
domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
links = [domain+link for link in response.xpath("//div[@id='comic']/img/@src").extract()]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.