0

I am using a while-loop to scrape several fields on a webpage. I want to save the output for every iteration of the loop in an individual json object.

This works perfectly on my machine (Scrapy 0.24.6, Python 2.7.5), but not on a ssh server (Scrapy 1.0.1, Python 2.7.6). I now want to write an item pipeline or an item exporter to ensure that every iteration of the loop is saved as a single json object even when running the script on the ssh server.

This is my Python code:

from scrapy.spiders import Spider
from blogtexts.items import BlogItem

class BlogText1Spider(Spider):
name = "texts1"
allowed_domains = ["blogger.ba"]

start_urls = ["http://www.blogger.ba/profil/SOKO/blogovi/str1"]

def parse(self, response):
    position = 1

    while response.xpath(''.join(["//a[@class='blog'][", str(position), "]/@href"])).extract():
        item = BlogItem()
        item["blog"] = response.xpath(''.join(["//a[@class='blog'][", str(position), "]/@href"])).extract()
        item["blogfavoritemarkings"] = response.xpath(''.join(["//a[@class='broj'][", str(position), "]/text()"])).extract()
        item["blogger"] = response.url.split("/")[-3]
        yield item
        position = position + 1

I DON'T want the output to look like this:

{'blog': [u'http://emirnisic.blogger.ba', u'http://soko.blogger.ba'],
'blogfavoritemarkings': [u'180', u'128'],
'blogger': 'SOKO'}

The output should instead like this:

{'blog': [u'http://emirnisic.blogger.ba'],
 'blogfavoritemarkings': [u'180'],
 'blogger': 'SOKO'}
{'blog': [u'http://soko.blogger.ba'],
 'blogfavoritemarkings': [u'128'],
 'blogger': 'SOKO'}

Do you have any recommendations on how I can make sure the output looks as I want? Should I use an item pipeline or item exporter, or instead change the while-loop? Any help is appreciated.

1
  • A pipeline will allow you to change the contents of your Item, but that is a Python object like scrapy.Item or dict. But you want to format the JSON string output so you will need to use an exporter for that. Commented Aug 14, 2016 at 1:35

1 Answer 1

1

Changing the while loop while it is so simple is an option. If it gets more complex I would switch to a custom item exporter to write the items as the expected result is leaving transparency between spider and result.

With this in mind (and preparing for future changes) I'd say create your own item exporter and form the resulting JSON elements. Eventually with the help of itertools.cycle.

Sign up to request clarification or add additional context in comments.

2 Comments

I am however wondering which change the while loop would exactly need. I have tried experimenting with it and couldn't come up with a workable solution.
Using the same logic as in the item extractor would do the trick: simply create multiple items for the resulting blogs and blogger (loop through the blogs result or `zip´ it together with the favorite markings and then for each result create a new item with the blogger for example)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.