I am using a while-loop to scrape several fields on a webpage. I want to save the output for every iteration of the loop in an individual json object.
This works perfectly on my machine (Scrapy 0.24.6, Python 2.7.5), but not on a ssh server (Scrapy 1.0.1, Python 2.7.6). I now want to write an item pipeline or an item exporter to ensure that every iteration of the loop is saved as a single json object even when running the script on the ssh server.
This is my Python code:
from scrapy.spiders import Spider
from blogtexts.items import BlogItem
class BlogText1Spider(Spider):
name = "texts1"
allowed_domains = ["blogger.ba"]
start_urls = ["http://www.blogger.ba/profil/SOKO/blogovi/str1"]
def parse(self, response):
position = 1
while response.xpath(''.join(["//a[@class='blog'][", str(position), "]/@href"])).extract():
item = BlogItem()
item["blog"] = response.xpath(''.join(["//a[@class='blog'][", str(position), "]/@href"])).extract()
item["blogfavoritemarkings"] = response.xpath(''.join(["//a[@class='broj'][", str(position), "]/text()"])).extract()
item["blogger"] = response.url.split("/")[-3]
yield item
position = position + 1
I DON'T want the output to look like this:
{'blog': [u'http://emirnisic.blogger.ba', u'http://soko.blogger.ba'],
'blogfavoritemarkings': [u'180', u'128'],
'blogger': 'SOKO'}
The output should instead like this:
{'blog': [u'http://emirnisic.blogger.ba'],
'blogfavoritemarkings': [u'180'],
'blogger': 'SOKO'}
{'blog': [u'http://soko.blogger.ba'],
'blogfavoritemarkings': [u'128'],
'blogger': 'SOKO'}
Do you have any recommendations on how I can make sure the output looks as I want? Should I use an item pipeline or item exporter, or instead change the while-loop? Any help is appreciated.
scrapy.Itemordict. But you want to format the JSON string output so you will need to use an exporter for that.