Python scrapy persistent database connection for MySQL

Question

I am using scrapy for one of my project. The data gets scraped from spider and gets passed to pipeline for insertion into database. Here is my database class code:

import MySQLdb


class Database:

    host = 'localhost'
    user = 'root'
    password = 'test123'
    db = 'scraping_db'

    def __init__(self):
        self.connection = MySQLdb.connect(self.host, self.user, self.password, self.db,use_unicode=True, charset="utf8")
        self.cursor = self.connection.cursor()

    def insert(self, query,params):
        try:
            self.cursor.execute(query,params)
            self.connection.commit()
        except Exception as ex:
            self.connection.rollback()


    def __del__(self):
        self.connection.close()

Here is my pipeline code that processes scraped items and saves into MySQL database.

from con import Database 

class LinkPipeline(object):

    def __init__(self):
        self.db=Database()

    def process_item(self, item, spider):
        query="""INSERT INTO links (title, location,company_name,posted_date,status,company_id,scraped_link,content,detail_link,job_id) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s,%s)"""
        params=(item['title'], item['location'], item['company_name'], item['posted_date'], item['status'], item['company_id'], item['scraped_link'], item['content'], item['detail_link'],item['job_id'])
        self.db.insert(query,params)
        return item

From above flow I feel whenever Item is processed via pipeline then a database connection is opened and closed when process_item is complete. This would open too much database connections. I want a way where my database connection is only opened once during the whole life cycle of spider and closed when spider is closed.

I read there are open_spider and close_spider method in Spider class, if I use them then how can I pass the reference to database connection from Spider's start_requests method to pipeline class?

Are there any better approaches to go about it?

Umair Ayub · Accepted Answer · 2018-04-10 11:45:45Z

4

class MySpider(scrapy.Spider):
    name = "myspidername"

    host = 'localhost'
    user = 'root'
    password = 'test123'
    db = 'scraping_db'

    def __init__(self):
        self.connection = MySQLdb.connect(self.host, self.user, self.password, self.db,use_unicode=True, charset="utf8")
        self.cursor = self.connection.cursor()

    def insert(self, query,params):
        try:
            self.cursor.execute(query,params)
            self.connection.commit()
        except Exception as ex:
            self.connection.rollback()


    def __del__(self):
        self.connection.close()

then in your Pipeline do this spider.cursor to access cursor and perform any MySQL operation.

class LinkPipeline(object):

    def process_item(self, item, spider):
        query="""INSERT INTO links (title, location,company_name,posted_date,status,company_id,scraped_link,content,detail_link,job_id) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s,%s)"""
        params=(item['title'], item['location'], item['company_name'], item['posted_date'], item['status'], item['company_id'], item['scraped_link'], item['content'], item['detail_link'],item['job_id'])
        spider.cursor.insert(query,params)
        return item

answered Apr 10, 2018 at 11:45

Umair Ayub

21.7k14 gold badges82 silver badges154 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Hammad Over a year ago

Thank you very much Umair. I was unaware that process_item already has reference to spider object. Just new to scrapy. Thanks for your help.

Collectives™ on Stack Overflow

Python scrapy persistent database connection for MySQL

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related