Database for web crawler in python?

Question

Hi im writing a web crawler in python to extract news articles from news websites like nytimes.com. i want to know what would be a good db to use as a backend for this project?

Thanks in advance!

How many records do you expect the database to hold? What will the fields be? How big will the database be? What type of searches do you wish to perform? Will there be multiple users accessing the db? — unutbu
– unutbu, Commented Jan 27, 2010 at 0:21
well as to how many records right now only very few but basically the idea is to index all the news articles in a particular news website and there wont be multiple users accessing the db — oktapodi
– oktapodi, Commented Jan 27, 2010 at 0:24

jsalonen · Accepted Answer · 2011-03-18 12:14:42Z

7

This could be a great project to use a document database like CouchDB, MongoDB, or SimpleDB.

MongoDB has a hosted solution: http://mongohq.com. There is also a binding for Python (Pymongo).

SimpleDB is a great choice if you are hosting this on Amazon Web Services

CouchDB is an open source package from the Apache Foundation.

edited Mar 18, 2011 at 12:14

jsalonen

30.8k15 gold badges93 silver badges111 bronze badges

answered Jan 27, 2010 at 2:26

Jackson Miller

1,5101 gold badge13 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

oktapodi Over a year ago

if the no of records increase whill these dbs be able to cope?

Jackson Miller Over a year ago

That is part of why I think a crawler would be well suited to these DBs. Google's underlying database is BigTable which is similar in design to the databases I mentioned. SimpleDB has a 10GB limit per domain and a 2500 result limit on SELECT statements. I don't know of any size limitations for CouchDB or MongoDB (doesn't mean they aren't there, just that I couldn't find them with a Google search).

Craig McQueen · Accepted Answer · 2010-01-27 01:23:57Z

3

Personally, I love PostGreSQL -- but other free DBs such as MySql (or, if you have reasonably small amounts of data -- a few GB at most -- even the SQLite that comes with Python) will be fine too.

edited Jan 27, 2010 at 1:23

Craig McQueen

43.8k32 gold badges138 silver badges188 bronze badges

answered Jan 27, 2010 at 0:20

Alex Martelli

887k175 gold badges1.3k silver badges1.4k bronze badges

2 Comments

Chinmay Kanchi Over a year ago

+1 Beat me to it. I would personally go with MySQL over PostGre, but that's just because I'm already familiar with it.

jsalonen Over a year ago

Don't use a hammer when you have no nails! For this specific use case document databases are pretty much in the sweet spot: they are scalable, fast and when you don't have to worry about transactions, then why would you choose an SQL database?

Justin · Accepted Answer · 2010-01-27 00:23:40Z

1

I think the database itself will probably be one of the easier aspects of a web crawler like this.

If expect high load reading or writing the database (for example if you intend to run many crawlers at the same time) then you will want to steer in the direction of MySql, otherwise something like Sqlite will probably do you just fine.

answered Jan 27, 2010 at 0:23

Justin

87.3k49 gold badges231 silver badges374 bronze badges

Comments

Hugues Van Landeghem · Accepted Answer · 2010-01-27 20:51:19Z

0

You can take a look at Firebird

Firebird python driver are developped by the core team

answered Jan 27, 2010 at 20:51

Hugues Van Landeghem

6,7964 gold badges37 silver badges61 bronze badges

Collectives™ on Stack Overflow

Database for web crawler in python?

4 Answers 4

2 Comments

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related