1

Hi im writing a web crawler in python to extract news articles from news websites like nytimes.com. i want to know what would be a good db to use as a backend for this project?

Thanks in advance!

2
  • 1
    How many records do you expect the database to hold? What will the fields be? How big will the database be? What type of searches do you wish to perform? Will there be multiple users accessing the db? Commented Jan 27, 2010 at 0:21
  • well as to how many records right now only very few but basically the idea is to index all the news articles in a particular news website and there wont be multiple users accessing the db Commented Jan 27, 2010 at 0:24

4 Answers 4

7

This could be a great project to use a document database like CouchDB, MongoDB, or SimpleDB.

MongoDB has a hosted solution: http://mongohq.com. There is also a binding for Python (Pymongo).

SimpleDB is a great choice if you are hosting this on Amazon Web Services

CouchDB is an open source package from the Apache Foundation.

Sign up to request clarification or add additional context in comments.

2 Comments

if the no of records increase whill these dbs be able to cope?
That is part of why I think a crawler would be well suited to these DBs. Google's underlying database is BigTable which is similar in design to the databases I mentioned. SimpleDB has a 10GB limit per domain and a 2500 result limit on SELECT statements. I don't know of any size limitations for CouchDB or MongoDB (doesn't mean they aren't there, just that I couldn't find them with a Google search).
3

Personally, I love PostGreSQL -- but other free DBs such as MySql (or, if you have reasonably small amounts of data -- a few GB at most -- even the SQLite that comes with Python) will be fine too.

2 Comments

+1 Beat me to it. I would personally go with MySQL over PostGre, but that's just because I'm already familiar with it.
Don't use a hammer when you have no nails! For this specific use case document databases are pretty much in the sweet spot: they are scalable, fast and when you don't have to worry about transactions, then why would you choose an SQL database?
1

I think the database itself will probably be one of the easier aspects of a web crawler like this.

If expect high load reading or writing the database (for example if you intend to run many crawlers at the same time) then you will want to steer in the direction of MySql, otherwise something like Sqlite will probably do you just fine.

Comments

0

You can take a look at Firebird

Firebird python driver are developped by the core team

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.