2

Hopefully this question is not too much open-ended… In a few words : I'm looking for a script or programming language with fast but easy access to a database (PostgreSQL).

I want to use results from queries on some tables in a PostgreSQL database as input for some R analysis. Queries are simple SELECT requests (there might be room for improvement in the requests, but for now I'm not looking this way — I already did a while ago), but within a loop on results of a first query. Tables include both numbers and strings, and are thousands if not hundreds of thousands rows long, so the total number of queries can be quite large.

Obviously I first wrote an R script using RPostgreSQL. However, it takes too much time to be comfortable to use (I'd like to be able to modify and rerun it anytime). I have already optimized this script quite efficiently, and ''system.time'' shows me that most of the time is spent on the DB query within the loop.

Then, as I figured out it would be way faster if I used a text file as input for R, I decided to translate this R script into python, using psycopg2. Unfortunately, the python script is not much faster than the R script.

Finally I started to write a C++ program using libpq-fe, but I stopped as I figured out it was not flexible enough (I mean, I would have to multiply the number of lines of my code by at least 3 or 4 in order to process the queries).

Thus I'm wondering which language (or maybe other R or python libraries ?) would provide the best compromise between speed and flexibility (in terms of queries results : lists, arrays, string manipulations…) for DB access (namely, PostgreSQL). That is, it needs to be much faster than R+RPostgreSQL and python+psycopg2, and almost as "flexible".

Thanks for suggestions (the language has to be linux-friendly).


Update : Here is a typical timing of the old versus new code using only the first 500 retrieved elements, after correcting the code for the N+1 issue as suggested by Ryan and  :

> system.time(source("oldcode.R"));
   user      system      elapsed  
  3.825       0.052      49.363 

> system.time(source("newcode.R"));
   user      system      elapsed 
  1.920       0.140       3.551 

The same for the 1000 first retrieved elements :

> system.time(source("oldcode.R"));
   user      system      elapsed  
  9.816       0.092     100.340 

> system.time(source("newcode.R"));
   user      system      elapsed 
  5.040       0.072       6.695 

Probably worth a change indeed. ;-)

5
  • 2
    Search for and read about "the N+1 selects problem". Commented Aug 8, 2012 at 23:33
  • Indeed, it seems I've fallen in this pitfall. So if I understand correctly, I should assume that the DB is much slower than the programming language I use in analyzing the data I get from the DB. That is, instead of N+1 requests to the DB I should do only one large request dumping all data I need, then loop within the code on the object which now contains all this data. Commented Aug 19, 2012 at 14:49
  • correct. More precisely, assume that the DB has high latency (relative to your programming language); that's why N 1-row queries are more expensive than 1 N-row query. Commented Aug 19, 2012 at 18:31
  • I've updated the question with estimated performance before/after the improvement you suggested. Thanks again, it was definitely worth it. Commented Sep 18, 2012 at 20:47
  • Thanks for the update. Those numbers are pretty persuasive. Commented Sep 18, 2012 at 23:53

1 Answer 1

2

To make any interface to the database go fast; optimize your database queries. As you discovered even with your optimized code using R, the majority of time was spent at the db. So you should pick the programming language that you are most familiar and comfortable with; as that will be the fastest you can go as far as the front end is concerned.

However the overall result (in terms of perceived performance) will be same no matter what programming language you use. There is no library that can increase the speed of your queries as that is purely a function of the database. All the library/language will allow you to do is combine multiple queries into a single transaction, but the results of the queries are still dependent on your database layout and optimization.

Simple things such as missing indexes on columns can have a big impact.

Start by running EXPLAIN ANALYZE on your query, and paste the result into this tool to visualize what the database is doing so you know where to start optimizing.

Sign up to request clarification or add additional context in comments.

2 Comments

Well, you are assuming that interfaces between programming languages and the database are optimal and that their speed is so much faster than the DB speed that time differences between different languages is negligible with regard to the time spent in the DB. I trust you on the second point, but actually I asked the question because I've read about performance issues with some R DB interfaces, so the first point may not always be true (however it was an old thread so I guess now it is obsolete).
Anyway, thanks for your reply, you're probably right on the fact I was looking in the wrong direction (as suggested by Ryan the n+1 problem is probably the right one). I also didn't know about EXPLAIN ANALYZE, and although I cannot use it with the current queries (would have no sense as they are simple SELECT ones) I'll keep it in mind for later.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.