Optimizing a long series of SQL update queries with Psycopg2

Question

I need to make a huge number of SQL queries that update or insert rows using Psycopg2. There are no other queries being run intermediately. Example with a table A having columns name and value:

% Basically models a list of strings and how many times they "appear"
% 'foo' is some random value each time, sometimes repeating
insert into A select ('foo', 0)
    where not exists(select 1 from A where name = 'foo' limit 1);
update A set value = value + 1 where name = 'foo';
% ... and many more just like this

This is just an example, one type of query I'm running. I'm doing other things too. I'm not looking for a solution involving reworking my SQL queries.

It's really slow, with Postgres (which is running on another server) bottlenecking it. I've tried various things to make it faster.

It was unbearably slow if I committed after every query.
It was a bit faster if I didn't connection.commit() until the end. This seems to be what the Psycopg2 documentation suggests I do. Postgres was still bottlenecking horribly on disk access.
It was much faster if I used cursor.mogrify() instead of cursor.execute(), stored all the queries in a big list, joined them at the end into one massive query (literally ";".join(qs)), and ran it. Postgres was using 100% CPU, a good sign because that means ~ no disk bottleneck. But that was sometimes causing the postgres process to use up all my RAM and start page faulting then get bottlenecked on disk access forever, a disaster. I've set all the memory limits for Postgres to reasonable values using pgtune, but I'm guessing Postgres is allocating a bunch of work buffers with no limit and going over.
I've tried the above solution except committing every 100,000 or so queries to avoid overloading the server, but that's not going to be a perfect solution. It's what I've got for now. It seems like a ridiculous hack and is still slower than I'd like.

Is there some other way I should try involving Psycopg2?

Michael Robellard · Accepted Answer · 2016-01-03 01:56:02Z

1

Sounds like you have a lot of issues here. The first is that Postgres should not page fault unless you have it improperly configured or you are running other services on the machine. A properly configured Postgres instance will use your memory, but it won't pagefault.

If you need to insert or update 100,000s of things at a time, You definetly do not want to do that 1 transaction at a time, as you noted that will be very slow. In your first example what is happening is you are sending each query to the db over the network, waiting for a result, then commiting and waiting for that result, once again over the network.

Stringing together multiple things at a time will save you the 1 commit per and the back and forth network traffic which is why you saw significantly faster performance.

You can take the stringing together a step further and use copy if you are doing inserts or use a value list instead of single insert or update statements.

The real problem is a design flow in what you are doing. What you are doing from the looks of your query is implementing a counter in your database. If you are only going to count a few hundred things here or there, no big deal, but when you get into the 100,000s+ It won't work well.

This is where tools like memcached and redis come in. Both have excellent tools for very fast in memory counters. (If you only have one server you could just implement a counter in your code.) Once you have things counted just create a process to save the count to the database and clear the in memory counters.

answered Jan 3, 2016 at 1:56

Michael Robellard

2,38817 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

sudo Over a year ago

Forgot that I had COPY at my disposal! I'm working on doing this right now. I'm not using memcached, but I'm building up lots of data in Python dicts (until RAM gets short), COPYing them into my database in a temporary table, then merging the temporary table with the permanent one with a single UPDATE query. I also have some more complex aggregate functions, so I had to do some math to figure out how to merge.

sudo Over a year ago

Also, this wouldn't work if my data couldn't be merged, maybe because of some recursive function. For example, if I were keeping an average of the difference between the new value and the last value... Luckily, this isn't the case.

Collectives™ on Stack Overflow

Optimizing a long series of SQL update queries with Psycopg2

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related