7

Using postgresql 9.4 we have a simple contacts table with (id text not null (as pk), blob json) to experiment with porting a couchdb crm database. We will eventually split out to more columns etc, and handle the data more idomatically for a rdbms, but that's besides the point for the time being.

There are approximately 100k rows.

I am aware that hardcore postgresql performance experts advise against using offset however I can accept a small performance penalty (happy with anything under 100msec)

SELECT id FROM couchcontacts OFFSET 10000 LIMIT 10 

As expected takes <10ms

SELECT blob->>'firstName' FROM couchcontacts LIMIT 10 

Also takes < 10ms (presume 10 json decode ops on blob column here)

SELECT blob->>'firstName' FROM couchcontacts OFFSET 10000 LIMIT 10 

Takes upwards of 10 seconds!! Noted inefficiencies of offset aside why is this presumably causing 10,010 json decode ops? As the projection has no side-effects I don't understand the reason this can't be fast?

Is this a limitation of json functionality being relatively new to postgres? and thus unable to determine ->> opereator isnt yielding side-effects?

Interesting rewriting the query to this bring it back under 10milliseconds

SELECT jsonblob->>'firstName' FROM couchdbcontacts WHERE id IN (SELECT id FROM couchcontacts OFFSET 10000 LIMIT 10)

Is there a way to ensure offset doesnt json decode the offsetted records? (i.e. don't execute the select projection)

"Limit  (cost=1680.31..1681.99 rows=10 width=32) (actual time=12634.674..12634.842 rows=10 loops=1)"
"  ->  Seq Scan on couchcontacts  (cost=0.00..17186.53 rows=102282 width=32) (actual time=0.088..12629.401 rows=10010 loops=1)"
"Planning time: 0.194 ms"
"Execution time: 12634.895 ms"
6
  • That does look like a rough corner of this new feature. Please report it as a bug to Postgresql. I guess the workaround is manual paging, just like you showed yourself. Commented Dec 11, 2014 at 4:27
  • 1
    EXPLAIN ANALYZE please? I'm not totally convinced by the explanation of the discrepancy. Did you profile / perf top / etc to see if your hypothesized explanation fits observed behaviour? Though on second thoughts ... I think that if you request a result set with an offset, PostgreSQL should evaluate expressions in discarded rows unless it can prove they have no side-effects. So maybe it is evaluating the json expressions... and arguably it should be unless it can prove they can't abort the query with an ERROR or change database state. Commented Dec 11, 2014 at 6:40
  • Given ->> is a built in operator, shouldnt postgresql know this isnt causing side-effects? is there a way to hint it is none side-effect causing?. Updated with explain analyze Commented Dec 11, 2014 at 6:57
  • @CraigRinger: I just ran a few tests, and I actually found similar effects. If you run select a field, or if you select *, the performance is roughly the same. But when you select foo->>'bar', it becomes slightly slower with empty json data ('{}'), and slower when the field actually exists ('{"bar":0}'), and increasingly slower if you make the json larger. It's basically behaving as if it's unserializing the json for each row when the operator is used. Commented Dec 11, 2014 at 8:11
  • 1
    Someone asked (and then removed answer, possibly @Denis?) if a non-indexed-json field behaves the same. There is another non indexed field on the table called version (int not null) SELECT couchcontacts."version" FROM couchcontacts OFFSET 10000 LIMIT 10 returns in 14ms. vs blob->>'field' taking 4158ms - 12,000ms Commented Dec 11, 2014 at 8:15

1 Answer 1

6

I ran a few tests, and I'm seeing similar behaviors. Each of these have immaterial differences in performance:

  • select id ...
  • select indexed_field ...
  • select unindexed_field ...
  • select json_field ...
  • select * ...

This one, however, does show a difference in performance:

  • select json_field->>'key' ...

When the json_field is null, the performance impact is negligible. When it's empty, it degrades things very slightly. When it's filled in, it degrades noticeably. And when the field is loaded with larger data, it degrades materially.

In other words, Postgres seems to want to unserialize the json data for every row it's visiting. (Which is probably a bug, and one that's massively affecting RoR developers seeing how they use json.)

Fwiw, I noted that re-arranging the query so it uses a CTE will work around the problem:

with data as (
  select * from table offset 10000 limit 10
)
select json_field->>'key' from data;

(It might get an only-very-slightly better plan than the id IN (...) query that you highlighted.)

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for the tests. Just to add we are not using RoR, but Clojure, as many developers are now pushing Postgres as a NoSQL replacement (blogs, conferences, microbenchmarks) Im sure these scenarios will become more common across a multitude of stacks.
Yeah. Just in case, please take a minute to report the issue to the pg-performance or the pg-bugs list. My guess is this is an oversight in the code base: the various json functions related to reading are immutable insofar as I can tell, so I fail to see any reason to bother with evaluating them when discarding rows.
I have reported to pg-bugs and will wait to see what happens (its a moderated list), I'll leave this marked unanswered for a day or two incase for some odd reason it's by design.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.