2

I have a large table with many rows (millions) with a column of type JSONB/HSTORE, which contains many fields (hundreds). For illustration, I use the following smaller and less complex table:

-- table with HSTORE column
CREATE TABLE test_hstore (id BIGSERIAL PRIMARY KEY, data HSTORE);
INSERT INTO test_hstore (data)
SELECT hstore(
    '  key_1=>' || trunc(2 * random()) ||
    ', key_2=>' || trunc(2 * random()) ||
    ', key_3=>' || trunc(2 * random()))
FROM generate_series(0, 9999999) i;

-- table with JSONB column
CREATE TABLE test_jsonb (id BIGSERIAL PRIMARY KEY, data JSONB);
INSERT INTO test_jsonb (data)
SELECT (
    '{ "key_1":' || trunc(2 * random()) ||
    ', "key_2":' || trunc(2 * random()) ||
    ', "key_3":' || trunc(2 * random()) || '}')::JSONB
FROM generate_series(0, 9999999) i;

I would like to simply SELECT one or more fields within the data column without using a WHERE clause. I get a decrease in performance with an increasing number of selected fields:

EXPLAIN ANALYSE
SELECT id FROM test_hstore;
--Seq Scan on test_hstore  (cost=0.00..213637.56 rows=10000056 width=8) (actual time=0.049..3705.852 rows=10000000 loops=1)
--Planning time: 0.419 ms
--Execution time: 5445.654 ms

EXPLAIN ANALYSE
SELECT data FROM test_hstore;
--Seq Scan on test_hstore  (cost=0.00..213637.56 rows=10000056 width=56) (actual time=0.083..2424.334 rows=10000000 loops=1)
--Planning time: 0.082 ms
--Execution time: 3856.972 ms

EXPLAIN ANALYSE
SELECT data->'key_1' FROM test_hstore;
--Seq Scan on test_hstore  (cost=0.00..238637.70 rows=10000056 width=32) (actual time=0.122..3263.937 rows=10000000 loops=1)
--Planning time: 0.052 ms
--Execution time: 5390.803 ms


EXPLAIN ANALYSE
SELECT data->'key_1', data->'key_2' FROM test_hstore;
--Seq Scan on test_hstore  (cost=0.00..263637.84 rows=10000056 width=64) (actual time=0.089..3621.768 rows=10000000 loops=1)
--Planning time: 0.051 ms
--Execution time: 5334.452 ms

EXPLAIN ANALYSE
SELECT data->'key_1', data->'key_2', data->'key_3' FROM test_hstore;
--Seq Scan on test_hstore  (cost=0.00..288637.98 rows=10000056 width=96) (actual time=0.086..4291.111 rows=10000000 loops=1)
--Planning time: 0.067 ms
--Execution time: 6375.229 ms

Same trend (even more pronounced) for JSONB column type:

EXPLAIN ANALYSE
SELECT id FROM test_jsonb;
--Seq Scan on test_jsonb  (cost=0.00..233332.28 rows=9999828 width=8) (actual time=0.028..4009.841 rows=10000000 loops=1)
--Planning time: 0.878 ms
--Execution time: 5867.604 ms

EXPLAIN ANALYSE
SELECT data FROM test_jsonb;
--Seq Scan on test_jsonb  (cost=0.00..233332.28 rows=9999828 width=68) (actual time=0.074..2371.212 rows=10000000 loops=1)
--Planning time: 0.061 ms
--Execution time: 3787.308 ms

EXPLAIN ANALYSE
SELECT data->'key_1' FROM test_jsonb;
--Seq Scan on test_jsonb  (cost=0.00..258331.85 rows=9999828 width=32) (actual time=0.106..4677.026 rows=10000000 loops=1)
--Planning time: 0.066 ms
--Execution time: 6382.469 ms

EXPLAIN ANALYSE
SELECT data->'key_1', data->'key_2' FROM test_jsonb;
--Seq Scan on test_jsonb  (cost=0.00..283331.42 rows=9999828 width=64) (actual time=0.094..6888.904 rows=10000000 loops=1)
--Planning time: 0.047 ms
--Execution time: 8593.060 ms

EXPLAIN ANALYSE
SELECT data->'key_1', data->'key_2', data->'key_3' FROM test_jsonb;
--Seq Scan on test_jsonb  (cost=0.00..308330.99 rows=9999828 width=96) (actual time=0.173..9567.699 rows=10000000 loops=1)
--Planning time: 0.171 ms
--Execution time: 11262.135 ms

This becomes even more pronounced when the table contains many more fields. Is there a workaround?

Adding a GIN INDEX doesn't seem to help:

CREATE INDEX ix_test_hstore ON test_hstore USING GIN (data);
EXPLAIN ANALYSE
SELECT data->'key_1', data->'key_2', data->'key_3' FROM test_hstore;
--Seq Scan on test_hstore  (cost=0.00..288637.00 rows=10000000 width=96) (actual time=0.045..4650.447 rows=10000000 loops=1)
--Planning time: 2.100 ms
--Execution time: 6746.631 ms

CREATE INDEX ix_test_jsonb ON test_jsonb USING GIN (data);
EXPLAIN ANALYSE
SELECT data->'key_1', data->'key_2', data->'key_3' FROM test_jsonb;
--Seq Scan on test_jsonb  (cost=0.00..308334.00 rows=10000000 width=96) (actual time=0.149..9807.012 rows=10000000 loops=1)
--Planning time: 0.131 ms
--Execution time: 11739.948 ms
2
  • 1
    An index will never be used if you don't have a where clause and retrieve all rows of the table. Commented Jan 18, 2017 at 18:56
  • I was hoping that the index would also be used to access fields within the column. Do you know how access of key-value pairs in a JSONB/HSTORE columns is implemented internally? Is the content of each column scanned for retrieving values of particular keys? Commented Jan 19, 2017 at 9:37

1 Answer 1

1

There's actually not much you can do to improve access to one key within a data store, or a property of a JSON piece of data (which could be an array, or a string or number; which might be the reason why retrieving it is more difficult than retrieving it from an hstore).

An index could help you if you need to use data->key_1 in a WHERE clause, but it will not make it any easier to retrieve the property from the data.

The best course of action, if you always (or frequently) use a certain key_1, would be to normalise your data and make a column named key_1. If your data source makes it very easy for you to store data, but not so easy to store key_1, you could have a trigger function take care (on insert or update) to populate the column key_1 from the value of data:

CREATE TABLE test_jsonb 
(
    id BIGSERIAL PRIMARY KEY, 
    data JSONB, 
    key_1 integer
);

CREATE OR REPLACE FUNCTION ins_upd_test_data() 
RETURNS trigger AS
$$
BEGIN
    new.key_1 = (new.data->>'key_1')::integer ;
    RETURN new ;
END ;
$$
LANGUAGE plpgsql VOLATILE LEAKPROOF;

CREATE TRIGGER ins_upd_test_jsonb_trigger 
    BEFORE INSERT OR UPDATE OF data
    ON test_jsonb FOR EACH ROW
    EXECUTE PROCEDURE ins_upd_test_data();

This way, you can retrieve key_1 with the same efficiency that you can retrieve id.

Sign up to request clarification or add additional context in comments.

2 Comments

Keys are determined from user input and not known beforehand. It it also not known, which keys might be more frequently accessed than others. In addition, the number of accessed keys varies greatly between queries. I am afraid this approach won't help me.
@mdh: Then, I'm afraid, there isn't much to be done. You can use GIN indices to speed up some searches, but you won't improve access to specific pieces of the data one the row has been accessed.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.